UTF-16 vs UTF-32

# Shawn Steele (13 years ago)

It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally. (Sure, it'd be nice for conversion types).

What UTF-32 has that UTF-16 doesn't is the ability to walk a string without accidentally chopping up a surrogate pair. However, in practice, stepping over surrogates is pretty much the least of the problems with walking a string. Combining characters and the like cause numerous typographical shapes/glyphs to be represented by more than one Unicode codepoint, even in UTF-32. We don't see that in Latin so much, especially in NFC, but in some scripts most characters require multiple code points.

In other words, if I'm trying to find "safe" places to break a string, append text, or many other operations, then UTF-16 is no more complicated than UTF-32, even when considering surrogates.

UTF-32 would cause a huge amount of ambiguity though about what happens to all of those UTF-16 sequences that currently sort-of work even though they shouldn't really because ES is nominally UCS-2.

# Allen Wirfs-Brock (13 years ago)

On May 16, 2011, at 5:42 PM, Shawn Steele wrote:

It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally. (Sure, it'd be nice for conversion types).

What UTF-32 has that UTF-16 doesn't is the ability to walk a string without accidentally chopping up a surrogate pair. However, in practice, stepping over surrogates is pretty much the least of the problems with walking a string. Combining characters and the like cause numerous typographical shapes/glyphs to be represented by more than one Unicode codepoint, even in UTF-32. We don't see that in Latin so much, especially in NFC, but in some scripts most characters require multiple code points.

In other words, if I'm trying to find "safe" places to break a string, append text, or many other operations, then UTF-16 is no more complicated than UTF-32, even when considering surrogates.

UTF-32 would cause a huge amount of ambiguity though about what happens to all of those UTF-16 sequences that currently sort-of work even though they shouldn't really because ES is nominally UCS-2.

-Shawn

One reason is that none of the built-in string methods understand surrogate pairs. If you want to do any string processing that recognizes such pairs you have to either handles such pairs as multi-character sequences or do you own character by character processing.

# Mike Samuel (13 years ago)

2011/5/16 Shawn Steele <Shawn.Steele at microsoft.com>:

It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally.  (Sure, it'd be nice for conversion types).

I don't think anyone says that UTF-32 is desirable internally. We're talking about the API of the string type.

I have been operating under the assumption that developers would benefit from a simple way to efficiently iterate by code unit. An efficient version of the below, ideally one that just works like current for (var i = 0, n = str.length; i < n; ++i) ...str[i]...

function iterateByCodeUnit(str, fn) {
  str += '';
  for (var i = 0, n = str.length, index; i < n; ++i, ++index) {
    var unit = str.charCodeAt(i);
    if (0xd800 <= unit && unit < 0xdc00 && i + 1 < n) {
      var next = str.charCodeAt(i + 1);
      if (0xdc00 <= next && next < 0xe000) {
        fn(((unit & 0x3ff) << 10) | (next & 0x3ff), i, index);
        ++i;
        continue;
      }
    }
    fn(unit, i, index);
  }
}
# John Tamplin (13 years ago)

On Mon, May 16, 2011 at 8:42 PM, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally. (Sure, it'd be nice for conversion types).

What UTF-32 has that UTF-16 doesn't is the ability to walk a string without accidentally chopping up a surrogate pair. However, in practice, stepping over surrogates is pretty much the least of the problems with walking a string. Combining characters and the like cause numerous typographical shapes/glyphs to be represented by more than one Unicode codepoint, even in UTF-32. We don't see that in Latin so much, especially in NFC, but in some scripts most characters require multiple code points.

In other words, if I'm trying to find "safe" places to break a string, append text, or many other operations, then UTF-16 is no more complicated than UTF-32, even when considering surrogates.

UTF-32 would cause a huge amount of ambiguity though about what happens to all of those UTF-16 sequences that currently sort-of work even though they shouldn't really because ES is nominally UCS-2.

Personally, I think UTF16 is more prone to error than either UTF8 or UTF32 -- in UTF32 there is a one-to-one correspondence, while in UTF8 it is obvious you have to deal with multi-byte encodings. With UTF16, most developers only run into BMP characters and just assume that there is a one-to-one correspondence between chars and characters. Then, when their code runs into non-BMP characters they run into problems like restricting the size of a field to a number of chars and it is no longer long enough, etc. The problems arise infrequently, which means many developers assume the problem doesn't exist.

# Shawn Steele (13 years ago)

And why do you care to iterate by code unit? I mean, sure it seems useful, but how useful?

Do you want to stop in the middle of Ä? You probably don't stop in the middle of Ä.

I have no problem with regular expressions or APIs that use 32bit (well, 21bit) values, but I think existing apps already have a preconceived notion of the index in the string that it'd be bad to break. Yes that means += 2 instead of ++ to get past a surrogate pair, but that happens to Ä as well.

# Shawn Steele (13 years ago)

Allen said:

One reason is that none of the built-in string methods understand surrogate pairs. If you want to do any string processing that recognizes such pairs you have to either handles such pairs as multi-character sequences or do you own character by character processing.

(Which I think is false security, see my earlier comment below)

John said:

Personally, I think UTF16 is more prone to error than either UTF8 or UTF32 -- in UTF32 there is a one-to-one correspondence, while in UTF8 it is obvious you have to deal with multi-byte encodings. With UTF16, most developers only run into BMP characters and just assume that there is a one-to-one correspondence between chars and characters. Then, when their code runs into non-BMP characters they run into problems like restricting the size of a field to a number of chars and it is no longer long enough, etc. The problems arise infrequently, which means many developers assume the problem doesn't exist.

The "developers only run into BMP characters and just assume (it works)" is basically the extended problem with multiple codepoint sequences. Most developers also only think that there's a 1:1 relationship between code points and "glyphs". If I want to restrict a field to a number of characters, then it ends up being too short not only for surrogate characters, but also for scripts that rely heavily on combining sequences. In fact, the latter is probably even more likely since most of the surrogates aren't that common. Once your code can handle that surrogates is a very small part of the problem.

# Boris Zbarsky (13 years ago)

On 5/16/11 9:07 PM, John Tamplin wrote:

Personally, I think UTF16 is more prone to error than either UTF8 or UTF32 -- in UTF32 there is a one-to-one correspondence

One-to-one correspondence between string code units and Unicode codepoints.

Unfortunately, "Unicode codepoint" is only a useful concept for some scripts... So you run into the same edge-case issues as UTF-16 does, but in somewhat fewer cases.

# Phillips, Addison (13 years ago)

Personally, I think UTF16 is more prone to error than either UTF8 or UTF32 -- in UTF32 there is a one-to-one correspondence

One-to-one correspondence between string code units and Unicode codepoints.

Unfortunately, "Unicode codepoint" is only a useful concept for some scripts... So you run into the same edge-case issues as UTF-16 does, but in somewhat fewer cases.

Not exactly. What is true is that, regardless of the Unicode encoding, a glyph on the screen may be comprised of multiple Unicode characters which, in turn, may be comprised of multiple code units.

I generally present this as:

Glyph == single visual unit of text Character == single logical unit of text Code point == integer value assigned to a single character (logical unit of text), generally it is better to refer to "Unicode scalar value" here. Code unit == single logical encoding unit of memory in a Unicode encoding form (where unit == byte, word, etc.)

So in UTF-8, the 'byte' is the code unit, which is used to encode "code points" [Unicode scalar values] (1 to 4 code units per code point). In UTF-16, the 'word' (16-bit unit) is the code unit, which is used to encode code points (1 to 2 code units per code point).

A glyph or "grapheme cluster" may require multiple characters to encode, so no part of Unicode can assume 1 glyph == 1 character == 1 code point == (x) code units. The values in the surrogate range D800-DFFF are valid Unicode code points, but not valid Unicode scalar values. Isolated surrogate code points (not part of a surrogate pair in the UTF-16 encoding) are never "well-formed" and surrogate code points are considered invalid in encoding forms other than UTF-16.

As a result, "Unicode code point" is a marginally useful concept in talking about specific character values in Unicode. But a code point is not and must not be confused with a code unit and is slightly worse than referring to a Unicode scalar value. When talking about a supplementary character, generally it is more useful to talk about the Unicode code point that the surrogate pair of UTF-16 code units encode, for example, than the Unicode code point of each surrogate.

Hope that helps.

Addison

Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG)

Internationalization is not a feature. It is an architecture.