Full Unicode strings strawman

# Allen Wirfs-Brock (14 years ago)

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:

strawman:support_full_unicode_in_strings

# Mike Samuel (14 years ago)

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason. Feed back would be appreciated: strawman:support_full_unicode_in_strings

Will this change the behavior of character groups in regular expressions? Would myString.match(/^.$/)[0].length ever have length 2? Would it ever match a supplemental codepoint?

How would the below, which replaces orphaned surrogates with U+FFFD when strings are viewed as sequences of UTF-16 code units behave?

myString.replace( /\ud800-\udbff/g, "\ufffd") .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")

# Shawn Steele (14 years ago)

Thanks for making a strawman

Unicode Escape Sequences Is it possible for U+ to accept either 4, 5, or 6 digit sequences? Typically when I encounter U+ notation the leading zero is omitted, and I see BMP characters quite often. Obviously BMP could use the U notation, however it seems like it'd be annoying to the occasional user to know that U is used for some and U+ for others. Seems like it'd be easier for developers to remember that U+ is "the new way" and U is "the old way that doesn't always work".

String Position It's unclear to me if the string indices can be "changed" from UTF-16 to UTF-32 positions. Although UTF-32 indices are clearly desirable, I think that many implementations currently allow UTF-16 codepoints U+D800 through U+DFFF. In other words, I can already have Javascript strings with full Unicode range data in them. Existing applications would then have indices that pointed to the UTF-16, not UTF-32 index. Changing the definition of the index to UTF-32 would break those applications I think.

You also touch on that with charCodeAt/codepointAt, which resolves the problem with the output type, but doesn't address the problem with the indexing. Similar to the way you differentiated charCode/codepoint, it may be necessary to differentiate charCode/codepoint indices. IMO .fromCharCode doesn't have this problem since it used to fail, but now works, which wouldn't be breaking. Unless we're concerned that now it can return a different UTF-16 length than before.

I don't like the "21" in the name of decodeURI21. Also, the "trick" I think, is encoding to surrogate pairs (illegally, since UTF8 doesn't allow that) vs decoding to UTF16. It seems like decoding can safely detect input supplementary characters and properly decode them, or is there something about encoding that doesn't make that state detectable?

-Shawn

From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Allen Wirfs-Brock Sent: Monday, May 16, 2011 11:12 AM To: es-discuss at mozilla.org Subject: Full Unicode strings strawman

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:

strawman:support_full_unicode_in_strings

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 11:30 AM, Mike Samuel wrote:

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason. Feed back would be appreciated: strawman:support_full_unicode_in_strings

Will this change the behavior of character groups in regular expressions? Would myString.match(/^.$/)[0].length ever have length 2? Would it ever match a supplemental codepoint?

No, supplement codepoints are single string characters and RegExp matching operates on such characters. A string could, of course, contain character sequences that correspond to UTF-8, UTF-16, or other multi-unit encodings. However, from the perspective of Strings and RegExp those encodings would be multiple character sequences just like they are today. The only ES functions currently proposed that would deal with multi-character encodings of supplemental codepoints are the URI handling functions. However, it may be a good idea to add string-to-string UTF-8 and UTF-16 encode/decode functions that simply to the encode/decode and don't have all the other processing involved in encodeURI/decodeURI.

How would the below, which replaces orphaned surrogates with U+FFFD when strings are viewed as sequences of UTF-16 code units behave?

myString.replace( /\ud800-\udbff/g, "\ufffd") .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")

Exactly as it currently does, assuming it was applied to a string that didn't contain any codepoints greater than \uffff. If the string contained any codepoints > \uffff those character would not match the pattern should be replaced.

The important thing two keep in mind here is that under this proposal, a supplemental codepoint is a single logical charater. For example using a random character that isn't in the BMP: "\u+02defc" === "\ud8ff\udefc"; //this is fale "\u+02defc".length ===1 ;//this is true "\ud8ff\udefc"===2; //this is true

Existing code that manipulates surrogate pairs continues to work unmodified because such code is explicitly manipulating pairs of characters. However, such code might produce unexpected results if handed a string containing a codepoint > \uffff . But that takes an explicit action by someone to introduce such an enhanced character into a string.

# Shawn Steele (14 years ago)

myString.replace( /\ud800-\udbff/g, "\ufffd") .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")

Exactly as it currently does, assuming it was applied to a string that didn't contain any codepoints greater than \uffff. If the string contained any codepoints > \uffff those character would not match the pattern should be replaced.

Isn't that breaking? I'm not sure how you can treat these characters distinctly as some code point from d800-dfff sometimes and as a codepoint > 0xffff at other times.

# Mike Samuel (14 years ago)

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

On May 16, 2011, at 11:30 AM, Mike Samuel wrote:

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason. Feed back would be appreciated: strawman:support_full_unicode_in_strings

Will this change the behavior of character groups in regular expressions?  Would myString.match(/^.$/)[0].length ever have length 2?   Would it ever match a supplemental codepoint?

No, supplement codepoints are  single string characters and RegExp matching operates on such characters.  A string could, of course, contain character sequences that correspond to UTF-8, UTF-16, or other multi-unit encodings.  However, from the perspective of Strings and RegExp those encodings would be multiple character sequences just like they are today.  The only ES functions currently proposed that would deal with multi-character encodings of supplemental codepoints are the URI handling functions.  However, it may be a good idea to add string-to-string UTF-8 and UTF-16 encode/decode functions that simply to the encode/decode and don't have all the other processing involved in encodeURI/decodeURI.

DOMString is defined at www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus

Type Definition DOMString
A DOMString is a sequence of 16-bit units.

so how would round tripping a JS string through a DOM string work?

How would

var oneSupplemental = "\U00010000";
alert(oneSupplemental.length);  //  alerts 1
var utf16Encoded = encodeUTF16(oneSupplemental);
alert(utf16Encoded.length);  //  alerts 2
var textNode = document.createTextNode(utf16Encoded);
alert(textNode.nodeValue.length);   // alerts ?

Does the DOM need to represent utf16Encoded internally so that it can report 2 as the length on fetch of nodeValue? If so, how can it represent that for systems that use a UTF-16 internal representation for DOMString?

# Mike Samuel (14 years ago)

2011/5/16 Shawn Steele <Shawn.Steele at microsoft.com>:

myString.replace( /\ud800-\udbff/g, "\ufffd")    .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")

My example code has typos. It should have read

myString.replace( /[\ud800-\udbff](?![\udc00-\udfff])/g, "\ufffd")
    .replace( /(^|[^\ud800-\udbff])([\udc00-\udfff])/g, "\ufffd")
# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 11:34 AM, Shawn Steele wrote:

Thanks for making a strawman (see my very last sentence below as it may impact the interpreation of some of the rest of these responses)

Unicode Escape Sequences Is it possible for U+ to accept either 4, 5, or 6 digit sequences? Typically when I encounter U+ notation the leading zero is omitted, and I see BMP characters quite often. Obviously BMP could use the U notation, however it seems like it’d be annoying to the occasional user to know that U is used for some and U+ for others. Seems like it’d be easier for developers to remember that U+ is “the new way” and U is “the old way that doesn’t always work”.

The ES string literal notation does't really accommodate variable length subtokens without explicit terminators. What would be the rules for parsing "\u+12345678". How do we know if the programmer meant "\u1234"+"5678" or "\u0012"+"345678" or ...

There have been past proposals for a syntax like \u{xxxxxx} that could have 1to 6 hex digits. In the past proposal the assumption was that it would produce UTF-16 surrogate pairs but in this context we could adopt it instead of \u+ to produce a single character. The disadvantage is that it is a slightly long sequence for actual large code points. On the other hand perhaps it is more readable? "\u+123456\u+123456" vs. "\u{123456}\u{123456}" ??

String Position It’s unclear to me if the string indices can be “changed” from UTF-16 to UTF-32 positions. Although UTF-32 indices are clearly desirable, I think that many implementations currently allow UTF-16 codepoints U+D800 through U+DFFF. In other words, I can already have Javascript strings with full Unicode range data in them. Existing applications would then have indices that pointed to the UTF-16, not UTF-32 index. Changing the definition of the index to UTF-32 would break those applications I think.

No it wouldn't break anything, at least when applied to existing data. Your existing code is explicitly doing UTF-16 processing. Somebody had to do the processing to create the surrogate pairs in the string. As long as you use that same agent to are still going to bet UTF-16 encoded strings. Even though the underlying character values could hold single characters with codepoints > \uffff the actual string won't unless unless somebody actually constructed the string to contain such values. That presumably doesn't happen for existing code.

The place where existing code might break is if somebody explicitly constructs a string (using \u+ literals or String.fromCodepoint) that contains non-BMP characters and passes it to routines that that only expect 16-bits characters. For this reason, any existing host routines that convert external data resources to ES strings that contain surrogate pairs should probably continue to do so. New routines should be provided that produce single characters instead of pairs for non-BMP pointpoints. However, the definition of such routines is outside the scope of the ES specification.

Finally, note that just as current strings can contain16-bit character values that are not valid Unicode code points, the expanded full unicode strings can also contain 21-bit character values that are not valid Unicode codepoints.

You also touch on that with charCodeAt/codepointAt, which resolves the problem with the output type, but doesn’t address the problem with the indexing. Similar to the way you differentiated charCode/codepoint, it may be necessary to differentiate charCode/codepoint indices. IMO .fromCharCode doesn’t have this problem since it used to fail, but now works, which wouldn’t be breaking. Unless we’re concerned that now it can return a different UTF-16 length than before.

Again, nothing changes. Code that expects to deal with multi-character encodings can still do so. What "magically" changes is that code that act Unicode like codepoints are only 16-bits (ie, the code doesn't correctly deal with surrogate pairs) will now work with full 21-bit characters.

I don’t like the “21” in the name of decodeURI21.

Suggestions for better names are always welcome.

Also, the “trick” I think, is encoding to surrogate pairs (illegally, since UTF8 doesn’t allow that) vs decoding to UTF16. It seems like decoding can safely detect input supplementary characters and properly decode them, or is there something about encoding that doesn’t make that state detectable?

I think I missing the distinction you are making between surrogate pairs and UTF-16. I think I've been using the terms interchangeably. I may be munging up the terminology.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 12:28 PM, Mike Samuel wrote:

DOMString is defined at www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus

Type Definition DOMString A DOMString is a sequence of 16-bit units.

so how would round tripping a JS string through a DOM string work?

Because, the DOM spec. says: "Applications must encode DOMString using UTF-16 (defined in [Unicode] and Amendment 1 of [ISO/IEC 10646])." it must continue to do this.

Values return as DOM strings would (21-bit char enhanced) ES strings where each string character contained a 16-bit UTF-16 code unit. Just like they do now. Processing of such strings would have to do explicit surrogate pair processing just like they do now. However, such a string could be converted to a non-UTF-16 ecoded string by explicitly user code or via a new built-in function such as: String.UTF16Decode(aDOMStringValue)

For passing strings from ES to a DOMString we have to do the inverse conversions. If explicit decoding was done as suggested above then explicit UTF-16 encoding probably should be done. But note that the internal representation of the string is likely to know if the an actual string contains any characters with codepoints > \uffff. It may be reasonable to assume that strings without such characters are already DOMString encoded but that stings with such characters should be automatically UTF-16 encoded when they are passed as DOMString values.

How would

var oneSupplemental = "\U00010000";

I don't think I understand you literal notation. \U is a 32-bit character value? I whose implementation?

alert(oneSupplemental.length); // alerts 1

I'll take your word for this

var utf16Encoded = encodeUTF16(oneSupplemental); alert(utf16Encoded.length); // alerts 2

yes

var textNode = document.createTextNode(utf16Encoded); alert(textNode.nodeValue.length); // alerts ?

2

Does the DOM need to represent utf16Encoded internally so that it can report 2 as the length on fetch of nodeValue?

However the DOM representations DOMString values internally, to conform to the DOM spec. it must act as if it is representing them using UTF-16.

If so, how can it represent that for systems that use a UTF-16 internal representation for DOMString?

Let me know if I haven't already answered this.

# Mike Samuel (14 years ago)

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

On May 16, 2011, at 12:28 PM, Mike Samuel wrote:

DOMString is defined at www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus

Type Definition DOMString    A DOMString is a sequence of 16-bit units.

so how would round tripping a JS string through a DOM string work?

Because, the DOM spec. says: "Applications must encode DOMString using UTF-16 (defined in [Unicode] and Amendment 1 of [ISO/IEC 10646])." it must continue to do this. Values return as DOM strings would (21-bit char enhanced) ES strings where each string character contained a 16-bit UTF-16 code unit.  Just like they do now. Processing of such strings would have to do explicit surrogate pair processing just like they do now.  However, such a string could be converted to a non-UTF-16 ecoded string by explicitly user code or via a new built-in function such as:    String.UTF16Decode(aDOMStringValue) For passing strings from ES to a DOMString we have to do the inverse conversions. If explicit decoding was done as suggested above then explicit UTF-16 encoding probably should be done. But note that the internal representation of the string is likely to know if the an actual string contains any characters with codepoints > \uffff.  It may be reasonable to assume that strings without such characters are already DOMString encoded but that stings with such characters should be automatically UTF-16 encoded when they are passed as DOMString values.

How would

var oneSupplemental = "\U00010000";

I don't think I understand you literal notation. \U is a 32-bit character value?  I whose implementation?

Sorry, please read this as var oneSupplemental = String.fromCharCode(0x10000);

alert(oneSupplemental.length);  //  alerts 1

I'll take your word for this

If I understand, a string containing the single codepoint U+10000 should have length 1.

"The length of a String is the number of elements (i.e.,

16-bit\b\b\b\b\b\b 21-bit values) within it."

var utf16Encoded = encodeUTF16(oneSupplemental);    alert(utf16Encoded.length);  //  alerts 2

yes

var textNode = document.createTextNode(utf16Encoded);    alert(textNode.nodeValue.length);   // alerts ?

2

Does the DOM need to represent utf16Encoded internally so that it can report 2 as the length on fetch of nodeValue?

However the DOM representations DOMString values internally, to conform to the DOM spec. it must act as if it is representing them using UTF-16.

Ok. This seems to present two options: (1) Break the internet by binding DOMStrings to a JavaScript host type and not to the JavaScript string type. (2) DOMStrings never contain supplemental codepoints.

So for either alert(typeof

var roundTripped = document.createTextNode(oneSupplemental).nodeValue

either

typeof roundTripped !== "string"

or

roundTripped.length != oneSupplemental.length

If so, how can it represent that for systems that use a UTF-16 internal representation for DOMString?

Let me know if I haven't already answered this.

You might have. If you reject my assertion about option 2 above, then to clarify, The UTF-16 representation of codepoint U+10000 is the code-unit pair U+D8000 U+DC000. The UTF-16 representation of codepoint U+D8000 is the single code-unit U+D8000 and similarly for U+DC00.

How can the codepoints U+D800 U+DC00 be distinguished in a DOMString implementation that uses UTF-16 under the hood from the codepoint U+10000?

# Wes Garland (14 years ago)

Allen;

Thanks for putting this together. We use Unicode data extensively in both our web and server-side applications, and being forced to deal with UTF-16 surrogate pair directly -- rather than letting the String implementation deal with them -- is a constant source of mild pain. At first blush, this proposal looks like it meets all my needs, and my gut tells me the perf impacts will probably be neutral or good.

Two great things about strings composed of Unicode code points:

  1. .length represents the number of code points, rather than the number of pairs used in UTF-16, even if the underlying representation isn't UTF-16
  2. S.charCodeAt(S.indexOf(X)) always returns the same kind of information (a Unicode code point), regardless of whether X is in the BMP or not

If though this is a breaking change from ES-5, I support it whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00), it should be possible for users using String as immutable C-arrays to keep doing so. Users doing surrogate pair decomposition will probably find that their code "just works", as those code points will never appear in legitimate strings of Unicode code points. Users creating Strings with surrogate pairs will need to re-tool, but this is a small burden and these users will be at the upper strata of Unicode-foodom. I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters.

Mike Samuel, there would never a supplement code unit to match, as the return value of [[Get]] would be a code point.

Shawn Steele, I don't understand this comment:

Also, the “trick” I think, is encoding to surrogate pairs (illegally, since UTF8 doesn’t allow that) vs decoding to UTF16.

Why do we care about the UTF-16 representation of particular codepoints? Why can't the new functions just encode the Unicode string as UTF-8 and URI escape it?

Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping through the DOM? Does the DOM specify UTF-16 encoding? If it does, that's silly. Both ES and DOM should specify "Unicode" and let the data interchange format be an implementation detail. It is an unfortunate accident of history that UTF-16 surrogate pairs leak their abstraction into ES Strings, and I believe it is high time we fixed that.

# Boris Zbarsky (14 years ago)

On 5/16/11 4:37 PM, Mike Samuel wrote:

You might have. If you reject my assertion about option 2 above, then to clarify, The UTF-16 representation of codepoint U+10000 is the code-unit pair U+D8000 U+DC000.

No. The UTF-16 representation of codepoint U+10000 is the code-unit pair 0xD800 0xDC00. These are 16-bit unsigned integers, NOT Unicode characters (which is what the U+NNNNN notation means).

The UTF-16 representation of codepoint U+D8000 is the single code-unit U+D8000 and similarly for U+DC00.

I'm assuming you meant U+D800 in the first two code-units there.

There is no Unicode codepoint U+D800 or U+DC00. See www.unicode.org/charts/PDF/UD800.pdf and www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there are no Unicode characters with those codepoints.

How can the codepoints U+D800 U+DC00 be distinguished in a DOMString implementation that uses UTF-16 under the hood from the codepoint U+10000?

They don't have to be; if 0xD800 0xDC00 are present (in that order) then they encode U+10000. If they're present on their own, it's not a valid UTF-16 string, hence not a valid DOMString and some sort of error-handling behavior (which presumably needs defining) needs to take place.

That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy, assigning from JS strings to DOMString might be lossy, etc). It's a minefield.

# Mike Samuel (14 years ago)

2011/5/16 Wes Garland <wes at page.ca>:

Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping through the DOM?

I was UTF-16 encoding it because there will be host objects in browsers that assume a UTF-16 encoding and so a possibility for orphaned surrogates in internal representations based on UTF-16.

I was wondering how those strings round trip across host object boundaries. When a programmer assigns a string to a property, they expect a string with the same length to come out. When it doesn't, hilarity ensues.

Does the DOM specify UTF-16 encoding?

Yes. www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 says

Type Definition DOMString A DOMString is a sequence of 16-bit units.

If it does, that's silly.

Yes, it is. It is also a published standard assumed by a lot of existing code.

Both ES and DOM should specify "Unicode" and let the data interchange format be an implementation detail.

It is an unfortunate accident of history that UTF-16 surrogate pairs leak their abstraction into ES Strings, and I believe it is high time we fixed that.

I agree, but there's a coordination problem. TC39 can't redefine DOMString on their own. The DOMString definition is not maintained by ECMA, and changes to it affect bindings for Java and C#, which uses UTF-16, and languages like python which is technically agnostic but is normally compiled to treat a unicode string as a sequence of UTF-16 code-units.

# Mike Samuel (14 years ago)

2011/5/16 Boris Zbarsky <bzbarsky at mit.edu>:

On 5/16/11 4:37 PM, Mike Samuel wrote:

You might have.  If you reject my assertion about option 2 above, then to clarify, The UTF-16 representation of codepoint U+10000 is the code-unit pair U+D8000 U+DC000.

No.  The UTF-16 representation of codepoint U+10000 is the code-unit pair 0xD800 0xDC00.  These are 16-bit unsigned integers, NOT Unicode characters (which is what the U+NNNNN notation means).

My apologies for abusing notation.

The UTF-16 representation of codepoint U+D8000 is the single code-unit U+D8000 and similarly for U+DC00.

I'm assuming you meant U+D800 in the first two code-units there.

yes

There is no Unicode codepoint U+D800 or U+DC00.  See www.unicode.org/charts/PDF/UD800.pdf and www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there are no Unicode characters with those codepoints.

Correct. The strawman says

"The String type is the set of all finite ordered sequences of zero or more 21-bit unsigned integer values (“elements”)."

There is no exclusion for invalid code-points, so I was assuming when Allen talked about an encodeUTF16 function that he was purposely fuzzing the term "codepoint" to include the entire range, and that encodeUTF16(oneSupplemental).charCodeAt(0) === 0xd800.

How can the codepoints U+D800 U+DC00 be distinguished in a DOMString implementation that uses UTF-16 under the hood from the codepoint U+10000?

They don't have to be; if 0xD800 0xDC00 are present (in that order) then they encode U+10000.  If they're present on their own, it's not a valid UTF-16 string, hence not a valid DOMString and some sort of error-handling behavior (which presumably needs defining) needs to take place.

That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy, assigning from JS strings to DOMString might be lossy, etc).  It's a minefield.

Agreed. It is a minefield and one that could benefit from treatment in the strawman.

# Mark Davis ☕ (14 years ago)

I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16.

Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' would be at offset 1. This will definitely cause breakage in existing code; characters are in different positions than they were, even characters that are not supplemental ones. All it takes is one supplemental character before the current position and the offsets will be off for the rest of the string.

Faced with exactly the same problem, Java took a different approach that allows for handling of the full range of Unicode characters, but maintains backwards compatibility. It may be instructive to look at what they did (although there was definitely room for improvement in their approach!). I can follow up with that if people are interested. Alternatively, perhaps mechanisms can put in place to tell ECMAScript to use new vs old indexing (Perl uses PRAGMAs for that kind of thing, for example), although that has its own ugliness.

Mark

— Il meglio è l’inimico del bene —

# Mike Samuel (14 years ago)

Allen, could you clarify something.

When the strawman says without mentioning "codepoint"

"The String type is the set of all finite ordered sequences of zero or more 16-bit\b\b\b\b\b\b 21-bit unsigned integer values (“elements”)."

does that mean that String.charCodeAt(...) can return any value in the range [0, 1 << 21)?

When the strawman says using "codepoint"

"SourceCharacter :: any Unicode codepoint"

that excludes the blocks reserved for surrogates?

# Shawn Steele (14 years ago)

I'm having some (ok, a great deal of) confusion between the DOM Encoding and the JavaScript encoding and whatever. I'd assumed that if I had a web page in some encoding, that it was converted to UTF-16 (well, UCS-2), and that's what the JavaScript engine did it's work on. I confess to not having done much encoding stuff in JS in the last decade.

In UTF-8, individually encoded surrogates are illegal (and a security risk). Eg: you shouldn't be able to encode D800/DC00 as two 3 byte sequences, they should be a single 6 byte sequence. Having not played with the js encoding/decoding in quite some time, I'm not sure what they do in that case, but hopefully it isn't illegal UTF-8. (You also shouldn't be able to have half a surrogate pair in UTF-16, but many things are pretty lax about that.)

-Shawn

From: Allen Wirfs-Brock [mailto:allen at wirfs-brock.com] Sent: Monday, May 16, 2011 12:53 PM To: Shawn Steele Cc: es-discuss at mozilla.org Subject: Re: Full Unicode strings strawman

On May 16, 2011, at 11:34 AM, Shawn Steele wrote:

Thanks for making a strawman (see my very last sentence below as it may impact the interpreation of some of the rest of these responses)

Unicode Escape Sequences Is it possible for U+ to accept either 4, 5, or 6 digit sequences? Typically when I encounter U+ notation the leading zero is omitted, and I see BMP characters quite often. Obviously BMP could use the U notation, however it seems like it'd be annoying to the occasional user to know that U is used for some and U+ for others. Seems like it'd be easier for developers to remember that U+ is "the new way" and U is "the old way that doesn't always work".

The ES string literal notation does't really accommodate variable length subtokens without explicit terminators. What would be the rules for parsing "\u+12345678". How do we know if the programmer meant "\u1234"+"5678" or "\u0012"+"345678" or ...

There have been past proposals for a syntax like \u{xxxxxx} that could have 1to 6 hex digits. In the past proposal the assumption was that it would produce UTF-16 surrogate pairs but in this context we could adopt it instead of \u+ to produce a single character. The disadvantage is that it is a slightly long sequence for actual large code points. On the other hand perhaps it is more readable? "\u+123456\u+123456" vs. "\u{123456}\u{123456}" ??

String Position It's unclear to me if the string indices can be "changed" from UTF-16 to UTF-32 positions. Although UTF-32 indices are clearly desirable, I think that many implementations currently allow UTF-16 codepoints U+D800 through U+DFFF. In other words, I can already have Javascript strings with full Unicode range data in them. Existing applications would then have indices that pointed to the UTF-16, not UTF-32 index. Changing the definition of the index to UTF-32 would break those applications I think.

No it wouldn't break anything, at least when applied to existing data. Your existing code is explicitly doing UTF-16 processing. Somebody had to do the processing to create the surrogate pairs in the string. As long as you use that same agent to are still going to bet UTF-16 encoded strings. Even though the underlying character values could hold single characters with codepoints > \uffff the actual string won't unless unless somebody actually constructed the string to contain such values. That presumably doesn't happen for existing code.

The place where existing code might break is if somebody explicitly constructs a string (using \u+ literals or String.fromCodepoint) that contains non-BMP characters and passes it to routines that that only expect 16-bits characters. For this reason, any existing host routines that convert external data resources to ES strings that contain surrogate pairs should probably continue to do so. New routines should be provided that produce single characters instead of pairs for non-BMP pointpoints. However, the definition of such routines is outside the scope of the ES specification.

Finally, note that just as current strings can contain16-bit character values that are not valid Unicode code points, the expanded full unicode strings can also contain 21-bit character values that are not valid Unicode codepoints.

You also touch on that with charCodeAt/codepointAt, which resolves the problem with the output type, but doesn't address the problem with the indexing. Similar to the way you differentiated charCode/codepoint, it may be necessary to differentiate charCode/codepoint indices. IMO .fromCharCode doesn't have this problem since it used to fail, but now works, which wouldn't be breaking. Unless we're concerned that now it can return a different UTF-16 length than before.

Again, nothing changes. Code that expects to deal with multi-character encodings can still do so. What "magically" changes is that code that act Unicode like codepoints are only 16-bits (ie, the code doesn't correctly deal with surrogate pairs) will now work with full 21-bit characters.

I don't like the "21" in the name of decodeURI21.

Suggestions for better names are always welcome.

Also, the "trick" I think, is encoding to surrogate pairs (illegally, since UTF8 doesn't allow that) vs decoding to UTF16. It seems like decoding can safely detect input supplementary characters and properly decode them, or is there something about encoding that doesn't make that state detectable?

I think I missing the distinction you are making between surrogate pairs and UTF-16. I think I've been using the terms interchangeably. I may be munging up the terminology.

-Shawn

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Allen Wirfs-Brock

Sent: Monday, May 16, 2011 11:12 AM To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: Full Unicode strings strawman

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:

strawman:support_full_unicode_in_strings

# Jungshik Shin (신정식, 申政湜) (14 years ago)

On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ <mark at macchiato.com> wrote:

I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16.

I agree with Mark wrote except that the previous spec used UCS-2, which this proposal (and other proposals on the issue) try to rectify. I think that taking Java's approach would work better with DOMString as well.

See W3C I18N WG's proposalwww.w3.org/International/wiki/JavaScriptInternationalization

on the issue and Java's approachjava.sun.com/developer/technicalArticles/Intl/Supplementarylinked

there)

Jungshik

# Shawn Steele (14 years ago)

I think the problem isn’t so much that the spec used UCS-2, but rather that some implementations used UTF-16 instead as that is more convenient in many cases. To the application developer, it’s difficult to tell the difference between UCS-2 and UTF-16 if I can use a regular expression to find D800, DC00. Indeed, when the rendering engine of whatever host is going to display the glyph for U+10000, it’d be hard to notice the subtlety of UCS-2 vs UTF-16.

-Shawn

From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Jungshik Shin (???, ???) Sent: Monday, May 16, 2011 2:24 PM To: Mark Davis ☕ Cc: Markus Scherer; es-discuss at mozilla.org Subject: Re: Full Unicode strings strawman

On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:

I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16.

I agree with Mark wrote except that the previous spec used UCS-2, which this proposal (and other proposals on the issue) try to rectify. I think that taking Java's approach would work better with DOMString as well.

See W3C I18N WG's proposalwww.w3.org/International/wiki/JavaScriptInternationalization on the issue and Java's approachjava.sun.com/developer/technicalArticles/Intl/Supplementary linked there)

Jungshik

Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' would be at offset 1. This will definitely cause breakage in existing code; characters are in different positions than they were, even characters that are not supplemental ones. All it takes is one supplemental character before the current position and the offsets will be off for the rest of the string.

Faced with exactly the same problem, Java took a different approach that allows for handling of the full range of Unicode characters, but maintains backwards compatibility. It may be instructive to look at what they did (although there was definitely room for improvement in their approach!). I can follow up with that if people are interested. Alternatively, perhaps mechanisms can put in place to tell ECMAScript to use new vs old indexing (Perl uses PRAGMAs for that kind of thing, for example), although that has its own ugliness.

Mark

— Il meglio è l’inimico del bene —

On Mon, May 16, 2011 at 13:38, Wes Garland <wes at page.ca<mailto:wes at page.ca>> wrote:

Allen;

Thanks for putting this together. We use Unicode data extensively in both our web and server-side applications, and being forced to deal with UTF-16 surrogate pair directly -- rather than letting the String implementation deal with them -- is a constant source of mild pain. At first blush, this proposal looks like it meets all my needs, and my gut tells me the perf impacts will probably be neutral or good.

Two great things about strings composed of Unicode code points:

  1. .length represents the number of code points, rather than the number of pairs used in UTF-16, even if the underlying representation isn't UTF-16
  2. S.charCodeAt(S.indexOf(X)) always returns the same kind of information (a Unicode code point), regardless of whether X is in the BMP or not

If though this is a breaking change from ES-5, I support it whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00), it should be possible for users using String as immutable C-arrays to keep doing so. Users doing surrogate pair decomposition will probably find that their code "just works", as those code points will never appear in legitimate strings of Unicode code points. Users creating Strings with surrogate pairs will need to re-tool, but this is a small burden and these users will be at the upper strata of Unicode-foodom. I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters.

Mike Samuel, there would never a supplement code unit to match, as the return value of [[Get]] would be a code point.

Shawn Steele, I don't understand this comment:

Also, the “trick” I think, is encoding to surrogate pairs (illegally, since UTF8 doesn’t allow that) vs decoding to UTF16.

Why do we care about the UTF-16 representation of particular codepoints? Why can't the new functions just encode the Unicode string as UTF-8 and URI escape it?

Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping through the DOM? Does the DOM specify UTF-16 encoding? If it does, that's silly. Both ES and DOM should specify "Unicode" and let the data interchange format be an implementation detail. It is an unfortunate accident of history that UTF-16 surrogate pairs leak their abstraction into ES Strings, and I believe it is high time we fixed that.

# Boris Zbarsky (14 years ago)

On 5/16/11 4:38 PM, Wes Garland wrote:

Two great things about strings composed of Unicode code points: ... If though this is a breaking change from ES-5, I support it whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00)

Those aren't code points at all. They're just not Unicode.

If you allow storage of such, then you're allowing mixing Unicode strings and "something else" (whatever the something else is), with bad most likely bad results.

Most simply, assignign a DOMString containing surrogates to a JS string should collapse the surrogate pairs into the corresponding codepoint if JS strings really contain codepoints...

The only way to make this work is if either DOMString is redefined or DOMString and full Unicode strings are different kinds of objects.

Users doing surrogate pair decomposition will probably find that their code "just works"

How, exactly?

Users creating Strings with surrogate pairs will need to re-tool

Such users would include the DOM, right?

but this is a small burden and these users will be at the upper strata of Unicode-foodom.

You're talking every single web developer here. Or at least every single web developer who wants to work with Devanagari text.

I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters.

Not unless DOMString is changed or the interaction between the two very carefully defined in failure-proof ways.

Why do we care about the UTF-16 representation of particular codepoints?

Because of DOMString's use of UTF-16, at least (forced on it by the fact that that's what ES used to do, but here we are).

Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping through the DOM? Does the DOM specify UTF-16 encoding?

Yes.

If it does, that's silly.

It needed to specify something, and UTF-16 was the thing that was compatible with how scripts work in ES. Not to mention the Java legacy if the DOM...

Both ES and DOM should specify "Unicode" and let the data interchange format be an implementation detail.

That's fine if both are changed. Changing just one without the other would just cause problems.

It is an unfortunate accident of history that UTF-16 surrogate pairs leak their abstraction into ES Strings, and I believe it is high time we fixed that.

If you can do that without breaking web pages, great. If not, then we need to talk. ;)

# Mark Davis ☕ (14 years ago)

In terms of implementation capabilities, there isn't really a significant practical difference between

  • a UCS-2 implementation, and
  • a UTS-16 implementation that doesn't have supplemental characters in its supported repertoire.

Mark

— Il meglio è l’inimico del bene —

# Boris Zbarsky (14 years ago)

On 5/16/11 5:16 PM, Mike Samuel wrote:

The strawman says

"The String type is the set of all finite ordered sequences of zero or more 21-bit unsigned integer values (“elements”)."

Yeah, that's not the same thing as an actual Unicode string, and requires handling of all sorts of "what if someone sticks non-Unicode in there?" issues...

Of course people actually do use JS strings as immutable arrays of 16-bit unsigned integers right now (not just as byte arrays), so I suspect that we can't easily exclude the surrogate ranges from "strings" without breaking existing content...

# Shawn Steele (14 years ago)

I’d go further and also say there isn’t really a very big practical difference between:

· A UCS-2 implementation who’s data is rendered by a completely Unicode aware rendering engine, and

· A UTF-16 implementation.

In fact I’m unaware of any UCS-2/UTF-16 conversion functionality that cause D800-DFFF to throw an error or change to U+FFFD, most just blindly pass along the input, pretending they’re the same, or at least “close enough.”

-Shawn

From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis ? Sent: Monday, May 16, 2011 2:45 PM To: Shawn Steele Cc: Jungshik Shin (신정식, 申政湜); Markus Scherer; es-discuss at mozilla.org Subject: Re: Full Unicode strings strawman

In terms of implementation capabilities, there isn't really a significant practical difference between

  • a UCS-2 implementation, and
  • a UTF-16 implementation that doesn't have supplemental characters in its supported repertoire.

Mark

— Il meglio è l’inimico del bene —

On Mon, May 16, 2011 at 14:28, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:

I think the problem isn’t so much that the spec used UCS-2, but rather that some implementations used UTF-16 instead as that is more convenient in many cases. To the application developer, it’s difficult to tell the difference between UCS-2 and UTF-16 if I can use a regular expression to find D800, DC00. Indeed, when the rendering engine of whatever host is going to display the glyph for U+10000, it’d be hard to notice the subtlety of UCS-2 vs UTF-16.

-Shawn

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org>] On Behalf Of Jungshik Shin (???, ???)

Sent: Monday, May 16, 2011 2:24 PM To: Mark Davis ☕ Cc: Markus Scherer; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: Re: Full Unicode strings strawman

On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:

I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16.

I agree with Mark wrote except that the previous spec used UCS-2, which this proposal (and other proposals on the issue) try to rectify. I think that taking Java's approach would work better with DOMString as well.

See W3C I18N WG's proposalwww.w3.org/International/wiki/JavaScriptInternationalization on the issue and Java's approachjava.sun.com/developer/technicalArticles/Intl/Supplementary linked there)

Jungshik

Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' would be at offset 1. This will definitely cause breakage in existing code; characters are in different positions than they were, even characters that are not supplemental ones. All it takes is one supplemental character before the current position and the offsets will be off for the rest of the string.

Faced with exactly the same problem, Java took a different approach that allows for handling of the full range of Unicode characters, but maintains backwards compatibility. It may be instructive to look at what they did (although there was definitely room for improvement in their approach!). I can follow up with that if people are interested. Alternatively, perhaps mechanisms can put in place to tell ECMAScript to use new vs old indexing (Perl uses PRAGMAs for that kind of thing, for example), although that has its own ugliness.

Mark

— Il meglio è l’inimico del bene — On Mon, May 16, 2011 at 13:38, Wes Garland <wes at page.ca<mailto:wes at page.ca>> wrote:

Allen;

Thanks for putting this together. We use Unicode data extensively in both our web and server-side applications, and being forced to deal with UTF-16 surrogate pair directly -- rather than letting the String implementation deal with them -- is a constant source of mild pain. At first blush, this proposal looks like it meets all my needs, and my gut tells me the perf impacts will probably be neutral or good.

Two great things about strings composed of Unicode code points:

  1. .length represents the number of code points, rather than the number of pairs used in UTF-16, even if the underlying representation isn't UTF-16
  2. S.charCodeAt(S.indexOf(X)) always returns the same kind of information (a Unicode code point), regardless of whether X is in the BMP or not

If though this is a breaking change from ES-5, I support it whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00), it should be possible for users using String as immutable C-arrays to keep doing so. Users doing surrogate pair decomposition will probably find that their code "just works", as those code points will never appear in legitimate strings of Unicode code points. Users creating Strings with surrogate pairs will need to re-tool, but this is a small burden and these users will be at the upper strata of Unicode-foodom. I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters.

Mike Samuel, there would never a supplement code unit to match, as the return value of [[Get]] would be a code point.

Shawn Steele, I don't understand this comment:

Also, the “trick” I think, is encoding to surrogate pairs (illegally, since UTF8 doesn’t allow that) vs decoding to UTF16.

Why do we care about the UTF-16 representation of particular codepoints? Why can't the new functions just encode the Unicode string as UTF-8 and URI escape it?

Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping through the DOM? Does the DOM specify UTF-16 encoding? If it does, that's silly. Both ES and DOM should specify "Unicode" and let the data interchange format be an implementation detail. It is an unfortunate accident of history that UTF-16 surrogate pairs leak their abstraction into ES Strings, and I believe it is high time we fixed that.

# Boris Zbarsky (14 years ago)

On 5/16/11 5:23 PM, Shawn Steele wrote:

I’m having some (ok, a great deal of) confusion between the DOM Encoding and the JavaScript encoding and whatever. I’d assumed that if I had a web page in some encoding, that it was converted to UTF-16 (well, UCS-2), and that’s what the JavaScript engine did it’s work on.

JS strings are currently defined as arrays of 16-bit unsigned integers. I believe the intent at the time was that these could represent actual Unicode strings encoded as UCS-2, but they can also represent arbitrary arrays of 16-bit unsigned integers.

The DOM just uses JS strings for DOMString and defines DOMString to be UTF-16. That's not quite compatible with UCS-2, but....

JS strings can contain integers that correspond to UTF-16 surrogates. There are no constraints in what comes before/after them in JS strings.

In UTF-8, individually encoded surrogates are illegal (and a security risk). Eg: you shouldn’t be able to encode D800/DC00 as two 3 byte sequences, they should be a single 6 byte sequence

A single 4 byte sequence, actually, last I checked.

Having not played with the js encoding/decoding in quite some time, I’m not sure what they do in that case, but hopefully it isn’t illegal UTF-8.

I'm not sure which "they" and under what conditions we're considering here.

(You also shouldn’t be able to have half a surrogate pair in UTF-16, but many things are pretty lax about that.)

Verily.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 1:37 PM, Mike Samuel wrote:

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

...

How would

var oneSupplemental = "\U00010000";

I don't think I understand you literal notation. \U is a 32-bit character value? I whose implementation?

Sorry, please read this as var oneSupplemental = String.fromCharCode(0x10000);

In my proposal you would have to say String.fromCodepoint(0x10000);

In ES5 String.fromCharCode(0x10000) produced the same string as "\0". That remains the case in my proposal.

alert(oneSupplemental.length); // alerts 1

I'll take your word for this

If I understand, a string containing the single codepoint U+10000 should have length 1.

"The length of a String is the number of elements (i.e., 16-bit\b\b\b\b\b\b 21-bit values) within it."

yes, it's 1. My gruff comment was in reference to not being sure of your literal notation

var utf16Encoded = encodeUTF16(oneSupplemental); alert(utf16Encoded.length); // alerts 2

yes

var textNode = document.createTextNode(utf16Encoded); alert(textNode.nodeValue.length); // alerts ?

2

Does the DOM need to represent utf16Encoded internally so that it can report 2 as the length on fetch of nodeValue?

However the DOM representations DOMString values internally, to conform to the DOM spec. it must act as if it is representing them using UTF-16.

Ok. This seems to present two options: (1) Break the internet by binding DOMStrings to a JavaScript host type and not to the JavaScript string type.

Not sure why this would break the internet. At the implementation level, a key point of my proposal is that implementations can (and even today some do) have multiple different internal representations for strings. These internal representation difference simply are not exposed to the JS program expect possiblely in terms to measurable performance differences.

(2) DOMStrings never contain supplemental codepoints.

that's how DOMStrings are currently defined and I'm not proposing to change this. Adding full unicode DOMStrings to the DOM spec. seems like a task for W3C.

So for either alert(typeof

var roundTripped = document.createTextNode(oneSupplemental).nodeValue

either

typeof roundTripped !== "string"

Not really, I'm perfectly happy to allow the DOM to continue report the type of DOMString as 'string'. It's no different from a user constructed string that may or may not contain a UTF-16 character sequence depending upon what the user code.

or

roundTripped.length != oneSupplemental.length

Yes, this may be the case but only for new code that explicit builds oneSupplemental to contain a supplemental character using \u+xxxxxx or String.fromCodepoint or some other new function. All existing valid code only produces strings with codepoints limited to 0xffff

If so, how can it represent that for systems that use a UTF-16 internal representation for DOMString?

Let me know if I haven't already answered this.

You might have. If you reject my assertion about option 2 above, then to clarify, The UTF-16 representation of codepoint U+10000 is the code-unit pair U+D8000 U+DC000. The UTF-16 representation of codepoint U+D8000 is the single code-unit U+D8000 and similarly for U+DC00.

How can the codepoints U+D800 U+DC00 be distinguished in a DOMString implementation that uses UTF-16 under the hood from the codepoint U+10000?

I think you have an extra 0 at a couple of places above...

A DOMstring is defined by the DOM spec. to consists of 16-bit elements that are to be interpreted as a UTF-16 encoding of Unicode characters. It doesn't matter what implementation level representation is used for the string, the indexible positions within a DOMString is restricted to 16-bit values. At the representation level each position could even be represented by a 32-bit cell and it doesn't matter. To be a valid DOMString element values must be in the range 0-0xffff.

I think you are unnecessarily mixing up the string semantics defined by the language, encodings that might be used in implementing the semantics, and application level processing of those strings.

To simplify things just think of a ES string as if it was an array each element of which could contain an arbitrary integer value. If we have such an array like [0xd800, 0xdc00] at the language semantics level this is a two element array containing two well specific values. At the language implementation level there are all sorts of representations that might be used, maybe the implementation Huffman encodes the elements... How the application processes that array is completely up to the application. It may treat the array simply as two integer values. It may treated each element as a 21-bit value encoding a Unicode codepoint and logically consider the array to be a unicode string of length 2. It may consider each element to be a 16-bit value and that sequences of values are interpreted as UTF-16 string encodings. In that case, it could consider it to represent a string of logical length 1.

This is no different from what people do today with 16-bit char JS strings. Many people just treat them as strings of BMP characters and ignore the possibility of supplemental characters or UTF-16 encodings. Other people (particularly when dealing with DOMStrings) treat strings as code units of an UTF-16 encoding. They need to use more complex sting processing algorithms to deal with logical unicode characters.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 1:38 PM, Wes Garland wrote:

Allen;

Thanks for putting this together. We use Unicode data extensively in both our web and server-side applications, and being forced to deal with UTF-16 surrogate pair directly -- rather than letting the String implementation deal with them -- is a constant source of mild pain. At first blush, this proposal looks like it meets all my needs, and my gut tells me the perf impacts will probably be neutral or good.

Two great things about strings composed of Unicode code points:

  1. .length represents the number of code points, rather than the number of pairs used in UTF-16, even if the underlying representation isn't UTF-16
  2. S.charCodeAt(S.indexOf(X)) always returns the same kind of information (a Unicode code point), regardless of whether X is in the BMP or not

If though this is a breaking change from ES-5, I support it whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00), it should be possible for users using String as immutable C-arrays to keep doing so. Users doing surrogate pair decomposition will probably find that their code "just works", as those code points will never appear in legitimate strings of Unicode code points. Users creating Strings with surrogate pairs will need to re-tool, but this is a small burden and these users will be at the upper strata of Unicode-foodom. I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters.

Thanks, this is exactly my thinking on the subject.

# Mark Davis ☕ (14 years ago)

A correction.

U+D800 is indeed a code point: www.unicode.org/glossary/#Code_Point. It is defined for usage in Unicode Strings (see www.unicode.org/glossary/#Unicode_String) because often it is useful for implementations to be able to allow it in processing.

It does, however, have a special status, and is not representable in well-formed UTF-xx, for general interchange.

A quick note on the intro to the doc, with a bit more history.

ECMAScript currently only directly supports the 16-bit basic multilingual

plane (BMP) subset of Unicode which is all that existed when ECMAScript was first designed. Since then Unicode has been extended to require up to 21-bits per code.

  1. Unicode was extended to to up to 10FFFF in version 2.0, in July of
  2. ECMAScript, according to Wikipedia, was first issued in 1997. So actually for all of ECMAScript's existence, it has been obsolete in its usage of Unicode.
    • (It isn't quite as bad as that, since we made provision for supplementary characters early-on, but the first actual supplementary characters appeared in 2003.)
  3. In 2003, Markus Scherer proposed support for Unicode in ECMAScript v4:
    1. sites.google.com/site/markusicu/unicode/es/unicode-2003
    2. sites.google.com/site/markusicu/unicode/es/i18n-2003

Mark

— Il meglio è l’inimico del bene —

# Mark Davis ☕ (14 years ago)

In practice, the supplemental code points don't really cause problems in Unicode strings. Most implementations just treat them as if they were unassigned. The only important issue is that when they are converted to UTF-xx for storage or transmission, they need to be handled; typically by converting to FFFD (never just deleted - a bad idea for security).

Mark

— Il meglio è l’inimico del bene —

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 2:16 PM, Mike Samuel wrote:

2011/5/16 Boris Zbarsky <bzbarsky at mit.edu>:

On 5/16/11 4:37 PM, Mike Samuel wrote:

There is no Unicode codepoint U+D800 or U+DC00. See www.unicode.org/charts/PDF/UD800.pdf and www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there are no Unicode characters with those codepoints.

Correct. The strawman says

"The String type is the set of all finite ordered sequences of zero or more 21-bit unsigned integer values (“elements”)."

There is no exclusion for invalid code-points, so I was assuming when Allen talked about an encodeUTF16 function that he was purposely fuzzing the term "codepoint" to include the entire range, and that encodeUTF16(oneSupplemental).charCodeAt(0) === 0xd800.

Correct in my proposal, ES string elements are 21-bit values. All possible values are useable even though some are not valid Unicode code points. We may not have a clear common language let for referring to such element values. If current ES we call them "character codes" but we need to be careful about moving that terminology forward because it occurs in APIs that depend upon character codes being 16-bit values.

encodeUTF16 is a Unicode domain specific function. It would need to define what it does when encountering a "character code" that is not a valid codepoint.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 2:19 PM, Mark Davis ☕ wrote:

I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16.

Not by the ECMAScript specification

Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' would be at offset 1.

It the string is written as \ud800\udc00\u0061" the 'a' will be at offset 1, even in the new proposal. It would only be at offset 1 if it was written as "\u+010000\u+000061" (using the literal notation from the proposal).

This will definitely cause breakage in existing code;

How does this break existing code. Existing code can not say "\u+010000\u+000061". As I've pointed out elsewhere

# Shawn Steele (14 years ago)

The problem is that “\UD800\UDC00” === “\U+010000”. And if the internal representation is UTF-32, then they’d have to continue to be the same. And it’s really hard for them to have the same length if one’s 2 code points and the other’s 1 code point.

-Shawn

From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Allen Wirfs-Brock Sent: Monday, May 16, 2011 3:18 PM To: Mark Davis ☕ Cc: Markus Scherer; es-discuss at mozilla.org Subject: Re: Full Unicode strings strawman

On May 16, 2011, at 2:19 PM, Mark Davis ☕ wrote:

I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16.

Not by the ECMAScript specification

Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' would be at offset 1.

It the string is written as \ud800\udc00\u0061" the 'a' will be at offset 1, even in the new proposal. It would only be at offset 1 if it was written as "\u+010000\u+000061" (using the literal notation from the proposal).

This will definitely cause breakage in existing code;

How does this break existing code. Existing code can not say "\u+010000\u+000061". As I've pointed out elsewhere on this thread existing libraries that do UTF-16 encoding/decoding must continue to do so even under this new proposal.

characters are in different positions than they were, even characters that are not supplemental ones. All it takes is one supplemental character before the current position and the offsets will be off for the rest of the string.

# Gillam, Richard (14 years ago)

Allen--

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:

strawman:support_full_unicode_in_strings

I was actually on the committee when the language you're proposing to change was adopted and, in fact, I think I actually proposed that wording.

The intent behind the original wording was to extend the standard back then in ES3 to allow the use of the full range of Unicode characters, and to do it in more or less the same way that Java had done it: While the actual choice of an internal string representation would be left up to the implementer, all public interfaces (where it made a difference) would behave exactly as if the internal representation was UTF-16. In particular, you would represent supplementary-plane characters with two \u escape sequences representing a surrogate pair, and interfaces that assigned numeric indexes to characters in strings would do so based on the UTF-16 representation of the string-- a supplementary-plane character would take up two character positions in the string.

I don't have a problem with introducing a new escaping syntax for supplementary-plane characters, but I really don't think you want to go messing with string indexing. It'd be a breaking change for existing implementations. I don't think it actually matters if the existing implementation "supported" UTF-16 or not-- if you read string content that included surrogate pairs from some external source, I doubt anything in the JavaScript implementation was filtering out the surrogate pairs because the implementation "only supported UCS-2". And most things would have worked fine. But the characters would be numbered according to their UTF-16 representation.

If you want to introduce new APIs that index things according to the UTF-32 representation, that'd be okay, but it's more of a burden for implementations that use UTF-16 for their internal representation, and we optimized for that on the assumption that it was the most common choice.

Defining String.fromCharCode() to build a string based on an abstract Unicode code point value might be okay (although it might be better to make that a new function), but when presented with a code point value about 0xFFFF, it'd produce a string of length 2-- the length of the UTF-16 representation. String.charCodeAt() was always defined, and should continue to be defined, based on the UTF-16 representation. If you want to introduce a new API based on the UTF-32 representation, fine.

I'd also recommend against flogging the 21-bit thing so heavily-- the 21-bit thing is sort of an accident of history, and not all 21-bit values are legal Unicode code point values either. I'd use "32" for the longer forms of things.

I think it's fine to have everything work in terms of abstract Unicode code points, but I don't think you can ignore the backward-compatibility issues with character indexing in the current API.

--Rich Gillam Lab126

# Allen Wirfs-Brock (14 years ago)

See the section of the proposal about String.prototype.charCodeAt

On May 16, 2011, at 2:20 PM, Mike Samuel wrote:

Allen, could you clarify something.

When the strawman says without mentioning "codepoint"

"The String type is the set of all finite ordered sequences of zero or more 16-bit\b\b\b\b\b\b 21-bit unsigned integer values (“elements”)."

does that mean that String.charCodeAt(...) can return any value in the range [0, 1 << 21)?

When the strawman says using "codepoint"

"SourceCharacter :: any Unicode codepoint"

that excludes the blocks reserved for surrogates?

Does the Unicode spec. refer to those surrogate codes as "codepoints"? My understanding is that it does not, but I could be wrong. My intent is that the answer is no.

Note that this section is defining the input alphabet of the grammar that . It has nothing to do this the actual character encodings used for source programs. The production essentially says that the input alphabet of ECMAScript is all defined Unicode characters. The actual encoding of source programs (bother external and internal) is up to the implementation and the host environment. (the string input to eval is an exception to this).

# Mike Samuel (14 years ago)

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>: I think you have an extra 0 at a couple of  places above...

Yep. Sorry. The 0x10000 really is supposed to be five digits though.

A DOMstring is defined by the DOM spec. to consists of 16-bit elements that are to be interpreted as a UTF-16 encoding of Unicode characters.  It doesn't matter what implementation level representation is used for the string, the indexible positions within a DOMString is restricted to 16-bit

Really?

There is existing code out there that uses particular implementations for strings. Should the cost of migrating existing implementations be taken into account when considering this strawman?

values. At the representation level each position could even be represented by a 32-bit cell and it doesn't matter.  To be a valid DOMString element values must be in the range 0-0xffff. I think you are unnecessarily mixing up the string semantics defined by the language, encodings that might be used in implementing the semantics, and application level processing of those strings.

To simplify things just think of a ES string as if it was an array each element of which could contain an arbitrary integer value.  If we have such an array like [0xd800, 0xdc00] at the language semantics level this is a two element array containing two well specific values.  At the language implementation level there are all sorts of representations that might be used, maybe the implementation Huffman encodes the elements...  How the application processes that array is completely up to the application.  It may treat the array simply as two integer values.  It may treated each element as a 21-bit value encoding a Unicode codepoint and logically consider the array to be a unicode string of length 2.  It may consider each element to be a 16-bit value and that sequences of values are interpreted as UTF-16 string encodings.  In that case, it could consider it to represent a string of logical length 1.

I think we agree about the implementation/interface split.

If DOMString specifies the semantics of a result from

I'm not sure I understand the bit about how the semantics of DOMString could affect ES programs.

Is it the case that

document.createTextNode('\u+010000').length === 2
'\u+010000' === 1

or are you saying that when DOMStrings are exposed to ES code, ES gets to defined the semantics of the "length" and "0" properties.

# Mark Davis ☕ (14 years ago)

Mark

— Il meglio è l’inimico del bene —

On Mon, May 16, 2011 at 15:27, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

See the section of the proposal about String.prototype.charCodeAt

On May 16, 2011, at 2:20 PM, Mike Samuel wrote:

Allen, could you clarify something.

When the strawman says without mentioning "codepoint"

"The String type is the set of all finite ordered sequences of zero or more 16-bit\b\b\b\b\b\b 21-bit unsigned integer values (“elements”)."

does that mean that String.charCodeAt(...) can return any value in the range [0, 1 << 21)?

When the strawman says using "codepoint"

"SourceCharacter :: any Unicode codepoint"

that excludes the blocks reserved for surrogates?

Does the Unicode spec. refer to those surrogate codes as "codepoints"? My understanding is that it does not, but I could be wrong. My intent is that the answer is no.

Yes, it does. See my message, with a pointer to the Unicode glossary.

Note that this section is defining the input alphabet of the grammar that . It has nothing to do this the actual character encodings used for source programs. The production essentially says that the input alphabet of ECMAScript is all defined Unicode characters.

all defined Unicode characters.

That would also not be correct. The defined characters are only about 109K (more if you consider private use); nowhere near the number of code points, because there are over 800K code points that are reserved for the allocation of future characters. For a breakdown, see www.unicode.org/versions/Unicode6.0.0/#Character_Additions

Sorry to seem picky, but we have found over time that you have to be very careful about the use of terms. The term "character" is especially fraught with ambiguities.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 2:23 PM, Shawn Steele wrote:

I’m having some (ok, a great deal of) confusion between the DOM Encoding and the JavaScript encoding and whatever. I’d assumed that if I had a web page in some encoding, that it was converted to UTF-16 (well, UCS-2), and that’s what the JavaScript engine did it’s work on. I confess to not having done much encoding stuff in JS in the last decade.

In UTF-8, individually encoded surrogates are illegal (and a security risk). Eg: you shouldn’t be able to encode D800/DC00 as two 3 byte sequences, they should be a single 6 byte sequence. Having not played with the js encoding/decoding in quite some time, I’m not sure what they do in that case, but hopefully it isn’t illegal UTF-8. (You also shouldn’t be able to have half a surrogate pair in UTF-16, but many things are pretty lax about that.)

I don't know enough about DOM behavior as I probably should, but the implication of the DOMString spec. is that if HTML source code contains a text element with supplemental characters (using any encoding recognized by a browser) then if the text of that element is accessed from JavaScript it will see each supplemental character as two JavaScript character that taken together are the UTF-16 encoding of the original supplemental character. I'm not proposing any changes in that regard.

There is a chicken and egg issue here. The DOM will never evolved to directly support non UTF-16 encoded supplemental characters unless ECMAScript first provides such support. It may take 20 years to get there but that clock won't even start until ECMAScript provides the necessary support.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 2:42 PM, Boris Zbarsky wrote:

On 5/16/11 4:38 PM, Wes Garland wrote:

Two great things about strings composed of Unicode code points: ... If though this is a breaking change from ES-5, I support it whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00)

Those aren't code points at all. They're just not Unicode.

If you allow storage of such, then you're allowing mixing Unicode strings and "something else" (whatever the something else is), with bad most likely bad results.

Most simply, assignign a DOMString containing surrogates to a JS string should collapse the surrogate pairs into the corresponding codepoint if JS strings really contain codepoints...

No, that would be a breaking change to the web!

The only way to make this work is if either DOMString is redefined or DOMString and full Unicode strings are different kinds of objects.

Not really, you need to make the distinction between what a String can contain and what String contents are valid in specific application domains.

DOMString seems to be quite clearly defined to consists of of 16-bit valued elements interpreted as a UTF-16 encode Unicode string.

All such DOMStings are valid ES strings according to may proposal but it isn't the case all ES Strings are valid DOMStrings. To the depth of my understanding I that this is already the case today with 16-bit ES characters. You can create a ES string which does not conform to the UTF-16 encoding rules.

Users doing surrogate pair decomposition will probably find that their code "just works"

How, exactly?

Because the string will continue to contain surrogate pairs.

Users creating Strings with surrogate pairs will need to re-tool

Such users would include the DOM, right?

No. That would be a breaking change in the context of the browser. Programs creating surrogate that want to be updated to not use surrogate pairs are the only ones that need to retool. More likely we are talking about new code that can be written without having to worry about surrogate pairs. If somebody wants to grab a bunch of text from the DOM and manipulate it without encountering surrogate pairs, they will need to explicit perform a decodeUTF16 transformation.

but this is a small burden and these users will be at the upper strata of Unicode-foodom.

You're talking every single web developer here. Or at least every single web developer who wants to work with Devanagari text.

No, they will probably always have a choice for their own internal processing. Deal with logically 16-bit character that use UTF-16. Or deal with logical 21-bit characters. Only when communicating with an external agent (for example the DOM) do you have to adapt to that agents requirments.

I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters.

Not unless DOMString is changed or the interaction between the two very carefully defined in failure-proof ways.

Why do we care about the UTF-16 representation of particular codepoints?

Because of DOMString's use of UTF-16, at least (forced on it by the fact that that's what ES used to do, but here we are).

Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping through the DOM? Does the DOM specify UTF-16 encoding?

Yes.

If it does, that's silly.

It needed to specify something, and UTF-16 was the thing that was compatible with how scripts work in ES. Not to mention the Java legacy if the DOM...

Both ES and DOM should specify "Unicode" and let the data interchange format be an implementation detail.

That's fine if both are changed. Changing just one without the other would just cause problems.

Somebody has to go first. I'm saying that it has to be ES that goes first. ES can do this without breaking any existing web code.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 3:22 PM, Shawn Steele wrote:

The problem is that “\UD800\UDC00” === “\U+010000”. And if the internal representation is UTF-32, then they’d have to continue to be the same. And it’s really hard for them to have the same length if one’s 2 code points and the other’s 1 code point.

Not in my proposal! "\ud800\udc00"=== "\u+010000" is false in my proposal. One has length 2 and one has length. You are confusing the logical interpretation of a UTF-16 encoded character sequence with the actual character encoding . To get an equality in the context of my proposal you would have to say something like:

UTF16Decode("\ud800\udc00") === "\u+010000" or "\ud800\udc00" === UTF16Encode("u+010000")

# Shawn Steele (14 years ago)

Not in my proposal! "\ud800\udc00"=== "\u+010000" is false in my proposal.

That’s exactly my problem. I think the engine’s (or at least the applications written in JavaScript) are still UTF-16-centric and that they’ll have d800, dc00 === 10000. For example, if they were different, then d800, dc00 should print �� instead of 𐀀, however I’m reasonably sure that any implementation would end up rendering it as 𐀀.

In other words I don’t think you can get the engine to be completely UTF-32. At least not without declaring a page as being UTF-32.

# Mike Samuel (14 years ago)

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

No. That would be a breaking change in the context of the browser.  Programs creating surrogate that want to be updated to not use surrogate pairs are the only ones that need to retool.  More likely we are talking about new code that can be written without having to worry about surrogate pairs.  If somebody wants to grab a bunch of text from the DOM and manipulate it without encountering surrogate pairs, they will need to explicit perform a decodeUTF16 transformation.

Without this strawman, devs willing to put in the effort can use one mechanism to loop by codepoint. Devs who don't put in the effort don't get easy/correct codepoint iteration.

With this strawman, devs who care about supplemental codepoints have to call decodeUTF16 whenever they access a DOMString property. Devs who don't put in the effort get easy/correct codepoint iteration.

So this strawman will not provide a single way to iterate correctly by codepoint in apps that are not written with supplemental codepoints in mind.

Is that correct?

Knowing where to put decodeUTF16 calls is tough in the presence of reflective property access. Consider a bulk property copy

function (properties, src, dest) {
  for (var i = 0, n = properties.length; i < n; ++i) {
    var k = properties[i]l
    dest[k] = src[k];
  }
}

A DOM object can have custom properties that are not DOMStrings and regular properties that are. How would an application that wants to make sure all DOMStrings entering the program are properly decoded if it uses an idiom like this?

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 3:33 PM, Mike Samuel wrote:

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

Really?

There is existing code out there that uses particular implementations for strings. Should the cost of migrating existing implementations be taken into account when considering this strawman?

If you mean existing ES implementation, then yes this proposal will impact all existing implementations. So does everything else we add to the ES specification. In my proposal, I have a section that discusses why I think the actual implementation impact will generally not be as great as you might first imagine.

values. At the representation level each position could even be represented by a 32-bit cell and it doesn't matter. To be a valid DOMString element values must be in the range 0-0xffff. I think you are unnecessarily mixing up the string semantics defined by the language, encodings that might be used in implementing the semantics, and application level processing of those strings.

To simplify things just think of a ES string as if it was an array each element of which could contain an arbitrary integer value. If we have such an array like [0xd800, 0xdc00] at the language semantics level this is a two element array containing two well specific values. At the language implementation level there are all sorts of representations that might be used, maybe the implementation Huffman encodes the elements... How the application processes that array is completely up to the application. It may treat the array simply as two integer values. It may treated each element as a 21-bit value encoding a Unicode codepoint and logically consider the array to be a unicode string of length 2. It may consider each element to be a 16-bit value and that sequences of values are interpreted as UTF-16 string encodings. In that case, it could consider it to represent a string of logical length 1.

I think we agree about the implementation/interface split.

If DOMString specifies the semantics of a result from

I'm not sure I understand the bit about how the semantics of DOMString could affect ES programs.

Is it the case that

document.createTextNode('\u+010000').length === 2

ideally, but...

This is a where it get potentially sticky. This code is passing something that is not a valid UTF-16 string encoding to a DOM routine that is declared to taken DOMString argument. This is a new situation that we need to negotiate with the WebIDL ES binding. There are a couple possible ways to approach this binding. One is to say it is illegal to pass such a string as a DOMString. However, that isn't very user friendly as it precludes using such literals as DOM arguments or forces them to be written as document.createTextNode(UTF16Encode('\u+10000')). It would be better to do the encoding automatically as part of the DOM call marshaling.

Note that in either case, a check has to be made to determine whether the string contains characters whose character codes are > 0xffff. My argument is that, perhaps surprisingly, this should be a very cheap test. The reasons is that ES strings are immutable and that any reasonable implementation is like to use an optimized internal representation for strings that only contain 16-bit character codes. Thus, it is likely that 16-bit character code only string values will be immediately identifiable as such without requiring any call site inspection of the actual individual characters to make that determination.

'\u+010000' === 1

yes, assuming that there is a .length missing above.

or are you saying that when DOMStrings are exposed to ES code, ES gets to defined the semantics of the "length" and "0" properties.

No, for compatibility with the existing web, DOMstrings need to manifest as UTF-16 encodings.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 3:36 PM, Mark Davis ☕ wrote:

all defined Unicode characters.

That would also not be correct. The defined characters are only about 109K (more if you consider private use); nowhere near the number of code points, because there are over 800K code points that are reserved for the allocation of future characters. For a breakdown, see www.unicode.org/versions/Unicode6.0.0/#Character_Additions

Sorry about the terminology issues, I work on fixing them.

I actually think "character" is the right term for use in:

SourceCharcter :: any Unicode character

This is defining the alphabet of the grammar. The alphabet is composed of logical characters, not specific encodings. The actual program might be encoded in EBCDIC or Hollerith card codes as long as there is a mapping of the characters actually used in that encoding to Unicode characters.

The intent is that any defined Unicode character can be used. That is the 109K but growing in the future as Unicode adopts additional characters. In practice, there are actually very view places in the grammar when any SourceCharacter is allowed but in those places we really do me the valid logical characters defined by the current Unicode standard.

# Brendan Eich (14 years ago)

On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote:

That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy, assigning from JS strings to DOMString might be lossy, etc). It's a minefield.

Plus, people stuff random data into JS strings, which so far have not UTF-16 validated or indexed, and they could read back arbitrary uint16s in a row.

Breaking this seems web-breaking to me, from what I remember. It's impossible to detect statically (early error).

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 4:21 PM, Shawn Steele wrote:

Not in my proposal! "\ud800\udc00"=== "\u+010000" is false in my proposal.

That’s exactly my problem. I think the engine’s (or at least the applications written in JavaScript) are still UTF-16-centric and that they’ll have d800, dc00 === 10000. For example, if they were different, then d800, dc00 should print �� instead of 𐀀, however I’m reasonably sure that any implementation would end up rendering it as 𐀀.

I think you'll find that the actual JS engines are currently UCS-2 centric. The surrounding browser environments are doing the UTF-16 interpretation. That why you see 𐀀 instead of �� in browser generated display output.

In other words I don’t think you can get the engine to be completely UTF-32. At least not without declaring a page as being UTF-32.

I agree that application writer will continue for the foreseeable future have to know whether or not they are dealing with UTF-16 encoded data and/or communicating with other subsystems that expect such data. However, core language support for UTF-32 is a prerequisite for ever moving beyond UTF-16APIs and libraries and getting back to uniform sized character processing.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 5:06 PM, Brendan Eich wrote:

On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote:

That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy, assigning from JS strings to DOMString might be lossy, etc). It's a minefield.

Plus, people stuff random data into JS strings, which so far have not UTF-16 validated or indexed, and they could read back arbitrary uint16s in a row.

Breaking this seems web-breaking to me, from what I remember. It's impossible to detect statically (early error).

I think I've addressed this in other responses, including in esdiscuss/2011-May/014307 See the part about passing a string with >16-bit chars to a parameter that requires a DOMString

The main thing to add, is that to put random >16-bit values into a string requires using new APIs or syntax defined in the proposal and that currently is not in ES or browsers. I don't see how that can be called "web-breaking"

# Shawn Steele (14 years ago)

I think you'll find that the actual JS engines are currently UCS-2 centric. The surrounding browser environments are doing the UTF-16 interpretation. That why you see 𐀀 instead of �� in browser generated display output.

There’s no difference. I wouldn’t call Windows C++ WCHAR “UCS-2”, however if wc[0] = 0xD800 and wc[1] = 0xDC00, it’s going to act like a there’s a U+10000 character at wc[0], and wc[1] is going to be meaningless. Which is exactly how JavaScript behaves today. The JavaScript engines don’t care if it’s UCS-2 or UTF-16 because they aren’t doing anything meaningful with the difference, except not supporting native recognition of code points > 0x10000.

# Boris Zbarsky (14 years ago)

On 5/16/11 6:18 PM, Allen Wirfs-Brock wrote:

It the string is written as \ud800\udc00\u0061" the 'a' will be at offset 1, even in the new proposal. It would only be at offset 1 if it was written as "\u+010000\u+000061" (using the literal notation from the proposal).

Ah, so in the proposal strings that happen to be sequences of UTF-16 units won't be automatically converted to Unicode strings?

That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation....

# Mike Samuel (14 years ago)

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

It the string is written as   \ud800\udc00\u0061" the 'a' will be at offset 1, even in the new proposal.  It would only be at offset 1 if it was written as "\u+010000\u+000061"  (using the literal notation from the proposal).

Under this scheme,

 eval('  "\\uD834\\uDD1E"  ')  !== JSON.parse('  "\\uD834\\uDD1E"  ')

From RFC 4627 """ To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". """

# Boris Zbarsky (14 years ago)

On 5/16/11 7:05 PM, Allen Wirfs-Brock wrote:

If you allow storage of such, then you're allowing mixing Unicode strings and "something else" (whatever the something else is), with bad most likely bad results.

Most simply, assignign a DOMString containing surrogates to a JS string should collapse the surrogate pairs into the corresponding codepoint if JS strings really contain codepoints...

No, that would be a breaking change to the web!

OK, we agree there.

The only way to make this work is if either DOMString is redefined or DOMString and full Unicode strings are different kinds of objects.

Not really, you need to make the distinction between what a String can contain and what String contents are valid in specific application domains.

DOMString seems to be quite clearly defined to consists of of 16-bit valued elements interpreted as a UTF-16 encode Unicode string.

Right, because that's all it can be given what ES string is right now.

All such DOMStings are valid ES strings according to may proposal but it isn't the case all ES Strings are valid DOMStrings.

OK, what happens when such an ES string is passed to an interface that takes a DOMString?

What happens when such an ES string is concatenated with a DOMString containing surrogate pairs? I don't mean on the implementation level for the concatenation case; that part is trivial. I mean what can the programmer sanely do with the result?

To the depth of my understanding I that this is already the case today with 16-bit ES characters. You can create a ES string which does not conform to the UTF-16 encoding rules.

In practice browsers just let you put non-UTF16 stuff in the DOM and then fake things, as far as I can tell. There are a few bugs around on not allowing that, but the cost is too high (e.g. there are popular browser implementations in which the DOM and JS share the same reference counted string buffer when you pass a string across the boundary).

Users doing surrogate pair decomposition will probably find that their code "just works"

How, exactly?

Because the string will continue to contain surrogate pairs.

Until someone somewhere else in the workflow (say an ad on the page, in the browser context) adds in a string containing non-BMP codepoints, right?

Users creating Strings with surrogate pairs will need to re-tool

Such users would include the DOM, right?

No. That would be a breaking change in the context of the browser.

OK...

Programs creating surrogate that want to be updated to not use surrogate pairs are the only ones that need to retool.

I think this is compartmentalizing programs in ways that don't map to reality.

More likely we are talking about new code that can be written without having to worry about surrogate pairs.

And again here. A lot of JS on the web takes strings from all sorts of sources not necessarily under the control of the JS itself (user input, XMLHttpRequest of XML and JSON, etc), then mashes them together in various ways.

If somebody wants to grab a bunch of text from the DOM and manipulate it without encountering surrogate pairs, they will need to explicit perform a decodeUTF16 transformation.

What if they don't want to encounter non-BMP characters except in surrogate pair form (i.e. have the environment they have now)?

You're talking every single web developer here. Or at least every single web developer who wants to work with Devanagari text.

No, they will probably always have a choice for their own internal processing. Deal with logically 16-bit character that use UTF-16. Or deal with logical 21-bit characters. Only when communicating with an external agent (for example the DOM) do you have to adapt to that agents requirments.

Web JS is all about communicating with external agents. That's its purpose in life, for the most part.

Somebody has to go first. I'm saying that it has to be ES that goes first. ES can do this without breaking any existing web code.

I disagree on "somebody has to go first"; it should be possible to coordinate such a change.

I agree that if we impose an ordering then clearly ES has to go first.

I think that if we made this change to ES only today and the new capabilities were completely unused no existing web code would break.

I also think that if we made this change to ES only today and then part but not all of the web got changed to use the new capabilities we would break some web code.

I will posit as an axiom that any changes to the web in terms of adopting the new feature will be incremental (please let me know if there is reason to think this is not the case). A corollary that I believe to be true is that we therefore have to assume that "old strings" and "new strings" will coexist in the set of strings scripts have to handle as things stand. This may be true no matter what the DOM does, but is definitely true if the DOM remains as it is.

# Boris Zbarsky (14 years ago)

On 5/16/11 7:21 PM, Shawn Steele wrote:

In other words I don’t think you can get the engine to be completely UTF-32. At least not without declaring a page as being UTF-32.

For what it's worth, HTML5 does not support declaring a page as UTF-32 at all. We're removing our existing support for this from Gecko in Firefox 5....

# Brendan Eich (14 years ago)

On May 16, 2011, at 5:18 PM, Allen Wirfs-Brock wrote:

On May 16, 2011, at 5:06 PM, Brendan Eich wrote:

On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote:

That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy, assigning from JS strings to DOMString might be lossy, etc). It's a minefield.

Plus, people stuff random data into JS strings, which so far have not UTF-16 validated or indexed, and they could read back arbitrary uint16s in a row.

Breaking this seems web-breaking to me, from what I remember. It's impossible to detect statically (early error).

I think I've addressed this in other responses, including in esdiscuss/2011-May/014307 See the part about passing a string with >16-bit chars to a parameter that requires a DOMString

I'm not sure this covers all the cases. Boris mentioned how JS takes strings from many sources, and it can concatenate them, in a data flow that crosses programs. Is it really safe to reason about this in a modular or "local" way?

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 6:41 PM, Boris Zbarsky wrote:

On 5/16/11 6:18 PM, Allen Wirfs-Brock wrote:

It the string is written as \ud800\udc00\u0061" the 'a' will be at offset 1, even in the new proposal. It would only be at offset 1 if it was written as "\u+010000\u+000061" (using the literal notation from the proposal).

Ah, so in the proposal strings that happen to be sequences of UTF-16 units won't be automatically converted to Unicode strings?

Probably more correct to say UCS-2 units above. Such strings could be internally represented using 16-bit character cells. From a JS perspective, it is not possible to determine the size of the actual cell used to store individual characters of a string. Of course, at the implementation level you have to know the size of the character cell and if it is 16-bits then you know that the string doesn't contain any raw supplemental characters. (It might contain UTF-16 encoded character but that is at a different logical system layer.

That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation....

Some implementations already use tree structures to represent strings that are built via concatenation. It would be straight forward to have such a tree string representation where some segments have 16-bit cells and others 32-bit (or even 8-bit) cells. That is probably how I would present any long string that that contained only a few supplemental characters.

# Boris Zbarsky (14 years ago)

On 5/16/11 10:20 PM, Allen Wirfs-Brock wrote:

That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation....

Some implementations already use tree structures to represent strings that are built via concatenation. It would be straight forward to have such a tree string representation where some segments have 16-bit cells and others 32-bit (or even 8-bit) cells. That is probably how I would present any long string that that contained only a few supplemental characters.

I'm not talking about the implementation end. I can see how I'd implement this stuff, or make Luke implement it or something. What I don't see is how the JS program author can sanely work with the result.

# Allen Wirfs-Brock (14 years ago)

It already ins't the case that eval(x)===JSON.parse(x). See timelessrepo.com/json-isnt-a-javascript-subset

On May 16, 2011, at 6:51 PM, Mike Samuel wrote:

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

It the string is written as \ud800\udc00\u0061" the 'a' will be at offset 1, even in the new proposal. It would only be at offset 1 if it was written as "\u+010000\u+000061" (using the literal notation from the proposal).

Under this scheme,

eval('  "\\uD834\\uDD1E"  ')  !== JSON.parse('  "\\uD834\\uDD1E"  ')

It already ins't the case that eval(x)===JSON.parse(x) is not necessarily true for values of x that are valid JSON source strings. See timelessrepo.com/json-isnt-a-javascript-subset

From RFC 4627 """ To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". """

That's what is says, but its not what JSON parsers do. They don't a generate a single character ES string containing only the single character U+1D11E. They generate a 2 character ES string containing the character pair \uD834 and \uDD1E. Essentially JSON.parse currently generates UCS-2 strings that may be interpreted as UTF-16 by the application layer. Nothing would change in this regard under my proposal.

Interestingly, REC 4627 says that JSON text is "Unicode" and may be encoded in UTF-8, 16, or 32 and one of the alternatives of for the 'unescaped' production is %x5D-10FFFF. That sees to suggest that supplemental characters are allowed to occur in JSON text without escaping. JSON.parse doesn't support this. In fact, it looks like JSON.parse doesn't conform to the RFC 4626 parsing requirements. To do so, it would either have to be explicitly told whether the encoding used within the JS String argument was UTF-8 or UTF-16 or it should determine the encoding from the first four octets as described in section 3 of 4627...

# Mike Samuel (14 years ago)

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

It already ins't the case that eval(x)===JSON.parse(x).  See timelessrepo.com/json-isnt-a-javascript-subset

I'm aware of that hole. That doesn't mean that we should break the relationship for code that doesn't error out in either.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 7:22 PM, Boris Zbarsky wrote:

On 5/16/11 10:20 PM, Allen Wirfs-Brock wrote:

That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation....

Some implementations already use tree structures to represent strings that are built via concatenation. It would be straight forward to have such a tree string representation where some segments have 16-bit cells and others 32-bit (or even 8-bit) cells. That is probably how I would present any long string that that contained only a few supplemental characters.

I'm not talking about the implementation end. I can see how I'd implement this stuff, or make Luke implement it or something. What I don't see is how the JS program author can sanely work with the result.

In theory, the JS programmer already has to manually keep track of where or not a string value is UTF-16 or UCS-2. As John Tamplin observed in esdiscuss/2011-May/014319 most JS programmer simply assume they are dealing with the BMP and trip-up if they actually have to process a surrogate pair that was unexpectedly handed to them form the DOM.

That said, it was be easy enough to expand proposal with a JS programmer visible property

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 7:53 PM, Mike Samuel wrote:

2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:

It already ins't the case that eval(x)===JSON.parse(x). See timelessrepo.com/json-isnt-a-javascript-subset

I'm aware of that hole. That doesn't mean that we should break the relationship for code that doesn't error out in either.

Like I said in the the text after the RFC 4627 quote, nothing would change with this. JSON.parse would continue to produce exactly the same ES strings that it currently does.

# Allen Wirfs-Brock (14 years ago)

On May 16, 2011, at 7:18 PM, Brendan Eich wrote:

On May 16, 2011, at 5:18 PM, Allen Wirfs-Brock wrote:

On May 16, 2011, at 5:06 PM, Brendan Eich wrote:

On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote:

That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy, assigning from JS strings to DOMString might be lossy, etc). It's a minefield.

Plus, people stuff random data into JS strings, which so far have not UTF-16 validated or indexed, and they could read back arbitrary uint16s in a row.

Breaking this seems web-breaking to me, from what I remember. It's impossible to detect statically (early error).

I think I've addressed this in other responses, including in esdiscuss/2011-May/014307 See the part about passing a string with >16-bit chars to a parameter that requires a DOMString

I'm not sure this covers all the cases. Boris mentioned how JS takes strings from many sources, and it can concatenate them, in a data flow that crosses programs. Is it really safe to reason about this in a modular or "local" way?

I think it does. In another reply I also mentioned the possibility of tagging in a JS visible manner strings that have gone through a known encoding process.

If the strings you are combining from different sources have not been canonicalize to a common encoding then you better be damn care how you combine them. The DOM seems seems to canonicalize to UTF-16 (with some slop WRT invalid encoding that Boris and others have pointed out). I don't about other sources such as XMLHttpRequest or the file APIs. However, in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. You can hide such things form many programmers but not all. After all, people actually have to implement transcoders.

# Norbert Lindenberg (14 years ago)

I have read the discussion so far, but would like to come back to the
strawman itself because I believe that it starts with a problem
statement that's incorrect and misleading the discussion. Correctly
describing the current situation would help in the discussion of
possible changes, in particular their compatibility impact.

The relevant portion of the problem statement:

"ECMAScript currently only directly supports the 16-bit basic
multilingual plane (BMP) subset of Unicode which is all that existed
when ECMAScript was first designed. [...] As currently defined,
characters in this expanded character set cannot be used in the source
code of ECMAScript programs and cannot be directly included in runtime
ECMAScript string values."

My reading of the ECMAScript Language Specification, edition 5.1
(January 2011), is:

  1. ECMAScript allows, but does not require, implementations to support
    the full Unicode character set.

  2. ECMAScript allows source code of ECMAScript programs to contain
    characters from the full Unicode character set.

  3. ECMAScript requires implementations to treat String values as
    sequences of UTF-16 code units, and defines key functionality based on
    an interpretation of String values as sequences of UTF-16 code units,
    not based on an interpretation as sequences of Unicode code points.

  4. ECMAScript prohibits implementations from conforming to the Unicode
    standard with to case conversions.

The relevant text portions leading to these statements are:

  1. Section 2, Conformance: "A conforming implementation of this
    Standard shall interpret characters in conformance with the Unicode
    Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2
    or UTF-16 as the adopted encoding form, implementation level 3. If the
    adopted ISO/IEC 10646-1 subset is not otherwise specified, it is
    presumed to be the BMP subset, collection 300. If the adopted encoding
    form is not otherwise specified, it presumed to be the UTF-16 encoding
    form."

To interpret this, note that the Unicode Standard, Version 3.1 was the
first one to encode actual supplementary characters [1], and that the
only difference between UCS-2 and UTF-16 is that UTF-16 supports
supplementary characters while UCS-2 does not [2].

  1. Section 6, Source Text: "ECMAScript source text is represented as a
    sequence of characters in the Unicode character encoding, version 3.0
    or later. [...] ECMAScript source text is assumed to be a sequence of
    16-bit code units for the purposes of this specification. [...] If an
    actual source text is encoded in a form other than 16-bit code units
    it must be processed as if it was first converted to UTF-16."

To interpret this, note again that the Unicode Standard, Version 3.1
was the first one to encode actual supplementary characters, and that
the conversion requirement enables the use of supplementary characters
represented as 4-byte UTF-8 characters in source text. As UTF-8 is now
the most commonly used character encoding on the web [3], the 4-byte
UTF-8 representation, not Unicode escape sequences, should be seen as
the normal representation of supplementary characters in ECMAScript
source text.

  1. Section 6, Source Text: "If an actual source text is encoded in a
    form other than 16-bit code units it must be processed as if it was
    first converted to UTF-16. [...] Throughout the rest of this document,
    the phrase “code unit” and the word “character” will be used to
    refer to a 16-bit unsigned value used to represent a single 16-bit
    unit of text." Section 15.5.4.4, String.prototype.charCodeAt(pos):
    "Returns a Number (a nonnegative integer less than 2**16) representing
    the code unit value of the character at position pos in the String
    resulting from converting this object to a String." Section 15.5.5.1
    length: "The number of characters in the String value represented by
    this String object."

I don't like that the specification redefines a commonly used term
such as "character" to mean something quite different ("code unit"),
and hides that redefinition in a section on source text while applying
it primarily to runtime behavior. But there it is: Thanks to the
redefinition, it's clear that charCodeAt() returns UTF-16 code units,
and that the length property holds the number of UTF-16 code units in
the string.

  1. Section 15.5.4.16, String.prototype.toLowerCase(): "For the
    purposes of this operation, the 16-bit code units of the Strings are
    treated as code points in the Unicode Basic Multilingual Plane.
    Surrogate code points are directly transferred from S to L without any
    mapping."

This does not meet Conformance Requirement C8 of the Unicode Standard,
Version 6.0 [4]: "When a process interprets a code unit sequence which
purports to be in a Unicode character encoding form, it shall
interpret that code unit sequence according to the corresponding code
point sequence."

References:

[1] www.unicode.org/reports/tr27/tr27-4.html [2] www.unicode.org/glossary/#U [3] as Mark Davis reported at the Unicode Conference 2010 [4] www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Best , Norbert

# Wes Garland (14 years ago)

On 16 May 2011 17:42, Boris Zbarsky <bzbarsky at mit.edu> wrote:

On 5/16/11 4:38 PM, Wes Garland wrote:

Two great things about strings composed of Unicode code points:

...

If though this is a breaking change from ES-5, I support it

whole-heartedly.... but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00)

Those aren't code points at all. They're just not Unicode.

Not quite: code points D800-DFFF are reserved code points which are not representable with UTF-16. Definition D71, Unicode 6.0.

If you allow storage of such, then you're allowing mixing Unicode strings and "something else" (whatever the something else is), with bad most likely bad results.

I don't believe this is true. We are merely allowing storage of Unicode strings which cannot be converted into UTF-16. That allows us to maintain most of the existing String behaviour (arbitrary array of uint16), although overflowing like this would break:

a = String.fromCharCode(str.charCodeAt(0) + 1)

when str[0] is 0+FFFF.

Most simply, assignign a DOMString containing surrogates to a JS string should collapse the surrogate pairs into the corresponding codepoint if JS strings really contain codepoints...

The only way to make this work is if either DOMString is redefined or DOMString and full Unicode strings are different kinds of objects.

Users doing surrogate pair decomposition will probably find that their

code "just works"

How, exactly?

/** Untested and not rigourous */ function unicode_strlen(validUnicodeString) { var length = 0; for (var i = 0; i < validUnicodeString.length; i++) { if (validUnicodeString.charCodeAt(i) >= 0xd800 &&

validUnicodeString.charCodeAt(i) <= 0xdc00) continue; length++; } return length; }

Code like this ^^^^ which looks for surrogate pairs in valid Unicode strings will simply not find them, instead only finding code points which seem to the same size as the code unit.

Users creating Strings with surrogate pairs will need to

re-tool

Such users would include the DOM, right?

I am hopeful that most web browsers have one or few interfaces between DOM strings and JS strings. I do not know if my hopes reflect reality.

but this is a small burden and these users will be at the upper

strata of Unicode-foodom.

You're talking every single web developer here. Or at least every single web developer who wants to work with Devanagari text.

I don't think so. I bet if we could survey web developers across the industry (rather than just top-tier people who tend to participate in discussions like this one), we would find that the vast major of them never both handling non-BMP cases, and do not test non-BMP cases.

Heck, I don't even know if a non-BMP character can be data-entered into an <input type="text" maxlength="1"> or not. (Do you? What happens?)

I suspect that 99.99% of users will find that

this change will fix bugs in their code when dealing with non-BMP characters.

Not unless DOMString is changed or the interaction between the two very carefully defined in failure-proof ways.

Yes, I was dismayed to find out that DOMString defines UTF-16.

We could get away with converting UTF-16 at DOMString <> JSString transition

point. This might mean that it is possible that JSString=>DOMString would

throw, as full Unicode Strings could contain code points which are not representable in UTF-16.

If don't throw on invalid-in-UTF-16 code points, then round-tripping is lossy. If it does, that's silly.

It needed to specify something, and UTF-16 was the thing that was compatible with how scripts work in ES. Not to mention the Java legacy if the DOM...

By this comment, I am inferring then that DOM and JS Strings share their backing store. From an API-cleanliness point of view, that's too bad. From an implementation POV, it makes sense. Actually, it makes even more sense when I recall the discussion we had last week when you explained how external strings etc work in SpiderMonkey/Gecko.

Do all the browsers share JS/DOM String backing stores?

It is an unfortunate accident of history that UTF-16 surrogate pairs leak

their abstraction into ES Strings, and I believe it is high time we fixed that.

If you can do that without breaking web pages, great. If not, then we need

to talk. ;)

There is no question in mind that this proposal would break Unicode-aware JS. It is my belief that that doesn't matter if it accompanies other major, opt-in changes.

Resolving DOM String <> JS String interchange is a little trickier, but I

think it can be managed if we can allow JS=>DOM to throw when high surrogate

code points are encountered in the JS String. It might mean extra copying, or it might not if the DOM implementation already uses UTF-8 internally.

# Boris Zbarsky (14 years ago)

On 5/17/11 10:40 AM, Wes Garland wrote:

On 16 May 2011 17:42, Boris Zbarsky <bzbarsky at mit.edu Those aren't code points at all. They're just not Unicode.

Not quite: code points D800-DFFF are reserved code points which are not representable with UTF-16.

Nor with any other Unicode encoding, really. They don't represent, on their own, Unicode characters.

If you allow storage of such, then you're allowing mixing Unicode
strings and "something else" (whatever the something else is), with
bad most likely bad results.

I don't believe this is true. We are merely allowing storage of Unicode strings which cannot be converted into UTF-16.

No, you're allowing storage of some sort of number arrays that don't represent Unicode strings at all.

    Users doing surrogate pair decomposition will probably find that
    their code "just works"

How, exactly?

/** Untested and not rigourous */ function unicode_strlen(validUnicodeString) { var length = 0; for (var i = 0; i < validUnicodeString.length; i++) { if (validUnicodeString.charCodeAt(i) >= 0xd800 && validUnicodeString.charCodeAt(i) <= 0xdc00) continue; length++; } return length; }

Code like this ^^^^ which looks for surrogate pairs in valid Unicode strings will simply not find them, instead only finding code points which seem to the same size as the code unit.

Right, so if it's looking for non-BMP characters in the string, say, instead of computing the length, it won't find them. How the heck is that "just works"?

    Users creating Strings with surrogate pairs will need to
    re-tool

Such users would include the DOM, right?

I am hopeful that most web browsers have one or few interfaces between DOM strings and JS strings.

A number of web browsers have an interface between DOM and JS strings that consists of either "memcpy" or "addref the buffer".

I do not know if my hopes reflect reality.

They probably do, so you're only really talking about at least 10 different places across at least 5 different codebases that have to be fixed, in a coordinated way...

You're talking every single web developer here.  Or at least every
single web developer who wants to work with Devanagari text.

I don't think so. I bet if we could survey web developers across the industry (rather than just top-tier people who tend to participate in discussions like this one), we would find that the vast major of them never both handling non-BMP cases, and do not test non-BMP cases.

And how many of them use libraries that handle that for them?

And how many implicitly rely on DOM-to-JS roundtripping without explicitly doing anything with non-BMP chars or surrogate pairs?

Heck, I don't even know if a non-BMP character can be data-entered into an <input type="text" maxlength="1"> or not. (Do you? What happens?)

It cannot in Gecko, as I recall; there maxlength is interpreted as number of UTF-16 code units.

In WebKit, maxlength is interpreted as the number of grapheme clusters based on my look at their code just now.

I don't know offhand about Presto and Trident, for obvious reasons.

We could get away with converting UTF-16 at DOMString <> JSString transition point.

What would that even mean? DOMString is defined to be an ES string in the ES binding right now. Is the proposal to have some other kind of object for DOMString (so that, for example, String.prototype would no longer affect the behavior of DOMString the way it does now)?

This might mean that it is possible that JSString=>DOMString would throw, as full Unicode Strings could contain code points which are not representable in UTF-16.

How is that different from sticking non-UTF-16 into an ES string right now?

If don't throw on invalid-in-UTF-16 code points, then round-tripping is lossy. If it does, that's silly.

So both options suck, yes? ;)

It needed to specify _something_, and UTF-16 was the thing that was
compatible with how scripts work in ES.  Not to mention the Java
legacy if the DOM...

By this comment, I am inferring then that DOM and JS Strings share their backing store.

That's not what the comment was about, actually. The comment was about API.

But yes, in many cases they do share backing store.

Do all the browsers share JS/DOM String backing stores?

Gecko does in some cases.

WebKit+JSC does in all cases, I believe (or at least a large majority of cases).

I don't know about others.

There is no question in mind that this proposal would break Unicode-aware JS.

As far as I can tell it would also break Unicode-unaware JS.

It is my belief that that doesn't matter if it accompanies other major, opt-in changes.

It it's opt-in, perhaps.

Resolving DOM String <> JS String interchange is a little trickier, but I think it can be managed if we can allow JS=>DOM to throw when high surrogate code points are encountered in the JS String.

I'm 99% sure this would break sites.

It might mean extra copying, or it might not if the DOM implementation already uses UTF-8 internally.

Uh... what does UTF-8 have to do with this?

(As a note, Gecko and WebKit both use UTF-16 internally; I would be really surprised if Trident does not. No idea about Presto.)

# Allen Wirfs-Brock (14 years ago)

On May 17, 2011, at 1:15 AM, Norbert Lindenberg wrote:

I have read the discussion so far, but would like to come back to the strawman itself because I believe that it starts with a problem statement that's incorrect and misleading the discussion. Correctly describing the current situation would help in the discussion of possible changes, in particular their compatibility impact.

Since I was editor of the ES5 specification and the person who wrote or updated this language I can tell you the actual intent.

Clause 6 of Edition 3 referenced Unicode version 2.1 and used the words "codepoint" and "character" interchangeably to refer to "a 16-bit unsigned value used to represent a single 16-bit unit of UTF-16 text". It also defined "Unicode character" to mean "...a single Unicode scalar value (which may be longer than 16-bits...". and defined the production SourcerCharacter :: any Unicode Character

Regardless of these definition the rest of the Edition 3 specification generally treated SourceCharacter as being equivalent to a 16-bit "character". For example, the elements of a StringLiteral are SourceCharacters but in defining how string values are constructed from these literals each SourceCharacter maps to a single 16-bit character element of the string value. There is no dissuasion of how to handle SourceCharacters whose Unicode scalar value is larger than 0xffff.

In practice browser JavaScript implementations of this era processed source input as if it was UCS-2. They did not recognize surrogate pairs as representing single Unicode character SourceCharacters.

In drafting the ES5 spec, TC39 had two goals WRT character encoding. We wanted to allow the occurrences of (BMP) characters defined in Unicode versions beyond 2.1 and we wanted to update the specification to reflect actual implementation reality that source was processed as if it was UCS-2. We updated the Unicode reference to "3.0 or later". More importantly, we changed the definition of SourceCharacter to SourcerCharacter :: any Unicode code unit and generally changed all uses of the term "codepoint" to "code unit". Other editorial changes were also made to clarify that the alphabet of ES5 are 16-bit code units.

This editorial update was probably incomplete and some vestiges of the old language still remains. It other cases the language may not be technically correct, according to the official Unicode vocabulary. Regardless, that was the intent behind these changes. To make the ES5 spec. match the current web reality that JavaScript implementations process source code as if it exists UCS-2 world.

The ES5 specification language clearly still has issues WRT Unicode encoding of programs and strings. These need to be fixed in the next edition. However, interpreting the current language as allow supplemental characters to occur in program text and particularly string literals doesn't match either reality or the intent of the ES5 spec. Changing the specification to allow such occurrences and propagating them as surrogate pairs would seem to raise similar backwards compatibility concerns to those that have been raise about my proposal. The status quo would be to simply further clarify that the ES language processes programs as if they are UCS-2 encoded. I would prefer to move the language in a direction that doesn't have this restriction. The primary area of concern will be the handling of supplemental characters in string literals. Interpreting them either as single elements of UCS-32 encoded string or as UTF-16 pairs in the existing ES strings would be a change from current practice. The impact of both approaches need to be better understood.

The relevant portion of the problem statement:

"ECMAScript currently only directly supports the 16-bit basic multilingual plane (BMP) subset of Unicode which is all that existed when ECMAScript was first designed. [...] As currently defined, characters in this expanded character set cannot be used in the source code of ECMAScript programs and cannot be directly included in runtime ECMAScript string values."

My reading of the ECMAScript Language Specification, edition 5.1 (January 2011), is:

  1. ECMAScript allows, but does not require, implementations to support the full Unicode character set.

  2. ECMAScript allows source code of ECMAScript programs to contain characters from the full Unicode character set.

  3. ECMAScript requires implementations to treat String values as sequences of UTF-16 code units, and defines key functionality based on an interpretation of String values as sequences of UTF-16 code units, not based on an interpretation as sequences of Unicode code points.

  4. ECMAScript prohibits implementations from conforming to the Unicode standard with to case conversions.

The relevant text portions leading to these statements are:

  1. Section 2, Conformance: "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form."

To interpret this, note that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters [1], and that the only difference between UCS-2 and UTF-16 is that UTF-16 supports supplementary characters while UCS-2 does not [2].

Other than changing "2.1" to "3.0" this text was not updated from Edition 3. It probably should have been. The original text was written at a time where, in practice, UCS-2 and UTF-16 meant the same thing.

The ES5 motivation for changing "2.1" to "3.0" was not to allow for the use of supplementary characters. As described above, ES5 was attempting to clarify that supplementary characters are not recognized.

  1. Section 6, Source Text: "ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16."

To interpret this, note again that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters, and that the conversion requirement enables the use of supplementary characters represented as 4-byte UTF-8 characters in source text. As UTF-8 is now the most commonly used character encoding on the web [3], the 4-byte UTF-8 representation, not Unicode escape sequences, should be seen as the normal representation of supplementary characters in ECMAScript source text.

This was not the intent. Note that SourceCharacter was redefined as "code unit"

  1. Section 6, Source Text: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16. [...] Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text." Section 15.5.4.4, String.prototype.charCodeAt(pos): "Returns a Number (a nonnegative integer less than 2**16) representing the code unit value of the character at position pos in the String resulting from converting this object to a String." Section 15.5.5.1 length: "The number of characters in the String value represented by this String object."

I don't like that the specification redefines a commonly used term such as "character" to mean something quite different ("code unit"), and hides that redefinition in a section on source text while applying it primarily to runtime behavior. But there it is: Thanks to the redefinition, it's clear that charCodeAt() returns UTF-16 code units, and that the length property holds the number of UTF-16 code units in the string.

Clause 6 defines "code unit", not "UTF-16 code unit". 15.5.4.4 charCodeAt returns a the integer encoding of a "code unit". Nothing is said about it having any correspondence to UTF-16.

  1. Section 15.5.4.16, String.prototype.toLowerCase(): "For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping."

This does not meet Conformance Requirement C8 of the Unicode Standard, Version 6.0 [4]: "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence."

This is an explicit reflection that all character processing In ES5 (excepted for the URI encode/decode functions) exclusive process 16-bit code units and do not recognize or process UTF-16 encoded surrogate pairs.

# Brendan Eich (14 years ago)

On May 16, 2011, at 8:13 PM, Allen Wirfs-Brock wrote:

I think it does. In another reply I also mentioned the possibility of tagging in a JS visible manner strings that have gone through a known encoding process.

Saw that, seems helpful. Want to spec it?

If the strings you are combining from different sources have not been canonicalize to a common encoding then you better be damn care how you combine them.

Programmers miss this as you note, so arguably things are not much worse, at best no worse, with your proposal.

Your strawman does change the game, though, hence the global or cross-cutting (non-modular) concern. I'm warm to it, after digesting. It's about time we get past the 90's!

The DOM seems seems to canonicalize to UTF-16 (with some slop WRT invalid encoding that Boris and others have pointed out). I don't about other sources such as XMLHttpRequest or the file APIs. However, in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. You can hide such things form many programmers but not all. After all, people actually have to implement transcoders.

Transcoding to some canonical Unicode representation is often done by the browser upstream of JS, and that's a good thing. Declarative specification by authors, implementation by relative-few browser i18n gurus, sparing the many JS devs the need to worry. This is good, I claim.

That it means JS hackers are careless about Unicode is inevitable, and there are other reasons for that condition anyway. At least with your strawman there will be full Unicode flowing through JS and back into the DOM and layout.

# Boris Zbarsky (14 years ago)

<<< No Message Collected >>>

# Brendan Eich (14 years ago)

On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:

Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that "in the long run JS in the browser is going to have to be able to deal with arbitrary encodings". Having the capability might be nice, but forcing all web authors to think about it seems like a non-starter.

Allen said "be able to", not "forcing". Big difference. I think we three at least are in agreement here.

That it means JS hackers are careless about Unicode is inevitable, and there are other reasons for that condition anyway. At least with your strawman there will be full Unicode flowing through JS and back into the DOM and layout.

See, this is the part I don't follow. What do you mean by "full Unicode" and how do you envision it flowing?

I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.

# Boris Zbarsky (14 years ago)

On 5/17/11 1:27 PM, Brendan Eich wrote:

On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:

Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that "in the long run JS in the browser is going to have to be able to deal with arbitrary encodings". Having the capability might be nice, but forcing all web authors to think about it seems like a non-starter.

Allen said "be able to", not "forcing". Big difference. I think we three at least are in agreement here.

I think we're in agreement on the sentiment, but perhaps not on where on the "able to" to "forcing" spectrum this strawman falls.

See, this is the part I don't follow. What do you mean by "full Unicode" and how do you envision it flowing?

I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.

That doesn't answer my questions....

# Brendan Eich (14 years ago)

On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote:

On 5/17/11 1:27 PM, Brendan Eich wrote:

On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:

Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that "in the long run JS in the browser is going to have to be able to deal with arbitrary encodings". Having the capability might be nice, but forcing all web authors to think about it seems like a non-starter.

Allen said "be able to", not "forcing". Big difference. I think we three at least are in agreement here.

I think we're in agreement on the sentiment, but perhaps not on where on the "able to" to "forcing" spectrum this strawman falls.

Where do you read "forcing"? Not in the words you cited.

See, this is the part I don't follow. What do you mean by "full Unicode" and how do you envision it flowing?

I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.

That doesn't answer my questions....

Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc.

IOW, JS grows to treat strings as "full Unicode", not uint16 vectors. This is a big deal!

Hope this helps,

# Boris Zbarsky (14 years ago)

On 5/17/11 1:40 PM, Brendan Eich wrote:

On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote:

On 5/17/11 1:27 PM, Brendan Eich wrote:

On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:

Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that "in the long run JS in the browser is going to have to be able to deal with arbitrary encodings". Having the capability might be nice, but forcing all web authors to think about it seems like a non-starter.

Allen said "be able to", not "forcing". Big difference. I think we three at least are in agreement here.

I think we're in agreement on the sentiment, but perhaps not on where on the "able to" to "forcing" spectrum this strawman falls.

Where do you read "forcing"? Not in the words you cited.

In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly?

I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.

That doesn't answer my questions....

Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc.

IOW, JS grows to treat strings as "full Unicode", not uint16 vectors. This is a big deal!

OK, but still allows sticking non-Unicode gunk into the strings, right? So they're still vectors of "something". Whatever that something is.

Hope this helps,

Halfway. The DOM interaction questions remain unanswered. Seriously, I think we should try to make a list of the issues there, the pitfalls that would arise for web developers as a result, then go through and see how and whether to address them. Then we'll have a good basis for considering the web compat impact....

# Brendan Eich (14 years ago)

On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:

On 5/17/11 1:40 PM, Brendan Eich wrote:

Where do you read "forcing"? Not in the words you cited.

In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly?

Where in the strawman is anything of that kind observably (to JS authors) proposed?

Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc.

IOW, JS grows to treat strings as "full Unicode", not uint16 vectors. This is a big deal!

OK, but still allows sticking non-Unicode gunk into the strings, right? So they're still vectors of "something". Whatever that something is.

Yes, old APIs for building strings, e.g. String.fromCharCode, still build "gunk strings", aka uint16 data hacked into strings. New APIs for characters. This has to apply to internal JS engine / DOM implemnetation APIs as needed, too.

Hope this helps,

Halfway. The DOM interaction questions remain unanswered. Seriously, I think we should try to make a list of the issues there, the pitfalls that would arise for web developers as a result, then go through and see how and whether to address them. Then we'll have a good basis for considering the web compat impact....

Good idea.

# Brendan Eich (14 years ago)

On May 17, 2011, at 10:47 AM, Brendan Eich wrote:

On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:

On 5/17/11 1:40 PM, Brendan Eich wrote:

Where do you read "forcing"? Not in the words you cited.

In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly?

Where in the strawman is anything of that kind observably (to JS authors) proposed?

The flag idea just mooted in this thread is not addressing new problem -- we can have such mixing bugs today. True, the odds may go up for such bugs in the future (hard to assess whether or how much).

At least with new APIs for characters not gunk-units, we can detect mixtures dynamically. This still seems a good idea but it is not essential (yet) and it is nowhere near "forcing developers to worry about encodings".

# Shawn Steele (14 years ago)

I would much prefer changing "UCS-2" to "UTF-16", thus formalizing that surrogate pairs are permitted. That'd be very difficult to break any existing code and would still allow representation of everything reasonable in Unicode.

That would enable Unicode, and allow extending string literals and regular expressions for convenience with the U+10FFFF style notation (which would be equivalent to the surrogate pair). The character code manipulation functions could be similarly augmented without breaking anything (and maybe not needing different names?)

You might want to qualify the UTF-16 as allowing, but strongly discouraging, lone surrogates for those people who didn't realize their binary data wasn't a string.

The sole disadvantage would be that iterating through a string would require consideration of surrogates, same as today. The same caution is also necessary to avoid splitting Ä (U+0041 U+0308) into its component A and ̈ parts. I wouldn't be opposed to some sort of helper functions or classes that aided in walking strings, preferably with options to walk the graphemes (or whatever), not just the surrogate pairs. FWIW: we have such a helper for surrogates in .Net and "nobody uses them". The most common feedback is that it's not that helpful because it doesn't deal with the graphemes.

  • Shawn

Shawn.Steele at Microsoft.com Senior Software Design Engineer Microsoft Windows

# Boris Zbarsky (14 years ago)

On 5/17/11 1:47 PM, Brendan Eich wrote:

On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:

On 5/17/11 1:40 PM, Brendan Eich wrote:

Where do you read "forcing"? Not in the words you cited.

In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly?

Where in the strawman is anything of that kind observably (to JS authors) proposed?

The strawman is silent on the matter.

It was proposed by Allen in the discussion about how the strawman interacts with the DOM.

# Wes Garland (14 years ago)

On 17 May 2011 12:36, Boris Zbarsky <bzbarsky at mit.edu> wrote:

Not quite: code points D800-DFFF are reserved code points which are not

representable with UTF-16.

Nor with any other Unicode encoding, really. They don't represent, on their own, Unicode characters.

Right - but they are still legitimate code points, and they fill out the space required to let us treat String as uint16[] when defining the backing store as "something that maps to the set of all Unicode code points".

That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88.

No, you're allowing storage of some sort of number arrays that don't

represent Unicode strings at all.

No, if I understand Allen's proposal correctly, we're allowing storage of some sort of number arrays that may contain reserved code points, some of which cannot be represented in UTF-16.

This isn't that different from the status quo; it is possible right now to generate JS Strings which are not valid UTF-16 by creating invalid surrogate pairs.

Keep in mind, also, that even a sequence of random bytes is a valid Unicode string. The standard does not require that they be well-formed. (D80)

Right, so if it's looking for non-BMP characters in the string, say, instead of computing the length, it won't find them. How the heck is that "just works"?

My untested hypothesis is that the vast majority of JS code looking for non-BMP characters is looking for them in order to call them out for special processing, because the code unit and code point size are different. When they don't need special processing, they don't need to be found. Since the high-surrogate code points do not appear in well-formed Unicode strings, they will not be found, and the unneeded special processing will not happen. This train of clauses forms the basis for my opinion that, for the majority of folks, things will "just work".

What would that even mean? DOMString is defined to be an ES string in the ES binding right now. Is the proposal to have some other kind of object for DOMString (so that, for example, String.prototype would no longer affect the behavior of DOMString the way it does now)?

Wait, are DOMStrings formally UTF-16, or are they ES Strings?

This might mean that it is possible that

JSString=>DOMString would throw, as full Unicode Strings could contain code points which are not representable in UTF-16.

How is that different from sticking non-UTF-16 into an ES string right now?

Currently, JS Strings are effectively arrays of 16-bit code units, which are indistinguishable from 16-bit Unicode strings (D82). This means that a JS application can use JS Strings as arrays of uint16, and expect to be able to round-trip all strings, even those which are not well-formed, through a UTF-16 DOM.

If we redefine JS Strings to be arrays of Unicode code points, then the JS application can use JS Strings as arrays uint21 -- but round-tripping the high-surrogate code points through a UTF-16 layer would not work.

It might mean extra copying, or it might not if the DOM implementation

already uses UTF-8 internally.

Uh... what does UTF-8 have to do with this?

If you're already storing UTF-8 strings internally, then you are already doing something "expensive" (like copying) to get their code units into and out of JS; so no incremental perf impact by not having a common UTF-16 backing store.

(As a note, Gecko and WebKit both use UTF-16 internally; I would be really surprised if Trident does not. No idea about Presto.)

FWIW - last I time I scanned the v8 sources, it appeared to use a three-representation class, which could store either ASCII, UCS2, or UTF-8. Presumably ASCII could also be ISO-Latin-1, as both are exact, naive, byte-sized UCS2/UTF-16 subsets.

# Allen Wirfs-Brock (14 years ago)

On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:

On 5/17/11 1:40 PM, Brendan Eich wrote:

On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote:

On 5/17/11 1:27 PM, Brendan Eich wrote:

On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:

Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that "in the long run JS in the browser is going to have to be able to deal with arbitrary encodings". Having the capability might be nice, but forcing all web authors to think about it seems like a non-starter.

Allen said "be able to", not "forcing". Big difference. I think we three at least are in agreement here.

I think we're in agreement on the sentiment, but perhaps not on where on the "able to" to "forcing" spectrum this strawman falls.

Where do you read "forcing"? Not in the words you cited.

In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly?

This already occurs in JS. For example, the encodeURI function produces a string whose character are the UTF-8 encoding of a UTF-16 string (including recognition of surrogate pairs). So we have at least three string encodings that are explicit dealt with in the ES spec. UCS-2 without surrogate pair recognition (what string literal produce and string methods process), UTF-16 with surrogate pairs (produced by decodeURI and in browser reality also returned as "DOMStrings"), UTF-8 (produced by encodeURI). Any JS program that uses encodeURI/decodeURI or retrieves strings from the DOM should be worry about such encoding differences. Particular if they combine strings produced from these different sources.

I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.

That doesn't answer my questions....

Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc.

IOW, JS grows to treat strings as "full Unicode", not uint16 vectors. This is a big deal!

OK, but still allows sticking non-Unicode gunk into the strings, right? So they're still vectors of "something". Whatever that something is.

Conceptually unsigned 32-bit values. The actual internal representation is likely to be something else. Interpretation of those values is left to the functions (both built-in and application) that operate upon them. Most built-in string methods do not apply any interpretation and will happily process strings as vectors of arbitrary uint32 values. Some built-ins (encodeURI/decodeURI, toUpperCase/toLowerCase) explicitly deal with Unicode characters or various Unicode encodings and these have to be explicitly defined to deal with non-Unicode character values or invalid encodes. These functions already are defined for ES5 in this manner WRT the representation of strings as vectors of arbitrary uint16 values.

# Boris Zbarsky (14 years ago)

On 5/17/11 2:12 PM, Wes Garland wrote:

That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88.

By the same argument, you can encode them in UTF-16. The byte sequence above is not valid UTF-8. See "How do I convert an unpaired UTF-16 surrogate to UTF-8?" at unicode.org/faq/utf_bom.html which says:

A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error.

(fwiw, this is the third hit on Google for "utf-8 surrogates" right after the Wikipedia articles on UTF-8 and UTF-16, so it's not like it's hard to find this information).

No, you're allowing storage of some sort of number arrays that don't
represent Unicode strings at all.

No, if I understand Allen's proposal correctly, we're allowing storage of some sort of number arrays that may contain reserved code points, some of which cannot be represented in UTF-16.

See above. You're allowing number arrays that may or may not be interpretable as Unicode strings, period.

This isn't that different from the status quo; it is possible right now to generate JS Strings which are not valid UTF-16 by creating invalid surrogate pairs.

True. However right now no one is pretending that strings are anything other than arrays of 16-bit units.

Keep in mind, also, that even a sequence of random bytes is a valid Unicode string. The standard does not require that they be well-formed. (D80)

Uh... A sequence of bytes is not anything related to Unicode unless you know how it's encoded.

Not sure what "(D80)" is supposed to mean.

Right, so if it's looking for non-BMP characters in the string, say,
instead of computing the length, it won't find them.  How the heck
is that "just works"?

My untested hypothesis is that the vast majority of JS code looking for non-BMP characters is looking for them in order to call them out for special processing, because the code unit and code point size are different. When they don't need special processing, they don't need to be found.

This hypothesis is worth testing before being blindly inflicted on the web.

What would that even mean?  DOMString is defined to be an ES string
in the ES binding right now.  Is the proposal to have some other
kind of object for DOMString (so that, for example, String.prototype
would no longer affect the behavior of DOMString the way it does now)?

Wait, are DOMStrings formally UTF-16, or are they ES Strings?

DOMStrings are formally UTF-16 in the DOM spec.

They are defined to be ES strings in the ES binding for the DOM.

Please be careful to not confuse the DOM and its language bindings.

One could change the ES binding to use a non-ES-string object to preserve the DOM's requirement that strings be sequences of UTF-16 code units. I'd expect this would break the web unless one is really careful doing it...

How is that different from sticking non-UTF-16 into an ES string
right now?

Currently, JS Strings are effectively arrays of 16-bit code units, which are indistinguishable from 16-bit Unicode strings

Yes.

(D82)

?

This means that a JS application can use JS Strings as arrays of uint16, and expect to be able to round-trip all strings, even those which are not well-formed, through a UTF-16 DOM.

Yep. And they do.

If we redefine JS Strings to be arrays of Unicode code points, then the JS application can use JS Strings as arrays uint21 -- but round-tripping the high-surrogate code points through a UTF-16 layer would not work.

OK, that seems like a breaking change.

    It might mean extra copying, or it might not if the DOM
    implementation already uses
    UTF-8 internally.

Uh... what does UTF-8 have to do with this?

If you're already storing UTF-8 strings internally, then you are already doing something "expensive" (like copying) to get their code units into and out of JS

Maybe, and maybe not. We (Mozilla) have had some proposals to actually use UTF-8 throughout, including in the JS engine; it's quite possible to implement an API that looks like a 16-bit array on top of UTF-8 as long as you allow invalid UTF-8 that's needed to represent surrogates and the like.

(As a note, Gecko and WebKit both use UTF-16 internally; I would be
_really_ surprised if Trident does not.  No idea about Presto.)

FWIW - last I time I scanned the v8 sources, it appeared to use a three-representation class, which could store either ASCII, UCS2, or UTF-8. Presumably ASCII could also be ISO-Latin-1, as both are exact, naive, byte-sized UCS2/UTF-16 subsets.

There's a difference between internal representation and what things look like. For example, Gecko stores DOM text nodes as either ASCII or UTF-16 in practice, but always makes them look like UTF-16 to non-internal consumers....

There's also a possible difference, as you just noted, between what the ES implementation uses and what the DOM uses; certainly in the WebKit+V8 case, but also in the Gecko+Spidermonkey case when textnodes are involved, etc.

I was talking about what the DOM implementations do, not the ES implementations.

# Boris Zbarsky (14 years ago)

On 5/17/11 2:24 PM, Allen Wirfs-Brock wrote:

In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly?

This already occurs in JS. For example, the encodeURI function produces a string whose character are the UTF-8 encoding of a UTF-16 string (including recognition of surrogate pairs).

Last I checked, encodeURI output a pure ASCII string. Am I just missing something? The ASCII string happens to be the %-escaping of the UTF-8 representation of the Unicode string you get by assuming that the initial JS string is a UTF-16 representation of said Unicode string. But at no point here is the author dealing with UTF-8.

OK, but still allows sticking non-Unicode gunk into the strings, right? So they're still vectors of "something". Whatever that something is.

Conceptually unsigned 32-bit values. The actual internal representation is likely to be something else.

I don't care about the internal representation; I'm interested in the author-observable behavior.

Interpretation of those values is left to the functions (both built-in and application) that operate upon them.

OK. That includes user-written functions, of course, which currently only have to deal with UTF-16 (and maybe UCS-2 if you want to be very pedantic).

Most built-in string methods do not apply any interpretation and will happily process strings as vectors of arbitrary uint32 values. Some built-ins (encodeURI/decodeURI, toUpperCase/toLowerCase) explicitly deal with Unicode characters or various Unicode encodings and these have to be explicitly defined to deal with non-Unicode character values or invalid encodes.

That seems fine. This is not where problems lie.

These functions already are defined for ES5 in this manner WRT the representation of strings as vectors of arbitrary uint16 values.

Yes, sure.

# Phillips, Addison (14 years ago)

Note: The W3C Internationalization Core WG published a set of "requirements" in this area for consideration by ES some time ago. It lives here:

www.w3.org/International/wiki/JavaScriptInternationalization

The section on 'locale related behavior' is being separately addressed.

I think that:

  1. Changing references from UCS-2 to UTF-16 makes sense, although the spec, IIRC, already says UTF-16.
  2. Allowing unpaired surrogates is a requirement. Yes, such a string is "ill-formed", but there are too many cases in which one might wish to have such "broken" strings for scripting purposes.
  3. We should have escape syntax for supplementary characters (such as \U0010000). Looking up the surrogate pair for a given Unicode character is extremely inconvenient and is not self-documenting.

As Shawn notes, basically, there are three ways that one might wish to access strings:

  • as grapheme clusters (visual units of text)
  • as Unicode scalar values (logical units of text, i.e. characters)
  • as code units (encoding units of text)

The example I use in the Unicode conference internationalization tutorial is a box on a Web site with an ES controlled message underneath it saying "You have 200 characters remaining."

I think it is instructive to look at how Java managed this transition. In some cases the "200" represents the number of storage units I have available (as in my backing database), in which case String.length is what I probably want. In some cases I want to know how many Unicode characters there are (Java solves this with the codePointCount(), codePointBefore(), and codePointAt() methods). These are relatively rare operations, but they have occasional utility. Or I may want grapheme clusters (Java attempts to solve this with BreakIterators and I tend to favor doing the same thing in JavaScript---default grapheme clusters are better than nothing, but language-specific grapheme clusters are more useful).

If we follow the above, providing only minimal additional methods for accessing codepoints when necessary, this also limits the impact of adding supplementary character support to the language. Regex probably works the way one supposes (both \U0010000 and \ud800\udc00 find the surrogate pair \ud800\udc00 and one can still find the low surrogate \udc00 if one wishes too). And existing scripts will continue to function without alteration. However, new scripts can be written that use supplementary characters.

,

Addison

Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG)

Internationalization is not a feature. It is an architecture.

# Shawn Steele (14 years ago)

Right - but they are still legitimate code points, and they fill out the space required to let us treat String as uint16[] when defining the backing store as "something that maps to the set of all Unicode code points".

That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88.

No, you're allowing storage of some sort of number arrays that don't represent Unicode strings at all.

Codepoints != encoding. High and Low surrogates are legal code points, but are only legitimate code points in UTF-16 if they occur in a pair. If they aren’t in a proper pair, they’re illegal. They are always illegal in UTF-32 & UTF-8. There are other code points that shouldn’t be used for interchange in Unicode too: U+xxFFFF/U+xxFFFE for example. It’s orthogonal to the other question, but the documentation should clearly suggest that users don’t pretend binary data is character data when it’s not. That leads to all sorts of crazy stuff, like illegal lone surrogates trying to be illegally encoded in UTF-8.

# Allen Wirfs-Brock (14 years ago)

On May 17, 2011, at 12:00 PM, Phillips, Addison wrote:

Note: The W3C Internationalization Core WG published a set of "requirements" in this area for consideration by ES some time ago. It lives here:

www.w3.org/International/wiki/JavaScriptInternationalization

You might want to formally convey these requests to TC39 via the W3C/Ecma liaison process. That would carry much more weight and visibility. I don't believe this document has shown up

# Phillips, Addison (14 years ago)

We did.

Cf. lists.w3.org/Archives/Public/public-i18n-core/2009OctDec/0102.html

Addison

Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG)

Internationalization is not a feature. It is an architecture.

# Wes Garland (14 years ago)

On 17 May 2011 14:39, Boris Zbarsky <bzbarsky at mit.edu> wrote:

On 5/17/11 2:12 PM, Wes Garland wrote:

That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88.

By the same argument, you can encode them in UTF-16. The byte sequence above is not valid UTF-8. See "How do I convert an unpaired UTF-16 surrogate to UTF-8?" at unicode.org/faq/utf_bom.html which says:

You are comparing apples and oranges. Which happen to look a lot alike. So maybe apples and nectarines.

But the point remains, the FAQ entry you quote talks about encoding a lone surrogate, i.e. a code unit, which is not a complete code point. You can only convert complete code points from one encoding to another. Just like you can't represent part of a UTF-8 code sub-sequence in any other encoding. The fact that code point X is not representable in UTF-16 has no bearing on its status as a code point, nor its convertability to UTF-8. The problem is that UTF-16 cannot represent all possible code points.

See above. You're allowing number arrays that may or may not be interpretable as Unicode strings, period.

No, I'm not. Any sequence of Unicode code points is a valid Unicode string. It does not matter whether any of those code points are reserved, nor does it matter if it can be represented in all encodings.

From page 90 of the Unicode 6.0 specification, in the Conformance chapter:

D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form. • In the rawest form, Unicode strings may be implemented simply as arrays of the appropriate integral data type, consisting of a sequence of code units lined up one immediately after the other. • A single Unicode string must contain only code units from a single Unicode encoding form. It is not permissible to mix forms within a string.

Not sure what "(D80)" is supposed to mean.

Sorry, "(D80)" means "per definition D80 of The Unicode Standard, Version 6.0"

This hypothesis is worth testing before being blindly inflicted on the web.

I don't think anybody in this discussion is talking about blindly inflicting anything on the web. I do think this proposal is a good one, and certainly a better way forward than insisting that every JS developer, everywhere, understand and implement (over and over again) the details of encoding Unicode as UTF-16. Allen's point about URI escaping being right on target here.

If we redefine JS Strings to be arrays of Unicode code points, then the

JS application can use JS Strings as arrays uint21 -- but round-tripping the high-surrogate code points through a UTF-16 layer would not work.

OK, that seems like a breaking change.

Yes, I believe it would be, certainly if done naively, but I am hopeful somebody can figure out how to overcome this. Hopeful because I think that fixing the JS Unicode problem is a really big deal. "What happens if the guy types a non-BMP character?" is a question which should not have to be answered over and over again in every code review. And I still maintain that 99.99% of JS developers never give it first, let alone second, thought.

Maybe, and maybe not. We (Mozilla) have had some proposals to actually use

UTF-8 throughout, including in the JS engine; it's quite possible to implement an API that looks like a 16-bit array on top of UTF-8 as long as you allow invalid UTF-8 that's needed to represent surrogates and the like.

I understand by this that in the Moz proposals, you mean that the "invalid" UTF-8 sequences are actually valid UTF-8 Strings which encode code points in the range 0xd800-0xdfff, and that these code points were translated directly (and purposefully incorrectly) as UTF-16 code units when viewed as 16-bit arrays.

If JS Strings were arrays of Unicode code points, this conversion would be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08, with no incorrect conversion taking place. The only problem is if there is an intermediate component somewhere that insists on using UTF-16..at that point we just can't represent code point 0xdc08 at all. But that code point will never appear in text; it will only appear for users using the String to store arbitrary data, and their need has already been met..

# Wes Garland (14 years ago)

On 17 May 2011 15:00, Phillips, Addison <addison at lab126.com> wrote:

  1. Allowing unpaired surrogates is a requirement. Yes, such a string is "ill-formed", but there are too many cases in which one might wish to have such "broken" strings for scripting purposes.
  2. We should have escape syntax for supplementary characters (such as \U0010000). Looking up the surrogate pair for a given Unicode character is extremely inconvenient and is not self-documenting.

...

As Shawn notes, basically, there are three ways that one might wish to access strings:

...

  • as code units (encoding units of text)

I don't understand why (except that it is there by an accident of history) that it is desirable to expose a particular low-level detail about one possible encoding for Unicode characters to end-user programmers.

Your point about database storage only holds if the database happens to store Unicode strings encoded in UTF-16. It could just as easily use UTF-8, UTF-7, or UTF-32. For that matter, the database input routine could filter all characters not in ISO-Latin-1 and store only the lower half of non-surrogate-pair UTF-16 code units.

# Phillips, Addison (14 years ago)

Okay, that example was poorly chosen. However, it is the case that when a given string representation uses a particular code unit you often need to have programmatic access to it--for loops and such that iterate over the text, e.g.

It may be an accident ofhistory, but that doesn't mean that scripters don't need access to it.

Addison

Sent from my iPhone

On May 17, 2011, at 12:52 PM, "Wes Garland" <wes at page.ca<mailto:wes at page.ca>> wrote:

On 17 May 2011 15:00, Phillips, Addison <<mailto:addison at lab126.com>addison at lab126.com<mailto:addison at lab126.com>> wrote:

  1. Allowing unpaired surrogates is a requirement. Yes, such a string is "ill-formed", but there are too many cases in which one might wish to have such "broken" strings for scripting purposes.
  2. We should have escape syntax for supplementary characters (such as \U0010000). Looking up the surrogate pair for a given Unicode character is extremely inconvenient and is not self-documenting. ... As Shawn notes, basically, there are three ways that one might wish to access strings: ...
  • as code units (encoding units of text)

I don't understand why (except that it is there by an accident of history) that it is desirable to expose a particular low-level detail about one possible encoding for Unicode characters to end-user programmers.

Your point about database storage only holds if the database happens to store Unicode strings encoded in UTF-16. It could just as easily use UTF-8, UTF-7, or UTF-32. For that matter, the database input routine could filter all characters not in ISO-Latin-1 and store only the lower half of non-surrogate-pair UTF-16 code units.

# Boris Zbarsky (14 years ago)

On 5/17/11 3:29 PM, Wes Garland wrote:

But the point remains, the FAQ entry you quote talks about encoding a lone surrogate, i.e. a code unit, which is not a complete code point. You can only convert complete code points from one encoding to another. Just like you can't represent part of a UTF-8 code sub-sequence in any other encoding. The fact that code point X is not representable in UTF-16 has no bearing on its status as a code point, nor its convertability to UTF-8. The problem is that UTF-16 cannot represent all possible code points.

My point is that neither can UTF-8. Can you name an encoding that can represent the surrogate-range codepoints?

From page 90 of the Unicode 6.0 specification, in the Conformance chapter:

/D80 Unicode string:/ A code unit sequence containing code units of
a particular Unicode
encoding form.
• In the rawest form, Unicode strings may be implemented simply as
arrays of
the appropriate integral data type, consisting of a sequence of code
units lined
up one immediately after the other.
• A single Unicode string must contain only code units from a single
Unicode
encoding form. It is not permissible to mix forms within a string.



Not sure what "(D80)" is supposed to mean.

Sorry, "(D80)" means "per definition D80 of The Unicode Standard, Version 6.0"

Ah, ok. So the problem there is that this is definition only makes sense when a particular Unicode encoding form has been chosen. Which Unicode encoding form have we chosen here?

But note also that D76 in that same document says:

Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

and D79 says:

A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.

and

To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.

In particular, this makes it clear (to me, at least) that whatever Unicode encoding form you choose, a "Unicode string" can only consist of code units encoding Unicode scalar values, which does NOT include high and low surrogates.

Therefore I stand by my statement: if you allow what to me looks like arrays "UTF-32 code units and also values that fall into the surrogate ranges" then you don't get Unicode strings. You get a set of arrays that contains Unicode strings as a proper subset.

OK, that seems like a breaking change.

Yes, I believe it would be, certainly if done naively, but I am hopeful somebody can figure out how to overcome this.

As long as we worry about that before enshrining the result in a spec, I'm all of being hopeful.

Maybe, and maybe not.  We (Mozilla) have had some proposals to
actually use UTF-8 throughout, including in the JS engine; it's
quite possible to implement an API that looks like a 16-bit array on
top of UTF-8 as long as you allow invalid UTF-8 that's needed to
represent surrogates and the like.

I understand by this that in the Moz proposals, you mean that the "invalid" UTF-8 sequences are actually valid UTF-8 Strings which encode code points in the range 0xd800-0xdfff

There are no such valid UTF-8 strings; see spec quotes above. The proposal would have involved having invalid pseudo-UTF-ish strings.

and that these code points were translated directly (and purposefully incorrectly) as UTF-16 code units when viewed as 16-bit arrays.

Yep.

If JS Strings were arrays of Unicode code points, this conversion would be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08, with no incorrect conversion taking place.

Sorry, no. See above.

The only problem is if there is an intermediate component somewhere that insists on using UTF-16..at that point we just can't represent code point 0xdc08 at all.

I just don't get it. You can stick the invalid 16-bit value 0xdc08 into a "UTf-16" string just as easily as you can stick the invalid 24-bit sequence 0xed 0xb0 0x88 into a "UTF-8" string. Can you please, please tell me what made you decide there's any difference between the two cases? They're equally invalid in exactly the same way.

# Mark Davis ☕ (14 years ago)

The wrong conclusion is being drawn. I can say definitively that for the string "a\uD800b".

  • It is a valid Unicode string, according to the Unicode Standard.
  • It cannot be encoded as well-formed in any UTF-x (it is not 'well-formed' in any UTF).
  • When it comes to conversion, the bad code unit \uD800 needs to be handled (eg converted to FFFD, escaped, etc.)

Any programming language using Unicode has the choice of either

  1. allowing strings to be general Unicode strings, or
  2. guaranteeing that they are always well-formed.

There are trade-offs either way, but both are feasible.

Mark

— Il meglio è l’inimico del bene —

# Wes Garland (14 years ago)

On 17 May 2011 16:03, Boris Zbarsky <bzbarsky at mit.edu> wrote:

On 5/17/11 3:29 PM, Wes Garland wrote:

The problem is that UTF-16 cannot represent all possible code points.

My point is that neither can UTF-8. Can you name an encoding that can represent the surrogate-range codepoints?

UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so it's not really worth discussing. UTF-16 is the odd one out.

Therefore I stand by my statement: if you allow what to me looks like arrays

"UTF-32 code units and also values that fall into the surrogate ranges" then you don't get Unicode strings. You get a set of arrays that contains Unicode strings as a proper subset.

Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct.

There are no such valid UTF-8 strings; see spec quotes above. The proposal would have involved having invalid pseudo-UTF-ish strings.

Yes, you can encode code points d800 - dfff in UTF-8 Strings. These are not well-formed strings, but they are Unicode 8-bit Strings (D81) nonetheless. What you can't do is encode 16-bit code units in UTF-8 Strings. This is because you can only convert from one encoding to another via code points. Code units have no cross-encoding meaning.

Further, you can't encode code points d800 - dfff in UTF-16 Strings, leaving you at a loss when you want to store those values in JS Strings (i.e. when using them as uint16[]) except to generate ill-formed UTF-16. I believe it would be far better to treat those values as Unicode code points, not 16-bit code units, and to allow JS String elements to be able to express the whole 21-bit code point range afforded by Unicode.

In other words, current mis-use of JS Strings which can store "characters" 0-ffff in ill-formed UTF-16 strings would become use of JS Strings to store code points 0-1FFFFF which may use reserved code points d800-dfff, the high surrogates, which cannot be represented in UTF-16. But CAN be represented, without loss, in UTF-8, UTF-32, and proposed-new-JS-Strings.

If JS Strings were arrays of Unicode code points, this conversion would

be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08, with no incorrect conversion taking place.

Sorry, no. See above.

printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x

0000000 0000 dc08 0000004

printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x

0000000 edb0 8800 0000003

I just don't get it. You can stick the invalid 16-bit value 0xdc08 into a

"UTf-16" string just as easily as you can stick the invalid 24-bit sequence 0xed 0xb0 0x88 into a "UTF-8" string. Can you please, please tell me what made you decide there's any difference between the two cases? They're equally invalid in exactly the same way.

The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code point 0xdc08", and in UTF-16 0xdc08 means "Part of some non-BMP code point".

Said another way, 0xed in UTF-8 has nearly the same meaning as 0xdc08 in UTF-16. Both are ill-formed code unit subsequences which do not represent a code unit (D84a).

# Shawn Steele (14 years ago)

The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code point 0xdc08",

In UTF-8 0xed 0xb0 0x88 means “Garbage, please replace me with 0xFFFD”. CESU-8 allows this, but that sequence is illegal in UTF-8. The Windows SDK and .Net both disallow ill-formed UTF-8 code points for security reasons. I’m sure you can find other libraries that allow them still, but this sequence is ill-formed and considered a security threat. D92 of unicode 5.0 makes this clear.

and in UTF-16 0xdc08 means "Part of some non-BMP code point".

Only if there was a 0xd800-0xdbff before it. Otherwise it is also ill-formed.

# Boris Zbarsky (14 years ago)

On 5/17/11 5:24 PM, Wes Garland wrote:

UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so it's not really worth discussing. UTF-16 is the odd one out.

That's not what the spec says.

Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct.

Sorry, but no... how much more clear can the spec get?

There are no such valid UTF-8 strings; see spec quotes above.  The
proposal would have involved having invalid pseudo-UTF-ish strings.

Yes, you can encode code points d800 - dfff in UTF-8 Strings. These are not /well-formed/ strings, but they are Unicode 8-bit Strings (D81) nonetheless.

The spec seems to pretty clearly define UTF-8 strings as things that do NOT contain the encoding of those code points. If you think otherwise, cite please.

Further, you can't encode code points d800 - dfff in UTF-16 Strings,

Where does the spec say this? And why does that part of the spec not apply to UTF-8?

printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x

0000000 0000 dc08 0000004

printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x

0000000 edb0 8800 0000003

As far as I can tell, that second conversion is just an implementation bug per the spec. See the part I quoted which explicitly says that an encoder in that situation must stop and return an error.

The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code point 0xdc08"

According to the spec you were citing, that code unit sequence means a UTF-8 decoder should error, no?

# Wes Garland (14 years ago)

On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:

On 5/17/11 5:24 PM, Wes Garland wrote:

Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct.

Sorry, but no... how much more clear can the spec get?

In the past, I have read it thus, pseudo BNF:

UnicodeString => CodeUnitSequence // D80

CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78

CodeUnit => <anything in the current encoding form> // D77

Upon careful re-reading of this part of the specification, I see that D79 is also important. It says that "A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.", and further clarifies that "The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one."

This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct.

Which is unfortunate, as it means that we either

  1. Allow non-Unicode strings in JS -- i.e. Strings composed of all values in the set [0x0, 0x1FFFFF]
  2. Keep making programmers pay the raw-UTF-16 representation tax
  3. Break the String-as-uint16 pattern

I still believe that #1 is the way forward, and that problem of round-tripping these values through the DOM is solvable.

# Mark Davis ☕ (14 years ago)

That is incorrect. See below.

Mark

— Il meglio è l’inimico del bene —

On Tue, May 17, 2011 at 18:33, Wes Garland <wes at page.ca> wrote:

On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu> wrote:

On 5/17/11 5:24 PM, Wes Garland wrote:

Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct.

Sorry, but no... how much more clear can the spec get?

In the past, I have read it thus, pseudo BNF:

UnicodeString => CodeUnitSequence // D80 CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78 CodeUnit => <anything in the current encoding form> // D77

So far, so good. In particular, d800 is a code unit for UTF-16, since it is a code unit that can occur in some code unit sequence in UTF-16.

Upon careful re-reading of this part of the specification, I see that D79 is also important. It says that "A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.",

True.

and further clarifies that "The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one."

True.

This is all consistent with saying that UTF-16 can't contain an isolated d800.

However, that only shows that a Unicode 16-bit string (D82) is not the same as a UTF-16 String (D89), which has been pointed out previously. * *

Repeating the note under D89:

A Unicode string consisting of a well-formed UTF-16 code unit sequence is said to be in UTF-16. Such a Unicode string is referred to as a valid UTF-16 string, or a UTF-16 string for short.

That is, every UTF-16 string is a Unicode 16-bit string, but not vice versa.

Examples:

  • "\u0061\ud800\udc00" is both a Unicode 16-bit string and a UTF-16 string.
  • "\u0061\ud800\udc00" is a Unicode 16-bit string, but not a UTF-16 string.

This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct.

That is incorrect.

# Wes Garland (14 years ago)

Mark;

Are you Dr. Mark E. Davis (born September 13, 1952 (age 58)), co-founder of the Unicode en.wikipedia.org/wiki/Unicode project and the

president of the Unicode Consortiumen.wikipedia.org/wiki/Unicode_Consortiumsince its

incorporation in 1991?

(If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5, et al..those gave me lots of hair loss in the late 90s)

On 17 May 2011 21:55, Mark Davis ☕ <mark at macchiato.com> wrote:In the past, I

have read it thus, pseudo BNF:

UnicodeString => CodeUnitSequence // D80 CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78 CodeUnit => <anything in the current encoding form> // D77

So far, so good. In particular, d800 is a code unit for UTF-16, since it is a code unit that can occur in some code unit sequence in UTF-16.

head smack - code unit, not code point.

This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct.

That is incorrect.

Aie, Karumba!

If we have

  • a sequence of code points
  • taking on values between 0 and 0x1FFFFF
  • including high surrogates and other reserved values
  • independent of encoding

..what exactly are we talking about? Can it be represented in UTF-16 without round-trip loss when normalization is not performed, for the code points 0 through 0xFFFF?

Incidentally, I think this discussion underscores nicely why I think we should work hard to figure out a way to hide UTF-16 encoding details from user-end programmers.

# Jungshik Shin (신정식, 申政湜) (14 years ago)

On Tue, May 17, 2011 at 11:09 AM, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

I would much prefer changing "UCS-2" to "UTF-16", thus formalizing that surrogate pairs are permitted. That'd be very difficult to break any existing code and would still allow representation of everything reasonable in Unicode.

That would enable Unicode, and allow extending string literals and regular expressions for convenience with the U+10FFFF style notation (which would be equivalent to the surrogate pair). The character code manipulation functions could be similarly augmented without breaking anything (and maybe not needing different names?)

You might want to qualify the UTF-16 as allowing, but strongly discouraging, lone surrogates for those people who didn't realize their binary data wasn't a string.

The sole disadvantage would be that iterating through a string would require consideration of surrogates, same as today. The same caution is also necessary to avoid splitting Ä (U+0041 U+0308) into its component A and ̈ parts. I wouldn't be opposed to some sort of helper functions or classes that aided in walking strings, preferably with options to walk the graphemes (or whatever), not just the surrogate pairs. FWIW: we have such a helper for surrogates in .Net and "nobody uses them". The most common feedback is that it's not that helpful because it doesn't deal with the graphemes.

Hmm... I proposed break iterators for 'character/grapheme', word, line and sentence as a part of i18n API, but it's "shot down" (at least for version 0.5). Are you open to adding them now ? Once this discussion is settled and the proposal to support the full unicode range is in place, we can revisit the issue.

Jungshik

# Mark Davis ☕ (14 years ago)

On Tue, May 17, 2011 at 20:01, Wes Garland <wes at page.ca> wrote:

Mark;

Are you Dr. Mark E. Davis (born September 13, 1952 (age 58)), co-founder of the Unicode en.wikipedia.org/wiki/Unicode project and the president of the Unicode Consortiumen.wikipedia.org/wiki/Unicode_Consortiumsince its incorporation in 1991?

Guilty as charged.

(If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5, et al..those gave me lots of hair loss in the late 90s)

Your welcome. We did it to save ourselves from the hair-pulling we had in the 80's over those charsets ;-)

On 17 May 2011 21:55, Mark Davis ☕ <mark at macchiato.com> wrote:In the past, I have read it thus, pseudo BNF:

UnicodeString => CodeUnitSequence // D80 CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78 CodeUnit => <anything in the current encoding form> // D77

So far, so good. In particular, d800 is a code unit for UTF-16, since it is a code unit that can occur in some code unit sequence in UTF-16.

head smack - code unit, not code point.

This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct.

That is incorrect.

Aie, Karumba!

If we have

  • a sequence of code points
  • taking on values between 0 and 0x1FFFFF

10FFFF

  • including high surrogates and other reserved values
  • independent of encoding

..what exactly are we talking about? Can it be represented in UTF-16 without round-trip loss when normalization is not performed, for the code points 0 through 0xFFFF?

Surrogate code points (U+D800..U+DFFF) can't be represented in any UTFstring. They can, be represented in Unicode strings (ones that are not valid UTF strings) with the one restriction that in UTF-16, they have to be isolated. In practice, we just don't find that isolated surrogates in Unicode 16-bit strings cause a problem, so I think that issue has derailed the more important issues involved in this discu, which are in the API.

Incidentally, I think this discussion underscores nicely why I think we should work hard to figure out a way to hide UTF-16 encoding details from user-end programmers.

The biggest issue is the indexing. In Java, for example, iterating through a string is has some ugly syntax:

int cp; for (int i = 0; i < string.length(); i += Character.charCount(cp)) { cp = string.codePointAt(i); doSomethingWith(cp); }

But it doesn't have to be that way; they could have supported, with a little bit of semantic sugar, something like:

for (int cp : aString) { doSomethingWith(cp); }

If done well, the complexity doesn't have to show to the user. In many cases, as Shawn pointed out, codepoints are not really the right unit anyway. What the user may actually need are word boundaries, or grapheme cluster boundaries, etc. If, for example, you truncate a string on just code point boundaries, you'll get the wrong answer sometimes.

It is of course simpler, if you are either designing a programming language from scratch or are able to take the compatibility hit, to have the API for strings always index by code points. That is, from the outside, a string always looks like it is a simple sequence of code points. There are a couple of ways to do that efficiently, even where the internal storage is not 32 bit chunks.

# Shawn Steele (14 years ago)

.#1 can’t happen. There’s no way to get legal input, since any input must be encoded in some form, and since Unicode clearly states that lone values like D800 are illegal in any of the encodings.

Also, none of the inputs really like UTF-32. We can munge it from UTF-8 or UTF-16 HTML to something else, but the developer still has it as UTF-8 or UTF-16, so this isn’t much of a burden for them.

But we can still allow code point notation (U+10FFFF), which mitigates most of the problem.

-Shawn

From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Wes Garland Sent: Tuesday, May 17, 2011 6:34 PM To: Boris Zbarsky Cc: es-discuss at mozilla.org Subject: Re: Full Unicode strings strawman

On 17 May 2011 20:09, Boris Zbarsky <bzbarsky at mit.edu<mailto:bzbarsky at mit.edu>> wrote:

On 5/17/11 5:24 PM, Wes Garland wrote: Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct.

Sorry, but no... how much more clear can the spec get?

In the past, I have read it thus, pseudo BNF:

UnicodeString => CodeUnitSequence // D80

CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78

CodeUnit => <anything in the current encoding form> // D77

Upon careful re-reading of this part of the specification, I see that D79 is also important. It says that "A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.", and further clarifies that "The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one."

This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct.

Which is unfortunate, as it means that we either

  1. Allow non-Unicode strings in JS -- i.e. Strings composed of all values in the set [0x0, 0x1FFFFF]
  2. Keep making programmers pay the raw-UTF-16 representation tax
  3. Break the String-as-uint16 pattern I still believe that #1 is the way forward, and that problem of round-tripping these values through the DOM is solvable.
# Shawn Steele (14 years ago)

Hmm... I proposed break iterators for 'character/grapheme', word, line and sentence as a part of i18n API, but it's "shot down" (at least for version 0.5). Are you open to adding them now ? Once this discussion is settled and the proposal to support the full unicode range is in place, we can revisit the issue.

That’s still a hard problem, particularly for i18n v0.5. However that’s probably a “better” problem to solve than “how to count supplementary characters”.

# Erik Corry (14 years ago)

2011/5/17 Wes Garland <wes at page.ca>:

If you're already storing UTF-8 strings internally, then you are already doing something "expensive" (like copying) to get their code units into and out of JS; so no incremental perf impact by not having a common UTF-16 backing store.

(As a note, Gecko and WebKit both use UTF-16 internally; I would be really surprised if Trident does not.  No idea about Presto.)

FWIW - last I time I scanned the v8 sources, it appeared to use a three-representation class, which could store either ASCII, UCS2, or UTF-8. Presumably ASCII could also be ISO-Latin-1, as both are exact, naive, byte-sized UCS2/UTF-16 subsets.

V8 has ASCII strings and UCS2 strings. There are no Latin1 strings and UTF-8 is only used for IO, never for internal representation. WebKit uses UCS2 throughout and V8 is able to work directly on WebKit UCS2 strings that are on WebKit's C++ heap.

I like Shawn Steele's suggestion.

# Mark Davis ☕ (14 years ago)

Yes, one of the options for the internal storage of the string class is to use different arrays depending on the contents.

  1. uint8's if all the codepoint are <=FF
  2. uint16's if all the codepoint values <=FFFF
  3. uint32's otherwise

That way the internal storage always corresponds directly to the code point index, which makes random access fast. Case #3 occurs rarely, so it is ok if it takes more storage in that case.

Mark

— Il meglio è l’inimico del bene —

# Waldemar Horwat (14 years ago)

On 05/16/11 11:11, Allen Wirfs-Brock wrote:

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:

strawman:support_full_unicode_in_strings

Allen

Two different languages made different decisions on how to approach extending their character sets as Unicode evolved:

  • Java kept their strings encoded exactly as they were (a sequence of 16-bit code units) and provided extra APIs for the cases where you want to extract a code point.

  • Perl widened the concept of characters in strings away from bytes to full Unicode characters. Thus a UTF-8 string can be either represented where each byte is one Perl character or where each Unicode character is one Perl character. There are conversion functions provided to move between the two.

My experience is that Java's approach worked, while Perl's has led to an endless shop of horrors. The problem is that different APIs expect different kinds of strings, so I'm still finding places where conversions should be added but weren't (or vice versa) in a lot of code years after it was written.

  1. I would not be in favor of any approach that widens the concept of a string character or introduces two different representations for a non-BMP character. It will suffer from the same problems as Perl, except that they will be harder to find because use of non-BMP characters is relatively rare.

  2. Widening characters to 21 bits doesn't really help much. As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc. All of these require the machinery to deal with clumps of code units as though they were single characters/graphemes/etc., and once you have that machinery you can reuse it to support non-BMP characters while keeping string code units 16 bits.

    Waldemar

# Mark Davis ☕ (14 years ago)

Markus isn't on es-discuss, so forwarding....

---------- Forwarded message ---------- From: Markus Scherer <markus.icu at gmail.com>

Date: Wed, May 18, 2011 at 22:18 Subject: Re: Full Unicode strings strawman To: Allen Wirfs-Brock <allen at wirfs-brock.com>

Cc: Shawn Steele <Shawn.Steele at microsoft.com>, Mark Davis ☕ <

mark at macchiato.com>, "es-discuss at mozilla.org" <es-discuss at mozilla.org>

On Mon, May 16, 2011 at 5:07 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

I agree that application writer will continue for the foreseeable future have to know whether or not they are dealing with UTF-16 encoded data and/or communicating with other subsystems that expect such data. However, core language support for UTF-32 is a prerequisite for ever moving beyond UTF-16APIs and libraries and getting back to uniform sized character processing.

This seems to be based on a misunderstanding. Fixed-width encodings are nice but not required. The majority of Unicode-aware code uses either UTF-8 or UTF-16, and supports the full Unicode code point range without too much trouble. Even with UTF-32 you get "user characters" that require sequences of two or more code points (e.g., base character + diacritic, Han character

  • variation selector) and there is not always a composite character for such a sequence.

Windows NT uses 16-bit Unicode, started BMP-only and has supported the full Unicode range since Windows 2000. MacOS X uses 16-bit Unicode (coming from NeXT) and supports the full Unicode range. (Ever since MacOS X 10.0 I believe.) Lower-level MacOS APIs use UTF-8 char* and support the full Unicode range. ICU uses 16-bit Unicode, started BMP-only and has supported the full range in most services since the year 2000. Java uses 16-bit Unicode, started BMP-only and has supported the full range since Java 5. KDE uses 16-bit Unicode, started BMP-only and has supported the full range for years. Gnome uses UTF-8 and supports the full range.

JavaScript uses 16-bit Unicode, is still BMP-only although most implementations input and render the full range, and updating its spec and implementations to upgrade compatibly like everyone else seems like the best option.

In a programming language like JavaScript that is heavy on string processing, and interfaces with the UTF-16 DOM and UTF-16 client OSes, a UTF-32 string model might be more trouble than it's worth (and possibly a performance hit).

FYI: I proposed full-Unicode support in JavaScript in 2003, a few months before the committee became practically defunct for a while. sites.google.com/site/markusicu/unicode/es/unicode-2003, sites.google.com/site/markusicu/unicode/es/i18n-2003

Best , markus (Google/ICU/Unicode)

# Brendan Eich (14 years ago)

On May 18, 2011, at 3:46 PM, Waldemar Horwat wrote:

On 05/16/11 11:11, Allen Wirfs-Brock wrote:

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:

strawman:support_full_unicode_in_strings

Allen

Two different languages made different decisions on how to approach extending their character sets as Unicode evolved:

  • Java kept their strings encoded exactly as they were (a sequence of 16-bit code units) and provided extra APIs for the cases where you want to extract a code point.

Bloaty.

  • Perl widened the concept of characters in strings away from bytes to full Unicode characters. Thus a UTF-8 string can be either represented where each byte is one Perl character or where each Unicode character is one Perl character. There are conversion functions provided to move between the two.

This is analogous but diferent in degree. Going from bytes to Unicode characters is different from going from uint16 to Unicode by ~15 to 5 bits.

My experience is that Java's approach worked, while Perl's has led to an endless shop of horrors. The problem is that different APIs expect different kinds of strings, so I'm still finding places where conversions should be added but weren't (or vice versa) in a lot of code years after it was written.

Won't all the relevant browser APIs expect DOMString, which will be the same as JS string? The encoding or lack of it for unformatted 16-bit data is up to the caller and callee. No conversion functions required.

The significant new APIs in Allen's proposal (apart from transcoding helpers that might be useful already, where JS hackers wrongly assume BMP only), are for generating strings from full Unicode code points: String.fromCode, etc.

  1. I would not be in favor of any approach that widens the concept of a string character or introduces two different representations for a non-BMP character. It will suffer from the same problems as Perl, except that they will be harder to find because use of non-BMP characters is relatively rare.

We have the "non-BMP-chars encoded as pairs" problem already. The proposal does not increase its incidence, it merely adds ways to make strings that don't have this problem.

The other problem, of mixing strings-with-pairs and strings-without, is real. But it is not obvious to me why it's worse than duplicating all the existing string APIs. Developers still have to choose. Without a new "ustring" type they can still mix. Do duplicated full-Unicode APIs really pay their way?

  1. Widening characters to 21 bits doesn't really help much. As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc.

Do you? I mean: do web developers?

You can use non-BMP characters in HTML today, pump them through JS, back into the DOM, and render them beautifully on all the latest browsers, IINM. They go as pairs, but the JS code does not care and it can't, unless it hardcodes index and lengths, or does something evil like s.indexOf("\ud800") or whatever.

All of these require the machinery to deal with clumps of code units as though they were single characters/graphemes/etc., and once you have that machinery you can reuse it to support non-BMP characters while keeping string code units 16 bits.

How do you support non-BMP characters without API bloat? Too many APIs by themselves will simply cause developers to stick to the old APIs when they should use the new ones.

The crucial win of Allen's proposal comes down the road, when someone in a certain locale can do s.indexOf(nonBMPChar) and win. That is what Unicode promises and JS fails to deliver. That seems worth considering, rather than s.wideIndexOf(nonBMPChar).

# Shawn Steele (14 years ago)
  • Java kept their strings encoded exactly as they were (a sequence of 16-bit code units) and provided extra APIs for the cases where you want to extract a code point.

Bloaty.

? defining UTF-16 instead of UCS-2 introduces zero bloat. In fact, it pretty much works anyway, it's just not "official". The only "bloat" would be helpers to handle 21-bit forms if desired. OTOH, replacing UCS-2 with 21 bit Unicode would require new functions to avoid breaking stuff that works today. Indeed, I believe that to handle 21-bit forms without breaking would require more bloat than helpers for UTF-16 could possible add.

This is analogous but diferent in degree. Going from bytes to Unicode characters is different from going from uint16 to Unicode by ~15 to 5 bits.

There is no "Unicode". It's UTF-32 or UTF-16. "Unicode" is an abstract collection of code points that have to be encoded in some form to be useful. Unless you also want to propose a UTF-21, which immediately gets really scary to me.

unformatted 16-bit data

I think this is the crux of the problem? A desire to allow binary data that isn't Unicode to impersonate a string? I'd much rather let those hacks continue to use UTF-16 rather than try to legitimize a hacky data store by forcing awkward behavior on the character strings.

The significant new APIs in Allen's proposal (apart from transcoding helpers that might be useful already, where JS hackers wrongly assume BMP only), are for generating strings from full Unicode code points: String.fromCode, etc.

Those are all possible even with UTF-16. In fact, they're probably less breaking because you don't need a new API to get a string from a code point, the existing API works fine, it'd just emit a pair. Only by using UTF-32 are some of these necessary.

We have the "non-BMP-chars encoded as pairs" problem already. The proposal does not increase its incidence, it merely adds ways to make strings that don't have this problem.

No, we don't. They're already perfectly valid UTF-16, which most software already knows how to handle. We don't need to add another way of describing a supplementary character, which would mean that everything that handled strings would have to test not only U+D800, U+DC00, but also realize that U+10000 is equivalent. That approach would introduce a terrifying number of encoding and conversion bugs. (I own encodings at Microsoft, that kind of thing is truly evil and causes endless amounts of support pain.)

The other problem, of mixing strings-with-pairs and strings-without, is real. But it is not obvious to me why it's worse than duplicating all the existing string APIs. Developers still have to choose. Without a new "ustring" type they can still mix. Do duplicated full-Unicode APIs really pay their way?

Nothing needs to change. At it's basic level, "all" that needs done is to change from UCS-2 to UTF-16, which, in practice, is pretty much a no-op. Whether or not additional helpers were desirable is orthogonal. Some, I think, are low-hanging and don't cause much duplication (to/from code point). Others are harder (walking a string at boundaries, although graphemes are more interesting than surrogate pairs anyway).

  1. Widening characters to 21 bits doesn't really help much. As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc.

Do you? I mean: do web developers?

They "blindly" use UTF-8 or UTF-16 in their HTML and expect it to "just work". My editor is going to show me a pretty glyph, not a surrogate pair, when I type the string anyway.

You can use non-BMP characters in HTML today, pump them through JS, back into the DOM, and render them beautifully on all the latest browsers

Exactly - pretty much everyone treats JavaScript as handling UTF-16 already, and it "just works". The only annoying parts are specifying a code point in a string literal or a regular expression. Those could be easily extended, even exactly as Allen suggested, even if we're officially UTF-16.

How do you support non-BMP characters without API bloat?

There is no bloat.

The crucial win of Allen's proposal comes down the road, when someone in a certain locale can do s.indexOf(nonBMPChar) and win.

s.indexOf("\U+10000"), who cares that it ends up as UTF-16? You can already do it, today, with s.indexOf("𐀀"). It happens that 𐀀 looks like d800 + dc00, but it still works. Today. This is no different than most other languages.

# Brendan Eich (14 years ago)

On May 19, 2011, at 10:27 AM, Shawn Steele wrote:

The crucial win of Allen's proposal comes down the road, when someone in a certain locale can do s.indexOf(nonBMPChar) and win. s.indexOf("\U+10000"),

Ok, but "\U+..." does not work today.

who cares that it ends up as UTF-16? You can already do it, today, with s.indexOf("𐀀"). It happens that 𐀀 looks like d800 + dc00, but it still works. Today. This is no different than most other languages.

My example was unclear. I meant something like a one-char indexOf where the result would be used to slice that char.

That doesn't work today. That's the point.

But hey, if JS does not need to change then we can avoid trouble and keep

# Mark S. Miller (14 years ago)

On Thu, May 19, 2011 at 9:50 AM, Brendan Eich <brendan at mozilla.com> wrote: [...]

That seems worth considering, rather than s.wideIndexOf(nonBMPChar).

Not jumping into the Unicode debate yet. But I did want to nip this terminological possibility in the bud. PLEASE do not refer to non-BMP characters, or char code/units with potentially more that 16 bits of significance, as "wide". C already uses the term "wide" to distinguish between 8 and 16 bits of significance in their char datatype, so we should consider "wide" ruined for any sensible usage.

# Shawn Steele (14 years ago)

The crucial win of Allen's proposal comes down the road, when someone in a certain locale can do s.indexOf(nonBMPChar) and win. s.indexOf("\U+10000"),

Ok, but "\U+..." does not work today.

Yes, that would be worth adding (IMO) as a convenience, regardless of whether the backend were UTF-16 or UTF-32. Though requiring 6 digits is annoying. I'd prefer something like \U+ffff or \U+10000 or \u+10FFFF being allowed, though you'd have to do something interesting if there were additional 0-9a-f after U+ffff/U+10000. So \U+{ffff} could be explicit if necessary.

who cares that it ends up as UTF-16? You can already do it, today, with s.indexOf("𐀀"). It happens that 𐀀 looks like d800 + dc00, but it still works. Today. This is no different than most other languages.

My example was unclear. I meant something like a one-char indexOf where the result would be used to slice that char. That doesn't work today. That's the point.

I wonder if we could allow "char" to have 21 bits in number context, and be a surrogate pair in string contexts.

But hey, if JS does not need to change then we can avoid trouble and keep on using 16-bit indexing and length. Is this really the best outcome?

IMO we get 99% of what's needed by just changing to UTF-16 from UCS-2, although I'd like to see helpers like the U+10000 thing.

I think there are only 2 "tricky" parts with UTF-16 instead of UCS-2:

  • Fixing the encode/decode url stuff so that it's UTF-8 instead of CESU-8. (Actually, just encode since decode would be obvious I thnk).
  • Optionally, for convenience, getting a 21 bit number from a string surrogate pair. (because the existing API wouldn't know if you wanted just the D800 or the 10000 represented by the D800, DC00 pair). That could be useful for finding out if the pair is like one of the math bold forms. (you could just do 1D400 <= x <= 1D433 instead of trying to figure out the pairs).

-Shawn

/be

Big-endian? ;-)

# Douglas Crockford (14 years ago)

On 11:59 AM, Brendan Eich wrote:

But hey, if JS does not need to change then we can avoid trouble and keep on using 16-bit indexing and length. Is this really the best outcome?

It may well be. The problem is largely theoretical, and the many offered cures seem to be much worse than the disease. The language works as it has worked for years. It is not ideal in all of its aspects, especially when examined from a critical, abstract perspective. But practically, JS strings work.

I have come around in thinking that we should have \u{HHHHHH} in string and regexp literals (excluding character classes) as a more convenient way of specifying extended characters.

A more critical need is some form of string.format or quasiliterals. The string operation that is most lacking is the ability to inject correctly encoded material into templates. Having such a mechanism, it will little matter if the inserted characters are basic or extended.

# Brendan Eich (14 years ago)

On May 19, 2011, at 11:18 AM, Douglas Crockford wrote:

A more critical need is some form of string.format or quasiliterals.

Yes, these are important. On the agenda for next week? Which strawmen? I've had trouble sorting through the quasi-variations on the wiki, and I know I'm not alone.

# Mark S. Miller (14 years ago)

On Thu, May 19, 2011 at 12:05 PM, Brendan Eich <brendan at mozilla.com> wrote:

On May 19, 2011, at 11:18 AM, Douglas Crockford wrote:

A more critical need is some form of string.format or quasiliterals.

Yes, these are important. On the agenda for next week? Which strawmen? I've had trouble sorting through the quasi-variations on the wiki, and I know I'm not alone.

It is on the agenda as < strawman:quasis>, to be presented by

Mike Samuel.

# Allen Wirfs-Brock (14 years ago)

On May 18, 2011, at 3:46 PM, Waldemar Horwat wrote:

  1. Widening characters to 21 bits doesn't really help much. As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc. All of these require the machinery to deal with clumps of code units as though they were single characters/graphemes/etc., and once you have that machinery you can reuse it to support non-BMP characters while keeping string code units 16 bits.

Others have also made this argument, but I don't fine it particularly persuasive. It seems to be conflating the programming language concept of a "string" with various application specific interpretations of string values and then saying that there is no reason to generalize the ECMAScript string type in one specific way (expanded the allowed code range of individual string elements) because that specific generalization does address certain application specific use cases (combining diacritic marks, grapheme selection, etc.).

Interpreting sequences of "characters" ("characters" meaning individual string elements) within a string requires apply some sort of semantic interpretation to individual "character" values and how they relate to each other when places in sequence. This semantics is generally application or at least domain specific. In addition to the Unicode specific semantics that have been mentioned on this thread, other semantic interpretations of "character" sequences includes things like recognizing English words and sentences, lexical parsing of ECMAScript or XML tokens, processing DNA nucleotide sequence notations , etc.

I contend that none of these or any other application specific semantics should be hardwired into a language's fundamental string data type. The purpose of the string data type is to provide a foundation abstraction that can be used to implement any higher level application semantics.

In the absences of any semantic interpretation of "characters", language string types are most useful if they present a simple uniform model of "character" sequences. The key issue in design a language level string abstraction is the selection of the number of states that can be encoded by each "character". In plainer words, the size of "character". The "character" size limits the number of individual states that can be processed by built-in string operations. As soon as an application requires more states per unit than can be represented by a "character" the application must apply its own specific multi-"character" encoding semantics to the processing of its logical units. Note that this encoding semantics is an additional semantic layer that lies between the basic string data type semantics and the actual application semantics regard the interpretation of sequences of its own state units. Such layered semantics can add significant application complexity.

So one way to look at my proposal is that is is addressing the question of whether 16-bit "characters" are enough to allow the vast majority of applications to do application domain specific semantic string processing without have to work with the complexity of intermediate encoding that exist solely to expand the logical character size.

The processing of Unicode text is a key use case of the ECMAScript string datatype. The fundamental unit of the Unicode code space is pretty obviously the 21-bit code points. Unicode provides domain specific encodings to use when 21-bit characters are not available but all semantics relating to character sequences seem to all be expressed in terms of these 21-bit code points. This seems to strongly suggest that 16-bit "characters" are not enough. Whether an application choses to internally use a UtF-8 or UTF-16 encoding is an independent question that is probably best left to the application designer. But if ECMAScript strings are limited to 16-bit "characters" that designer doesn't have the option of straightforward processing unencoded Unicode code points.

# Shawn Steele (14 years ago)

There are several sequences in Unicode which are meaningless if you have only one character and not the other. Eg: any of the variation selectors by themselves are meaningless. So if you break a modified character from its variation selector you've damaged the string. That's pretty much identical to splitting a high surrogate from its low surrogate.

The fallacy that surrogate pairs are somehow more special than other unicode character sequences is a huge problem. Developers think they've solved "the problem" by detecting surrogate pairs, when in truth they haven't even scratched the surface of "the problem".

In the Unicode community it is well known and understood that there is little, if any, practical advantage to using UTF-32 over UTF-16. I realize this seems counterintuitive on the surface.

Microsoft is firmly in favor of UTF-16 representation, though, as I stated before, we'd be happy if there were convenience functions, like for entering string literals. (I can type 10000 and press alt-x in outlook to get 𐀀, which never gets stored as UTF-32, but it sure is convenient).

If there wasn't an existing body of applications using UTF-16 already, I'd be a little more open to changing, though UTF-32 would be inconvenient for us. Since ES already, effectively, uses UTF-16, changing now just opens a can of incompatibility and special case worms.

Things that UTF-32 works for without special cases:

  • Ordinal collation/sorting (eg: non-linguistic (so why is it a string?))

Things that surrogates don't really add complexity to for UTF-16:

  • Linguistic sorting.
    • You want Ä (U+00C4) == Ä (U+0041, U+0308) anyway, so you have to understand sequences.
    • Or maybe both of them equal to AE (German).
    • Many code points have no weight at all.
    • Compressions (dz) and double compressions (ddz) are far more complex.
    • etc.
  • String searching, eg: you can search for A in Ä, but do you really want it to match?
  • SubString. I found A in Äpfle, but is it really smart to break that into "A" and ""̈pfle"? (there's a U+0308 before pfle, which is pretty meaningless and probably doesn't render well in everyone's font. It makes my " look funny). Probably this isn't very desirable.
  • And that just scratches the surface

Things that don't change with UTF-16:

  • People shoving binary data into UTF-16 and pretending it's a string aren't broken (assuming we allow irregular sequences for compatibility).

Things that make UTF-32 complicated:

  • We already support UTF-16. There will be an amazing number of problems around U+D800, U+DC00 ?= U+10000 if they can both be represented differently in the same string. Some of those will be security problems.
# Allen Wirfs-Brock (14 years ago)

On May 19, 2011, at 2:06 PM, Shawn Steele wrote:

There are several sequences in Unicode which are meaningless if you have only one character and not the other. Eg: any of the variation selectors by themselves are meaningless. So if you break a modified character from its variation selector you've damaged the string. That's pretty much identical to splitting a high surrogate from its low surrogate.
... Things that UTF-32 works for without special cases:

  • Ordinal collation/sorting (eg: non-linguistic (so why is it a string?))

This is exactly my point. The string data type in ECMAScript is non-linguistic. There is noting Unicode specific about the fundamental ECMAScript string data type nor of any of the language operations (concatenation, comparison, length determination, character access) upon strings. Similarly, the majority of String method also have no specific Unicode semantic dependencies (the exception are the for toUpper/LowerCase methods and they don't treat surrogate pairs as a unit). The string data type can be used for many purposes that have nothing to do with the linguistic semantics of Unicode. That is why linguistic based arguments seem to be missing the point.

Where there is a potential connection between Unicode semantics and the string data type is in the interpretation of ECMAScript string literals as constructors of string values. ECMAScript is biased towards Unicode in the sense that it only supports a Unicode interpretation of string literals. However currently ECMAScript literals can only contain BMP characters and escape sequences that produce BMP code points and these are directly represented in string values as 16-bit character codes. Given that level of Unicode bias in the language there is obvious utility in allowing literals to contain any Unicode character and in supporting the generation of string values that use Unicode UTF-16 (and possibly alternatively UTF-8) encodings semantics. The utility of such features seems independent of the underlying size of the string type's character codes.

# Shawn Steele (14 years ago)

I'm still not at all convinced :) I don't buy that the linguistic case isn't interesting (though granted JS is really bad about that to date), and I don't buy that non-linguistic uses have any trouble with UTF-16. \UD800\UDC00 == \UD800\UDC00 just as easily in UTF-16. The "only" advantage is that supplementary characters would "sort" > 0xFFFF in UTF-32 ordinally, but since the order is sort of meaningless anyway, it doesn't really matter that they sort in the surrogate range instead.

# Allen Wirfs-Brock (14 years ago)

On May 19, 2011, at 3:35 PM, Shawn Steele wrote:

I'm still not at all convinced :) I don't buy that the linguistic case isn't interesting

Just to be clear, I'm not saying the linguistic case isn't interesting. It's obviously very interesting for a lot of application. I was trying to say the the ES string value type and String objects are linguistic neutral abstractions and also that I think this is a good thing. I would have no problem at all with a different object-level String-like abstraction that supported Unicode linguistics. Call it UnicodeString or UTF16String or whatever and give it all the linguistic aware methods you want. The underlying data would still be represented using the linguistic neutral string value type.