Question about the “full Unicode in strings” strawman

# Mathias Bynens (7 years ago)

strawman:support_full_unicode_in_strings#unicode_escape_sequences states:

To address this issue, a new form ofUnicodeEscapeSequence is added that is explicitly tagged as containing var variable number (up to 8) of hex digits. The new definition is:

UnicodeEscapeSequence :: u HexDigit HexDigit HexDigit HexDigit u{ HexDigit HexDigitopt HexDigitopt HexDigitopt HexDigitopt HexDigitopt HexDigitopt HexDigitopt }

The \u{ } extended UnicodeEscapeSequence is a syntactic extension that is only recognized after explicit versioning opt-in to the extended “Harmony” syntax.

Why up to 8 hex digits? Shouldn’t 6 hex digits suffice to represent every possible Unicode character (in the range from 0x0 to 0x10ffff)?

Is this a typo or was this done intentionally to be future-compatible with potential Unicode additions?

# Allen Wirfs-Brock (7 years ago)

Note that this proposal isn't currently under consideration for inclusion in ES.next, but the answer to you question is below On Jan 22, 2012, at 10:59 PM, Mathias Bynens wrote:

strawman:support_full_unicode_in_strings#unicode_escape_sequences states:

To address this issue, a new form ofUnicodeEscapeSequence is added that is explicitly tagged as containing var variable number (up to 8) of hex digits. The new definition is:

UnicodeEscapeSequence :: u HexDigit HexDigit HexDigit HexDigit u{ HexDigit HexDigitopt HexDigitopt HexDigitopt HexDigitopt HexDigitopt HexDigitopt HexDigitopt }

The \u{ } extended UnicodeEscapeSequence is a syntactic extension that is only recognized after explicit versioning opt-in to the extended “Harmony” syntax.

Why up to 8 hex digits? Shouldn’t 6 hex digits suffice to represent every possible Unicode character (in the range from 0x0 to 0x10ffff)?

Is this a typo or was this done intentionally to be future-compatible with potential Unicode additions?

Just as the current definition of string specifies that a String is a sequence of 16-bit unsigned integer values, the proposal would specify that a String is a sequence of 32-bit unsigned integer values. In neither cause is it required that the individual String elements must be valid Unicode code point or code units. 8 hex digits are required to express a the full range of unsigned 32-bit integers.

# Mark S. Miller (7 years ago)

On Tue, Jan 24, 2012 at 12:33 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

Note that this proposal isn't currently under consideration for inclusion in ES.next, but the answer to you question is below

[...]

Just as the current definition of string specifies that a String is a sequence of 16-bit unsigned integer values, the proposal would specify that a String is a sequence of 32-bit unsigned integer values. In neither cause is it required that the individual String elements must be valid Unicode code point or code units. 8 hex digits are required to express a the full range of unsigned 32-bit integers.

Why 32? Unicode has only 21 bits of significance. Since we don't expect strings to be stored naively (taking up 4x the space that would otherwise be allocated), I don't see the payoff from choosing the next power of 2. The other choices I see are a) 21 bits, b) 53 bits, or c) unbounded.

# Allen Wirfs-Brock (7 years ago)

On Jan 24, 2012, at 2:11 PM, Mark S. Miller wrote:

On Tue, Jan 24, 2012 at 12:33 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote: Note that this proposal isn't currently under consideration for inclusion in ES.next, but the answer to you question is below [...] Just as the current definition of string specifies that a String is a sequence of 16-bit unsigned integer values, the proposal would specify that a String is a sequence of 32-bit unsigned integer values. In neither cause is it required that the individual String elements must be valid Unicode code point or code units. 8 hex digits are required to express a the full range of unsigned 32-bit integers.

Why 32? Unicode has only 21 bits of significance. Since we don't expect strings to be stored naively (taking up 4x the space that would otherwise be allocated),

I believe most current implementation actually store 16-bits per characters so it would be 2x rather than 4x

I don't see the payoff from choosing the next power of 2. The other choices I see are a) 21 bits, b) 53 bits, or c) unbounded.

The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars. 21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings. 32-bit seems like a plausable unit.

The real controversy that developed over this proposal regarded whether or not every individual Unicode characters needs to be uniformly representable as a single element of a String. This proposal took the position that they should. Other voices felt that such uniformity was unnecessary and seem content to expose UTF-8 or UTF-16. The argument was that applications may have to look at multiple character logical units anyway, so dealing with UTF encodings isn't much of an added burden.

# Norbert Lindenberg (7 years ago)

I don't see the standard allowing character encodings other than UTF-16 in strings. Section 8.4 says "When a String contains actual textual data, each element is considered to be a single UTF-16 code unit." This aligns with other normative references to UTF-16 in sections 2, 6, and 15.1.3. Section 8.4 does seem to allow the use of strings for non-textual data, but character encodings are by definition for characters, i.e., textual data.

Using a Unicode escape for non-textual data seems like abuse to me - Unicode is a character encoding standard. For Unicode, anything beyond six hex digits is excessive.

Norbert

# Mathias Bynens (7 years ago)

Norbert echoes my thoughts perfectly:

Using a Unicode escape for non-textual data seems like abuse to me -

Unicode is a character encoding standard. For Unicode, anything beyond six hex digits is excessive.

Allen, what use cases for using Unicode escapes / strings for non-textual data did you have in mind?

# Tab Atkins Jr. (7 years ago)

On Tue, Jan 24, 2012 at 5:14 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars.  21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings.  32-bit seems like a plausable unit.

People only use strings to store binary data because they didn't have native binary data types. Now they do. Continuing to optimize strings for this use-case seems unnecessary.

The real controversy that developed over this proposal regarded whether or not every individual Unicode characters needs to be uniformly representable as a single element of a String. This proposal took the position that they should.  Other voices felt that such uniformity was unnecessary and seem content to expose UTF-8 or UTF-16.  The argument was that applications may have to look at multiple character logical units anyway, so dealing with UTF encodings isn't much of an added burden.

Anyone who argues that authors should have to deal with multibyte characters spread across >1 elements in a string has never tried to

deal with having a non-BMP name on the web. UTF-16 is particularly horrible in this regard, as "most" names authors will see (if they're not serving a CJK audience explicitly) are in the BMP and thus are a single element. UTF-8 at least has the "advantage" that authors are somewhat more likely to encounter problems if they assume 1 character = 1 element.

Making strings more complicated is, unfortunately, user-hostile against people with names outside of ASCII or the BMP.

# Allen Wirfs-Brock (7 years ago)

On Jan 24, 2012, at 11:45 PM, Norbert Lindenberg wrote:

I don't see the standard allowing character encodings other than UTF-16 in strings. Section 8.4 says "When a String contains actual textual data, each element is considered to be a single UTF-16 code unit." This aligns with other normative references to UTF-16 in sections 2, 6, and 15.1.3. Section 8.4 does seem to allow the use of strings for non-textual data, but character encodings are by definition for characters, i.e., textual data.

8.4 definitely allows for non-textual data" "String type is ... sequences of ... 16-bit unsigned integer values...", "The String type is generally used to represent textual data...", "All operations on Strings ... treat them as sequence of undifferentiated 16-bit signed integers..."

Arbitrary 16-bit values can be placed in a String using either String.fromCharCode (15.5.3.2) or the \uxxxx notation in string literals. Neither of these enforce a requirement that individual String elements are valid Unicode code units.

The standard always encodes strings expressed as string literals (except for literal containing \u escapes) using Unicode. However such literals are restricted to containing characters in the BCP so all such characters are encoded as single 16-bit String elements.

The functions in 15.1.3 do UTF-8 encoding/decoding but only if the the actual string arguments contain well formed UTF data. They explicitly throw when encountering other data. This is a characteristic of these specific functions, not of strings in general.

Using a Unicode escape for non-textual data seems like abuse to me - Unicode is a character encoding standard. For Unicode, anything beyond six hex digits is excessive.

I see no intent in the spec. that \u or String.fromCharCode was to be restricted to valid Unicode character encodings.

Any character encoding is simply a semantic interpretation of binary values. There is no particular reason that "text" encode using non-Unicode encodings (say, for example EBCDIC) can be presented using ES String values and most of the String methods would work fine with such textual data. You would probably want to do exactly that, if you were writing code that had to deal with character set conversions.

# John Tamplin (7 years ago)

On Wed, Jan 25, 2012 at 12:46 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

Arbitrary 16-bit values can be placed in a String using either String.fromCharCode (15.5.3.2) or the \uxxxx notation in string literals. Neither of these enforce a requirement that individual String elements are valid Unicode code units.

You can't really store arbitrary 16-bit values in strings, as they will get corrupted in some browsers. Specifically combining marks and unpaired surrogates are problematic, and some invalid code points get replaced with another character. Even if it is only text, you can't rely on the strings not being mangled -- GWT RPC quotes different ranges of characters on different browsers.

code.google.com/p/google-web-toolkit/source/browse/trunk/user/src/com/google/gwt/user/client/rpc/impl/ClientSerializationStreamWriter.java?spec=svn10146&r=10146#86

(the Android bug mentioned has been fixed long ago, but I haven't gone through any kind of research to see how many of the broken browsers are still in use to see if it is safe to remove).

# Gillam, Richard (7 years ago)

The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars. 21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings. 32-bit seems like a plausable unit.

How would an eight-digit \u escape sequence work from an implementation standpoint? I'm assuming most implementations right now use 16-bit unsigned values as the individual elements of a String. If we allow arbitrary 32-bit values to be placed into a String, how would you make that work? There seem to only be a few options:

a) Change the implementation to use 32-bit units.

b) Change the implementation to use either 32-bit units as needed, with some sort of internal flag that specifies the unit size for an individual string.

c) Encode the 32-bit values somehow as a sequence of 16-bit values.

If you want to allow full generality, it seems like you'd be stuck with option a or option b. Is there really enough value in doing this?

If, on the other hand, the idea is just to make it easier to include non-BMP Unicode characters in strings, you can accomplish this by making a long \u sequence just be shorthand for the equivalent sequence in UTF-16: \u10ffff would be exactly equivalent to \udbff\udfff. You don't have to change the internal format of the string, the indexes of individual characters stay the same, etc.

--Rich Gillam Lab126

# Mark Davis ☕ (7 years ago)

You can't use \u10FFFF as syntax, because that could be \u10FF followed by literal FF. A better syntax is \u{...}, with 1 to 6 digits, values from 0 .. 10FFFF.

Mark — Il meglio è l’inimico del bene — * * * [plus.google.com/114199149796022210033] *

# Mark Davis ☕ (7 years ago)

(oh, and I agree with your other points)

Mark — Il meglio è l’inimico del bene — * * * [plus.google.com/114199149796022210033] *

# Gillam, Richard (7 years ago)

Mark--

Of course. Sorry. That should have been "\U10ffff is equivalent to \udbff\udfff", with a capital U, or "\u{10ffff} is equivalent to \udbff\udfff".

--Rich

On Jan 25, 2012, at 11:11 AM, Mark Davis ☕ wrote:

You can't use \u10FFFF as syntax, because that could be \u10FF followed by literal FF. A better syntax is \u{...}, with 1 to 6 digits, values from 0 .. 10FFFF.

Mark — Il meglio è l’inimico del bene —

[plus.google.com/114199149796022210033]

On Wed, Jan 25, 2012 at 10:59, Gillam, Richard <gillam at lab126.com<mailto:gillam at lab126.com>> wrote:

The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars. 21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings. 32-bit seems like a plausable unit.

How would an eight-digit \u escape sequence work from an implementation standpoint? I'm assuming most implementations right now use 16-bit unsigned values as the individual elements of a String. If we allow arbitrary 32-bit values to be placed into a String, how would you make that work? There seem to only be a few options:

a) Change the implementation to use 32-bit units.

b) Change the implementation to use either 32-bit units as needed, with some sort of internal flag that specifies the unit size for an individual string.

c) Encode the 32-bit values somehow as a sequence of 16-bit values.

If you want to allow full generality, it seems like you'd be stuck with option a or option b. Is there really enough value in doing this?

If, on the other hand, the idea is just to make it easier to include non-BMP Unicode characters in strings, you can accomplish this by making a long \u sequence just be shorthand for the equivalent sequence in UTF-16: \u10ffff would be exactly equivalent to \udbff\udfff. You don't have to change the internal format of the string, the indexes of individual characters stay the same, etc.

--Rich Gillam Lab126

# Allen Wirfs-Brock (7 years ago)

On Jan 25, 2012, at 9:54 AM, John Tamplin wrote:

On Wed, Jan 25, 2012 at 12:46 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote: Arbitrary 16-bit values can be placed in a String using either String.fromCharCode (15.5.3.2) or the \uxxxx notation in string literals. Neither of these enforce a requirement that individual String elements are valid Unicode code units.

You can't really store arbitrary 16-bit values in strings, as they will get corrupted in some browsers. Specifically combining marks and unpaired surrogates are problematic, and some invalid code points get replaced with another character. Even if it is only text, you can't rely on the strings not being mangled -- GWT RPC quotes different ranges of characters on different browsers.

code.google.com/p/google-web-toolkit/source/browse/trunk/user/src/com/google/gwt/user/client/rpc/impl/ClientSerializationStreamWriter.java?spec=svn10146&r=10146#86

(the Android bug mentioned has been fixed long ago, but I haven't gone through any kind of research to see how many of the broken browsers are still in use to see if it is safe to remove).

It isn't clear from your source code what encoding issues you have actually identified. I suspect that you are talking about what happens when an external resource (a application/javascript file) which may be in various UTF encodings is normalized and passed to the JavaScript parser. If so, that isn't what we are talking about here. We are talking about what values can exist at runtime as the individual elements of a string value.

Any browser JavaScript implementation conforming to either the ES3 or ES5.1 spec should display passed for the following test case:

var hc, s; for(var c=0; c<=0xffff;c++) { // test charCode creation and access if (String.fromCharCode(c).charCodeAt(0)!==c) alert("failed for: "+c); //test "\uxxxx" using eval hc= '"\u'+(c<16?'000':c<256?'00':c<4096?'0':'')+c.toString(16)+'"'; s = eval(hc); if (s.length !== 1) alert(' failed \u bad length for '+c); if (s.charCodeAt(0)!==c) alert('failed \u for '+c); }; alert("passed");

# John Tamplin (7 years ago)

On Wed, Jan 25, 2012 at 2:33 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

It isn't clear from your source code what encoding issues you have actually identified. I suspect that you are talking about what happens when an external resource (a application/javascript file) which may be in various UTF encodings is normalized and passed to the JavaScript parser. If so, that isn't what we are talking about here. We are talking about what values can exist at runtime as the individual elements of a string value.

No, I am talking about storing values directly in a string in JS, sending them to a server via XHR, and have them arrive there the same as they existed in JS. There are tests in GWT that verify that, and without replacing certain values with escape sequences (which get reversed on the server), they do not make it unmangled to the server (and inspecting the packets on the wire shows the mangling happens in the browser not the server).

# Allen Wirfs-Brock (7 years ago)

On Jan 25, 2012, at 11:37 AM, John Tamplin wrote:

On Wed, Jan 25, 2012 at 2:33 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote: It isn't clear from your source code what encoding issues you have actually identified. I suspect that you are talking about what happens when an external resource (a application/javascript file) which may be in various UTF encodings is normalized and passed to the JavaScript parser. If so, that isn't what we are talking about here. We are talking about what values can exist at runtime as the individual elements of a string value.

No, I am talking about storing values directly in a string in JS, sending them to a server via XHR, and have them arrive there the same as they existed in JS. There are tests in GWT that verify that, and without replacing certain values with escape sequences (which get reversed on the server), they do not make it unmangled to the server (and inspecting the packets on the wire shows the mangling happens in the browser not the server).

Then that is an issue with XHR or the XHR implementation, not with JavaScript string semantics.

# Allen Wirfs-Brock (7 years ago)

On Jan 25, 2012, at 10:59 AM, Gillam, Richard wrote:

The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars. 21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings. 32-bit seems like a plausable unit.

How would an eight-digit \u escape sequence work from an implementation standpoint? I'm assuming most implementations right now use 16-bit unsigned values as the individual elements of a String. If we allow arbitrary 32-bit values to be placed into a String, how would you make that work? There seem to only be a few options:

a) Change the implementation to use 32-bit units.

b) Change the implementation to use either 32-bit units as needed, with some sort of internal flag that specifies the unit size for an individual string.

c) Encode the 32-bit values somehow as a sequence of 16-bit values.

If you want to allow full generality, it seems like you'd be stuck with option a or option b. Is there really enough value in doing this?

This issue is somewhat address in the proposal in the implementation impacts section strawman:support_full_unicode_in_strings#possible_implementation_impacts

My assumption is that most implementation would choose b. Although the other would all be valid implementation approaches. Note that some implementations already use multiple alternative internal string representations in order to optimize various scenarios.

If, on the other hand, the idea is just to make it easier to include non-BMP Unicode characters in strings, you can accomplish this by making a long \u sequence just be shorthand for the equivalent sequence in UTF-16: \u10ffff would be exactly equivalent to \udbff\udfff. You don't have to change the internal format of the string, the indexes of individual characters stay the same, etc.

The primary intent of the proposal was to extend ES Strings to support a uniform represent of all Unicode characters, including non-BMP. That means that any Unicode character should occupy exactly one element position within a String value. Interpreting \u{10ffff} as an UTF-16 encoding does not satisfy that objective. In particular, under that approach "{10ffff}".length would be 2 while a uniform character representation should yield a length of 1.

When this proposal was originally floated, the much of debated seemed to be about whether such a uniform character representation was desirable or even useful. See the thread starting at esdiscuss/2011-May/014252 also esdiscuss/2011-May/014316 and

# John Tamplin (7 years ago)

On Wed, Jan 25, 2012 at 2:55 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

The primary intent of the proposal was to extend ES Strings to support a uniform represent of all Unicode characters, including non-BMP. That means that any Unicode character should occupy exactly one element position within a String value. Interpreting \u{10ffff} as an UTF-16 encoding does not satisfy that objective. In particular, under that approach "{10ffff}".length would be 2 while a uniform character representation should yield a length of 1.

When this proposal was originally floated, the much of debated seemed to be about whether such a uniform character representation was desirable or even useful. See the thread starting at esdiscuss/2011-May/014252 also esdiscuss/2011-May/014316 and

That seems highly likely to break existing code that assumes UTF16 representation of JS strings.

# Allen Wirfs-Brock (7 years ago)

On Jan 25, 2012, at 12:25 PM, John Tamplin wrote:

On Wed, Jan 25, 2012 at 2:55 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote: The primary intent of the proposal was to extend ES Strings to support a uniform represent of all Unicode characters, including non-BMP. That means that any Unicode character should occupy exactly one element position within a String value. Interpreting \u{10ffff} as an UTF-16 encoding does not satisfy that objective. In particular, under that approach "{10ffff}".length would be 2 while a uniform character representation should yield a length of 1.

When this proposal was originally floated, the much of debated seemed to be about whether such a uniform character representation was desirable or even useful. See the thread starting at esdiscuss/2011-May/014252 also esdiscuss/2011-May/014316 and

That seems highly likely to break existing code that assumes UTF16 representation of JS strings.

The proposal was design to not break any existing JavaScript code. Just to be clear, ES5.1 and previous do not perform UTF-16 encoding of non-BMP characters in the course of normal string processing. Any UTF-encoding of non-BMP characters is either being done by user code, the built-in decodeURI functions, or host provided functions (for example XDR??). None of those should break under my proposal. (for external libraries such as XDR that may depend upon internal implementation data, it is really up to the platform implementation). Nothing in the proposal prevents application level UTF-16 string encodings using 32-bit String elements. This is complete analogous to how UTF-8 encodings are sometimes performed using current 16-bit ECMAScript string elements.