ES Discuss - Message History

Allen Wirfs-Brock (2012-02-27T21:58:47.000Z)

Go to Source

On Feb 26, 2012, at 1:55 AM, Mathias Bynens wrote:

> For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary Unicode character in the [Lo] category, which leads me to believe it should be allowed in identifier names. After all, the spec says:
> 
> UnicodeLetter = any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
> 
> However, since JavaScript uses UCS-2 internally, this symbol is represented by a surrogate pair, i.e. two code units: `\uD87E\uDC00`.
> 
> The spec, however, defines “character” as follows: http://es5.github.com/x6.html#x6
> 
> Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text. The phrase “Unicode character” will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit). The phrase “code point” refers to such a Unicode scalar value. “Unicode character” only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual “Unicode characters,” even though a user might think of the whole sequence as a single character.
> 
> So, based on this definition of “character” (code unit), U+2F800 should not be allowed in an identifier name after all.
> 
> I’m not sure if my interpretation of the spec is correct, though. Could anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters allowed in identifiers or not? For example, is this valid JavaScript or not?

Yes, this interpretation is consistent with my understanding of the requirements as expressed in the ES5 spec.   ES5 logically only works with UCS-2 characters corresponding to the BMP.

Some (probably most) implementations pass UTF-16 encodings of supplemental characters to the JavaScript compiler.  According to the spec, these are processed as two UCS-2 characters neither of which would be a member of any of the above character categories.  Their use in an identifier context should result in a syntax error.  Within a string literal, the two UCS-2 characters would generate two string elements.

This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements.  My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding. var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120227/e8263b4c/attachment.html>

domenic at domenicdenicola.com (2013-08-29T19:38:35.286Z)

Yes, this interpretation is consistent with my understanding of the requirements as expressed in the ES5 spec.   ES5 logically only works with UCS-2 characters corresponding to the BMP.

Some (probably most) implementations pass UTF-16 encodings of supplemental characters to the JavaScript compiler.  According to the spec, these are processed as two UCS-2 characters neither of which would be a member of any of the above character categories.  Their use in an identifier context should result in a syntax error.  Within a string literal, the two UCS-2 characters would generate two string elements.

This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements.  My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding. var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.

Edit