Question about allowed characters in identifier names
Yes, this interpretation is consistent with my understanding of the requirements as expressed in the ES5 spec. ES5 logically only works with UCS-2 characters corresponding to the BMP.
Some (probably most) implementations pass UTF-16 encodings of supplemental characters to the JavaScript compiler. According to the spec, these are processed as two UCS-2 characters neither of which would be a member of any of the above character categories. Their use in an identifier context should result in a syntax error. Within a string literal, the two UCS-2 characters would generate two string elements.
This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements. My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding. var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.
On 27 Feb 2012, at 22:58, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements. My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding.
That sounds nice.
var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.
Wouldn’t this be confusing, though?
global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == '\u{2F800}'` (compatible with ES5 behavior)
var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
var \u{2F800} = 42; // would work (as per your comment; incompatible with ES5 behavior)
var 丽 = 42; // would work (as per your comment; incompatible with ES5 behavior)
Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?
On 24 Aug 2013, at 11:02, Mathias Bynens <mathias at qiwi.be> wrote:
Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?
To clarify: consider what the Identifier Identification strawman or any scripts that emulate similar behavior should do if Allen’s suggestion would be implemented:
String.isIdentifierStart('\uD87E\uDC00'); // should be `false`
String.isIdentifierStart('\u{2F800}'); // should be `true`
// this is impossible, since `'\uD87E\uDC00' === '\u{2F800}'` and there is no way to distinguish these strings
On Aug 24, 2013, at 2:02 , Mathias Bynens <mathias at qiwi.be> wrote:
Wouldn’t this be confusing, though?
I do think it's confusing that \uD87E\uDC00 is not allowed in the current ES6 spec, and have reported this as part 4 of ecmascript#501
The issue was also discussed, without a conclusion, at the TC 39 meeting in July 2012 - look for "# Unicode support": esdiscuss/2012-July/024207
Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?
It's the same problem as when using
let a = 42;
You should not expect that a program using new ES6 features will run on ES5 implementations (although it might run on some that have already added the ES6 features used).
On 24 Aug 2013, at 22:13, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
I do think it's confusing that \uD87E\uDC00 is not allowed in the current ES6 spec, and have reported this as part 4 of ecmascript#501
Agreed. It would make much more sense to just treat \u{2F800}
and \uD87E\uDC00
exactly the same way, even outside of string contexts.
You should not expect that a program using new ES6 features will run on ES5 implementations (although it might run on some that have already added the ES6 features used).
I just want to make sure it’s possible to write a polyfill (in ES5) for the String.isIdentifier{Start,Part}
strawman. As long as String.isIdentifierStart('\uD87E\uDC00')
and String.isIdentifierStart('\u{2F800}')
are expected to return different results (as Allen suggests), this is impossible.
On Aug 24, 2013, at 5:42 , Mathias Bynens <mathias at qiwi.be> wrote:
To clarify: consider what the Identifier Identification strawman or any scripts that emulate similar behavior should do if Allen’s suggestion would be implemented:
String.isIdentifierStart('\uD87E\uDC00'); // should be `false` String.isIdentifierStart('\u{2F800}'); // should be `true` // this is impossible, since `'\uD87E\uDC00' === '\u{2F800}'` and there is no way to distinguish these strings
On Aug 24, 2013, at 14:19 , Mathias Bynens <mathias at qiwi.be> wrote:
I just want to make sure it’s possible to write a polyfill (in ES5) for the
String.isIdentifier{Start,Part}
strawman. As long asString.isIdentifierStart('\uD87E\uDC00')
andString.isIdentifierStart('\u{2F800}')
are expected to return different results (as Allen suggests), this is impossible.
Allen didn't discuss these functions - the strawman didn't exist during the previous round of this discussion. Your code uses string literals, and in ES6 string literals '\uD87E\uDC00' === '\u{2F800}'
. This means the functions proposed in my Identifier Identification strawman cannot tell the difference, but then the specification doesn't require them to.
What Allen suggested, and the current ES6 spec says, is that identifiers in source text using different Unicode escape forms behave differently:
var \uD87E\uDC00;
throws an exception, while
var \u{2F800};
declares a variable.
I don't think that's a technical problem. String.isIdentifier{Start,Part}, as I proposed them, don't deal with actual identifiers in source text; they check individual identifier characters. The functions are intended to be called by a parser, and it's up to the parser to deal with escaping rules, throwing exceptions or unescaping as specified before passing code points to String.isIdentifier{Start,Part}. Calling the functions with string literals doesn't seem like a useful use case.
I do think it's a problem in learning and understanding the language. Having different rules for \uD87E\uDC00 in string literals and identifiers, and therefore also for identifiers embedded in strings passed to eval(), adds yet another of those random inconsistencies that already litter ECMAScript, and ensures a "wat" moment for everybody who comes across them.
On a side note, the strawman hasn't been discussed by TC39 and hasn't been accepted for either ES6 or ES7, so it may be a bit premature to polyfill it. Informal feedback from some members indicated that they'd rather discuss it in the context of a complete proposal for Unicode character properties support.
On 25 Aug 2013, at 04:17, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
I don't think that's a technical problem. String.isIdentifier{Start,Part}, as I proposed them, don't deal with actual identifiers in source text; they check individual identifier characters.
The functions are intended to be called by a parser, and it's up to the parser to deal with escaping rules, throwing exceptions or unescaping as specified before passing code points to String.isIdentifier{Start,Part}. Calling the functions with string literals doesn't seem like a useful use case.
Ah, I see. Step 1 of the proposed algorithm in strawman:identifier_identification would convert the string to a single code point. Thanks for clarifying!
I would suggest adding something like String.isIdentifier
which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do String.isIdentifier('foobar')
I do think it's a problem in learning and understanding the language. Having different rules for \uD87E\uDC00 in string literals and identifiers, and therefore also for identifiers embedded in strings passed to eval(), adds yet another of those random inconsistencies that already litter ECMAScript, and ensures a "wat" moment for everybody who comes across them.
Agreed.
On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:
I would suggest adding something like
String.isIdentifier
which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to doString.isIdentifier('foobar')
What would be the use case(s) for that? Would it accept only an actual identifier or all possible escaped forms of one (i.e., only "𠮷野家" or also "\u{20BB7}野\u5BB6")?
On 26 Aug 2013, at 04:08, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:
I would suggest adding something like
String.isIdentifier
which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to doString.isIdentifier('foobar')
What would be the use case(s) for that?
Tools like mothereff.in/js-escapes.
Would it accept only an actual identifier or all possible escaped forms of one (i.e., only "𠮷野家" or also "\u{20BB7}野\u5BB6")?
Both, since "𠮷野家" === "\u{20BB7}野\u5BB6"
. That string is also equal to "\uD842\uDFB7\u91CE\u5BB6"
although it hasn’t been decided if that should be a valid identifier too (since it uses the surrogate code points explicitly): ecmascript#469. That complicates things.
On Sep 5, 2013, at 1:06 , Mathias Bynens <mathias at qiwi.be> wrote:
Tools like mothereff.in/js-escapes.
I see nothing on that page about identifiers.
Both, since
"𠮷野家" === "\u{20BB7}野\u5BB6"
. That string is also equal to"\uD842\uDFB7\u91CE\u5BB6"
although it hasn’t been decided if that should be a valid identifier too (since it uses the surrogate code points explicitly): ecmascript#469. That complicates things.
The question is about the purpose of the function: Should isIdentifier just help with character classification, or with parsing and unescaping as well? Without real use cases, we can't decide. Note that String methods in general don't know anything about Unicode escapes - those are handled by the ECMAScript or JSON parsers.
On 5 Sep 2013, at 19:37, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
On Sep 5, 2013, at 1:06 , Mathias Bynens <mathias at qiwi.be> wrote:
Tools like mothereff.in/js-escapes.
I see nothing on that page about identifiers.
Sorry, wrong link. I meant this one: mothereff.in/js-variables
Note that String methods in general don't know anything about Unicode escapes - those are handled by the ECMAScript or JSON parsers.
Of course.
On Sep 5, 2013, at 10:40 , Mathias Bynens <mathias at qiwi.be> wrote:
Sorry, wrong link. I meant this one: mothereff.in/js-variables
That's a nice page! But I doubt that developers will create such tools often enough to make a convenience function in the standard worthwhile. I proposed isIdentifierStart and isIdentifierPart because recognizing identifier characters across ECMAScript and Unicode versions requires large data tables; implementing the unescaping rules and filtering reserved words isn't all that hard.
For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary Unicode character in the [Lo] category, which leads me to believe it should be allowed in identifier names. After all, the spec says:
UnicodeLetter = any character in the Unicode categories “Uppercase letter
However, since JavaScript uses UCS-2 internally, this symbol is represented by a surrogate pair, i.e. two code units:
\uD87E\uDC00
.The spec, however, defines “character” as follows: es5.github.com/x6.html#x6
Throughout the rest of this document, the phrase “code unit” and the word
So, based on this definition of “character” (code unit), U+2F800 should not be allowed in an identifier name after all.
I’m not sure if my interpretation of the spec is correct, though. Could anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters allowed in identifiers or not? For example, is this valid JavaScript or not?