Question about allowed characters in identifier names

# Mathias Bynens (13 years ago)

For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary Unicode character in the [Lo] category, which leads me to believe it should be allowed in identifier names. After all, the spec says:

UnicodeLetter = any character in the Unicode categories “Uppercase letter

(Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

However, since JavaScript uses UCS-2 internally, this symbol is represented by a surrogate pair, i.e. two code units: \uD87E\uDC00.

The spec, however, defines “character” as follows: es5.github.com/x6.html#x6

Throughout the rest of this document, the phrase “code unit” and the word

“character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text. The phrase “Unicode character” will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit). The phrase “code point” refers to such a Unicode scalar value. “Unicode character” only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual “Unicode characters,” even though a user might think of the whole sequence as a single character.

So, based on this definition of “character” (code unit), U+2F800 should not be allowed in an identifier name after all.

I’m not sure if my interpretation of the spec is correct, though. Could anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters allowed in identifiers or not? For example, is this valid JavaScript or not?

// Using U+2F800 as an identifier
var \ud87e\udc00 = 42; \ud87e\udc00

For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary
Unicode character in the [Lo] category, which leads me to believe it should
be allowed in identifier names. After all, the spec says:

UnicodeLetter = any character in the Unicode categories “Uppercase letter
> (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter
> (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.


However, since JavaScript uses UCS-2 internally, this symbol is represented
by a surrogate pair, i.e. two code units: `\uD87E\uDC00`.

The spec, however, defines “character” as follows:
http://es5.github.com/x6.html#x6

Throughout the rest of this document, the phrase “code unit” and the word
> “character” will be used to refer to a 16-bit unsigned value used to
> represent a single 16-bit unit of text. The phrase “Unicode character” will
> be used to refer to the abstract linguistic or typographical unit
> represented by a single Unicode scalar value (which may be longer than 16
> bits and thus may be represented by more than one code unit). The phrase
> “code point” refers to such a Unicode scalar value. “Unicode character”
> only refers to entities represented by single Unicode scalar values: the
> components of a combining character sequence are still individual “Unicode
> characters,” even though a user might think of the whole sequence as a
> single character.


So, based on this definition of “character” (code unit), U+2F800 should not
be allowed in an identifier name after all.

I’m not sure if my interpretation of the spec is correct, though. Could
anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters
allowed in identifiers or not? For example, is this valid JavaScript or not?

    // Using U+2F800 as an identifier
    var \ud87e\udc00 = 42; \ud87e\udc00

Regards,
Mathias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120226/7d79e564/attachment.html>

# Allen Wirfs-Brock (13 years ago)

Yes, this interpretation is consistent with my understanding of the requirements as expressed in the ES5 spec. ES5 logically only works with UCS-2 characters corresponding to the BMP.

Some (probably most) implementations pass UTF-16 encodings of supplemental characters to the JavaScript compiler. According to the spec, these are processed as two UCS-2 characters neither of which would be a member of any of the above character categories. Their use in an identifier context should result in a syntax error. Within a string literal, the two UCS-2 characters would generate two string elements.

This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements. My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding. var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.

On Feb 26, 2012, at 1:55 AM, Mathias Bynens wrote:

> For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary Unicode character in the [Lo] category, which leads me to believe it should be allowed in identifier names. After all, the spec says:
> 
> UnicodeLetter = any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
> 
> However, since JavaScript uses UCS-2 internally, this symbol is represented by a surrogate pair, i.e. two code units: `\uD87E\uDC00`.
> 
> The spec, however, defines “character” as follows: http://es5.github.com/x6.html#x6
> 
> Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text. The phrase “Unicode character” will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit). The phrase “code point” refers to such a Unicode scalar value. “Unicode character” only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual “Unicode characters,” even though a user might think of the whole sequence as a single character.
> 
> So, based on this definition of “character” (code unit), U+2F800 should not be allowed in an identifier name after all.
> 
> I’m not sure if my interpretation of the spec is correct, though. Could anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters allowed in identifiers or not? For example, is this valid JavaScript or not?

Yes, this interpretation is consistent with my understanding of the requirements as expressed in the ES5 spec.   ES5 logically only works with UCS-2 characters corresponding to the BMP.

Some (probably most) implementations pass UTF-16 encodings of supplemental characters to the JavaScript compiler.  According to the spec, these are processed as two UCS-2 characters neither of which would be a member of any of the above character categories.  Their use in an identifier context should result in a syntax error.  Within a string literal, the two UCS-2 characters would generate two string elements.

This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements.  My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding. var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120227/e8263b4c/attachment.html>

# Mathias Bynens (12 years ago)

On 27 Feb 2012, at 22:58, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements. My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding.

That sounds nice.

var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.

Wouldn’t this be confusing, though?

global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == '\u{2F800}'` (compatible with ES5 behavior)
var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
var \u{2F800} = 42; // would work (as per your comment; incompatible with ES5 behavior)
var 丽 = 42; // would work (as per your comment; incompatible with ES5 behavior)

Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?

On 27 Feb 2012, at 22:58, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

> On Feb 26, 2012, at 1:55 AM, Mathias Bynens wrote:
> 
>> For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary Unicode character in the [Lo] category, which leads me to believe it should be allowed in identifier names. After all, the spec says:
>> 
>> UnicodeLetter = any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
>> 
>> However, since JavaScript uses UCS-2 internally, this symbol is represented by a surrogate pair, i.e. two code units: `\uD87E\uDC00`.
>> 
>> The spec, however, defines “character” as follows: http://es5.github.com/x6.html#x6
>> 
>> Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text. The phrase “Unicode character” will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit). The phrase “code point” refers to such a Unicode scalar value. “Unicode character” only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual “Unicode characters,” even though a user might think of the whole sequence as a single character.
>> 
>> So, based on this definition of “character” (code unit), U+2F800 should not be allowed in an identifier name after all.
>> 
>> I’m not sure if my interpretation of the spec is correct, though. Could anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters allowed in identifiers or not? For example, is this valid JavaScript or not?
> 
> Yes, this interpretation is consistent with my understanding of the requirements as expressed in the ES5 spec.   ES5 logically only works with UCS-2 characters corresponding to the BMP.
> 
> Some (probably most) implementations pass UTF-16 encodings of supplemental characters to the JavaScript compiler.  According to the spec, these are processed as two UCS-2 characters neither of which would be a member of any of the above character categories.  Their use in an identifier context should result in a syntax error.  Within a string literal, the two UCS-2 characters would generate two string elements.
> 
> This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements.  My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding.

That sounds nice.

> var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.

Wouldn’t this be confusing, though?

    global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
    global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == '\u{2F800}'` (compatible with ES5 behavior)
    var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
    var \u{2F800} = 42; // would work (as per your comment; incompatible with ES5 behavior)
    var 丽 = 42; // would work (as per your comment; incompatible with ES5 behavior)

Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?

# Mathias Bynens (12 years ago)

On 24 Aug 2013, at 11:02, Mathias Bynens <mathias at qiwi.be> wrote:

Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?

To clarify: consider what the Identifier Identification strawman or any scripts that emulate similar behavior should do if Allen’s suggestion would be implemented:

String.isIdentifierStart('\uD87E\uDC00'); // should be `false`
String.isIdentifierStart('\u{2F800}'); // should be `true`
// this is impossible, since `'\uD87E\uDC00' === '\u{2F800}'` and there is no way to distinguish these strings

On 24 Aug 2013, at 11:02, Mathias Bynens <mathias at qiwi.be> wrote:

> Wouldn’t this be confusing, though?
> 
>    global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
>    global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == '\u{2F800}'` (compatible with ES5 behavior)
>    var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
>    var \u{2F800} = 42; // would work (as per your comment; incompatible with ES5 behavior)
>    var 丽 = 42; // would work (as per your comment; incompatible with ES5 behavior)
> 
> Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?

To clarify: consider what the Identifier Identification strawman[1] or any scripts that emulate similar behavior should do if Allen’s suggestion would be implemented:

    String.isIdentifierStart('\uD87E\uDC00'); // should be `false`
    String.isIdentifierStart('\u{2F800}'); // should be `true`
    // this is impossible, since `'\uD87E\uDC00' === '\u{2F800}'` and there is no way to distinguish these strings

[1] http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

# Norbert Lindenberg (12 years ago)

On Aug 24, 2013, at 2:02 , Mathias Bynens <mathias at qiwi.be> wrote:

Wouldn’t this be confusing, though?

I do think it's confusing that \uD87E\uDC00 is not allowed in the current ES6 spec, and have reported this as part 4 of ecmascript#501

The issue was also discussed, without a conclusion, at the TC 39 meeting in July 2012 - look for "# Unicode support": esdiscuss/2012-July/024207

Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?

It's the same problem as when using

let a = 42;

You should not expect that a program using new ES6 features will run on ES5 implementations (although it might run on some that have already added the ES6 features used).

On Aug 24, 2013, at 2:02 , Mathias Bynens <mathias at qiwi.be> wrote:

> On 27 Feb 2012, at 22:58, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

>> var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.
> 
> Wouldn’t this be confusing, though?
> 
>    global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
>    global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == '\u{2F800}'` (compatible with ES5 behavior)
>    var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
>    var \u{2F800} = 42; // would work (as per your comment; incompatible with ES5 behavior)
>    var 丽 = 42; // would work (as per your comment; incompatible with ES5 behavior)

I do think it's confusing that \uD87E\uDC00 is not allowed in the current ES6 spec, and have reported this as part 4 of
https://bugs.ecmascript.org/show_bug.cgi?id=501

The issue was also discussed, without a conclusion, at the TC 39 meeting in July 2012 - look for "# Unicode support":
https://mail.mozilla.org/pipermail/es-discuss/2012-July/024207.html

> Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?

It's the same problem as when using
    let a = 42;

You should not expect that a program using new ES6 features will run on ES5 implementations (although it might run on some that have already added the ES6 features used).

Norbert

# Mathias Bynens (12 years ago)

On 24 Aug 2013, at 22:13, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

I do think it's confusing that \uD87E\uDC00 is not allowed in the current ES6 spec, and have reported this as part 4 of ecmascript#501

Agreed. It would make much more sense to just treat \u{2F800} and \uD87E\uDC00 exactly the same way, even outside of string contexts.

You should not expect that a program using new ES6 features will run on ES5 implementations (although it might run on some that have already added the ES6 features used).

I just want to make sure it’s possible to write a polyfill (in ES5) for the String.isIdentifier{Start,Part} strawman. As long as String.isIdentifierStart('\uD87E\uDC00') and String.isIdentifierStart('\u{2F800}') are expected to return different results (as Allen suggests), this is impossible.

On 24 Aug 2013, at 22:13, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

> On Aug 24, 2013, at 2:02 , Mathias Bynens <mathias at qiwi.be> wrote:
> 
>> On 27 Feb 2012, at 22:58, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
> 
>>> var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.
>> 
>> Wouldn’t this be confusing, though?
>> 
>>   global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
>>   global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == '\u{2F800}'` (compatible with ES5 behavior)
>>   var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
>>   var \u{2F800} = 42; // would work (as per your comment; incompatible with ES5 behavior)
>>   var 丽 = 42; // would work (as per your comment; incompatible with ES5 behavior)
> 
> I do think it's confusing that \uD87E\uDC00 is not allowed in the current ES6 spec, and have reported this as part 4 of
> https://bugs.ecmascript.org/show_bug.cgi?id=501

Agreed. It would make much more sense to just treat `\u{2F800}` and `\uD87E\uDC00` exactly the same way, even outside of string contexts.

> The issue was also discussed, without a conclusion, at the TC 39 meeting in July 2012 - look for "# Unicode support":
> https://mail.mozilla.org/pipermail/es-discuss/2012-July/024207.html
> 
>> Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?
> 
> It's the same problem as when using
>    let a = 42;
> 
> You should not expect that a program using new ES6 features will run on ES5 implementations (although it might run on some that have already added the ES6 features used).

I just want to make sure it’s possible to write a polyfill (in ES5) for the `String.isIdentifier{Start,Part}` strawman. As long as `String.isIdentifierStart('\uD87E\uDC00')` and `String.isIdentifierStart('\u{2F800}')` are expected to return different results (as Allen suggests), this is impossible.

# Norbert Lindenberg (12 years ago)

On Aug 24, 2013, at 5:42 , Mathias Bynens <mathias at qiwi.be> wrote:

To clarify: consider what the Identifier Identification strawman or any scripts that emulate similar behavior should do if Allen’s suggestion would be implemented:
String.isIdentifierStart('\uD87E\uDC00'); // should be `false`
String.isIdentifierStart('\u{2F800}'); // should be `true`
// this is impossible, since `'\uD87E\uDC00' === '\u{2F800}'` and there is no way to distinguish these strings

On Aug 24, 2013, at 14:19 , Mathias Bynens <mathias at qiwi.be> wrote:

I just want to make sure it’s possible to write a polyfill (in ES5) for the String.isIdentifier{Start,Part} strawman. As long as String.isIdentifierStart('\uD87E\uDC00') and String.isIdentifierStart('\u{2F800}') are expected to return different results (as Allen suggests), this is impossible.

Allen didn't discuss these functions - the strawman didn't exist during the previous round of this discussion. Your code uses string literals, and in ES6 string literals '\uD87E\uDC00' === '\u{2F800}'. This means the functions proposed in my Identifier Identification strawman cannot tell the difference, but then the specification doesn't require them to.

What Allen suggested, and the current ES6 spec says, is that identifiers in source text using different Unicode escape forms behave differently:

var \uD87E\uDC00;

throws an exception, while

var \u{2F800};

declares a variable.

I don't think that's a technical problem. String.isIdentifier{Start,Part}, as I proposed them, don't deal with actual identifiers in source text; they check individual identifier characters. The functions are intended to be called by a parser, and it's up to the parser to deal with escaping rules, throwing exceptions or unescaping as specified before passing code points to String.isIdentifier{Start,Part}. Calling the functions with string literals doesn't seem like a useful use case.

I do think it's a problem in learning and understanding the language. Having different rules for \uD87E\uDC00 in string literals and identifiers, and therefore also for identifiers embedded in strings passed to eval(), adds yet another of those random inconsistencies that already litter ECMAScript, and ensures a "wat" moment for everybody who comes across them.

On a side note, the strawman hasn't been discussed by TC39 and hasn't been accepted for either ES6 or ES7, so it may be a bit premature to polyfill it. Informal feedback from some members indicated that they'd rather discuss it in the context of a complete proposal for Unicode character properties support.

On Aug 24, 2013, at 5:42 , Mathias Bynens <mathias at qiwi.be> wrote:

> To clarify: consider what the Identifier Identification strawman[1] or any scripts that emulate similar behavior should do if Allen’s suggestion would be implemented:
> 
>    String.isIdentifierStart('\uD87E\uDC00'); // should be `false`
>    String.isIdentifierStart('\u{2F800}'); // should be `true`
>    // this is impossible, since `'\uD87E\uDC00' === '\u{2F800}'` and there is no way to distinguish these strings
> 
> [1] http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

On Aug 24, 2013, at 14:19 , Mathias Bynens <mathias at qiwi.be> wrote:

> I just want to make sure it’s possible to write a polyfill (in ES5) for the `String.isIdentifier{Start,Part}` strawman. As long as `String.isIdentifierStart('\uD87E\uDC00')` and `String.isIdentifierStart('\u{2F800}')` are expected to return different results (as Allen suggests), this is impossible.

Allen didn't discuss these functions - the strawman didn't exist during the previous round of this discussion. Your code uses string literals, and in ES6 string literals '\uD87E\uDC00' === '\u{2F800}'. This means the functions proposed in my Identifier Identification strawman cannot tell the difference, but then the specification doesn't require them to.

What Allen suggested, and the current ES6 spec says, is that identifiers in source text using different Unicode escape forms behave differently: 
   var \uD87E\uDC00;
throws an exception, while
   var \u{2F800};
declares a variable.

I don't think that's a technical problem. String.isIdentifier{Start,Part}, as I proposed them, don't deal with actual identifiers in source text; they check individual identifier characters. The functions are intended to be called by a parser, and it's up to the parser to deal with escaping rules, throwing exceptions or unescaping as specified before passing code points to String.isIdentifier{Start,Part}. Calling the functions with string literals doesn't seem like a useful use case.

I do think it's a problem in learning and understanding the language. Having different rules for \uD87E\uDC00 in string literals and identifiers, and therefore also for identifiers embedded in strings passed to eval(), adds yet another of those random inconsistencies that already litter ECMAScript, and ensures a "wat" moment for everybody who comes across them.

On a side note, the strawman hasn't been discussed by TC39 and hasn't been accepted for either ES6 or ES7, so it may be a bit premature to polyfill it. Informal feedback from some members indicated that they'd rather discuss it in the context of a complete proposal for Unicode character properties support.

Norbert

# Mathias Bynens (12 years ago)

On 25 Aug 2013, at 04:17, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

I don't think that's a technical problem. String.isIdentifier{Start,Part}, as I proposed them, don't deal with actual identifiers in source text; they check individual identifier characters.

The functions are intended to be called by a parser, and it's up to the parser to deal with escaping rules, throwing exceptions or unescaping as specified before passing code points to String.isIdentifier{Start,Part}. Calling the functions with string literals doesn't seem like a useful use case.

Ah, I see. Step 1 of the proposed algorithm in strawman:identifier_identification would convert the string to a single code point. Thanks for clarifying!

I would suggest adding something like String.isIdentifier which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do String.isIdentifier('foobar')

I do think it's a problem in learning and understanding the language. Having different rules for \uD87E\uDC00 in string literals and identifiers, and therefore also for identifiers embedded in strings passed to eval(), adds yet another of those random inconsistencies that already litter ECMAScript, and ensures a "wat" moment for everybody who comes across them.

Agreed.

On 25 Aug 2013, at 04:17, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

> I don't think that's a technical problem. String.isIdentifier{Start,Part}, as I proposed them, don't deal with actual identifiers in source text; they check individual identifier characters.
> 
> The functions are intended to be called by a parser, and it's up to the parser to deal with escaping rules, throwing exceptions or unescaping as specified before passing code points to String.isIdentifier{Start,Part}. Calling the functions with string literals doesn't seem like a useful use case.

Ah, I see. Step 1 of the proposed algorithm in http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification would convert the string to a single code point. Thanks for clarifying!

I would suggest adding something like `String.isIdentifier` which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do `String.isIdentifier('foobar')`

> I do think it's a problem in learning and understanding the language. Having different rules for \uD87E\uDC00 in string literals and identifiers, and therefore also for identifiers embedded in strings passed to eval(), adds yet another of those random inconsistencies that already litter ECMAScript, and ensures a "wat" moment for everybody who comes across them.

Agreed.

# Norbert Lindenberg (12 years ago)

On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:

I would suggest adding something like String.isIdentifier which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do String.isIdentifier('foobar')

What would be the use case(s) for that? Would it accept only an actual identifier or all possible escaped forms of one (i.e., only "𠮷野家" or also "\u{20BB7}野\u5BB6")?

On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:

> I would suggest adding something like `String.isIdentifier` which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do `String.isIdentifier('foobar')`

What would be the use case(s) for that? Would it accept only an actual identifier or all possible escaped forms of one (i.e., only "𠮷野家" or also "\u{20BB7}野\u5BB6")?

Norbert

# Mathias Bynens (12 years ago)

On 26 Aug 2013, at 04:08, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:

I would suggest adding something like String.isIdentifier which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do String.isIdentifier('foobar')

What would be the use case(s) for that?

Tools like mothereff.in/js-escapes.

Would it accept only an actual identifier or all possible escaped forms of one (i.e., only "𠮷野家" or also "\u{20BB7}野\u5BB6")?

Both, since "𠮷野家" === "\u{20BB7}野\u5BB6". That string is also equal to "\uD842\uDFB7\u91CE\u5BB6" although it hasn’t been decided if that should be a valid identifier too (since it uses the surrogate code points explicitly): ecmascript#469. That complicates things.

On 26 Aug 2013, at 04:08, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

> On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:
> 
>> I would suggest adding something like `String.isIdentifier` which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do `String.isIdentifier('foobar')`
> 
> What would be the use case(s) for that?

Tools like http://mothereff.in/js-escapes.

> Would it accept only an actual identifier or all possible escaped forms of one (i.e., only "𠮷野家" or also "\u{20BB7}野\u5BB6")?

Both, since `"𠮷野家" === "\u{20BB7}野\u5BB6"`. That string is also equal to `"\uD842\uDFB7\u91CE\u5BB6"` although it hasn’t been decided if that should be a valid identifier too (since it uses the surrogate code points explicitly): <https://bugs.ecmascript.org/show_bug.cgi?id=469>. That complicates things.

# Norbert Lindenberg (12 years ago)

On Sep 5, 2013, at 1:06 , Mathias Bynens <mathias at qiwi.be> wrote:

Tools like mothereff.in/js-escapes.

I see nothing on that page about identifiers.

Both, since "𠮷野家" === "\u{20BB7}野\u5BB6". That string is also equal to "\uD842\uDFB7\u91CE\u5BB6" although it hasn’t been decided if that should be a valid identifier too (since it uses the surrogate code points explicitly): ecmascript#469. That complicates things.

The question is about the purpose of the function: Should isIdentifier just help with character classification, or with parsing and unescaping as well? Without real use cases, we can't decide. Note that String methods in general don't know anything about Unicode escapes - those are handled by the ECMAScript or JSON parsers.

On Sep 5, 2013, at 1:06 , Mathias Bynens <mathias at qiwi.be> wrote:

> On 26 Aug 2013, at 04:08, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
> 
>> On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:
>> 
>>> I would suggest adding something like `String.isIdentifier` which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do `String.isIdentifier('foobar')`
>> 
>> What would be the use case(s) for that?
> 
> Tools like http://mothereff.in/js-escapes.

I see nothing on that page about identifiers.

>> Would it accept only an actual identifier or all possible escaped forms of one (i.e., only "𠮷野家" or also "\u{20BB7}野\u5BB6")?
> 
> Both, since `"𠮷野家" === "\u{20BB7}野\u5BB6"`. That string is also equal to `"\uD842\uDFB7\u91CE\u5BB6"` although it hasn’t been decided if that should be a valid identifier too (since it uses the surrogate code points explicitly): <https://bugs.ecmascript.org/show_bug.cgi?id=469>. That complicates things.

The question is about the purpose of the function: Should isIdentifier just help with character classification, or with parsing and unescaping as well? Without real use cases, we can't decide. Note that String methods in general don't know anything about Unicode escapes - those are handled by the ECMAScript or JSON parsers.

Norbert

# Mathias Bynens (12 years ago)

On 5 Sep 2013, at 19:37, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

On Sep 5, 2013, at 1:06 , Mathias Bynens <mathias at qiwi.be> wrote:

Tools like mothereff.in/js-escapes.

I see nothing on that page about identifiers.

Sorry, wrong link. I meant this one: mothereff.in/js-variables

Note that String methods in general don't know anything about Unicode escapes - those are handled by the ECMAScript or JSON parsers.

Of course.

On 5 Sep 2013, at 19:37, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

> On Sep 5, 2013, at 1:06 , Mathias Bynens <mathias at qiwi.be> wrote:
> 
>> On 26 Aug 2013, at 04:08, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
>> 
>>> On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:
>>> 
>>>> I would suggest adding something like `String.isIdentifier` which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do `String.isIdentifier('foobar')`
>>> 
>>> What would be the use case(s) for that?
>> 
>> Tools like http://mothereff.in/js-escapes.
> 
> I see nothing on that page about identifiers.

Sorry, wrong link. I meant this one: http://mothereff.in/js-variables

> Note that String methods in general don't know anything about Unicode escapes - those are handled by the ECMAScript or JSON parsers.

Of course.

# Norbert Lindenberg (12 years ago)

On Sep 5, 2013, at 10:40 , Mathias Bynens <mathias at qiwi.be> wrote:

Sorry, wrong link. I meant this one: mothereff.in/js-variables

That's a nice page! But I doubt that developers will create such tools often enough to make a convenience function in the standard worthwhile. I proposed isIdentifierStart and isIdentifierPart because recognizing identifier characters across ECMAScript and Unicode versions requires large data tables; implementing the unescaping rules and filtering reserved words isn't all that hard.

On Sep 5, 2013, at 10:40 , Mathias Bynens <mathias at qiwi.be> wrote:

> 
> On 5 Sep 2013, at 19:37, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
> 
>> On Sep 5, 2013, at 1:06 , Mathias Bynens <mathias at qiwi.be> wrote:
>> 
>>> On 26 Aug 2013, at 04:08, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
>>> 
>>>> On Aug 24, 2013, at 23:43 , Mathias Bynens <mathias at qiwi.be> wrote:
>>>> 
>>>>> I would suggest adding something like `String.isIdentifier` which accepts a multi-symbol string or an array of code points to the strawman. Seems useful to be able to do `String.isIdentifier('foobar')`
>>>> 
>>>> What would be the use case(s) for that?
>>> 
>>> Tools like http://mothereff.in/js-escapes.
>> 
>> I see nothing on that page about identifiers.
> 
> Sorry, wrong link. I meant this one: http://mothereff.in/js-variables

That's a nice page! But I doubt that developers will create such tools often enough to make a convenience function in the standard worthwhile. I proposed isIdentifierStart and isIdentifierPart because recognizing identifier characters across ECMAScript and Unicode versions requires large data tables; implementing the unescaping rules and filtering reserved words isn't all that hard.

Norbert