Backwards compatibility and U+2E2F in `Identifier`s

# Mathias Bynens (4 years ago)

I wrote a (new) script that generates a regular expression that matches valid JavaScript identifiers as per ECMAScript 5.1 / Unicode v6.2.0. mathiasbynens.be/demo/javascript-identifier-regex

Then, I made it do the same thing according to the latest ECMAScript 6 draft, which refers to Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax (www.unicode.org/reports/tr31).

After comparing the output, I noticed that both regular expressions are identical except for the following: ECMAScript 5 allows U+2E2F VERTICAL TILDE in IdentifierStart and IdentifierPart, but ECMAScript 6 / Unicode TR31 doesn’t.

Was this potentially breaking change intentional? I’m fine with disallowing U+2E2F, but only if we’re sure it doesn’t break any existing code.

# Mathias Bynens (4 years ago)

On 19 Aug 2013, at 11:25, Mathias Bynens <mathias at qiwi.be> wrote:

After comparing the output, I noticed that both regular expressions are identical except for the following: ECMAScript 5 allows U+2E2F VERTICAL TILDE in IdentifierStart and IdentifierPart, but ECMAScript 6 / Unicode TR31 doesn’t.

Was this potentially breaking change intentional? I’m fine with disallowing U+2E2F, but only if we’re sure it doesn’t break any existing code.

Follow-up: since this thread is being ignored, I filed ecmascript#1802.

# Brendan Eich (4 years ago)

Thanks for filing. I don't recall any reason for this and it seems bad to break compatibility.

It may be that Norbert and Allen just missed your post on Monday; cc'ing them.

# Norbert Lindenberg (4 years ago)

I had no intentions specific to U+2E2F when I proposed relying on UTR 31 - the change is simply the effect of the character properties that the Unicode Technical Committee assigned to this character.

I don't think there's a real problem. U+2E2F was added in Unicode version 5.1. ECMAScript 5.1 requires only support for Unicode 3.0, and warns "If portability is a concern, programmers should only employ identifier characters defined in Unicode 3.0" (section 7.6). IE 10 throws a SyntaxError if the character is used in an identifier.

BTW, if that's the only difference between the regular expressions for ES 5.1 and ES 6, then at least one of them is wrong - ES 6 allows supplementary characters in identifiers, while ES 5.1 doesn't.

Norbert

# Mathias Bynens (4 years ago)

I had no intentions specific to U+2E2F when I proposed relying on UTR 31 - the change is simply the effect of the character properties that the Unicode Technical Committee assigned to this character.

I don't think there's a real problem. U+2E2F was added in Unicode version 5.1. ECMAScript 5.1 requires only support for Unicode 3.0, and warns "If portability is a concern, programmers should only employ identifier characters defined in Unicode 3.0" (section 7.6). IE 10 throws a SyntaxError if the character is used in an identifier.

BTW, if that's the only difference between the regular expressions for ES 5.1 and ES 6, then at least one of them is wrong - ES 6 allows supplementary characters in identifiers, while ES 5.1 doesn't.

It’s the only difference in the BMP range. (Like you said, differences in the astral range are to be expected, since astral symbols weren’t allowed in ES5. No back-compat issues there.)

# Anne van Kesteren (4 years ago)

On Mon, Aug 19, 2013 at 5:25 AM, Mathias Bynens <mathias at qiwi.be> wrote:

After comparing the output, I noticed that both regular expressions are identical except for the following: ECMAScript 5 allows U+2E2F VERTICAL TILDE in IdentifierStart and IdentifierPart, but ECMAScript 6 / Unicode TR31 doesn’t.

Per ES6 identifiers start with code points whose category is ID_Start which per www.unicode.org/reports/tr31 includes Lm which per www.unicode.org/Public/UNIDATA/UnicodeData.txt is true for U+2E2F. So why exactly is it disallowed?

Unless I'm missing something, the discussion we had in TC39 yesterday was moot, and these bugs are invalid:

ecmascript#1802, bugzilla.mozilla.org/show_bug.cgi?id=917436, bugs.webkit.org/show_bug.cgi?id=121541, code.google.com/p/v8/issues/detail?id=2892

And IE10 and maybe IE11 have a bug in not allowing it.

# Mathias Bynens (4 years ago)

On 18 Sep 2013, at 21:05, Anne van Kesteren <annevk at annevk.nl> wrote:

On Mon, Aug 19, 2013 at 5:25 AM, Mathias Bynens <mathias at qiwi.be> wrote:

After comparing the output, I noticed that both regular expressions are identical except for the following: ECMAScript 5 allows U+2E2F VERTICAL TILDE in IdentifierStart and IdentifierPart, but ECMAScript 6 / Unicode TR31 doesn’t.

Per ES6 identifiers start with code points whose category is ID_Start which per www.unicode.org/reports/tr31 includes Lm which per www.unicode.org/Public/UNIDATA/UnicodeData.txt is true for U+2E2F. So why exactly is it disallowed?

ID_Start includes code points in the Lm category indeed, but then later explicitly disallows Pattern_Syntax and Pattern_White_Space code points. As it says on the page you linked to:

In set notation, this is [[:L:][:Nl:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]] plus stability extensions.

U+2E2F has the Pattern_Syntax property and is thus not a valid ID_Start code point.