Making the identifier identification strawman less restrictive

# Mathias Bynens (12 years ago)

This is about the identifier identification strawman: strawman:identifier_identification

For tooling, it’s better to have a false positive than to have a false negative. In the case of identifier identification, it’s more useful to flag an identifier that is permitted as per the latest Unicode version as valid instead of rejecting it, even if it’s perhaps not supported in some engines that use data tables based on older Unicode versions.

In general, tools try to be lenient rather than restrictive in the input they accept. The list of ECMAScript 5 parsers that handle non-ASCII symbols in identifiers in the strawman backs this up: instead of using Unicode 3.0.0 data, more recent Unicode versions are used, in an attempt to handle as many technically valid identifiers as possible.

  • Esprima and Acorn parse identifiers as per Unicode 6.3.0.
  • UglifyJS v1 and v2 use Unicode 6.1.0, which as far as ECMAScript 5.1 identifiers go, is identical to Unicode 6.3.0.

For these reasons, I’d suggest changing the identifier identification proposal as follows. Step 8 currently says:

If edition is 3 or 5, let unicode be 3.0.

Change that into step 8a:

If edition is 3, let unicode be 3.0.

Then, add a new step 8b:

If edition is 5, let unicode be 6.3.

P.S. I’ve created an identifier identification prollyfill (mathiasbynens/identifier-identification) based on the current strawman. I’ll happily modify it if the strawman gets updated in any way.

# Norbert Lindenberg (12 years ago)

On Oct 6, 2013, at 6:01 , Mathias Bynens <mathias at qiwi.be> wrote:

This is about the identifier identification strawman: strawman:identifier_identification

For tooling, it’s better to have a false positive than to have a false negative. In the case of identifier identification, it’s more useful to flag an identifier that is permitted as per the latest Unicode version as valid instead of rejecting it, even if it’s perhaps not supported in some engines that use data tables based on older Unicode versions.

I think that depends on the kind of tool you're writing:

  • For a code transformation tool, such as CoffeeScript, I agree that you probably don't want to introduce any artificial restrictions, so you want to use the latest Unicode version possible. Step 10 of the proposed algorithm ("let unicode be the Unicode version supported by the implementation in ECMAScript identifiers") is intended to cover that case.

  • For a code checker such as JSHint it's probably useful to be able to verify that code runs on all conforming implementations of a specific ECMAScript edition, and that's only guaranteed for the minimum Unicode version required by that edition. ECMAScript 5 implementations are not required to support Unicode 6.3.0, not even its BMP subset.

In general, tools try to be lenient rather than restrictive in the input they accept. The list of ECMAScript 5 parsers that handle non-ASCII symbols in identifiers in the strawman backs this up: instead of using Unicode 3.0.0 data, more recent Unicode versions are used, in an attempt to handle as many technically valid identifiers as possible.

In the case of JSHint, I think that's problematic - see above.

For these reasons, I’d suggest changing the identifier identification proposal as follows. Step 8 currently says:

If edition is 3 or 5, let unicode be 3.0.

Change that into step 8a:

If edition is 3, let unicode be 3.0.

Then, add a new step 8b:

If edition is 5, let unicode be 6.3.

That would create several problems:

  • The Unicode version for ES 5 would be above that for ES 6 (step 9).

  • Tools like JSHint, if they want to ensure compatibility with all ES 5 implementations, would have to lie and specify ES 3.

  • Step 11 would allow all Unicode code points that are matched by the IdentifierStart production, including supplementary code points, which ES 5 does not permit in identifiers. (Note that Unicode 3.0, the version referenced by the ES 3 and ES 5 specs, was the last one that did not define any supplementary characters, so the spec as proposed doesn't have that problem).

  • Implementations that don't support Unicode 6.3 yet, e.g., because they rely on Unicode information provided by the operating system, would not be able to comply with the spec.

  • When the next version of Unicode is published, a spec referencing 6.3 would be obsolete just like one referencing 3.0.

Norbert

# Mathias Bynens (12 years ago)

CC’ing the creators of the tools we’ve been talking about to get their input. Hi guys! Please start reading here: esdiscuss.org/topic/making-the-identifier-identification-strawman-less-restrictive.

On 9 Oct 2013, at 07:48, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

  • For a code transformation tool, such as CoffeeScript, I agree that you probably don't want to introduce any artificial restrictions, so you want to use the latest Unicode version possible. Step 10 of the proposed algorithm ("let unicode be the Unicode version supported by the implementation in ECMAScript identifiers") is intended to cover that case.

But that makes it an implementation-dependent impure function, which is unacceptable for code transformation tools like CoffeeScript and parsers like Esprima, Acorn, or UglifyJS. They’d support certain identifiers in engine A but not in engine B, without any control over it. If this is how String.isIdentifier{Start,Part} works I think these tools will stick to their custom identifier identification methods, which would defeat the purpose of the entire strawman. (Ariya, Marijn, Mihai: any thoughts?)

For these reasons, I’d suggest changing the identifier identification proposal as follows. […]

That would create several problems:

  • The Unicode version for ES 5 would be above that for ES 6 (step 9).

I would love to see that changed too as per javascript.spec.whatwg.org/#unicode-database-version, but that’s an issue with the main ES spec. ecmascript#2071

  • Tools like JSHint, if they want to ensure compatibility with all ES 5 implementations, would have to lie and specify ES 3.

They don’t at the moment. @Anton, any thoughts?

  • Step 11 would allow all Unicode code points that are matched by the IdentifierStart production, including supplementary code points, which ES 5 does not permit in identifiers. (Note that Unicode 3.0, the version referenced by the ES 3 and ES 5 specs, was the last one that did not define any supplementary characters, so the spec as proposed doesn't have that problem).

Step 11 says “If cp is matched by the IdentifierStart production in edition edition of the ECMAScript Language Specification using Unicode version unicode, then return true” so this is not a problem either way. ES5 IdentifierStart doesn’t include supplementary code points, like you said, because of the way ES5 defines “character”.

  • Implementations that don't support Unicode 6.3 yet, e.g., because they rely on Unicode information provided by the operating system, would not be able to comply with the spec.

Which implementations do that? The ones I’ve seen all use custom-generated Unicode data files. Is this really an issue?

# Norbert Lindenberg (12 years ago)

On Oct 9, 2013, at 0:27 , Mathias Bynens <mathias at qiwi.be> wrote:

  • Step 11 would allow all Unicode code points that are matched by the IdentifierStart production, including supplementary code points, which ES 5 does not permit in identifiers. (Note that Unicode 3.0, the version referenced by the ES 3 and ES 5 specs, was the last one that did not define any supplementary characters, so the spec as proposed doesn't have that problem).

Step 11 says “If cp is matched by the IdentifierStart production in edition edition of the ECMAScript Language Specification using Unicode version unicode, then return true” so this is not a problem either way. ES5 IdentifierStart doesn’t include supplementary code points, like you said, because of the way ES5 defines “character”.

You're right. In re-checking this, I also found that the proposed algorithm could leave edition undefined, so I fixed that.

  • Implementations that don't support Unicode 6.3 yet, e.g., because they rely on Unicode information provided by the operating system, would not be able to comply with the spec.

Which implementations do that? The ones I’ve seen all use custom-generated Unicode data files. Is this really an issue?

I haven't seen the implementation, but my tests find that IE 10 running on Windows 7 supports the BMP subset of Unicode 5.1 in identifiers, and rejects newer BMP characters. Unicode 5.1 happens to be the version supported in Windows 7 in general. Unicode 5.2 was published in October 2009, IE 10 in October 2012, so there would have been enough time to update an IE-internal table.

Norbert

# Mathias Bynens (12 years ago)

Forwarding Anton’s message since he’s not subscribed to es-discuss.

Begin forwarded message:

From: Anton Kovalyov <anton at kovalyov.net>

If someone who’s running their code in the ES5 environment has a potential of running into problems when using Unicode 6.3, JSHint needs to warn about it. Today it doesn’t mostly because I’m really fuzzy on differences between Unicode versions and I don’t have much time to dig into that so I’m relying on incoming patches.

Hope that helps at all. Let me know if you need more info or if I misunderstood the question.

# Mathias Bynens (12 years ago)

Forwarding Marijn’s message since he’s not subscribed to es-discuss.

Begin forwarded message:

From: Marijn Haverbeke <marijnh at gmail.com>

I have no particular opinion about this. Identifiers with obscure characters tend to be so rare that I don't expect to have any trouble with this except for constructed conformance tests. Since you'll probably be the people who are going to construct such tests, I'll leave you to figure out what's sane.

# Erik Arvidsson (12 years ago)

I'm concerned about the latest version of this on the wiki. The edition parameter requires that we ship 2 tables today. This seems like it might change to 3 in ES7 and n in ES(n+4). I think the only reasonable requirement is that it matches what the engine actually uses. For tools it seems better for them to include this table. I don't want all runtimes to have to pay for something that only tools need.

# Brendan Eich (12 years ago)

Erik Arvidsson wrote:

I'm concerned about the latest version of this on the wiki. The edition parameter requires that we ship 2 tablestoday. This seems like it might change to 3 in ES7 and n in ES(n+4). I think the only reasonable requirement is that it matches what the engine actually uses. For tools it seems better for them to include this table. I don't want all runtimes to have to pay for something that only tools need.

Good point.

We are still struggling with the download size of ICU on Firefox.

# Mathias Bynens (12 years ago)

On 14 Oct 2013, at 23:21, Erik Arvidsson <erik.arvidsson at gmail.com> wrote:

I'm concerned about the latest version of this on the wiki. The edition parameter requires that we ship 2 tables today. This seems like it might change to 3 in ES7 and n in ES(n+4). I think the only reasonable requirement is that it matches what the engine actually uses. For tools it seems better for them to include this table. I don't want all runtimes to have to pay for something that only tools need.

This strawman is only useful for tools. If tools need to implement this themselves, this basically means the strawman is rejected, right?