Identifying ECMAScript identifiers

# Norbert Lindenberg (13 years ago)

ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.

While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman: strawman:identifier_identification

I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.

Thanks, Norbert

ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.

While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.

Thanks,
Norbert

# gaz Heyes (13 years ago)

You forgot to include MentalJS. I can parse 120k identifier in 5ms on Firefox on my crappy machine. My method is much faster than any of the parsers you listed and I handle unicode escapes too. businessinfo.co.uk/labs/MentalJS/MentalJS.html

You forgot to include MentalJS. I can parse 120k identifier in 5ms on
Firefox on my crappy machine. My method is much faster than any of the
parsers you listed and I handle unicode escapes too.
http://businessinfo.co.uk/labs/MentalJS/MentalJS.html

On 8 March 2013 07:35, Norbert Lindenberg <ecmascript at lindenbergsoftware.com
> wrote:

> ECMAScript is used to implement a variety of tools that check code for
> conformance with the ECMAScript specification, minimize it, perform other
> transformations, or generate ECMAScript code. These tools have to be able
> to recognize ECMAScript identifiers, taking the identifier specification
> and the underlying Unicode specification into consideration - not quite
> easy given the ever-growing Unicode character set.
>
> While looking at support for Unicode character properties in general, I
> realized that this use case is shaped differently from others, fundamental
> to ECMAScript, and amenable to a fairly simple solution, and so there's now
> a strawman:
> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
>
> I'd like to discuss this at next week's TC 39 meeting, but also invite
> earlier comments.
>
> Thanks,
> Norbert
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130308/7414d31b/attachment-0001.html>

# Yusuke SUZUKI (13 years ago)

These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.

Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier identification functions. I wrote simple UnicodeData.txt parser and generated RegExp[1]. These functions are also used in Acorn.

In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3]. So I suggest accepting a code point number as an argument.

[1] code.google.com/p/esprima/issues/detail?id=110 [2] ariya/esprima/blob/master/esprima.js#L229 [3] marijnh/acorn/blob/master/acorn.js#L421

>
> These tools have to be able to recognize ECMAScript identifiers, taking
> the identifier specification and the underlying Unicode specification into
> consideration - not quite easy given the ever-growing Unicode character set.


Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier
identification functions.
I wrote simple UnicodeData.txt parser and generated RegExp[1]. These
functions are also used in Acorn.

In Esprima and Acorn, because of performance issue, their identifier
identification functions require a code point as number, not string[2][3].
So I suggest accepting a code point number as an argument.

[1] https://code.google.com/p/esprima/issues/detail?id=110
[2] https://github.com/ariya/esprima/blob/master/esprima.js#L229
[3] https://github.com/marijnh/acorn/blob/master/acorn.js#L421


On Fri, Mar 8, 2013 at 6:42 PM, gaz Heyes <gazheyes at gmail.com> wrote:

> You forgot to include MentalJS. I can parse 120k identifier in 5ms on
> Firefox on my crappy machine. My method is much faster than any of the
> parsers you listed and I handle unicode escapes too.
> http://businessinfo.co.uk/labs/MentalJS/MentalJS.html
>
>
> On 8 March 2013 07:35, Norbert Lindenberg <
> ecmascript at lindenbergsoftware.com> wrote:
>
>> ECMAScript is used to implement a variety of tools that check code for
>> conformance with the ECMAScript specification, minimize it, perform other
>> transformations, or generate ECMAScript code. These tools have to be able
>> to recognize ECMAScript identifiers, taking the identifier specification
>> and the underlying Unicode specification into consideration - not quite
>> easy given the ever-growing Unicode character set.
>>
>> While looking at support for Unicode character properties in general, I
>> realized that this use case is shaped differently from others, fundamental
>> to ECMAScript, and amenable to a fairly simple solution, and so there's now
>> a strawman:
>> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
>>
>> I'd like to discuss this at next week's TC 39 meeting, but also invite
>> earlier comments.
>>
>> Thanks,
>> Norbert
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>


-- 
Regards,
Yusuke Suzuki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130308/567e90f6/attachment.html>

# gaz Heyes (13 years ago)

On 8 March 2013 10:35, Yusuke SUZUKI <utatane.tea at gmail.com> wrote:

Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier identification functions. I wrote simple UnicodeData.txt parser and generated RegExp[1]. These functions are also used in Acorn.

RegEx is slower. I suggest using if statements on char codes and < and > to

check it's within the range of z-a etc and then separate functions to handle higher ascii variables only when needed and then compare the char codes are within the ranges of allowed identifiers.

code.google.com/p/mentaljs/source/browse/trunk/MentalJS/javascript/Mental.js#504

I still have to optimize that function further by removing <= and >= and

maybe separating each identifier range into their own function since higher non-alpha variables take longer to parse since they are at the end of the if statement.

On 8 March 2013 10:35, Yusuke SUZUKI <utatane.tea at gmail.com> wrote:

> Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier
> identification functions.
> I wrote simple UnicodeData.txt parser and generated RegExp[1]. These
> functions are also used in Acorn.
>

RegEx is slower. I suggest using if statements on char codes and < and > to
check it's within the range of z-a etc and then separate functions to
handle higher ascii variables only when needed and then compare the char
codes are within the ranges of allowed identifiers.

https://code.google.com/p/mentaljs/source/browse/trunk/MentalJS/javascript/Mental.js#504

I still have to optimize that function further by removing <= and >= and
maybe separating each identifier range into their own function since higher
non-alpha variables take longer to parse since they are at the end of the
if statement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130308/8666f35c/attachment.html>

# Norbert Lindenberg (13 years ago)

On Mar 8, 2013, at 2:35 , Yusuke SUZUKI wrote:

In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3]. So I suggest accepting a code point number as an argument.

The functions I proposed accept both numbers and strings.

strawman:identifier_identification

Norbert

On Mar 8, 2013, at 2:35 , Yusuke SUZUKI wrote:

> In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3].
> So I suggest accepting a code point number as an argument.

The functions I proposed accept both numbers and strings.

http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

Norbert

# Ariya Hidayat (13 years ago)

RegEx is slower. I suggest using if statements on char codes and < and > to check it's within the range of z-a etc ...

If you check Yusuke's links, that is exactly what Esprima is doing. The use of regular expression is reserved only for the slow/uncommon code path.

> RegEx is slower. I suggest using if statements on char codes and < and > to
> check it's within the range of z-a etc ...

If you check Yusuke's links, that is exactly what Esprima is doing.
The use of regular expression is reserved only for the slow/uncommon
code path.


-- 
Ariya Hidayat, http://ariya.ofilabs.com
http://twitter.com/ariyahidayat
http://gplus.to/ariyahidayat

# Yusuke SUZUKI (13 years ago)

The functions I proposed accept both numbers and strings. strawman:identifier_identification

Ah, I see. I missed what is intended at step 2. Looks very nice, thanks.

>
>
> The functions I proposed accept both numbers and strings.
> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification


Ah, I see. I missed what is intended at step 2. Looks very nice, thanks.


On Sat, Mar 9, 2013 at 6:29 AM, Norbert Lindenberg <
ecmascript at lindenbergsoftware.com> wrote:

>
> On Mar 8, 2013, at 2:35 , Yusuke SUZUKI wrote:
>
> > In Esprima and Acorn, because of performance issue, their identifier
> identification functions require a code point as number, not string[2][3].
> > So I suggest accepting a code point number as an argument.
>
> The functions I proposed accept both numbers and strings.
>
> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
>
> Norbert
>
>


-- 
Regards,
Yusuke Suzuki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130309/c6d6f57f/attachment.html>

# gaz Heyes (13 years ago)

On 9 March 2013 01:59, Ariya Hidayat <ariya.hidayat at gmail.com> wrote:

If you check Yusuke's links, that is exactly what Esprima is doing. The use of regular expression is reserved only for the slow/uncommon code path.

Yeah I can see you are converting a char code into a string using fromCharCode and comparing it against a regex which is slower and I showed you a function that checks non-alpha variables using charcodes. BTW your isWhiteSpace function also calls to functions indexOf/fromCharcode when it doesn't need to and also indexOf starts at 0 so you have to check if it's greater than -1.

On 9 March 2013 01:59, Ariya Hidayat <ariya.hidayat at gmail.com> wrote:

> If you check Yusuke's links, that is exactly what Esprima is doing.
> The use of regular expression is reserved only for the slow/uncommon
> code path.


Yeah I can see you are converting a char code into a string using
fromCharCode and comparing it against a regex which is slower and I showed
you a function that checks non-alpha variables using charcodes. BTW your
isWhiteSpace function also calls to functions indexOf/fromCharcode when it
doesn't need to and also indexOf starts at 0 so you have to check if it's
greater than -1.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130309/3439c58a/attachment.html>

# Allen Wirfs-Brock (13 years ago)

Norbert,

Can you explain why you think these should be functions on String rather than part of a more general character classification facility that might be associated with some more specialized object? The latter approach would seem to be to have modularity advantages at both the implementation and usage level.

Norbert,

Can you explain why you think these should be  functions on String rather than part of a more general character classification facility that might be associated with some more specialized object?  The latter approach would seem to be to have modularity advantages at both the implementation and usage level.

Allen




On Mar 7, 2013, at 11:35 PM, Norbert Lindenberg wrote:

> ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
> 
> While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
> 
> I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.
> 
> Thanks,
> Norbert
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>

# Peter van der Zee (13 years ago)

Norbert, for the sake of completeness;

ZeParser (qfox/zeparser) does support complete unicode identifiers ZeParser2 (qfox/zeparser2) doesn't (I simply didn't bother)

Norbert, for the sake of completeness;

ZeParser (http://github.com/qfox/zeparser) does support complete
unicode identifiers
ZeParser2 (http://github.com/qfox/zeparser2) doesn't (I simply didn't bother)

- peter

# Norbert Lindenberg (13 years ago)

I added these functions to String because that seems the best place for them in the current arrangement. I'm aware of the proposal to modularize the standard library [1] and can well imagine that these functions will find a better home in that new scheme.

The other character classification scheme I'm looking into is based on Unicode character properties. The reasons why I separated out this proposal are:

Tools operating on ECMAScript source code need to be aware of the ECMAScript version they use, for syntax, semantics, keywords, and, well, the characters allowed in identifiers. Some tools let their clients specify an ECMAScript version (e.g., "es5" in JSLint and JSHint), others may assume a fixed version. The characters in turn are tied to both Unicode versions and ECMAScript versions - for example, SpiderMonkey currently supports Unicode 6.2 characters, but restricted to the BMP because it hasn't been upgraded to ES6 identifiers yet.
For Unicode character properties, on the other hand, clients generally need only the properties as of the latest known version, and in the few exceptions that I know of (such as the 2003 version of IDNA) only specific Unicode versions are needed. Requiring that a general API for Unicode character properties provide access to Unicode version-specific information would create a huge burden on implementors, but benefit no-one.
It's difficult for tools developers to determine the correct set of characters to include as identifier characters. One particular difficulty is that the Unicode general category of a character can change in rare cases, so a character can move into or out of the categories that the ES3/ES5 specifications reference. For compatibility, characters shouldn't move out of the set of characters allowed for identifiers. (It turns out that browsers also get this wrong - all of them). (ES6 solves this problem by basing its identifier definition on Unicode Standard Annex 31, Unicode Identifier and Pattern Syntax, which defines special sets of characters Other_ID_Start and Other_ID_Continue and treats these characters as identifier characters even though their current general categories don't qualify them as such anymore.)
For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].

So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.

[1] harmony:modules_standard [2] www.unicode.org/reports/tr31/#Backward_Compatibility [3] strawman:identifier_identification

Norbert

I added these functions to String because that seems the best place for them in the current arrangement. I'm aware of the proposal to modularize the standard library [1] and can well imagine that these functions will find a better home in that new scheme.

The other character classification scheme I'm looking into is based on Unicode character properties. The reasons why I separated out this proposal are:

- Tools operating on ECMAScript source code need to be aware of the ECMAScript version they use, for syntax, semantics, keywords, and, well, the characters allowed in identifiers. Some tools let their clients specify an ECMAScript version (e.g., "es5" in JSLint and JSHint), others may assume a fixed version. The characters in turn are tied to both Unicode versions and ECMAScript versions - for example, SpiderMonkey currently supports Unicode 6.2 characters, but restricted to the BMP because it hasn't been upgraded to ES6 identifiers yet.

- For Unicode character properties, on the other hand, clients generally need only the properties as of the latest known version, and in the few exceptions that I know of (such as the 2003 version of IDNA) only specific Unicode versions are needed. Requiring that a general API for Unicode character properties provide access to Unicode version-specific information would create a huge burden on implementors, but benefit no-one.

- It's difficult for tools developers to determine the correct set of characters to include as identifier characters. One particular difficulty is that the Unicode general category of a character can change in rare cases, so a character can move into or out of the categories that the ES3/ES5 specifications reference. For compatibility, characters shouldn't move out of the set of characters allowed for identifiers. (It turns out that browsers also get this wrong - all of them). (ES6 solves this problem by basing its identifier definition on Unicode Standard Annex 31, Unicode Identifier and Pattern Syntax, which defines special sets of characters Other_ID_Start and Other_ID_Continue and treats these characters as identifier characters even though their current general categories don't qualify them as such anymore.)

- For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].

So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.

[1] http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard
[2] http://www.unicode.org/reports/tr31/#Backward_Compatibility
[3] http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

Norbert


On Mar 9, 2013, at 9:16 , Allen Wirfs-Brock wrote:

> Norbert,
> 
> Can you explain why you think these should be  functions on String rather than part of a more general character classification facility that might be associated with some more specialized object?  The latter approach would seem to be to have modularity advantages at both the implementation and usage level.
> 
> Allen
> 
> 
> 
> 
> On Mar 7, 2013, at 11:35 PM, Norbert Lindenberg wrote:
> 
>> ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
>> 
>> While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
>> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
>> 
>> I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.
>> 
>> Thanks,
>> Norbert
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>> 
>

# Mathias Bynens (12 years ago)

Great proposal, Norbert!

Another tool that uses JavaScript to identify identifiers as per ECMAScript 5.1 / Unicode 6.2 is mothereff.in/js-variables.

For a list of bug reports regarding identifier handling in browsers / JavaScript engines, see mathiasbynens.be/notes/javascript-identifiers (look for “Some of these don’t work in all browsers/environments”).

I’m a bit confused by step 7.2, though: “If edition is not 3, 5, or 6, throw a RangeError exception.” Does this mean only integers are accepted? E.g. you can specify 5 as the ECMAScript version, but not 5.1? I would suggest adding 5.1 to the list (even if it’s just an alias to 5), but perhaps I’m missing something.

Also, how about adding String.isIdentifier(string) as well?

On 12 Mar 2013, at 02:45, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.

For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].

Agreed it would be nice. In the meantime, to polyfill this functionality, tools that take a list of code points / symbols / ranges (like mths.be/regenerate) could be used.

, Mathias

Great proposal, Norbert!

Another tool that uses JavaScript to identify identifiers as per ECMAScript 5.1 / Unicode 6.2 is http://mothereff.in/js-variables.

For a list of bug reports regarding identifier handling in browsers / JavaScript engines, see http://mathiasbynens.be/notes/javascript-identifiers (look for “Some of these don’t work in all browsers/environments”).

I’m a bit confused by step 7.2, though: “If edition is not 3, 5, or 6, throw a RangeError exception.” Does this mean only integers are accepted? E.g. you can specify `5` as the ECMAScript version, but not `5.1`? I would suggest adding `5.1` to the list (even if it’s just an alias to `5`), but perhaps I’m missing something.

Also, how about adding `String.isIdentifier(string)` as well?

On 12 Mar 2013, at 02:45, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

> So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.

+1

> - For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].

Agreed it would be nice. In the meantime, to polyfill this functionality, tools that take a list of code points / symbols / ranges (like http://mths.be/regenerate) could be used.

Regards,
Mathias

# Mathias Bynens (12 years ago)

Also, what about the non-reserved words that act like reserved words, i.e. the immutable NaN, Infinity, and undefined properties of the global object, or eval and arguments which are disallowed as identifiers (see section 12.2.1) in strict mode? IMHO, these are examples of why it would be useful to add a robust String.isIdentifier to the proposal.

Also, what about the non-reserved words that act like reserved words, i.e. the immutable `NaN`, `Infinity`, and `undefined` properties of the global object, or `eval` and `arguments` which are disallowed as identifiers (see section 12.2.1) in strict mode? IMHO, these are examples of why it would be useful to add a robust `String.isIdentifier` to the proposal.