Identifying ECMAScript identifiers
You forgot to include MentalJS. I can parse 120k identifier in 5ms on Firefox on my crappy machine. My method is much faster than any of the parsers you listed and I handle unicode escapes too. businessinfo.co.uk/labs/MentalJS/MentalJS.html
These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier identification functions. I wrote simple UnicodeData.txt parser and generated RegExp[1]. These functions are also used in Acorn.
In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3]. So I suggest accepting a code point number as an argument.
[1] code.google.com/p/esprima/issues/detail?id=110 [2] ariya/esprima/blob/master/esprima.js#L229 [3] marijnh/acorn/blob/master/acorn.js#L421
On 8 March 2013 10:35, Yusuke SUZUKI <utatane.tea at gmail.com> wrote:
Yeah. We, Esprima developers, parse UnicodeData.txt to generate identifier identification functions. I wrote simple UnicodeData.txt parser and generated RegExp[1]. These functions are also used in Acorn.
RegEx is slower. I suggest using if statements on char codes and < and > to
check it's within the range of z-a etc and then separate functions to handle higher ascii variables only when needed and then compare the char codes are within the ranges of allowed identifiers.
code.google.com/p/mentaljs/source/browse/trunk/MentalJS/javascript/Mental.js#504
I still have to optimize that function further by removing <= and >= and
maybe separating each identifier range into their own function since higher non-alpha variables take longer to parse since they are at the end of the if statement.
On Mar 8, 2013, at 2:35 , Yusuke SUZUKI wrote:
In Esprima and Acorn, because of performance issue, their identifier identification functions require a code point as number, not string[2][3]. So I suggest accepting a code point number as an argument.
The functions I proposed accept both numbers and strings.
strawman:identifier_identification
Norbert
RegEx is slower. I suggest using if statements on char codes and < and > to check it's within the range of z-a etc ...
If you check Yusuke's links, that is exactly what Esprima is doing. The use of regular expression is reserved only for the slow/uncommon code path.
The functions I proposed accept both numbers and strings. strawman:identifier_identification
Ah, I see. I missed what is intended at step 2. Looks very nice, thanks.
On 9 March 2013 01:59, Ariya Hidayat <ariya.hidayat at gmail.com> wrote:
If you check Yusuke's links, that is exactly what Esprima is doing. The use of regular expression is reserved only for the slow/uncommon code path.
Yeah I can see you are converting a char code into a string using fromCharCode and comparing it against a regex which is slower and I showed you a function that checks non-alpha variables using charcodes. BTW your isWhiteSpace function also calls to functions indexOf/fromCharcode when it doesn't need to and also indexOf starts at 0 so you have to check if it's greater than -1.
Norbert,
Can you explain why you think these should be functions on String rather than part of a more general character classification facility that might be associated with some more specialized object? The latter approach would seem to be to have modularity advantages at both the implementation and usage level.
Norbert, for the sake of completeness;
ZeParser (qfox/zeparser) does support complete unicode identifiers ZeParser2 (qfox/zeparser2) doesn't (I simply didn't bother)
I added these functions to String because that seems the best place for them in the current arrangement. I'm aware of the proposal to modularize the standard library [1] and can well imagine that these functions will find a better home in that new scheme.
The other character classification scheme I'm looking into is based on Unicode character properties. The reasons why I separated out this proposal are:
-
Tools operating on ECMAScript source code need to be aware of the ECMAScript version they use, for syntax, semantics, keywords, and, well, the characters allowed in identifiers. Some tools let their clients specify an ECMAScript version (e.g., "es5" in JSLint and JSHint), others may assume a fixed version. The characters in turn are tied to both Unicode versions and ECMAScript versions - for example, SpiderMonkey currently supports Unicode 6.2 characters, but restricted to the BMP because it hasn't been upgraded to ES6 identifiers yet.
-
For Unicode character properties, on the other hand, clients generally need only the properties as of the latest known version, and in the few exceptions that I know of (such as the 2003 version of IDNA) only specific Unicode versions are needed. Requiring that a general API for Unicode character properties provide access to Unicode version-specific information would create a huge burden on implementors, but benefit no-one.
-
It's difficult for tools developers to determine the correct set of characters to include as identifier characters. One particular difficulty is that the Unicode general category of a character can change in rare cases, so a character can move into or out of the categories that the ES3/ES5 specifications reference. For compatibility, characters shouldn't move out of the set of characters allowed for identifiers. (It turns out that browsers also get this wrong - all of them). (ES6 solves this problem by basing its identifier definition on Unicode Standard Annex 31, Unicode Identifier and Pattern Syntax, which defines special sets of characters Other_ID_Start and Other_ID_Continue and treats these characters as identifier characters even though their current general categories don't qualify them as such anymore.)
-
For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].
So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.
[1] harmony:modules_standard [2] www.unicode.org/reports/tr31/#Backward_Compatibility [3] strawman:identifier_identification
Norbert
Great proposal, Norbert!
Another tool that uses JavaScript to identify identifiers as per ECMAScript 5.1 / Unicode 6.2 is mothereff.in/js-variables.
For a list of bug reports regarding identifier handling in browsers / JavaScript engines, see mathiasbynens.be/notes/javascript-identifiers (look for “Some of these don’t work in all browsers/environments”).
I’m a bit confused by step 7.2, though: “If edition is not 3, 5, or 6, throw a RangeError exception.” Does this mean only integers are accepted? E.g. you can specify 5
as the ECMAScript version, but not 5.1
? I would suggest adding 5.1
to the list (even if it’s just an alias to 5
), but perhaps I’m missing something.
Also, how about adding String.isIdentifier(string)
as well?
On 12 Mar 2013, at 02:45, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.
+1
- For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected [3].
Agreed it would be nice. In the meantime, to polyfill this functionality, tools that take a list of code points / symbols / ranges (like mths.be/regenerate) could be used.
, Mathias
Also, what about the non-reserved words that act like reserved words, i.e. the immutable NaN
, Infinity
, and undefined
properties of the global object, or eval
and arguments
which are disallowed as identifiers (see section 12.2.1) in strict mode? IMHO, these are examples of why it would be useful to add a robust String.isIdentifier
to the proposal.
ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman: strawman:identifier_identification
I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.
Thanks, Norbert