Questions regarding ES6 Unicode regular expressions

# Mathias Bynens (10 years ago)

Norbert’s original proposal for the u flag (norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#RegExp) mentioned the following:

Possibly the definition of the character classes \d\D\w\W\b\B is extended to their Unicode extensions, such as all characters in the Unicode category “Number, decimal” for \d, as proposed by Steven Levithan. Whether this can be done under the same flag or requires a different one still needs discussion.

Has this been discussed any further? (I couldn’t find any mention of it in the meeting notes repository.) Should I file a bug?

Norbert also suggested replacing ‘characters’ with ‘code points’ in sections like people.mozilla.org/~jorendorff/es6-draft.html#sec-characterclassescape and people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation when the u flag is set. It seems the intent was to make e.g. /\d/u match /[0-9]/, and /\D/u match all Unicode code points except [0-9]. This is different from /\D/ which only matches BMP code points.

It seems like this change has not propagated to the spec draft, though. Is this correct, and if so, what’s the reason for that?

The same goes for /[^a]/u – should this match all Unicode code points except a or should it only match BMP code points?

# Till Schneidereit (10 years ago)

(Forwarding to Norbert as I don't know how closely he follows es-discuss.)

---------- Forwarded message ---------- From: Mathias Bynens <mathias at qiwi.be>

Date: Mon, Aug 25, 2014 at 10:59 AM Subject: Questions regarding ES6 Unicode regular expressions To: es-discuss <es-discuss at mozilla.org>

Norbert’s original proposal for the u flag ( norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#RegExp) mentioned the following:

Possibly the definition of the character classes \d\D\w\W\b\B is

extended to their Unicode extensions, such as all characters in the Unicode category “Number, decimal” for \d, as proposed by Steven Levithan. Whether this can be done under the same flag or requires a different one still needs discussion.

Has this been discussed any further? (I couldn’t find any mention of it in the meeting notes repository.) Should I file a bug?

Norbert also suggested replacing ‘characters’ with ‘code points’ in sections like people.mozilla.org/~jorendorff/es6-draft.html#sec-characterclassescape and people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation when the u flag is set. It seems the intent was to make e.g. /\d/u match /[0-9]/, and /\D/u match all Unicode code points except [0-9]. This is different from /\D/ which only matches BMP code points.

It seems like this change has not propagated to the spec draft, though. Is this correct, and if so, what’s the reason for that?

The same goes for /[^a]/u – should this match all Unicode code points except a or should it only match BMP code points?

# Anne van Kesteren (10 years ago)

On Mon, Aug 25, 2014 at 11:44 AM, Till Schneidereit <till at tillschneidereit.net> wrote:

(Forwarding to Norbert as I don't know how closely he follows es-discuss.)

I think last year somewhere regular expression extensions were postponed because nobody was interested in working out detailed proposals. As far as I can tell that hasn't changed.

# Norbert Lindenberg (10 years ago)

On Aug 25, 2014, at 1:59 , Mathias Bynens <mathias at qiwi.be> wrote:

Norbert’s original proposal for the u flag (norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#RegExp) mentioned the following:

Possibly the definition of the character classes \d\D\w\W\b\B is extended to their Unicode extensions, such as all characters in the Unicode category “Number, decimal” for \d, as proposed by Steven Levithan. Whether this can be done under the same flag or requires a different one still needs discussion.

Has this been discussed any further? (I couldn’t find any mention of it in the meeting notes repository.) Should I file a bug?

The “needs discussion” part actually came from the March 2012 TC39 meeting: esdiscuss/2012-March/021919 We subsequently had some discussions about how to go about such a discussion, which petered out because no regular expression expert was available to work with.

I suspect this issue needs a proposal rather than a bug.

Norbert also suggested replacing ‘characters’ with ‘code points’ in sections like people.mozilla.org/~jorendorff/es6-draft.html#sec-characterclassescape and people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation when the u flag is set. It seems the intent was to make e.g. /\d/u match /[0-9]/, and /\D/u match all Unicode code points except [0-9]. This is different from /\D/ which only matches BMP code points.

It seems like this change has not propagated to the spec draft, though. Is this correct, and if so, what’s the reason for that?

Technically that works out as intended because section 21.2.2 defines “character” differently depending on whether the “u” flag is used or not: people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

It is somewhat confusing though, and it might be better to use a different spec mechanism. Ideas:

  • Since we're processing based on Lists anyway, we could just use "element”.
  • We could map code points to UTF-32 code units (1:1), and then consistently talk about code units, which would just have different sizes in the different modes.

The same goes for /[^a]/u – should this match all Unicode code points except a or should it only match BMP code points?

As above – the definition of CharSet depends on the “u” flag: people.mozilla.org/~jorendorff/es6-draft.html#sec-notation

Norbert

# Mathias Bynens (10 years ago)

On 26 Aug 2014, at 02:16, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

[…]

Thanks for confirming. Sounds like my “ES6 Unicode regular expressions to ES5” transpiler is working correctly, then: mathiasbynens/regexpu Demo: mothereff.in/regexpu (Bug reports welcome.)

On Aug 25, 2014, at 1:59 , Mathias Bynens <mathias at qiwi.be> wrote:

Norbert’s original proposal for the u flag (norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#RegExp) mentioned the following:

Possibly the definition of the character classes \d\D\w\W\b\B is extended to their Unicode extensions, such as all characters in the Unicode category “Number, decimal” for \d, as proposed by Steven Levithan. Whether this can be done under the same flag or requires a different one still needs discussion.

Has this been discussed any further? (I couldn’t find any mention of it in the meeting notes repository.) Should I file a bug?

The “needs discussion” part actually came from the March 2012 TC39 meeting: esdiscuss/2012-March/021919 We subsequently had some discussions about how to go about such a discussion, which petered out because no regular expression expert was available to work with.

I suspect this issue needs a proposal rather than a bug.

mathiasbynens/es6-unicode-character-class-escape-sets#readme I’m fairly confident in the proposals for \d and \w, but \b needs work.

@Steven Levithan, would you mind lending your expertise on this? This is your chance to make /na\b/u.test('naïve') return false :)

# Allen Wirfs-Brock (10 years ago)

I've thought about this a bit. I was initially inclined to agree with the idea of extending the existing character classes similar to what Mathias' proposes. But I now think that is probably not a very good idea and that what is currently spec'ed (essentially that the /u flag doesn't change the meaning of \w, \d, etc.) is the better path.

The basic issue I see is backwards compatibility and evolving code to using /u patterns.

I suspect that there is plenty of JS code in the world that does something more or less equivalent to parseInt(/\s*(\d+)/.exec(input)[1])

Note that parseInt is only prepared to recognize the digits U+0030-U+0039.

I t seems to me, that we want programmers to start migrating to full Unicode regular expressions without having to do major logic rewrite of their code. For example, ideally the above expression could simply be replaced by parseInt(/\s*(\d+)/u.exec(input)[1]) and everything in the application could continue to work unchanged.

That won't be the case if we redefine, as Mathias proposes, /\d/u to be equivalent to /[0-9\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0BE6-\u0BEF\u0C66-\u0C6F\u0CE6-\u0CEF\u0D66-\u0D6F\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F29\u1040-\u1049\u1090-\u1099\u17E0-\u17E9\u1810-\u1819\u1946-\u194F\u19D0-\u19D9\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\uA620-\uA629\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]|\uD801[\uDCA0-\uDCA9]|\uD804[\uDC66-\uDC6F\uDCF0-\uDCF9\uDD36-\uDD3F\uDDD0-\uDDD9\uDEF0-\uDEF9]|\uD805[\uDCD0-\uDCD9\uDE50-\uDE59\uDEC0-\uDEC9]|\uD806[\uDCE0-\uDCE9]|\uD81A[\uDE60-\uDE69\uDF50-\uDF59]|\uD835[\uDFCE-\uDFFF]/u rather than

/[0-9]/u We can apply similar logic to \w and even \s. Instead, we should leave the definitions of \d, \w and \s unchanged and plan to adopt the already established convention that \p{<Unicode property>} is the notation for matching Unicode categories. See www.regular-expressions.info/unicode.html

I think digesting all the \p{} possibilities is too much to do for ES6, so I suggest that for ES6 that we simply reserve the \p{<characters>} and \P{<characters>} syntax within /u patterns. A \p proposal can then be developed for ES7.

I see one remaining issue: In ES5 (and ES6): /a-z/i does not match U+017F (ſ) or U+212A (K) because the ES canonicalization algorithm excludes mapping code points > 127 that toUpperCase to code points <128.

However, as currently spec'ed, the ES6 canonicalization algorithm for /u RegExps does not include that >127/<128 exclusion. It maps U+017F to "S" which matches.

This is probably a minor variation, from the ES5 behavior, but we should probably be sure it is a desirable and tolerable change as we presumably could also apply the >127/<128 filter to /u canonicalization.

So, here is a summary of my proposal:

  1. don't change the current definitions of \d, \w, \s when used in /u regular expressions.

  2. Decide whether the current ES6 /u canonicalization algorithm is correct or if it should not translated code points > 127 that map to code points <128.

  3. Reserve within /u RegExp patterns, the syntax \p{<characters>} and \P{<characters>}

  4. Start to develop a \p{ } proposal for ES7.

# Mathias Bynens (10 years ago)

On 26 Aug 2014, at 19:01, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

I've thought about this a bit. I was initially inclined to agree with the idea of extending the existing character classes similar to what Mathias' proposes. But I now think that is probably not a very good idea and that what is currently spec'ed (essentially that the /u flag doesn't change the meaning of \w, \d, etc.) is the better path. […] It seems to me, that we want programmers to start migrating to full Unicode regular expressions without having to do major logic rewrite of their code. For example, ideally the above expression could simply be replaced by parseInt(/\s*(\d+)/u.exec(input)[1]) and everything in the application could continue to work unchanged.

I see your point, but I disagree with the notion that we must absolutely maintain backwards compatibility in this case. The fact that the new flag is opt-in gives us an opportunity to improve behavior without obsessing about back-compat, similar to how the strict mode opt-in is used to make all sorts of things better. When evangelizing /u, we can educate developers and tell them to not blindly/needlessly add /u to their existing regular expressions.

Instead, we should leave the definitions of \d, \w and \s unchanged and plan to adopt the already established convention that \p{<Unicode property>} is the notation for matching Unicode categories. See www.regular-expressions.info/unicode.html

We could do both: improve \d and \w now, and add \p{property} and \P{property} later. Anyhow, I’ve filed ecmascript#3157 for reserving \p{…}/\P{…}.

I think digesting all the \p{} possibilities is too much to do for ES6, so I suggest that for ES6 that we simply reserve the \p{<characters>} and \P{<characters>} syntax within /u patterns. A \p proposal can then be developed for ES7.

Sounds good to me.

I see one remaining issue: In ES5 (and ES6): /a-z/i does not match U+017F (ſ) or U+212A (K) because the ES canonicalization algorithm excludes mapping code points > 127 that toUpperCase to code points <128. However, as currently spec'ed, the ES6 canonicalization algorithm for /u RegExps does not include that >127/<128 exclusion. It maps U+017F to "S" which matches. This is probably a minor variation, from the ES5 behavior, but we should probably be sure it is a desirable and tolerable change as we presumably could also apply the >127/<128 filter to /u canonicalization.

This is a useful feature, and the explicit opt-in makes the small back-compat break acceptable IMHO.

# Norbert Lindenberg (10 years ago)

On Aug 26, 2014, at 10:01 , Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

So, here is a summary of my proposal:

  1. Reserve within /u RegExp patterns, the syntax \p{<characters>} and \P{<characters>}

This was already decided by TC39 at the March 2012 meeting, and if I read the spec correctly, it’s already specified:

IdentityEscape[U] :: [+U] SyntaxCharacter [~U] SourceCharacter but not IdentifierPart [~U] <ZWJ> [~U] <ZWNJ>

esdiscuss/2012-March/021919, people.mozilla.org/~jorendorff/es6-draft.html#sec-patterns

Norbert

# Claude Pache (10 years ago)

Le 26 août 2014 à 19:01, Allen Wirfs-Brock <allen at wirfs-brock.com> a écrit :

  1. Reserve within /u RegExp patterns, the syntax \p{<characters>} and \P{<characters>}

I'll go even further: when the u flag is on, it shall be verboten for implementations to interpret \ followed by one of 0-9, A-Z or a-z as a literal character. (To be added in the newly created Section 16.1 Forbidden Extensions of the spec.) The current lenient behaviour of most ES engines is just good for hiding bugs and hindering evolution.

# Allen Wirfs-Brock (10 years ago)

On Aug 26, 2014, at 11:16 AM, Norbert Lindenberg wrote:

On Aug 26, 2014, at 10:01 , Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

So, here is a summary of my proposal:

  1. Reserve within /u RegExp patterns, the syntax \p{<characters>} and \P{<characters>}

This was already decided by TC39 at the March 2012 meeting, and if I read the spec correctly, it’s already specified:

IdentityEscape[U] :: [+U] SyntaxCharacter [~U] SourceCharacter but not IdentifierPart [~U] <ZWJ> [~U] <ZWNJ>

Yes, that covers it. Perhaps a NOTE explaining would be appropriate. It's a subtle change and I forgot the significance of it.

# Claude Pache (10 years ago)

Le 26 août 2014 à 20:15, Mathias Bynens <mathias at qiwi.be> a écrit :

On 26 Aug 2014, at 19:01, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

I've thought about this a bit. I was initially inclined to agree with the idea of extending the existing character classes similar to what Mathias' proposes. But I now think that is probably not a very good idea and that what is currently spec'ed (essentially that the /u flag doesn't change the meaning of \w, \d, etc.) is the better path. […] It seems to me, that we want programmers to start migrating to full Unicode regular expressions without having to do major logic rewrite of their code. For example, ideally the above expression could simply be replaced by parseInt(/\s*(\d+)/u.exec(input)[1]) and everything in the application could continue to work unchanged.

I see your point, but I disagree with the notion that we must absolutely maintain backwards compatibility in this case. The fact that the new flag is opt-in gives us an opportunity to improve behavior without obsessing about back-compat, similar to how the strict mode opt-in is used to make all sorts of things better. When evangelizing /u, we can educate developers and tell them to not blindly/needlessly add /u to their existing regular expressions.

Instead, we should leave the definitions of \d, \w and \s unchanged and plan to adopt the already established convention that \p{<Unicode property>} is the notation for matching Unicode categories. See www.regular-expressions.info/unicode.html

We could do both: improve \d and \w now, and add \p{property} and \P{property} later. Anyhow, I’ve filed ecmascript#3157 for reserving \p{…}/\P{…}.

The meaning of \d should not be changed; it is routinely used as a synonym of [0-9]. Changing its meaning is willfully introducing traps in the language, and it will produce bugs, for very little gain. It is much safer to learn to use \pN in the rare situations where one want to match numerical characters in any script.

For \w and \b, on the other hand, it can be corrected, because nobody would normally consider that there is two word boundaries in the middle of "fiancée", and it is not a useful semantics, especially in Unicode-aware contexts (that is, in situations where you should use the u flag).

# Norbert Lindenberg (10 years ago)

On Aug 26, 2014, at 11:15 , Mathias Bynens <mathias at qiwi.be> wrote:

On 26 Aug 2014, at 19:01, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

I see one remaining issue: In ES5 (and ES6): /a-z/i does not match U+017F (ſ) or U+212A (K) because the ES canonicalization algorithm excludes mapping code points > 127 that toUpperCase to code points <128. However, as currently spec'ed, the ES6 canonicalization algorithm for /u RegExps does not include that >127/<128 exclusion. It maps U+017F to "S" which matches. This is probably a minor variation, from the ES5 behavior, but we should probably be sure it is a desirable and tolerable change as we presumably could also apply the >127/<128 filter to /u canonicalization.

This is a useful feature, and the explicit opt-in makes the small back-compat break acceptable IMHO.

I’d say the explicit opt-in means that there is no backwards compatibility issue.

I removed the exclusion based on input from Erik Corry on es-discuss:

esdiscuss/2012-March/021249, esdiscuss/2012-March/021306

At the March 2012 TC39 shortly after, Waldemar explained the motivation for the exclusion, but Unicode case folding was approved with the “u” flag:

esdiscuss/2012-March/021919

Norbert

# Allen Wirfs-Brock (10 years ago)

On Aug 26, 2014, at 1:45 PM, Norbert Lindenberg wrote:

On Aug 26, 2014, at 11:15 , Mathias Bynens <mathias at qiwi.be> wrote:

On 26 Aug 2014, at 19:01, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

I see one remaining issue: In ES5 (and ES6): /a-z/i does not match U+017F (ſ) or U+212A (K) because the ES canonicalization algorithm excludes mapping code points > 127 that toUpperCase to code points <128. However, as currently spec'ed, the ES6 canonicalization algorithm for /u RegExps does not include that >127/<128 exclusion. It maps U+017F to "S" which matches. This is probably a minor variation, from the ES5 behavior, but we should probably be sure it is a desirable and tolerable change as we presumably could also apply the >127/<128 filter to /u canonicalization.

This is a useful feature, and the explicit opt-in makes the small back-compat break acceptable IMHO.

I’d say the explicit opt-in means that there is no backwards compatibility issue.

Except, as discussed WRT \d, if a JS programmer updates an existing regexp using /u simply because they want to allow for "full Unicodee" they may not also realize that they are changing matching semantics changes in other ways. As Claude said, this will cause bugs.

I removed the exclusion based on input from Erik Corry on es-discuss:

esdiscuss/2012-March/021249, esdiscuss/2012-March/021306

At the March 2012 TC39 shortly after, Waldemar explained the motivation for the exclusion, but Unicode case folding was approved with the “u” flag:

esdiscuss/2012-March/021919

I'm actually not very worried that the canonicalization of U+017F and friends is going to break anything. But, if we are reexamining decisions in this space, it should be on the table.