Full Unicode based on UTF-16 proposal

# Norbert Lindenberg (13 years ago)

Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html

The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.

Comments?

Thanks, Norbert

[1] esdiscuss/2012-February/020721

# Erik Corry (13 years ago)

This is very useful, and was surely a lot of work. I like the general thrust of it a lot. It has a high level of backwards compatibility, does not rely on the VM having two different string implementations in it, and it seems to fix the issues people are encountering.

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: "If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch."

2012/3/16 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

# Mark Davis ☕ (13 years ago)

Whew, a lot of work, Norbert. Looks quite good. My one question is whether it is worth having a mechanism for iteration.

OLD CODE for (int i = 0; i < s.length(); ++) { var x = s.charAt(i); // do something with x }

Using your mechanism, one would write:

NEW CODE for (int i = 0; i < s.length(); ++) { var x = s.codePointAt(i); // do something with x if (x > 0xFFFF) { ++i; } }

In Java, for example, I really wish you could write:

DESIRED

for (int codepoint : s) { // do something with x }

However, maybe this kind of iteration is rare enough in ES that it suffices to document the pattern under NEW CODE.

Thanks for all your work!

proposal for upgrading ECMAScript to a Unicode version released in this

century

This was amusing; could have said "this millennium" ;-)

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

# Jonas Höglund (13 years ago)

On Sat, 17 Mar 2012 00:23:25 +0100, Mark Davis ☕ <mark at macchiato.com>

wrote:

Whew, a lot of work, Norbert. Looks quite good. My one question is
whether it is worth having a mechanism for iteration.

OLD CODE for (int i = 0; i < s.length(); ++) { var x = s.charAt(i); // do something with x }

Using your mechanism, one would write:

NEW CODE for (int i = 0; i < s.length(); ++) { var x = s.codePointAt(i); // do something with x if (x > 0xFFFF) { ++i; } }

In Java, for example, I really wish you could write:

DESIRED

for (int codepoint : s) { // do something with x }

However, maybe this kind of iteration is rare enough in ES that it
suffices to document the pattern under NEW CODE.

That's the beauty of ECMAScript; it's extensible. :-)

   String.prototype.forEachCodePoint = function(fun) {
     for (var i=0; i<s.length; i++) {
       var x = s.codePointAt(i)
       fun(x, s)
       if (x > 0xFFFF) { ++i }
     }
   }

   "hello".forEachCodepoint(function(x) {
     // do something with x
   })

Thanks for all your work!

proposal for upgrading ECMAScript to a Unicode version released in this century

This was amusing; could have said "this millennium" ;-)

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

Jonas

# Norbert Lindenberg (13 years ago)

Thanks for your comments - a few replies below.

Norbert

On Mar 16, 2012, at 1:55 , Erik Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

Good point about Harmony code, although it seems opt-in got replaced by being part of a module.

The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: "If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch."

And the exception for "ß" and other characters whose upper case equivalent has more than one code point ("If u does not consist of a single character, return ch." in the Canonicalize algorithm in ES 5.1).

# Norbert Lindenberg (13 years ago)

In Harmony we should be able to make this even more beautiful using iterators [1]:

If we add:

String.prototype.[iterator] = function() { var s = this; return { index: 0, next: function() { if (this.index >= s.length) { throw StopIteration; } let cp = s.codePointAt(index); index += cp > 0xFFFF ? 2 : 1; return cp; } } }

clients can write:

for (codePoint of str) { // do something with codePoint }

Norbert

[1] harmony:iterators

# Erik Corry (13 years ago)

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

Thanks for your comments - a few replies below.

Norbert

On Mar 16, 2012, at 1:55 , Erik Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour.  There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

No. In general I don't think it is realistic to try to prove that problematic code does not exist, since that requires quantifying over all existing JS code, which is clearly impossible.

Good point about Harmony code, although it seems opt-in got replaced by being part of a module.

That would work too, I think.

The algorithm given for codePointAt never returns NaN.  It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

Then you should probably remove the text: "If there is no code unit at that position, the result is NaN" from your proposal :-)

I am wary of using exceptions for non-exceptional data-driven events, since performance is usually terrible and it's arguably an abuse of the mechanism. Your iterator code looks fine to me an needs neither NaN or exceptions.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

I can see that. But if we are going to have multiple versions of the RegExp syntax we should probably aim to keep the number down.

If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules.  This is the section that says: "If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal  128, then return ch."

And the exception for "ß" and other characters whose upper case equivalent has more than one code point ("If u does not consist of a single character, return ch." in the Canonicalize algorithm in ES 5.1).

Yes.

# Norbert Lindenberg (13 years ago)

On Mar 16, 2012, at 19:57 , Erik Corry wrote:

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

Thanks for your comments - a few replies below.

Norbert

On Mar 16, 2012, at 1:55 , Erik Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

No. In general I don't think it is realistic to try to prove that problematic code does not exist, since that requires quantifying over all existing JS code, which is clearly impossible.

We cannot prove its absence, but we can discuss the likelihood of its existence, and showing an actual example is a quick way to bring that discussion to a conclusion.

I note that you didn't challenge my claim about the (un)likelihood of the existence of applications that depend on Deseret characters not being mapped to lower case while calling toLowerCase...

The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

Then you should probably remove the text: "If there is no code unit at that position, the result is NaN" from your proposal :-)

I am wary of using exceptions for non-exceptional data-driven events, since performance is usually terrible and it's arguably an abuse of the mechanism. Your iterator code looks fine to me an needs neither NaN or exceptions.

The iterator or codePointAt?

The latter has the statement you quote, which shows a disconnect between what I wrote a few days ago starting from the charCodeAt spec, and what I think when I don't look at that spec. charCodeAt (and hence the current implementation of codePointAt) returns NaN when given an index < 0 or ≥ length. The normal behavior when accessing elements or properties that don't exist is to return undefined. We can't fix charCodeAt anymore, but I can still fix codePointAt.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

I can see that. But if we are going to have multiple versions of the RegExp syntax we should probably aim to keep the number down.

True. And in the meantime Brendan pointed to some regex proposals that try to address a different set of Unicode-related issues, also with a /u flag. Some coordination is clearly needed. blog.stevenlevithan.com/archives/fixing

# Steven L. (13 years ago)

Eric Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2. Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE:

\X

From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6.

Norbert Lindenberg wrote:

\uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz] [\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as [\uwwww\uyyyy-\uxxxx\uzzzz] This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters.

Yikes! -1! This is unnecessary if the handling of \uhhhh is unmodified and support for \u{h..} and/or \x{h..} is added (the latter is the syntax from Perl and PCRE). Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \uhhhh alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs.

--Steven Levithan

# Erik Corry (13 years ago)

2012/3/17 Steven L. <steves_list at hotmail.com>:

Eric Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour.  There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2.

Care to enlighten us with any thinking behind this disagreeing?

Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE:

\X

This doesn't work inside []. Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does.

From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6.

Norbert Lindenberg wrote:

\uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]

Norbert, this just happens automatically if unmatched surrogates are just treated as if they were normal code units.

[\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as [\uwwww\uyyyy-\uxxxx\uzzzz]

Norbert, this will have different semantics to the current implementations unless the second range is the full trail surrogate range.

I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now.

Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \uhhhh alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs.

You seem to be confusing graphemes and unicode code points. Here is the same text 3 times:

Four UTF-16 code units:

0x0020 0xD800 0xDF30 0x0308

Three Unicode code points:

0x20 0x10330 0x308

Two Graphemes

" " "¨" <-- This is an attempt to show a Gothic Ahsa with an umlaut. My mail program probably screwed it up.

The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue.

# Steven L. (13 years ago)

Eric Corry wrote:

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2.

Care to enlighten us with any thinking behind this disagreeing?

Sorry for the rushed and overly ebullient message. I disagreed with /u for switching from code unit to code point mode because in the moment I didn't think a code point mode necessary or particularly beneficial. Upon further reflection, I rushed into this opinion and will be more closely examining the related issues.

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). Therefore, I think that if a flag is added that only switches from code unit to code point mode, it should not be "u". Presumably, flag /u could simultaneously affect \d\w\b and switch to code point mode. I haven't yet thought enough about combining these two proposals to hold a strong opinion on the matter.

there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE: \X

This doesn't work inside []. Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does.

You are of course correct. And yes, I was envisioning the same restriction within character classes. But I'm not a strong proponent of \X, especially if support for Unicode categories is added.

I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now.

Glad to hear it.

You seem to be confusing graphemes and unicode code points. [...] The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue.

Indeed. My response was rushed and poorly formed. My apologies.

--Steven Levithan

# Norbert Lindenberg (13 years ago)

Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters.

Norbert

# Erik Corry (13 years ago)

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters.

/foo/☃ // slash-unicode-snowman for the win! :-)

# Erik Corry (13 years ago)

2012/3/17 Steven L. <steves_list at hotmail.com>:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful.

And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine.

\b is a little tougher. The Unicode rewrite would be (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like

/(?<!\w)word(?=!\w)/ // Negative look-behind for a \w and negative look-ahead for \w at the end.

which isn't too bad, even if it is much worse than

/\bword\b/

Indeed. My response was rushed and poorly formed. My apologies.

Gratefully accepted with the hope that my next rushed and poorly formed response will also be forgiven!

# Norbert Lindenberg (13 years ago)

On Mar 17, 2012, at 10:20 , Erik Corry wrote:

2012/3/17 Steven L. <steves_list at hotmail.com>:

Eric Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2.

Care to enlighten us with any thinking behind this disagreeing?

Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE:

\X

This doesn't work inside []. Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does.

From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6.

Norbert Lindenberg wrote:

\uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]

Norbert, this just happens automatically if unmatched surrogates are just treated as if they were normal code units.

I don't see how. In the actual matching process, the new design only looks at code points, not code units. Without this transformation, it would see surrogate code points in the pattern, but supplementary code points in the text to be matched. Enhancing the matching process to recognize surrogate code points and insert them into the continuation might work, but wouldn't be any prettier than this transformation.

[\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as [\uwwww\uyyyy-\uxxxx\uzzzz]

Norbert, this will have different semantics to the current implementations unless the second range is the full trail surrogate range.

True. I think if we restrict the transformation to that specific case it'll still cover normal usage of this pattern.

I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now.

Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \uhhhh alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs.

You seem to be confusing graphemes and unicode code points. Here is the same text 3 times:

Four UTF-16 code units:

0x0020 0xD800 0xDF30 0x0308

Three Unicode code points:

0x20 0x10330 0x308

Two Graphemes

" " "¨" <-- This is an attempt to show a Gothic Ahsa with an umlaut. My mail program probably screwed it up.

Mac Mail is usually Unicode-friendly, so let's try again: " 𐌰̈"

The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue.

Correct - thanks for the explanation!

# Norbert Lindenberg (13 years ago)

On Mar 17, 2012, at 11:58 , Erik Corry wrote:

2012/3/17 Steven L. <steves_list at hotmail.com>:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful.

Looking at that page, it seems \d gives you a reasonable set of digits, the ones in the Unicode general category Nd (number, decimal). These digits come from a variety of writing systems, but are all used decimal-positional, so you can parse at least integers using them with a fairly generic algorithm.

Dealing with roman numerals or counting rods requires specialized algorithms, so you probably don't want to find them in this bucket.

Norbert

# Steven L. (13 years ago)

Eric Corry wrote:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to match both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari digits, and probably others. Even if it wasn't often useful, IMO this change is necessary for congruity with Unicode-enabled \w and \b (I'll get to that), and would likely never be detrimental since /u would be opt-in and it's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not /\p{N}/. I.e., it should not match any Unicode number, but rather any Unicode decimal digit (see unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Nd} for the list). And as Norbert noted, that is in fact what Perl's \d matches.

Comparison with other regex flavors:

  • \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).

  • \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

  • \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).

  • \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

  • \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).

  • \d == \p{Nd} -- .NET, Perl, Python (with (?u)).

  • \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).

  • \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for users to sometimes want them to be Unicode-based--thus, an opt-in flag offers the best of both worlds. In fact, I'd go so far as to say they are broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which currently returns true.

Unicode-based \d would not only help international users/apps, it is also important because otherwise Unicode-based \w and \b would have to use [\p{L}0-9_] rather than [\p{L}\p{Nd}], which breaks portability with .NET, Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used [\p{L}\p{Nd}] but \d used [0-9], then among other consequences (including user confusion), [^\W\d_] could not be used equivalently to \p{L}.

And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works only within character classes. IMO, the POSIX-style [[:name:]] syntax is clumsy and confusing, not to mention backward incompatible. It would potentially also be confusing if ES supports only [:alnum:] without adding the rest of the (not-very-useful) POSIX regex class names.

\b is a little tougher. The Unicode rewrite would be (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like

/(?<!\w)word(?=!\w)/ // Negative look-behind for a \w and negative look-ahead for \w at the end.

which isn't too bad, even if it is much worse than

/\bword\b/

I've already started to explain above why I think Unicode-based \b is important and useful. I'll just add the footnote that relying on lookbehind would in all likelihood perform less efficiently than \b (depending on implementation optimizations).

Indeed. My response was rushed and poorly formed. My apologies.

Gratefully accepted with the hope that my next rushed and poorly formed response will also be forgiven!

Consider it done. ;-P

--Steven Levithan

# Erik Corry (13 years ago)

2012/3/18 Steven L. <steves_list at hotmail.com>:

Eric Corry wrote:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this.  I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space.  The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes  This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to match both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari digits, and probably others. Even if it wasn't often useful, IMO this change is necessary for congruity with Unicode-enabled \w and \b (I'll get to that), and would likely never be detrimental since /u would be opt-in and it's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not /\p{N}/. I.e., it should not match any Unicode number, but rather any Unicode decimal digit (see unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Nd} for the list). And as Norbert noted, that is in fact what Perl's \d matches.

Ah, that makes much more sense.

Comparison with other regex flavors:

  • \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).

  • \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

  • \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).

  • \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

  • \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).

  • \d == \p{Nd} -- .NET, Perl, Python (with (?u)).

  • \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).

  • \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for users to sometimes want them to be Unicode-based--thus, an opt-in flag offers the best of both worlds. In fact, I'd go so far as to say they are broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which currently returns true.

Unicode-based \d would not only help international users/apps, it is also important because otherwise Unicode-based \w and \b would have to use [\p{L}0-9_] rather than [\p{L}\p{Nd}], which breaks portability with .NET, Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used [\p{L}\p{Nd}] but \d used [0-9], then among other consequences (including user confusion), [^\W\d_] could not be used equivalently to \p{L}.

And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works

This would be pretty useless and is not true in perl. I tried the following:

perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . "\n";"

and it prints 1, indicating a match.

only within character classes. IMO, the POSIX-style [[:name:]] syntax is clumsy and confusing, not to mention backward incompatible. It would potentially also be confusing if ES supports only [:alnum:] without adding the rest of the (not-very-useful) POSIX regex class names.

The implication was to add the rest too. Seeing things like the regexp at the bottom of this page inimino.org/~inimino/blog/javascript_cset is an indication to me that there is a demand.

\b is a little tougher.  The Unicode rewrite would be (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is obviously too verbose.  But if we take \b for this then the ASCII version has to be written as (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little annoying.  However, often you don't need that if you have negative lookbehind because you can write something like

/(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative look-ahead for \w at the end.

which isn't too bad, even if it is much worse than

/\bword\b/

I've already started to explain above why I think Unicode-based \b is important and useful. I'll just add the footnote that relying on lookbehind would in all likelihood perform less efficiently than \b (depending on implementation optimizations).

OK, I'm convinced that /u should make \d, \b and \w Unicode aware. I don't think the performance will be much different between a lookbehind and a \b though.

# Steven L. (13 years ago)

Steven Levithan wrote:

  • \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
  • \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Oops. My ASCII-only version of \s is obviously missing space \x20 and no-break space \xAO (which are included in Unicode's \p{Z}).

Erik Corry wrote:

Steven Levithan wrote:

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing.

This would be pretty useless and is not true in perl. I tried the following:

perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . "\n";"

and it prints 1, indicating a match.

<Updating my mental notes> Roger that. Online docs (including the

Perl-specific page you linked to earlier) typically list [:alnum:] as [A-Za-z0-9], but I've just done some quick testing and it seems that regex packages supporting [:alnum:] give it at least three different meanings:

  • [A-Za-z0-9]
  • [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]
  • [\p{Ll}\p{Lu}\p{Lt}\p{Nd}\p{Nl}]

Note that although Java doesn't support POSIX character class syntax, it too supports alnum via \p{Alnum}. Java's alnum matches only [A-Za-z0-9].

Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :)

Erik Corry wrote:

OK, I'm convinced that /u should make \d, \b and \w Unicode aware.

w00t!

--Steven Levithan

# Erik Corry (13 years ago)

2012/3/18 Steven L. <steves_list at hotmail.com>:

Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :)

Please do, they seem quite sensible to me.

In fact \w with Unicode support seems very similar to [:alnum:] to me. If this one is useful are there not other Unicode categories that would be useful?

# Steven L. (13 years ago)

Erik Corry wrote:

Steven Levithan wrote:

Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :)

Please do, they seem quite sensible to me.

My main objections are due to the POSIX character class syntax itself, and my preference for introducing Unicode categories using \p{..} instead. But to get down a little more detail...

  • They're backward incompatible. /[[:name:]]/ is currently equivalent to /[[:aemn]]/ in web-reality. Granted, this probably won't be a big deal for existing code, but because they're not currently an error, their use could cause latent bugs in old browsers that don't support them and treat them as part of a character class's set.

  • They work inside of bracket expressions only. This is clumsy and needlessly confusing. [:alnum:] outside of a bracket expression would probably have to continue to be equivalent to [:almnu], which would lead to at least occasional developer frustration and bugs.

  • Since the exact characters they match differs between regex libraries (beyond just Unicode version variation), they would contribute to the existing landscape of regex features that seem to be portable but actually work slightly differently in different places. We need less of this.

  • They are either rarely useful or only minor conveniences over existing shorthands, explicit character classes, or Unicode categories that could be matched using \p{..} in more standardized fashion.

  • Other implementations, at least, do not allow them to be negated on their own, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by using them in negated bracket expressions, but that may negate more than you want.

  • If ES ever adopts .NET/XPath-style character class subtraction or Java-style character class intersection (the latter was on the cards for ES4), their syntax would become even more confusing.

  • Bonus pompous bullet point: IMO, there are more useful and important new RegExp features to focus on, including support for Unicode categories (which, IMO, are regex's new and improved version of POSIX character classes). My personal wishlist would probably include at least 20 new regex features above POSIX character classes, even if they were introduced using the \p{..} syntax (which is how Java included them).

  • Bonus nitpick: The name of the syntax itself causes confusion. POSIX calls them character classes, and calls their container a bracket expression. JavaScripters already call the container a character class. (Not an objection, per se. Presumably we could call them something like "POSIX shorthands" to avoid confusion.)

I'd have no actual objections to adding them using the \p{Name} syntax (as Java does), especially if there is demand for them among regex power-users (you're the first person who I've seen strongly advocate for them). However, I'd still have concerns about exactly which names are added, exactly what they match, and their compatibility with other regex flavors.

In fact \w with Unicode support seems very similar to [:alnum:] to me. If this one is useful are there not other Unicode categories that would be useful?

\w with Unicode should match [\p{L}{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't).

As you said, though, Unicode categories are indeed quite useful. Unicode scripts, too. I'd advocate for them alongside you. Because of how useful they are, I've even made them usable via my XRegExp JavaScript library (see git.io/xregexp ). That lib has a relatively small but enthusiastic user base and is seeing increasing use in server-side JS, where the overhead of loading long Unicode code point ranges doesn't matter as much. But, so long as a /u flag is added for switching \w\b\d to Unicode-mode, I'd argue that even Unicode categories and scripts are less important than various other features I've mentioned recently on es-discuss, including named capture and atomic groups.

-- Steven Levithan

# Steven L. (13 years ago)

Steven Levithan wrote:

\w with Unicode should match [\p{L}{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't).

Although some regex libraries indeed implement the above, I've just looked over UTS#18 Annex C 1, which requires that \w be equivalent to:

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]

Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear on whether the differences from \p{L} are fully covered by the inclusion of \p{M} in the above character class. I'm sure there are plenty of people here with greater Unicode expertise than me who could clarify, though.

-- Steven Levithan

# Steven L. (13 years ago)

Java SE 7 apparently added flag (?U) to do the same thing as Python's (?u). The new flag also affects Java's POSIX character class definitions such as \p{Alnum}.

Note the difference in casing, and also that Java's (?U)\w follows UTS#18, unlike Python's (?u)\w. Java has long supported a lowercase (?u) flag for Unicode-aware case folding.

-- Steven Levithan

-----Original Message---

# Norbert Lindenberg (13 years ago)

I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. norbertlindenberg.com/2012/03/ecmascript-supplementary-characters

Norbert

# Steven Levithan (13 years ago)

Norbert Lindenberg wrote:

I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. norbertlindenberg.com/2012/03/ecmascript-supplementary-characters

Cool.

From the proposal's Updates section:

Indicated that "u" may not be the actual character for the flag for code point mode in regular expressions, as a "u" flag has already been proposed for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

  1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.

  2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.

  3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used far less often, and more developers would continue to get bitten by code-unit-based processing.

As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

  1. "[S]ome applications might have processed gunk with regular expressions where neither the 'characters' in the patterns nor the input to be matched are text."

  2. "s.match(/^.$/)[0].length can now be 2." I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.

  3. /./g.exec(s) can now increment the regex's lastIndex by 2.

-- Steven Levithan

# Lasse Reichstein (13 years ago)

On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan <steves_list at hotmail.com> wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

...

  1. [New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character.

/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

I think a compliant implementation should (read: ought to) already get that example, since "στιγμας".toUpperCase() == "ΣΤΙΓΜΑΣ".toUpperCase() in the browsers I have checked, and the ignore-case canonicalization is based on toUpperCase. Alas, most of the implementations miss it anyway.

# Phillips, Addison (13 years ago)

Comments follow.

  1. Definition of string. You say:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence.

I know what you mean, but others might not. Perhaps:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates).

  1. In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs.

  2. Under "text interpretation" you say:

-- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters).

This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters. Perhaps:

-- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which do not individually represent characters).

  1. 0xFFFE and 0xFFFF are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-).

  2. Editorial unnecessary ;-):

-- This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters.

  1. Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note.

  2. Skipping down a lot, to "section 6 source text", you propose:

-- The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15.

I think this should be removed or modified. Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps:

-- Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding.

  1. In "7.6 Identifier Names and Identifiers" you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier.

  2. "15.5.4.6": you say "(a nonnegative integer less than 0x10FFFF)", whereas it should say: "(a nonnegative integer less than or equal to 0x10FFFF)"

  3. In the section on "what about utf-32", you say: " and the code points start at positions 1, 2, 3.". Of course this should be "... and the code points start at positions 0, 1, 2."

Thanks for this proposal!

Addison

# Roger Andrews (13 years ago)

Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().

Nb. Already encodeURI throws an URIError exception if 'str' is not a well-formed UTF-16 string.

# David Herman (13 years ago)

On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:

Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().

No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics.

# David Herman (13 years ago)

On Mar 23, 2012, at 6:30 AM, Steven Levithan wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

+all my internet points

Now you're talking!!

  1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.

  2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.

  3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

This is really exciting.

As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

I haven't completely understood this part of the discussion. Looking at /u as a "little red switch" (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.:

js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
["𝌆𝌇𝌈𝌉𝌊"]

I'm still getting up to speed

# David Herman (13 years ago)

Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.

Clarification: the JS source of the regexp literal.

# Wes Garland (13 years ago)

On 24 March 2012 15:25, David Herman <dherman at mozilla.com> wrote:

Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.

Clarification: the JS source of the regexp literal.

We certainly can, although this means that certain Unicode Strings cannot be matched by a regexp with this flag. These strings would be the ones containing reserved code points.

That said, why is the JS source suddenly a sequence of UTF-16 code units?I believe JS source code should be a sequence of Unicode code points (and I think ES5 says something to this effect).

The underlying transport format should not be a concern for the JS lexer. The lexer should receive a series of code points from the network transport, allowing web sites to transmit JS in whatever encoding they see fit, provided the browser and server can both agree on it. I think UTF-8 would make a fine transport format for JS source code. IMHO the transport format between the browser and the JS lexer [i.e. the input program encoding] should be allowed to be implementation-defined and not specified by TC-39.

# David Herman (13 years ago)

On Mar 24, 2012, at 1:11 PM, Wes Garland wrote:

On 24 March 2012 15:25, David Herman <dherman at mozilla.com> wrote:

Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.

Clarification: the JS source of the regexp literal.

We certainly can, although this means that certain Unicode Strings cannot be matched by a regexp with this flag. These strings would be the ones containing reserved code points.

I didn't mean to imply only allowing non-BMP ranges by their unescaped representation, just that if it's possible that would often be nice and readable. I would certainly expect that we should also allow [\u{xxxxx}-\u{yyyyy}].

That said, why is the JS source suddenly a sequence of UTF-16 code units?I believe JS source code should be a sequence of Unicode code points (and I think ES5 says something to this effect).

I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."

The underlying transport format should not be a concern for the JS lexer.

eval

# Wes Garland (13 years ago)

On 24 March 2012 17:22, David Herman <dherman at mozilla.com> wrote:

I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."

Ugh, IMHO, that's wrong, and should be "any Unicode code point". (let the flames begin?)

The underlying transport format should not be a concern for the JS lexer.

eval

Eval is a red herring: its input is defined as the contents of the given String. So, we come full-circle back to "what's in a String?". I'm still partial to Brendan's BRS idea, because at least it fixes everything all at once.

# Norbert Lindenberg (13 years ago)

On Mar 23, 2012, at 6:30 , Steven Levithan wrote:

Norbert Lindenberg wrote:

I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. norbertlindenberg.com/2012/03/ecmascript-supplementary-characters

Cool.

From the proposal's Updates section:

Indicated that "u" may not be the actual character for the flag for code point mode in regular expressions, as a "u" flag has already been proposed for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

  1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.

  2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.

One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full Unicode sets for such literals?

In the other direction it's clear that using /u for \d\D\w\W\b\B has to imply code point mode.

  1. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

We probably should review the complete Unicode Technical Standard #18, Unicode Regular Expressions, and see how we can upgrade RegExp for better Unicode support. Maybe on a separate thread...

Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used far less often, and more developers would continue to get bitten by code-unit-based processing.

Good thinking :-)

# Norbert Lindenberg (13 years ago)

On Mar 23, 2012, at 7:12 , Lasse Reichstein wrote:

On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan <steves_list at hotmail.com> wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

...

  1. [New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character.

/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

I think a compliant implementation should (read: ought to) already get that example, since "στιγμας".toUpperCase() == "ΣΤΙΓΜΑΣ".toUpperCase() in the browsers I have checked, and the ignore-case canonicalization is based on toUpperCase. Alas, most of the implementations miss it anyway.

According to the ES5 spec, /ΣΤΙΓΜΑΣ/i.test("στιγμας") must be true indeed. Chrome and Node (i.e., V8) and IE get this right; Safari, Firefox, and Opera don't.

Note that toUpperCase allows mappings from 1 to multiple code units, while RegExp canonicalization in ES5 doesn't, so /SS/i.test("ß") === false even though "SS".toUpperCase() === "ß".toUpperCase().

Norbert

# Norbert Lindenberg (13 years ago)

Thanks for the detailed comments! Replies below.

Norbert

On Mar 23, 2012, at 9:46 , Phillips, Addison wrote:

Comments follow.

  1. Definition of string. You say:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence.

I know what you mean, but others might not. Perhaps:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates).

I can add a note that ill-formed here means containing unpaired surrogates. If I read chapter 3 of the Unicode Standard correctly, there's no other way for UTF-16 to be ill-formed. UTF-16 code units by themselves cannot be invalid - any 16-bit value can occur in a well-formed UTF-16 string.

  1. In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs.

Makes sense.

  1. Under "text interpretation" you say:

-- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters).

This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters.

The text is about surrogate code points, not about surrogate code units.

  1. 0xFFFE and 0xFFFF are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-).

  2. Editorial unnecessary ;-):

-- This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters.

  1. Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note.

Unfortunately, the term "character" is poisoned in ES5 by a redefinition as "code unit" (chapter 6). For ES6, I'd like the spec to be really clear where it means code units and where it means code points. Maybe we can then reintroduce "character" in ES7...

  1. Skipping down a lot, to "section 6 source text", you propose:

-- The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15.

I think this should be removed or modified.

This sentence is essentially copied from ES5 (with corrected references), and as I copied it, I made a note to myself that we need to discuss normalization, just not as part of this proposal...

Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps:

-- Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding.

  1. In "7.6 Identifier Names and Identifiers" you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier.

I can add a note about surrogate code points and non-characters, but, as you say, they are already ruled out because they can't have the required Unicode properties ID_Start or ID_Continue.

The use of ZWJ and ZWNJ is unchanged from ES5. UAX 31 has much stricter rules on where they would be allowed, but I'm not sure we have a strong case for changing the rules in ECMAScript. www.unicode.org/reports/tr31/tr31-9.html#Layout_and_Format_Control_Characters

  1. "15.5.4.6": you say "(a nonnegative integer less than 0x10FFFF)", whereas it should say: "(a nonnegative integer less than or equal to 0x10FFFF)"

Will fix.

  1. In the section on "what about utf-32", you say: " and the code points start at positions 1, 2, 3.". Of course this should be "... and the code points start at positions 0, 1, 2."

Of course.

# Norbert Lindenberg (13 years ago)

It's easy to provide this function, but in which situations would it be useful? In most cases that I can think of you're interested in far more constrained definitions of validity:

  • what are valid ECMAScript identifiers?
  • what are valid BCP 47 language tags?
  • what are the characters allowed in a certain protocol?
  • what are the characters that my browser can render?

Thanks, Norbert

# Norbert Lindenberg (13 years ago)

On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]

As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

I haven't completely understood this part of the discussion. Looking at /u as a "little red switch" (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.:

js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u) ["𝌆𝌇𝌈𝌉𝌊"]

I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?

With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

# David Herman (13 years ago)

On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote:

One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony).

This idea makes me nervous. Partly because I think we should keep the set of semantic changes between non-module code and module code reasonable small, and partly because the idea of your proposal is to continue to treat strings as sequences of 16-bit code units, not Unicode code points-- which means that quietly switching regexps to be closer to operating at the level of code points seems like it creates a kind of impedance mismatch. It feels more appropriate to me to require programmers to declare explicitly that they're dealing with a string at the level of code points, using the (quite concise) /u flag. That way they're saying "yes, I know this string is just a sequence of 16-bit code points, but it may contain non-BMP data, and I would like to match its contents with a regexp that deals with code points."

(Again, I'm still new to the finer points of Unicode, so I'm prepared to be shown I'm thinking about it wrong.)

# David Herman (13 years ago)

On Mar 24, 2012, at 11:23 PM, Norbert Lindenberg wrote:

On Mar 24, 2012, at 12:21 , David Herman wrote:

I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?

With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

Excellent!

Thanks,

# David Herman (13 years ago)

On Mar 24, 2012, at 2:30 PM, Wes Garland wrote:

On 24 March 2012 17:22, David Herman <dherman at mozilla.com> wrote: I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."

Ugh, IMHO, that's wrong, and should be "any Unicode code point". (let the flames begin?)

That sounds nice in theory, but we can't change the past. Even with the BRS, there would still be a compatibility mode where it's code points.

The underlying transport format should not be a concern for the JS lexer.

eval

Eval is a red herring: its input is defined as the contents of the given String. So, we come full-circle back to "what's in a String?". I'm still partial to Brendan's BRS idea, because at least it fixes everything all at once.

I share Erik and others' concerns about the BRS. Working at the heap level sounds brittle to me. It seems like a lot of spec and implementation complexity, and it doesn't really have a good story for integrating legacy code and future code. I think the direction that Norbert, Erik, and Steven have been going is very promising.

# Roger Andrews (13 years ago)

I use something like String.isValid functionality in a transcoder that converts Strings to/from UTF-8, HTML Formdata (MIME type application/x-www-form-urlencoded -- not the same as URI encoding!), and Base64.

Admittedly these currently use 'encodeURI' to do the work, or it just drops out naturally when considering UTF-8 sequences.

(I considered testing the regexp /^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/ against the input string.)

Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers.


From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>

# Roger Andrews (13 years ago)

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.

Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever.

Could you use this to avoid complicated things in RegExps like [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of interest?

The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character.

(Sorry if I've missed something in the prior discussion.)


From: "Norbert Lindenberg" To: "David Herman"

# Roger Andrews (13 years ago)

Just confirmed C/C++ do allow \Uxxxxxxxx escaped characters for non-BMP code points in string literals.

Interesting page at: publib.boulder.ibm.com/infocenter/comphelp/v7v91/topic/com.ibm.vacpp7a.doc/language/ref/clrc02unicode_standard.htm

So C/C++ has: \xNN 8-bit character (U+0000 - U+00FF) \uNNNN 16-bit character \UNNNNNNNN 32-bit character

This naturally expresses any character, without worrying about the UTF-16 or whatever encoding.


From: "Roger Andrews" To: "Norbert Lindenberg"

# Norbert Lindenberg (13 years ago)

JavaScript source today is a sequence of UTF-16 code units because that's what clause 6 of ES5 says and what most implementations do (V8/Node currently limits to UCS-2, but a fix for that is on the way): "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16."

Actual source code is normally encoded in UTF-8 or some legacy encoding, so it must be converted to UTF-16. The rest of the ES5 spec deals with source text in terms of code units, not in terms of code points.

The term "code point" is defined in clause 6 of ES5 (in a way that's slightly incompatible with the Unicode definition), but the only normative use is in relation to URI mappings in subclause 15.1.3, never in relation to source code.

Allen, Brendan, and I have proposed several ways to move to code point semantics in ES6, with each proposal representing a different trade-off between compatibility with existing code and ease of future development.

Norbert

# Norbert Lindenberg (13 years ago)

Perfectly valid concerns.

My thinking here is that normally applications want to deal with code points, but we force them to deal with UTF-16 and additional flags because we need them for compatibility. Within modules, where we know that compatibility is not an issue, I'd rather give applications by default what they need.

Looking back at Java, supporting supplementary characters was fairly painless for many applications despite UTF-16 because Java already had a rich API performing all kinds of operations on strings, so many applications had little need to look at individual characters in the first place. We went through the entire Java SE API and fixed all those operations to use code point semantics (look for "under the hood" at [1] for details). We were also able to switch regular expressions to code point semantics without any flags because regular expressions never worked on binary data and developers hadn't created funky workarounds to support supplementary characters yet. JavaScript today has more constraints, but for new development it would still be good to get as close as possible to that experience.

Norbert

[1] java.sun.com/developer/technicalArticles/Intl/Supplementary

# Norbert Lindenberg (13 years ago)

Let's see:

  • Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to convert it, so isValid doesn't really help. You still have to look at all code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD.

  • Conversion from UTF-8: For security reasons, you have to check for well-formedness before conversion, in particular to catch non-shortest forms [1].

  • HTML form data: Same situation as conversion to UTF-8.

  • Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.

I don't think we'd add API just to flag an issue - that's what documentation is for.

Norbert

[1] www.unicode.org/reports/tr36/#UTF

# Norbert Lindenberg (13 years ago)

There is a strawman for code point escapes: strawman:full_unicode_source_code#unicode_escape_sequences

Note that for references to specific characters it's usually best to just use the characters directly, as Dave did in "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as regular expressions where you might have to refer to range limits that aren't actually assigned characters, or in test cases where you might use characters for which your OS doesn't have glyphs yet.

Norbert

# Roger Andrews (13 years ago)

The strawman is for source code characters, and says it has "no implications for string value encodings" (or RegExps). String & regexp literal escape sequences are explicitly defined in ES5 sections 7.8.4 & 7.8.5. Will Strawman style also work in ES6 string & regexp literals? Thus making regexp ranges much nicer (see final example below).

As well as describing code points that have not yet been defined as characters, character escapes in string literals and regexps are good:

  1. control characters don't have glyphs at all,
  2. the various space glyphs are not readily distinguishable (same for some dash/minus/line glyphs),
  3. breaking/non-breaking versions of characters are not distinguishable,
  4. many other glyphs are hard to distinguish (being tiny adjustments in positioning or form detail),
  5. some characters are "combining" -- which makes for a messy and confusing program if you use them raw.

If you use the raw non-ASCII characters in a program then you need some means of creating them, preferably via a normal keyboard and in your favourite text editor. All program readers need appropriate fonts installed to fully understand the program, and program maintainers also need a Unicode-capable text editor (potentially including non-BMP support). All links/stores that the program travels over or rests in must be Unicode-capable. Whereas using only ASCII chars to write a program is easy to do and always works no matter how basic your computing/transmission infrastructure. (ASCII chars never get silently mangled in transmission or text editors.)

How to represent character escapes in a language. C/C++ has: \xNN 8-bit char (U+0000 - U+00FF) \uNNNN 16-bit char (U+0000 - U+FFFF) \UNNNNNNNN 32-bit char (i.e. any 21-bit Unicode char) Strawman for source chars has: \u{N...} 8 to 24-bit char (i.e. any 21-bit Unicode char)

I'm struggling with how non-BMP escapes would be used in practice in strings & regexps -- especially regexp ranges. Will Strawman style be used in string & regexp literals?

Considering U+1D307 (𝌇) as an example (where "𝌇" == "\uD834\uDF07").

To create the string "I like 𝌇" using escapes in C/C++ you can create a string: "I like \U0001D307" if the Strawman style works in strings, in ES6 presumably you say: "I like \u{1D307}" or do you have to know UTF-16 encoding rules and say: "I like \uD834\uDF07"

To use U+1D307 (𝌇) and U+1D356 (𝍖) as a range in a regexp, i.e. /[𝌇-𝍖]/ should the programmer write: C/C++ style /[\U0001D307-\U0001D356]/ or will Strawman style work in regexps /[\u{1D307}-\u{1D356}]/ or in UTF-16 with {} grouping /[{\uD834\uDF07}-{\uD834\uDF56}]/

Either C/C++ style or Strawman style escape is readable, natural, doesn't require knowledge of UTF-16 encoding rules, can be created easily with any old keyboard, and won't upset text editors.

It's a bit unfriendly to require programmers to know UTF-16 rules just to put a non-BMP character in a string or regexp using an escape. And in a regexp range it looks ugly and confusing.


From: "Norbert Lindenberg"

# Roger Andrews (13 years ago)

Maybe String.isValid is just not generally useful enough. I accept the point that you don't add APIs simply to flag an issue, (there has to be more weighty justification to carry the trifle).

PS: As for UTF-16 -> UTF-8 or HTML-Formdata, I decided to follow encodeURI /

encodeURIComponent's lead and throw an exception. Maybe that's the wrong thing to do?

My UTF-8 -> UTF-16 does check for well-formed UTF-8 because it seemed the

right thing to do. Thanks for the link which explains why.

Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, same issues as above really.


From: "Norbert Lindenberg"

# Steven Levithan (13 years ago)

Sorry for jumping between messages...

Roger Andrews wrote:

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.

Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

The example above also hints at additional potentially breaking changes for code point matching by default that haven't yet been discussed in this thread: that the meaning of negated character classes and shorthands would change, and that their match length may be 2 (like the dot).

Roger Andrews wrote:

I'm struggling with how non-BMP escapes would be used in practice in strings & regexps -- especially regexp ranges. Will Strawman style be used in string & regexp literals? [...examples snipped]

I'm not sure whether this was already clear, but the curly braces I included in my paraphrasing of Norbert's proposed transformations were not meant to be included literally. I was trying to describe ranges between arbitrary code points, represented by pairs of high and low surrogates. As far as I understand, no existing proposal would allow a character class range written as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the curly braces. To match a range outside the BMP in a literal RegExp, you would have to use [<char>-<char>] (where <char> represents a literal

character, and this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for this syntax is added, and where X.. represents a hex number between 0 and at least 10FFFF).

Norbert Lindenberg wrote:

[...snip] My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

Although I've argued the compatibility risk angle, on that point I should defer to implementers and others who might have a better sense of the scope of risk/damage to existing programs. More personally affecting, though, is the negative gut reaction I have to the well-thought-out but ugly and complicated (not so much in implementation, but for devs who have to learn about it) transformations that would otherwise be necessary to avoid breaking current regexes. And like David, I think just requiring /u is not so bad, especially since I'd want to use it for its other meanings anyway.

I'm also nervous about using different default semantics inside and out of ES6 modules, but David Herman has already well articulated my concerns and you've already responded, so I'll leave that discussion to you two except to say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable to automatically turn it on in ES6 modules. That's because I think applying only code unit to code point mode switching in modules by default is too magical and confusing, but if it were described as turning on /u by default, that's easy to understand and explain.

Lasse Reichstein wrote:

Steven Levithan wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag: [...] 3. [New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character.

That would be my hope as well.

Norbert Lindenberg wrote:

One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full Unicode sets for such literals?

As I said above, yes, I think it makes sense to apply all semantics of /u by default within modules. Previously in this thread, I detailed what \d\w\b\s mean in various regex flavors. The ones that give Unicode meanings by default are .NET and Perl, so ES would be in excellent regex company. Additionally, Java's \b (only) supports Unicode by default, as does ES's \s.

Norbert Lindenberg wrote:

In the other direction it's clear that using /u for \d\D\w\W\b\B has to imply code point mode.

Not if their meaning was limited to the BMP, which is already true for \D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point matching it would be false. Yet another reason to tie the multiple proposed meanings of /u together.

Norbert Lindenberg wrote:

Steven Levithan wrote:

  1. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

We probably should review the complete Unicode Technical Standard #18, Unicode Regular Expressions, and see how we can upgrade RegExp for better Unicode support. Maybe on a separate thread...

Agreed. You may already be thinking this, but IMO if we're going to add /u as a Little Red Switch (as David called it), the priority should be on making sure that /u gets all aspects of Unicode-aware regular expression semantics done right, before looking at new features from UTS#18 like Unicode property matching.

-- Steven Levithan

# Erik Corry (13 years ago)

2012/3/26 Steven Levithan <steves_list at hotmail.com>:

Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

Without the /u flag it should behave exactly as it has done until now, for reasons of backwards compatibility. On V8 that means that

/[\u{10000}]/

is the same as

/[u01{}]/

# Gavin Barraclough (13 years ago)

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state: "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations. But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

My concern would be expressions such as: /[\uD800\uDC00\uDC00\uD800]/u Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00". Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?). It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

# Erik Corry (13 years ago)

2012/3/26 Gavin Barraclough <barraclough at apple.com>:

Hi Norbert,

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state:        "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations.  But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

My concern would be expressions such as:        /[\uD800\uDC00\uDC00\uD800]/u Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00".  Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?).  It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

This is too nasty. The regexp constructor should not have to look up the stack to see what behaviour is expected of it.

# Gavin Barraclough (13 years ago)

On Mar 26, 2012, at 2:13 PM, Erik Corry wrote:

This is too nasty. The regexp constructor should not have to look up the stack to see what behaviour is expected of it.

I think you misunderstood me - I think we're saying the same thing. :-)

If we do imply the u flag for regexp literals in modules, and if we do make these regexps unable to match unpaired surrogates, then we may need to provide a method for programmers to create non-unicode aware regexps from within modules.

I was simply stating that since the regexp constructor isn't going to look up the stack to determine where it is being called from (we agree here), then a call to RegExp("\uD800") will create a non-unicode matching regexp, and as such a mechanism to create non-unicode regular expressions from within modules already exists. (If this weren't available we might have wanted to provide a symmetric flag to /u for regexp literals in modules to opt-out of unicode matching, but given that calling the RexExp constructor is a convenient alternative I don't think this is necessary or desirable).

# Glenn Adams (13 years ago)

On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <barraclough at apple.com>wrote:

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state: "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

Just as a reminder, this would be in explicit violation of the Unicode conformance clause C1 unless it can be guaranteed that such a code point will not be interpreted as an abstract character:

C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.

[1] www.unicode.org/versions/Unicode6.1.0/ch03.pdf

Given that such guarantee is likely impractical, this presents a problem for the above proposed language.

# Steven Levithan (13 years ago)

Erik Corry wrote:

[...snip] what does /[^\0-\uFFFF\u{10000}]/ without /u match? Without the /u flag it should behave exactly as it has done until now, for reasons of backwards compatibility. On V8 that means that /[\u{10000}]/ is the same as /[u01{}]/

That sounds good. Not only does it avoid breaking from web-reality, it also means that regexes without /u don't need to use a weird mix of code unit and code point matching semantics, ever. To extend the backward compatible approach you prescribe here, the following should all be true when /u is not used:

  • /\u{10}/ eq /u{10}/ (literal u repeated 10 times).
  • Shorthand classes like \D, \S, and the dot match BMP code units only.
  • [^\0-\uFFFF] eq [] eq (?!) eq \b\B. (All of these are used in real-world regexes.)
  • If ES6 or later adds \p{..} for Unicode property matching, it's limited to matching BMP code units.

In other words, without /u, all matching is restricted to BMP code units. With /u, all matching is code point based and works with full 21-bit Unicode.

This also provides another argument in favor of automatically implying /u in ES modules. It would be somewhat obnoxious to not let \u{..} work by default in modules.

-- Steven Levithan

# Roger Andrews (13 years ago)

Steven Levithan wrote:

[snip]

  • /\u{10}/ eq /u{10}/ (literal u repeated 10 times).

A point in favour of \Uxxxxxxxx over \u{x...} as a representation of character escapes? -- to avoid ambiguity in regexps.

# Steven Levithan (13 years ago)

Roger Andrews wrote:

Steven Levithan wrote:

[snip]

  • /\u{10}/ eq /u{10}/ (literal u repeated 10 times).

A point in favour of \Uxxxxxxxx over \u{x...} as a representation of character escapes? -- to avoid ambiguity in regexps.

No. For backcompat, /\Uxxxxxxxx/ must eq /Uxxxxxxxx/, without some kind of mode-based switching.

-- Steven Levithan

# Norbert Lindenberg (13 years ago)

OK, I guess we have to have Unicode code point escapes :-)

I'd expect them to work in identifiers, string literals, and regular expressions (possibly with restrictions coming out of today's emails), but not in JSON source.

Norbert

# Norbert Lindenberg (13 years ago)

I should have said "use appropriate error handling" instead of "convert unpaired surrogates to the UTF-8 sequence for U+FFFD". While using the replacement character is a reasonable default behavior, it's best to let the caller control the behavior. I'd assume that most callers would want as much information to pass through even if there's some stray unpaired surrogate in a string. If your converters just throw exceptions, then many callers will have to go through input strings themselves and remove unpaired surrogates that might have crept in.

For Base64, you could encode UTF-16 directly; you just have to make sure that encoder and decoder agree on the byte order.

Norbert

# Norbert Lindenberg (13 years ago)

On Mar 26, 2012, at 9:45 , Steven Levithan wrote:

Sorry for jumping between messages...

Roger Andrews wrote:

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.

Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

As long as the underlying system is UTF-16 based, I'd think \u{10000} is simply a different notation for \uD800\uDC00. But with code unit based matching that will not result in the intended behavior.

The example above also hints at additional potentially breaking changes for code point matching by default that haven't yet been discussed in this thread: that the meaning of negated character classes and shorthands would change, and that their match length may be 2 (like the dot).

Yes.

Roger Andrews wrote:

I'm struggling with how non-BMP escapes would be used in practice in strings & regexps -- especially regexp ranges. Will Strawman style be used in string & regexp literals? [...examples snipped]

I'm not sure whether this was already clear, but the curly braces I included in my paraphrasing of Norbert's proposed transformations were not meant to be included literally. I was trying to describe ranges between arbitrary code points, represented by pairs of high and low surrogates. As far as I understand, no existing proposal would allow a character class range written as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the curly braces. To match a range outside the BMP in a literal RegExp, you would have to use [<char>-<char>] (where <char> represents a literal character, and this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for this syntax is added, and where X.. represents a hex number between 0 and at least 10FFFF).

In my proposal the following two regular expressions are equivalent:

/[𝌆-𝍖]+/u /[\uD834\uDF06-\uD834\uDF56]+/u

They are made equivalent by the first preprocessing step proposed for 15.10.4.1 and the subsequent interpretation of UTF-16 sequences as code points.

I think I'd process Unicode code point escapes by first converting them to equivalent code unit escapes and then following the same path. This would make

/[\u{1D306}-\u{1D356}]+/u

equivalent to the two above.

Norbert Lindenberg wrote:

[...snip] My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

Although I've argued the compatibility risk angle, on that point I should defer to implementers and others who might have a better sense of the scope of risk/damage to existing programs. More personally affecting, though, is the negative gut reaction I have to the well-thought-out but ugly and complicated (not so much in implementation, but for devs who have to learn about it) transformations that would otherwise be necessary to avoid breaking current regexes. And like David, I think just requiring /u is not so bad, especially since I'd want to use it for its other meanings anyway.

I'm also nervous about using different default semantics inside and out of ES6 modules, but David Herman has already well articulated my concerns and you've already responded, so I'll leave that discussion to you two except to say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable to automatically turn it on in ES6 modules. That's because I think applying only code unit to code point mode switching in modules by default is too magical and confusing, but if it were described as turning on /u by default, that's easy to understand and explain.

Good input.

# Norbert Lindenberg (13 years ago)

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the same as /[\u01{}]/ - it matches "\u01{}u01". In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either.

Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them?

For string literals, I see that most implementations correctly throw a SyntaxError when given "\u{10}". The exception here is V8.

Norbert

# Norbert Lindenberg (13 years ago)

On Mar 26, 2012, at 13:02 , Gavin Barraclough wrote:

Hi Norbert,

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state: "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations. But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

That's worth considering. It seems we're more and more moving towards two separate RegExp versions anyway - a legacy version based on code units and with all kinds of quirks, and an all-around-better version based on code points. It means however that you can't easily remove unpaired surrogates by str.replace(/[\u{D800}-\u{DFFF}]/ug, "\u{FFFD}")

My concern would be expressions such as: /[\uD800\uDC00\uDC00\uD800]/u Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00". Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?). It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

I think/hope that my specification is clear: a surrogate pair is always treated as one entity, not as two pieces. If the input is "\uD800\uDC00", you match "\uD800\uDC00". If you have to backtrack over "\uD800\uDC00", you step back two code units.

It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

Agreed, especially after reading Erik's and your additional emails on this.

# Norbert Lindenberg (13 years ago)

The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted.

My proposal interprets the resulting code points in the following ways:

  1. In regular expressions, they can be used in both patterns and input strings to be matched. They may be compared against other code points, or against character classes, some of which will hopefully soon be defined by Unicode properties. In the case of comparing against other code points, they can't match any code points assigned to abstract characters. In the case of Unicode properties, they'll typically fall into the large bucket of have-nots, along with other unassigned code points or, for example, U+FFFD, unless you ask for their general category.

  2. When parsing identifiers, they will not have the ID_Start or ID_Continue properties, so they'll be excluded, just like other unassigned code points or U+FFFD.

  3. In case conversion, they won't have upper case or lower case equivalents defined, and remain as is, as would happen for unassigned code points or U+FFFD.

I don't think either of these amount to interpretation as abstract characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything.

Norbert

# Steven Levithan (13 years ago)

Norbert Lindenberg wrote:

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the same as /[\u01{}]/ - it matches "\u01{}u01". In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either.

How did you test this. I get consistent results that agree with Erik in IE 9, Firefox 11, Chrome 17, and Safari 5.1:

"\u01{}".match(/[\u{10000}]/g); // ['u','0','1','{','}'] /\u{2}/g.test("uu"); // true

Opera, as you said, returns null and false (tested v11.6 and v10.0).

Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them?

Minus Opera, browsers do agree on them. Admirably so. And they aren't bugs--they're intentional breaks from ES for backcompat with earlier implementations that were themselves designed for backcompat with older non-ES regex behavior. The RegExp Match Web Reality proposal at harmony:regexp_match_web_reality

says to add them to the spec, and Allen has said the web reality proposal should be the top RegExp priority for ES6.

I'd easily believe it's safe enough to change /[\u{n..}]/ because of the four-part sequence involved in \u + { + n.. + } that is fairly unlikely to appear in that specific order in a character class. But I'd have a harder time believing /\u{n..}/ is safe to change. It would of course be great to have some real data on the risks/damage.

For string literals, I see that most implementations correctly throw a SyntaxError when given "\u{10}". The exception here is V8.

I'm sure it would be safer to allow \u{n..} for string literals even if this fortunate SyntaxError wasn't thrown. Users haven't been trained to think of escaped nonmetacharacters as safe for string literals to the extent that they have for regexes, and you can't programmatically generate such escapes so easily as when passing to the RegExp constructor.

-- Steven Levithan

# Glenn Adams (13 years ago)

On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg < ecmascript at norbertlindenberg.com> wrote:

The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted.

True, but if the proposed language

"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this will increase the likelihood of their being interpreted as abstract characters... e.g., if the unpaired code unit is interpreted as a unpaired surrogate code point, and some process/function performs any predicate or transform on that code point, then that amounts to interpreting it as an abstract character.

I would rather see such unpaired code unit either (1) be mapped to U+00FFFD, or (2) an exception raised when performing an operation that requires conversion of the UTF-16 code unit sequence.

# Glenn Adams (13 years ago)

On Tue, Mar 27, 2012 at 12:11 AM, Glenn Adams <glenn at skynav.com> wrote:

I don't think either of these amount to interpretation as abstract

characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything.

I would call it an improvement since it reduces spread|contamination of|by unpaired surrogate code points. I'm not sure what other advantage would be great enough to trump the desire to prevent such contamination.

# Norbert Lindenberg (13 years ago)

On Mar 26, 2012, at 22:49 , Steven Levithan wrote:

Norbert Lindenberg wrote:

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the same as /[\u01{}]/ - it matches "\u01{}u01". In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either.

How did you test this. I get consistent results that agree with Erik in IE 9, Firefox 11, Chrome 17, and Safari 5.1:

"\u01{}".match(/[\u{10000}]/g); // ['u','0','1','{','}'] /\u{2}/g.test("uu"); // true

Sorry, stupid mistake on my side. It is /[u01{}]/, as Erik said.

Opera, as you said, returns null and false (tested v11.6 and v10.0).

Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them?

Minus Opera, browsers do agree on them. Admirably so. And they aren't bugs--they're intentional breaks from ES for backcompat with earlier implementations that were themselves designed for backcompat with older non-ES regex behavior. The RegExp Match Web Reality proposal at harmony:regexp_match_web_reality says to add them to the spec, and Allen has said the web reality proposal should be the top RegExp priority for ES6.

Grumble. How about applying the web reality proposal only to old-style regex, not when /u is set or implied?

Norbert

# Steven Levithan (13 years ago)

The idea for /u and the following aspects of it already seem to have some consensus:

  • Switch from code unit to code point matching.
  • Make \d\w\b Unicode-aware.
  • Make /i use proper Unicode casefolding.
  • Enable \u{x..} (break from web reality).

Since /u may be a one-time opportunity to broadly change RegExp semantics, how about adding another change on the pile?

  • Break from web reality for escaped A-Z and a-z. Throw a SyntaxError when any letter not assigned a special meaning is escaped, instead of matching the literal character.

I.e., /\i/u etc. must throw a SyntaxError.

This is relevant to future Unicode support, because without breaking web reality we might never be able to add \p{..} and \P{..} for Unicode properties, \X for graphemes, \N{..} for named characters, etc.

Of course, this change would also make it easier to add any from a host of special escapes in other regex libraries (such as \k<..> for named

backreferences) or new ES inventions. It's really ugly that such features might not be able to be added by default everywhere, but them's the breaks, I suppose (I hope I'm wrong).

We could go crazy and start fixing all of ES's RegExp warts when /u is applied, even though such changes would not be related to Unicode support. I'd be happy to pursue that, but I suspect many here would see it as a bridge too far.

Thoughts?

-- Steven Levithan

# Steven Levithan (13 years ago)

Norbert Lindenberg wrote:

Grumble. How about applying the web reality proposal only to old-style regex, not when /u is set or implied?

I'd support that. See my recent email along similar lines, where I suggested using /u as an opportunity to discuss all of ES's RegExp warts. Limiting this to not applying web reality would be a reasonable compromise that would at least allow for future letter escapes and finally kill RegExp octals (which overlap with backreferences, among other problems).

-- Steven Levithan

# Erik Corry (13 years ago)

2012/3/27 Steven Levithan <steves_list at hotmail.com>:

The idea for /u and the following aspects of it already seem to have some consensus:

  • Switch from code unit to code point matching.
  • Make \d\w\b Unicode-aware.

I think we should leave these alone. They are concise and useful and will continue to be so when /u is the default in Harmony code. Instead we should introduce \p{...} immediately which provides the same functionality.

  • Make /i use proper Unicode casefolding.
  • Enable \u{x..} (break from web reality).

Make unpaired surrogates in /u regexps a syntax error.

Add /U to mean old-style regexp literals in Harmony code (analogous to /s and /S which have opposite meanings).

Since /u may be a one-time opportunity to broadly change RegExp semantics, how about adding another change on the pile?

  • Break from web reality for escaped A-Z and a-z. Throw a SyntaxError when any letter not assigned a special meaning is escaped, instead of matching the literal character.

I.e., /\i/u etc. must throw a SyntaxError.

Yes.

Also we should consider the Perl syntax that allows you to switch on and off flags for only part of a regexp, so that case independence does not have to apply to the whole regexp.

# Norbert Lindenberg (13 years ago)

Would you consider the ICU4J method com.ibm.icu.text.UTF16.charAt [1] to be in violation of C1 because it can return surrogate code points?

Or the ICU4J method com.ibm.icu.lang.UCharacter.isUUppercase [2] because it's a predicate that tells you that surrogate code points do not represent upper case characters?

Or the ICU4J method com.ibm.icu.lang.UCharacter.toUpperCase [3] because it's a transform that maps surrogate code points to themselves as their upper case form?

Norbert

[1] icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html#charAt(java.lang.CharSequence, int) [2] icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#isUUppercase(int) [3] icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#toUpperCase(int)

# Steven Levithan (13 years ago)

Erik Corry wrote:

Steven Levithan wrote:

  • Make \d\w\b Unicode-aware.

I think we should leave these alone. They are concise and useful and will continue to be so when /u is the default in Harmony code. Instead we should introduce \p{...} immediately which provides the same functionality.

\w and \b are broken without Unicode. ASCII \d is concise and useful, but so is [0-9]. Unicode-aware \b can't be emulated using \p{..} unless lookbehind is also added (which is tentatively approved for ES6 but could get delayed). Unicode-aware \w\b\d are required by UTS#18. If \w\b\d are not made Unicode-aware by /u, we won't easily be able to fix them in the future.

We went down this road before, and at the end you agreed that \w\b\d with /u should be Unicode aware. :/

I agree with adding \p{..} as soon as possible, with two caveats:

  • If I recall correctly, mobile browser implementers voiced concerns about overhead during the es4-discuss days.
  • It can easily be pushed down the road to ES7+.

Delaying /u, on the other hand, might mean also having to delay Norbert's work on code point matching, etc. Introducing \p{..} without code point matching would be nonideal. \p{..} might need to be delayed anyway to allow RegExp proposals already approved by TC39 (match web reality, lookbehind, flag /y), the flag /x strawman, and flag /u to be completed in time. For starters, it's not clear which properties \p{..} in ES would support, and there would be a number of other details to discuss, too.

Erik Corry wrote:

Make unpaired surrogates in /u regexps a syntax error.

Sounds good to me.

-- Steven Levithan

# Mark Davis ☕ (13 years ago)

That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of "Unicode 16-bit string" in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good.

x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length());

After this operation is done, you want y == a, even if 5 is between D800 and DC00.

Or take: output = ""; for (int i = 0; i < s.length(); ++i) { ch = s.charAt(i); if (ch.equals('&')) { ch = '@'; } output += ch; }

After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b", not "a&\u{FFFD}\u{FFFD}b". It is also an unnecessary burden on lower-level software to always check this stuff.

Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or output, then you do need to either convert to FFFD or take some other action.


Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

# Glenn Adams (13 years ago)

On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of "Unicode 16-bit string" in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good.

x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length());

After this operation is done, you want y == a, even if 5 is between D800 and DC00.

Assuming that b.length() == 1 in this example, my interpretation of this is that '=', '+', and 'substring' are operations whose domain and co-domain are (currently defined) ES Strings, namely sequences of UTF-16 code units. Since none of these operations entail interpreting the semantics of a code point (i.e., interpreting abstract characters), then there is no violation of C1 here.

Or take:

output = ""; for (int i = 0; i < s.length(); ++i) { ch = s.charAt(i); if (ch.equals('&')) { ch = '@'; } output += ch; }

After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b", not "a&\u{FFFD}\u{FFFD}b". It is also an unnecessary burden on lower-level software to always check this stuff.

Again, in this example, I assume that the string literal "a&\u{10000}b" maps to the UTF-16 code unit sequence:

0061 0026 D800 DC00 0062

Given that 'charAt(i)' is defined on (and is indexing) code units and not code points, and since the 'equals' operator is also defined on code units, this example also does not require interpreting the semantics of code points (i.e., interpreting abstract characters).

However, in Norbert's questions above about isUUppercase(int) and toUpperCase(int), it is clear that the domain of these operations are code points, not code units, and further, that they requiring interpretation as abstract characters in order to determine the semantics of the corresponding characters.

My conclusion is that the determination of whether C1 is violated or not depends upon the domain, codomain, and operation being considered.

# Mark Davis ☕ (13 years ago)

That would not be practical, nor predictable. And note that the 700K reserved code points are also not to be interpreted as characters; by your logic all of them would need to be converted to FFFD.

And in practice, an unpaired surrogate is best treated just like a reserved (unassigned) code point. For example, a lowercase operation should convert characters with lowercase correspondants to those correspondants, and leave everything else alone: control characters, format characters, reserved code points, surrogates, etc.


Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

# Glenn Adams (13 years ago)

This begs the question of what is the point of C1.

# Mark Davis ☕ (13 years ago)

The point of C1 is that you can't interpret the surrogate code point U+DC00 as a character, like an "a".

Neither can you interpret the reserved code point U+0378 as a character, like a "b".


Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

# Gavin Barraclough (13 years ago)

On Mar 26, 2012, at 11:57 PM, Erik Corry wrote:

Add /U to mean old-style regexp literals in Harmony code (analogous to /s and /S which have opposite meanings).

Are we sure this has enough utility to be worth adding? - it seems unlikely that programmers are going to often have cause to explicitly opt-out of correct unicode support (since little consideration usually seems to be given to this topic), and as discussed previously, a mechanism to do so already exists if they need it (RegExp("foo") will behave the same as the proposed /foo/U). If we do add a 'U' flag, I'd worry that it may end up more commonly being used in error when people intended to append a 'u'!

# Glenn Adams (13 years ago)

So, if as a result of a policy of converting any UTF-16 code unit sequence to a code point sequence one ends up with an unpaired surrogate, e.g., "\u{00DC00}", then performing a predicate on that code point, such as described in D21 (e.g., IsAlphabetic) would entail interpreting it as an abstract character?

I can see that D20 defines code point properties which would not entail interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter, but where does one draw the line?

# Mark Davis ☕ (13 years ago)

performing a predicate on that code point, such as described in D21 (e.g.,

IsAlphabetic) would entail interpreting it as an abstract character? No.

but where does one draw the line?

The line is already drawn by the Unicode consortium, by consulting the Unicode Character Database properties. If you look at the data in the Unicode Character Database for any particular property, say Alphabetic, you'll find that surrogate code points are not included where the property is a true character property. There are a few special cases where reserved code points are provisionally given "anticipatory" character properties, such as in bidi ranges, simply because that makes implementations is more forward compatible, but there aren't any cases where a "character" property applies to a surrogate code point (other than by returning "No", or "n/a", or some such).


Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

# Glenn Adams (13 years ago)

ok, i'll accept your position at this point and drop my comment; i suppose it is true that if there are already unpaired surrogates in user data as UTF-16, then having unpaired surrogates as code points is no worse;

however, it would be useful if there were an informative pointer from the spec under consideration to a UTC sanctioned list of operations that constitute "interpreting as abstract characters" and, that, if used on such data would possibly violate C1; to this end, it would be useful if C1 itself included a concrete example of such an operation