Full Unicode based on UTF-16 proposal

# Norbert Lindenberg (13 years ago)

Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html

The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.

Comments?

Thanks, Norbert

[1] esdiscuss/2012-February/020721

Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html

The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.

Comments?

Thanks,
Norbert

[1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html

# Erik Corry (13 years ago)

This is very useful, and was surely a lot of work. I like the general thrust of it a lot. It has a high level of backwards compatibility, does not rely on the VM having two different string implementations in it, and it seems to fix the issues people are encountering.

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: "If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch."

2012/3/16 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

This is very useful, and was surely a lot of work.  I like the general
thrust of it a lot.  It has a high level of backwards compatibility,
does not rely on the VM having two different string implementations in
it, and it seems to fix the issues people are encountering.

However I think we probably do want the /u modifier on regexps to
control the new backward-incompatible behaviour.  There may be some
way to relax this for regexp literals in opted in Harmony code, but
for new RegExp(...) and for other string literals I think there are
rather too many inconsistencies with the old behaviour.

The algorithm given for codePointAt never returns NaN.  It should
probably do that for indices that hit a trail surrogate that has a
lead surrogate preceeding it.

Perhaps it is outside the scope of this proposal, but it would also
make a lot of sense to add some named character classes to RegExp.

If we are makig a /u modifier for RegExp it would also be nice to get
rid of the incorrect case independent matching rules.  This is the
section that says: "If ch's code unit value is greater than or equal
to decimal 128 and cu's code unit value is less than decimal  128,
then return ch."

2012/3/16 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
> Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
>
> The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.
>
> Comments?
>
> Thanks,
> Norbert
>
> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Mark Davis ☕ (13 years ago)

Whew, a lot of work, Norbert. Looks quite good. My one question is whether it is worth having a mechanism for iteration.

OLD CODE for (int i = 0; i < s.length(); ++) { var x = s.charAt(i); // do something with x }

Using your mechanism, one would write:

NEW CODE for (int i = 0; i < s.length(); ++) { var x = s.codePointAt(i); // do something with x if (x > 0xFFFF) { ++i; } }

In Java, for example, I really wish you could write:

DESIRED

for (int codepoint : s) { // do something with x }

However, maybe this kind of iteration is rare enough in ES that it suffices to document the pattern under NEW CODE.

Thanks for all your work!

proposal for upgrading ECMAScript to a Unicode version released in this

century

This was amusing; could have said "this millennium" ;-)

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

Whew, a lot of work, Norbert. Looks quite good. My one question is whether
it is worth having a mechanism for iteration.

OLD CODE
for (int i = 0; i < s.length(); ++) {
  var x = s.charAt(i);
  // do something with x
}

Using your mechanism, one would write:

NEW CODE
for (int i = 0; i < s.length(); ++) {
  var x = s.codePointAt(i);
  // do something with x
  if (x > 0xFFFF) {
    ++i;
  }
}

In Java, for example, I *really* wish you could write:

DESIRED

for (int codepoint : s) {
  // do something with x
}

However, maybe this kind of iteration is rare enough in ES that it suffices
to document the pattern under NEW CODE.

Thanks for all your work!


> proposal for upgrading ECMAScript to a Unicode version released in this
century

This was amusing; could have said "this millennium" ;-)
------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Fri, Mar 16, 2012 at 01:55, Erik Corry <erik.corry at gmail.com> wrote:

> This is very useful, and was surely a lot of work.  I like the general
> thrust of it a lot.  It has a high level of backwards compatibility,
> does not rely on the VM having two different string implementations in
> it, and it seems to fix the issues people are encountering.
>
> However I think we probably do want the /u modifier on regexps to
> control the new backward-incompatible behaviour.  There may be some
> way to relax this for regexp literals in opted in Harmony code, but
> for new RegExp(...) and for other string literals I think there are
> rather too many inconsistencies with the old behaviour.
>
> The algorithm given for codePointAt never returns NaN.  It should
> probably do that for indices that hit a trail surrogate that has a
> lead surrogate preceeding it.
>
> Perhaps it is outside the scope of this proposal, but it would also
> make a lot of sense to add some named character classes to RegExp.
>
> If we are makig a /u modifier for RegExp it would also be nice to get
> rid of the incorrect case independent matching rules.  This is the
> section that says: "If ch's code unit value is greater than or equal
> to decimal 128 and cu's code unit value is less than decimal  128,
> then return ch."
>
> 2012/3/16 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
> > Based on my prioritization of goals for support for full Unicode in
> ECMAScript [1], I've put together a proposal for supporting the full
> Unicode character set based on the existing representation of text in
> ECMAScript using UTF-16 code unit sequences:
> >
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
> >
> > The detailed proposed spec changes serve to get a good idea of the scope
> of the changes, but will need some polishing.
> >
> > Comments?
> >
> > Thanks,
> > Norbert
> >
> > [1]
> https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
> >
> > _______________________________________________
> > es-discuss mailing list
> > es-discuss at mozilla.org
> > https://mail.mozilla.org/listinfo/es-discuss
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120316/b89c0b0a/attachment.html>

# Jonas Höglund (13 years ago)

On Sat, 17 Mar 2012 00:23:25 +0100, Mark Davis ☕ <mark at macchiato.com>

wrote:

Whew, a lot of work, Norbert. Looks quite good. My one question is
whether it is worth having a mechanism for iteration.

OLD CODE for (int i = 0; i < s.length(); ++) { var x = s.charAt(i); // do something with x }

Using your mechanism, one would write:

NEW CODE for (int i = 0; i < s.length(); ++) { var x = s.codePointAt(i); // do something with x if (x > 0xFFFF) { ++i; } }

In Java, for example, I really wish you could write:

DESIRED

for (int codepoint : s) { // do something with x }

However, maybe this kind of iteration is rare enough in ES that it
suffices to document the pattern under NEW CODE.

That's the beauty of ECMAScript; it's extensible. :-)

   String.prototype.forEachCodePoint = function(fun) {
     for (var i=0; i<s.length; i++) {
       var x = s.codePointAt(i)
       fun(x, s)
       if (x > 0xFFFF) { ++i }
     }
   }

   "hello".forEachCodepoint(function(x) {
     // do something with x
   })

Thanks for all your work!

proposal for upgrading ECMAScript to a Unicode version released in this century

This was amusing; could have said "this millennium" ;-)

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

Jonas

On Sat, 17 Mar 2012 00:23:25 +0100, Mark Davis ☕ <mark at macchiato.com>
wrote:

> Whew, a lot of work, Norbert. Looks quite good. My one question is  
> whether it is worth having a mechanism for iteration.
>
> OLD CODE
> for (int i = 0; i < s.length(); ++) {
>   var x = s.charAt(i);
>   // do something with x
> }
>
> Using your mechanism, one would write:
>
> NEW CODE
> for (int i = 0; i < s.length(); ++) {
>   var x = s.codePointAt(i);
>   // do something with x
>   if (x > 0xFFFF) {
>     ++i;
>   }
> }
>
> In Java, for example, I *really* wish you could write:
>
> DESIRED
>
> for (int codepoint : s) {
>   // do something with x
> }
>
> However, maybe this kind of iteration is rare enough in ES that it  
> suffices
> to document the pattern under NEW CODE.
>

That's the beauty of ECMAScript; it's extensible. :-)

       String.prototype.forEachCodePoint = function(fun) {
         for (var i=0; i<s.length; i++) {
           var x = s.codePointAt(i)
           fun(x, s)
           if (x > 0xFFFF) { ++i }
         }
       }

       "hello".forEachCodepoint(function(x) {
         // do something with x
       })

> Thanks for all your work!
>
>
>> proposal for upgrading ECMAScript to a Unicode version released in this
> century
>
> This was amusing; could have said "this millennium" ;-)
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>

Jonas

# Norbert Lindenberg (13 years ago)

Thanks for your comments - a few replies below.

Norbert

On Mar 16, 2012, at 1:55 , Erik Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

Good point about Harmony code, although it seems opt-in got replaced by being part of a module.

The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: "If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch."

And the exception for "ß" and other characters whose upper case equivalent has more than one code point ("If u does not consist of a single character, return ch." in the Canonicalize algorithm in ES 5.1).

Thanks for your comments - a few replies below.

Norbert

On Mar 16, 2012, at 1:55 , Erik Corry wrote:

> However I think we probably do want the /u modifier on regexps to
> control the new backward-incompatible behaviour.  There may be some
> way to relax this for regexp literals in opted in Harmony code, but
> for new RegExp(...) and for other string literals I think there are
> rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

Good point about Harmony code, although it seems opt-in got replaced by being part of a module.

> The algorithm given for codePointAt never returns NaN.  It should
> probably do that for indices that hit a trail surrogate that has a
> lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

> Perhaps it is outside the scope of this proposal, but it would also
> make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

> If we are makig a /u modifier for RegExp it would also be nice to get
> rid of the incorrect case independent matching rules.  This is the
> section that says: "If ch's code unit value is greater than or equal
> to decimal 128 and cu's code unit value is less than decimal  128,
> then return ch."

And the exception for "ß" and other characters whose upper case equivalent has more than one code point ("If u does not consist of a single character, return ch." in the Canonicalize algorithm in ES 5.1).

> 2012/3/16 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
>> Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
>> 
>> The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.
>> 
>> Comments?
>> 
>> Thanks,
>> Norbert
>> 
>> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (13 years ago)

In Harmony we should be able to make this even more beautiful using iterators [1]:

If we add:

String.prototype.[iterator] = function() { var s = this; return { index: 0, next: function() { if (this.index >= s.length) { throw StopIteration; } let cp = s.codePointAt(index); index += cp > 0xFFFF ? 2 : 1; return cp; } } }

clients can write:

for (codePoint of str) { // do something with codePoint }

Norbert

[1] harmony:iterators

In Harmony we should be able to make this even more beautiful using iterators [1]:

If we add:

String.prototype.[iterator] = function() {
    var s = this;
    return {
        index: 0,
        next: function() {
            if (this.index >= s.length) {
                throw StopIteration;
            }
            let cp = s.codePointAt(index);
            index += cp > 0xFFFF ? 2 : 1;
            return cp;
        }
    }
}

clients can write:

for (codePoint of str) {
    // do something with codePoint
}

Norbert

[1] http://wiki.ecmascript.org/doku.php?id=harmony:iterators


On Mar 16, 2012, at 17:04 , Jonas Höglund wrote:

> On Sat, 17 Mar 2012 00:23:25 +0100, Mark Davis ☕ <mark at macchiato.com>
> wrote:
> 
>> Whew, a lot of work, Norbert. Looks quite good. My one question is whether it is worth having a mechanism for iteration.
>> 
>> OLD CODE
>> for (int i = 0; i < s.length(); ++) {
>>  var x = s.charAt(i);
>>  // do something with x
>> }
>> 
>> Using your mechanism, one would write:
>> 
>> NEW CODE
>> for (int i = 0; i < s.length(); ++) {
>>  var x = s.codePointAt(i);
>>  // do something with x
>>  if (x > 0xFFFF) {
>>    ++i;
>>  }
>> }
>> 
>> In Java, for example, I *really* wish you could write:
>> 
>> DESIRED
>> 
>> for (int codepoint : s) {
>>  // do something with x
>> }
>> 
>> However, maybe this kind of iteration is rare enough in ES that it suffices
>> to document the pattern under NEW CODE.
>> 
> 
> That's the beauty of ECMAScript; it's extensible. :-)
> 
>      String.prototype.forEachCodePoint = function(fun) {
>        for (var i=0; i<s.length; i++) {
>          var x = s.codePointAt(i)
>          fun(x, s)
>          if (x > 0xFFFF) { ++i }
>        }
>      }
> 
>      "hello".forEachCodepoint(function(x) {
>        // do something with x
>      })
> 
>> Thanks for all your work!
>> 
>> 
>>> proposal for upgrading ECMAScript to a Unicode version released in this
>> century
>> 
>> This was amusing; could have said "this millennium" ;-)
>> ------------------------------
>> Mark <https://plus.google.com/114199149796022210033>
>> *
>> *
>> *— Il meglio è l’inimico del bene —*
>> **
>> 
> 
> Jonas
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Erik Corry (13 years ago)

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

Thanks for your comments - a few replies below.

Norbert

On Mar 16, 2012, at 1:55 , Erik Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

No. In general I don't think it is realistic to try to prove that problematic code does not exist, since that requires quantifying over all existing JS code, which is clearly impossible.

Good point about Harmony code, although it seems opt-in got replaced by being part of a module.

That would work too, I think.

The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

Then you should probably remove the text: "If there is no code unit at that position, the result is NaN" from your proposal :-)

I am wary of using exceptions for non-exceptional data-driven events, since performance is usually terrible and it's arguably an abuse of the mechanism. Your iterator code looks fine to me an needs neither NaN or exceptions.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

I can see that. But if we are going to have multiple versions of the RegExp syntax we should probably aim to keep the number down.

If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: "If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch."

And the exception for "ß" and other characters whose upper case equivalent has more than one code point ("If u does not consist of a single character, return ch." in the Canonicalize algorithm in ES 5.1).

Yes.

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
> Thanks for your comments - a few replies below.
>
> Norbert
>
>
> On Mar 16, 2012, at 1:55 , Erik Corry wrote:
>
>> However I think we probably do want the /u modifier on regexps to
>> control the new backward-incompatible behaviour.  There may be some
>> way to relax this for regexp literals in opted in Harmony code, but
>> for new RegExp(...) and for other string literals I think there are
>> rather too many inconsistencies with the old behaviour.
>
> Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

No.  In general I don't think it is realistic to try to prove that
problematic code does not exist, since that requires quantifying over
all existing JS code, which is clearly impossible.

> Good point about Harmony code, although it seems opt-in got replaced by being part of a module.

That would work too, I think.

>> The algorithm given for codePointAt never returns NaN.  It should
>> probably do that for indices that hit a trail surrogate that has a
>> lead surrogate preceeding it.
>
> NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

Then you should probably remove the text: "If there is no code unit at
that position, the result is NaN" from your proposal :-)

I am wary of using exceptions for non-exceptional data-driven events,
since performance is usually terrible and it's arguably an abuse of
the mechanism.  Your iterator code looks fine to me an needs neither
NaN or exceptions.

>> Perhaps it is outside the scope of this proposal, but it would also
>> make a lot of sense to add some named character classes to RegExp.
>
> It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

I can see that.  But if we are going to have multiple versions of the
RegExp syntax we should probably aim to keep the number down.

>> If we are makig a /u modifier for RegExp it would also be nice to get
>> rid of the incorrect case independent matching rules.  This is the
>> section that says: "If ch's code unit value is greater than or equal
>> to decimal 128 and cu's code unit value is less than decimal  128,
>> then return ch."
>
> And the exception for "ß" and other characters whose upper case equivalent has more than one code point ("If u does not consist of a single character, return ch." in the Canonicalize algorithm in ES 5.1).

Yes.


>> 2012/3/16 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
>>> Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
>>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
>>>
>>> The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.
>>>
>>> Comments?
>>>
>>> Thanks,
>>> Norbert
>>>
>>> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es-discuss
>

# Norbert Lindenberg (13 years ago)

On Mar 16, 2012, at 19:57 , Erik Corry wrote:

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

Thanks for your comments - a few replies below.

Norbert

On Mar 16, 2012, at 1:55 , Erik Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

No. In general I don't think it is realistic to try to prove that problematic code does not exist, since that requires quantifying over all existing JS code, which is clearly impossible.

We cannot prove its absence, but we can discuss the likelihood of its existence, and showing an actual example is a quick way to bring that discussion to a conclusion.

I note that you didn't challenge my claim about the (un)likelihood of the existence of applications that depend on Deseret characters not being mapped to lower case while calling toLowerCase...

The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

Then you should probably remove the text: "If there is no code unit at that position, the result is NaN" from your proposal :-)

I am wary of using exceptions for non-exceptional data-driven events, since performance is usually terrible and it's arguably an abuse of the mechanism. Your iterator code looks fine to me an needs neither NaN or exceptions.

The iterator or codePointAt?

The latter has the statement you quote, which shows a disconnect between what I wrote a few days ago starting from the charCodeAt spec, and what I think when I don't look at that spec. charCodeAt (and hence the current implementation of codePointAt) returns NaN when given an index < 0 or ≥ length. The normal behavior when accessing elements or properties that don't exist is to return undefined. We can't fix charCodeAt anymore, but I can still fix codePointAt.

Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

I can see that. But if we are going to have multiple versions of the RegExp syntax we should probably aim to keep the number down.

True. And in the meantime Brendan pointed to some regex proposals that try to address a different set of Unicode-related issues, also with a /u flag. Some coordination is clearly needed. blog.stevenlevithan.com/archives/fixing

On Mar 16, 2012, at 19:57 , Erik Corry wrote:

> 2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
>> Thanks for your comments - a few replies below.
>> 
>> Norbert
>> 
>> 
>> On Mar 16, 2012, at 1:55 , Erik Corry wrote:
>> 
>>> However I think we probably do want the /u modifier on regexps to
>>> control the new backward-incompatible behaviour.  There may be some
>>> way to relax this for regexp literals in opted in Harmony code, but
>>> for new RegExp(...) and for other string literals I think there are
>>> rather too many inconsistencies with the old behaviour.
>> 
>> Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?
> 
> No.  In general I don't think it is realistic to try to prove that
> problematic code does not exist, since that requires quantifying over
> all existing JS code, which is clearly impossible.

We cannot prove its absence, but we can discuss the likelihood of its existence, and showing an actual example is a quick way to bring that discussion to a conclusion.

I note that you didn't challenge my claim about the (un)likelihood of the existence of applications that depend on Deseret characters not being mapped to lower case while calling toLowerCase...

>>> The algorithm given for codePointAt never returns NaN.  It should
>>> probably do that for indices that hit a trail surrogate that has a
>>> lead surrogate preceeding it.
>> 
>> NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.
> 
> Then you should probably remove the text: "If there is no code unit at
> that position, the result is NaN" from your proposal :-)
> 
> I am wary of using exceptions for non-exceptional data-driven events,
> since performance is usually terrible and it's arguably an abuse of
> the mechanism.  Your iterator code looks fine to me an needs neither
> NaN or exceptions.

The iterator or codePointAt?

The latter has the statement you quote, which shows a disconnect between what I wrote a few days ago starting from the charCodeAt spec, and what I think when I don't look at that spec. charCodeAt (and hence the current implementation of codePointAt) returns NaN when given an index < 0 or ≥ length. The normal behavior when accessing elements or properties that don't exist is to return undefined. We can't fix charCodeAt anymore, but I can still fix codePointAt.

>>> Perhaps it is outside the scope of this proposal, but it would also
>>> make a lot of sense to add some named character classes to RegExp.
>> 
>> It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)
> 
> I can see that.  But if we are going to have multiple versions of the
> RegExp syntax we should probably aim to keep the number down.

True. And in the meantime Brendan pointed to some regex proposals that try to address a different set of Unicode-related issues, also with a /u flag. Some coordination is clearly needed.
http://blog.stevenlevithan.com/archives/fixing-javascript-regexp

# Steven L. (13 years ago)

Eric Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2. Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE:

From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6.

Norbert Lindenberg wrote:

\uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz] [\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as [\uwwww\uyyyy-\uxxxx\uzzzz] This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters.

Yikes! -1! This is unnecessary if the handling of \uhhhh is unmodified and support for \u{h..} and/or \x{h..} is added (the latter is the syntax from Perl and PCRE). Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \uhhhh alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs.

--Steven Levithan

Eric Corry wrote:
> However I think we probably do want the /u modifier on regexps to
> control the new backward-incompatible behaviour.  There may be some
> way to relax this for regexp literals in opted in Harmony code, but
> for new RegExp(...) and for other string literals I think there are
> rather too many inconsistencies with the old behaviour.

Disagree with adding /u for this purpose and disagree with breaking backward 
compatibility to let `/./.exec(s)[0].length == 2`. Instead, if this is 
deemed an important enough issue, there are two ways to match any Unicode 
grapheme that match existing regex library precedent:

>From Perl and PCRE:

\X

>From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care 
about this, IMO the more widely compatible solution that uses Unicode 
categories is Good Enough if Unicode category syntax is on the table for 
ES6.

Norbert Lindenberg wrote:
> \uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]
> [\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as 
> [\uwwww\uyyyy-\uxxxx\uzzzz]
> This transformation is rather ugly, but I’m afraid it’s the price 
> ECMAScript
> has to pay for being 12 years late in supporting supplementary characters.

Yikes! -1! This is unnecessary if the handling of \uhhhh is unmodified and 
support for \u{h..} and/or \x{h..} is added (the latter is the syntax from 
Perl and PCRE). Some people will want a way to match arbitrary Unicode code 
points rather than graphemes anyway, so leaving \uhhhh alone lets that use 
case continue working. This would still allow modifying the handling of 
literal astral/supplementary characters in RegExps. If it can be handled 
sensibly, I'm all for treating literal characters in RegExps as discrete 
graphemes rather than splitting them into surrogate pairs.

--Steven Levithan

# Erik Corry (13 years ago)

2012/3/17 Steven L. <steves_list at hotmail.com>:

Eric Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2.

Care to enlighten us with any thinking behind this disagreeing?

Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE:

\X

This doesn't work inside []. Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does.

From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6.

Norbert Lindenberg wrote:

\uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]

Norbert, this just happens automatically if unmatched surrogates are just treated as if they were normal code units.

[\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as [\uwwww\uyyyy-\uxxxx\uzzzz]

Norbert, this will have different semantics to the current implementations unless the second range is the full trail surrogate range.

I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now.

Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \uhhhh alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs.

You seem to be confusing graphemes and unicode code points. Here is the same text 3 times:

Four UTF-16 code units:

0x0020 0xD800 0xDF30 0x0308

Three Unicode code points:

0x20 0x10330 0x308

Two Graphemes

" " "¨" <-- This is an attempt to show a Gothic Ahsa with an umlaut. My mail program probably screwed it up.

The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue.

2012/3/17 Steven L. <steves_list at hotmail.com>:
> Eric Corry wrote:
>>
>> However I think we probably do want the /u modifier on regexps to
>> control the new backward-incompatible behaviour.  There may be some
>> way to relax this for regexp literals in opted in Harmony code, but
>> for new RegExp(...) and for other string literals I think there are
>> rather too many inconsistencies with the old behaviour.
>
>
> Disagree with adding /u for this purpose and disagree with breaking backward
> compatibility to let `/./.exec(s)[0].length == 2`.

Care to enlighten us with any thinking behind this disagreeing?

> Instead, if this is
> deemed an important enough issue, there are two ways to match any Unicode
> grapheme that match existing regex library precedent:
>
> From Perl and PCRE:
>
> \X

This doesn't work inside [].  Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is
completely different to what the dot does.

> From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):
>
> \P{M}\p{M}*
>
> Obviously \X is prettier, but because it's fairly rare for people to care
> about this, IMO the more widely compatible solution that uses Unicode
> categories is Good Enough if Unicode category syntax is on the table for
> ES6.
>
> Norbert Lindenberg wrote:
>>
>> \uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]

Norbert, this just happens automatically if unmatched surrogates are
just treated as if they were normal code units.

>> [\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as
>> [\uwwww\uyyyy-\uxxxx\uzzzz]

Norbert, this will have different semantics to the current
implementations unless the second range is the full trail surrogate
range.

I agree with Steven that these two cases should just be left alone,
which means they will continue to work the way they have until now.

> Some people will want a way to match arbitrary Unicode code
> points rather than graphemes anyway, so leaving \uhhhh alone lets that use
> case continue working. This would still allow modifying the handling of
> literal astral/supplementary characters in RegExps. If it can be handled
> sensibly, I'm all for treating literal characters in RegExps as discrete
> graphemes rather than splitting them into surrogate pairs.

You seem to be confusing graphemes and unicode code points.  Here is
the same text 3 times:

Four UTF-16 code units:

0x0020 0xD800 0xDF30 0x0308

Three Unicode code points:

0x20 0x10330 0x308

Two Graphemes

" " "¨"  <-- This is an attempt to show a Gothic Ahsa with an umlaut.
My mail program probably screwed it up.

The proposal you are responding to is all about adding Unicode code
point handling to regexps.  It is not about adding grapheme support,
which is a rather different issue.

-- 
Erik Corry

# Steven L. (13 years ago)

Eric Corry wrote:

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2.

Care to enlighten us with any thinking behind this disagreeing?

Sorry for the rushed and overly ebullient message. I disagreed with /u for switching from code unit to code point mode because in the moment I didn't think a code point mode necessary or particularly beneficial. Upon further reflection, I rushed into this opinion and will be more closely examining the related issues.

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). Therefore, I think that if a flag is added that only switches from code unit to code point mode, it should not be "u". Presumably, flag /u could simultaneously affect \d\w\b and switch to code point mode. I haven't yet thought enough about combining these two proposals to hold a strong opinion on the matter.

there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE: \X

This doesn't work inside []. Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does.

You are of course correct. And yes, I was envisioning the same restriction within character classes. But I'm not a strong proponent of \X, especially if support for Unicode categories is added.

I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now.

Glad to hear it.

You seem to be confusing graphemes and unicode code points. [...] The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue.

Indeed. My response was rushed and poorly formed. My apologies.

--Steven Levithan

Eric Corry wrote:
>> Disagree with adding /u for this purpose and disagree with breaking 
>> backward
>> compatibility to let `/./.exec(s)[0].length == 2`.
>
> Care to enlighten us with any thinking behind this disagreeing?

Sorry for the rushed and overly ebullient message. I disagreed with /u for 
switching from code unit to code point mode because in the moment I didn't 
think a code point mode necessary or particularly beneficial. Upon further 
reflection, I rushed into this opinion and will be more closely examining 
the related issues.

I further objected because I think the /u flag would be better used as a 
ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on 
Python's re.UNICODE or (?u) flag, which does the same thing except that it 
also covers \s (which is already Unicode-based in ES). Therefore, I think 
that if a flag is added that only switches from code unit to code point 
mode, it should not be "u". Presumably, flag /u could simultaneously affect 
\d\w\b and switch to code point mode. I haven't yet thought enough about 
combining these two proposals to hold a strong opinion on the matter.

>> there are two ways to match any Unicode
>> grapheme that match existing regex library precedent:
>>
>> From Perl and PCRE:
>> \X
>
> This doesn't work inside [].  Were you envisioning the same restriction in 
> JS?
>
> Also it matches a grapheme cluster, which is may be useful but is
> completely different to what the dot does.

You are of course correct. And yes, I was envisioning the same restriction 
within character classes. But I'm not a strong proponent of \X, especially 
if support for Unicode categories is added.

> I agree with Steven that these two cases should just be left alone,
> which means they will continue to work the way they have until now.

Glad to hear it.

> You seem to be confusing graphemes and unicode code points.
> [...]
> The proposal you are responding to is all about adding Unicode code
> point handling to regexps.  It is not about adding grapheme support,
> which is a rather different issue.

Indeed. My response was rushed and poorly formed. My apologies.

--Steven Levithan

# Norbert Lindenberg (13 years ago)

Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters.

Norbert

Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters.

Norbert


On Mar 17, 2012, at 11:22 , Steven L. wrote:

> Eric Corry wrote:
>>> Disagree with adding /u for this purpose and disagree with breaking backward
>>> compatibility to let `/./.exec(s)[0].length == 2`.
>> 
>> Care to enlighten us with any thinking behind this disagreeing?
> 
> Sorry for the rushed and overly ebullient message. I disagreed with /u for switching from code unit to code point mode because in the moment I didn't think a code point mode necessary or particularly beneficial. Upon further reflection, I rushed into this opinion and will be more closely examining the related issues.
> 
> I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). Therefore, I think that if a flag is added that only switches from code unit to code point mode, it should not be "u". Presumably, flag /u could simultaneously affect \d\w\b and switch to code point mode. I haven't yet thought enough about combining these two proposals to hold a strong opinion on the matter.
> 
>>> there are two ways to match any Unicode
>>> grapheme that match existing regex library precedent:
>>> 
>>> From Perl and PCRE:
>>> \X
>> 
>> This doesn't work inside [].  Were you envisioning the same restriction in JS?
>> 
>> Also it matches a grapheme cluster, which is may be useful but is
>> completely different to what the dot does.
> 
> You are of course correct. And yes, I was envisioning the same restriction within character classes. But I'm not a strong proponent of \X, especially if support for Unicode categories is added.
> 
>> I agree with Steven that these two cases should just be left alone,
>> which means they will continue to work the way they have until now.
> 
> Glad to hear it.
> 
>> You seem to be confusing graphemes and unicode code points.
>> [...]
>> The proposal you are responding to is all about adding Unicode code
>> point handling to regexps.  It is not about adding grapheme support,
>> which is a rather different issue.
> 
> Indeed. My response was rushed and poorly formed. My apologies.
> 
> --Steven Levithan
>

# Erik Corry (13 years ago)

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters.

/foo/☃ // slash-unicode-snowman for the win! :-)

2012/3/17 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
> Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters.

/foo/☃   // slash-unicode-snowman for the win! :-)

-- 
Erik Corry

P.S. I shudder to think what slash-pile-of-poo could mean.

# Erik Corry (13 years ago)

2012/3/17 Steven L. <steves_list at hotmail.com>:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful.

And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine.

\b is a little tougher. The Unicode rewrite would be (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like

/(?<!\w)word(?=!\w)/ // Negative look-behind for a \w and negative look-ahead for \w at the end.

which isn't too bad, even if it is much worse than

/\bword\b/

Indeed. My response was rushed and poorly formed. My apologies.

Gratefully accepted with the hope that my next rushed and poorly formed response will also be forgiven!

2012/3/17 Steven L. <steves_list at hotmail.com>:
> I further objected because I think the /u flag would be better used as a
> ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
> Python's re.UNICODE or (?u) flag, which does the same thing except that it
> also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this.  I think "any digit
including rods and roman characters but not decimal points/commas"
http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
would be needed much less often than the digits 0-9, so I think
hijacking \d for this case is poor use of name space.  The \d escape
in perl does not cover other Unicode numerals, and even with the
[:name:] syntax there appears to be no way to get the Unicode
numerals: http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
 This suggests to me that it's not very useful.

And instead of changing the meaning of \w, which will be confusing, I
think that [:alnum:] as in perl would work fine.

\b is a little tougher.  The Unicode rewrite would be
(?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
obviously too verbose.  But if we take \b for this then the ASCII
version has to be written as
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
annoying.  However, often you don't need that if you have negative
lookbehind because you can write something
like

/(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative
look-ahead for \w at the end.

which isn't _too_ bad, even if it is much worse than

/\bword\b/

> Indeed. My response was rushed and poorly formed. My apologies.

Gratefully accepted with the hope that my next rushed and poorly
formed response will also be forgiven!

-- 
Erik Corry

# Norbert Lindenberg (13 years ago)

On Mar 17, 2012, at 10:20 , Erik Corry wrote:

2012/3/17 Steven L. <steves_list at hotmail.com>:

Eric Corry wrote:

However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour.

Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let /./.exec(s)[0].length == 2.

Care to enlighten us with any thinking behind this disagreeing?

Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent:

From Perl and PCRE:

\X

This doesn't work inside []. Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does.

From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6.

Norbert Lindenberg wrote:

\uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]

Norbert, this just happens automatically if unmatched surrogates are just treated as if they were normal code units.

I don't see how. In the actual matching process, the new design only looks at code points, not code units. Without this transformation, it would see surrogate code points in the pattern, but supplementary code points in the text to be matched. Enhancing the matching process to recognize surrogate code points and insert them into the continuation might work, but wouldn't be any prettier than this transformation.

[\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as [\uwwww\uyyyy-\uxxxx\uzzzz]

Norbert, this will have different semantics to the current implementations unless the second range is the full trail surrogate range.

True. I think if we restrict the transformation to that specific case it'll still cover normal usage of this pattern.

I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now.

Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \uhhhh alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs.

You seem to be confusing graphemes and unicode code points. Here is the same text 3 times:

Four UTF-16 code units:

0x0020 0xD800 0xDF30 0x0308

Three Unicode code points:

0x20 0x10330 0x308

Two Graphemes

" " "¨" <-- This is an attempt to show a Gothic Ahsa with an umlaut. My mail program probably screwed it up.

Mac Mail is usually Unicode-friendly, so let's try again: " 𐌰̈"

The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue.

Correct - thanks for the explanation!

On Mar 17, 2012, at 10:20 , Erik Corry wrote:

> 2012/3/17 Steven L. <steves_list at hotmail.com>:
>> Eric Corry wrote:
>>> 
>>> However I think we probably do want the /u modifier on regexps to
>>> control the new backward-incompatible behaviour.  There may be some
>>> way to relax this for regexp literals in opted in Harmony code, but
>>> for new RegExp(...) and for other string literals I think there are
>>> rather too many inconsistencies with the old behaviour.
>> 
>> 
>> Disagree with adding /u for this purpose and disagree with breaking backward
>> compatibility to let `/./.exec(s)[0].length == 2`.
> 
> Care to enlighten us with any thinking behind this disagreeing?
> 
>> Instead, if this is
>> deemed an important enough issue, there are two ways to match any Unicode
>> grapheme that match existing regex library precedent:
>> 
>> From Perl and PCRE:
>> 
>> \X
> 
> This doesn't work inside [].  Were you envisioning the same restriction in JS?
> 
> Also it matches a grapheme cluster, which is may be useful but is
> completely different to what the dot does.
> 
>> From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):
>> 
>> \P{M}\p{M}*
>> 
>> Obviously \X is prettier, but because it's fairly rare for people to care
>> about this, IMO the more widely compatible solution that uses Unicode
>> categories is Good Enough if Unicode category syntax is on the table for
>> ES6.
>> 
>> Norbert Lindenberg wrote:
>>> 
>>> \uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]
> 
> Norbert, this just happens automatically if unmatched surrogates are
> just treated as if they were normal code units.

I don't see how. In the actual matching process, the new design only looks at code points, not code units. Without this transformation, it would see surrogate code points in the pattern, but supplementary code points in the text to be matched. Enhancing the matching process to recognize surrogate code points and insert them into the continuation might work, but wouldn't be any prettier than this transformation.

>>> [\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as
>>> [\uwwww\uyyyy-\uxxxx\uzzzz]
> 
> Norbert, this will have different semantics to the current
> implementations unless the second range is the full trail surrogate
> range.

True. I think if we restrict the transformation to that specific case it'll still cover normal usage of this pattern.

> I agree with Steven that these two cases should just be left alone,
> which means they will continue to work the way they have until now.
> 
>> Some people will want a way to match arbitrary Unicode code
>> points rather than graphemes anyway, so leaving \uhhhh alone lets that use
>> case continue working. This would still allow modifying the handling of
>> literal astral/supplementary characters in RegExps. If it can be handled
>> sensibly, I'm all for treating literal characters in RegExps as discrete
>> graphemes rather than splitting them into surrogate pairs.
> 
> You seem to be confusing graphemes and unicode code points.  Here is
> the same text 3 times:
> 
> Four UTF-16 code units:
> 
> 0x0020 0xD800 0xDF30 0x0308
> 
> Three Unicode code points:
> 
> 0x20 0x10330 0x308
> 
> Two Graphemes
> 
> " " "¨"  <-- This is an attempt to show a Gothic Ahsa with an umlaut.
> My mail program probably screwed it up.

Mac Mail is usually Unicode-friendly, so let's try again:
" 𐌰̈"

> The proposal you are responding to is all about adding Unicode code
> point handling to regexps.  It is not about adding grapheme support,
> which is a rather different issue.

Correct - thanks for the explanation!

# Norbert Lindenberg (13 years ago)

On Mar 17, 2012, at 11:58 , Erik Corry wrote:

2012/3/17 Steven L. <steves_list at hotmail.com>:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful.

Looking at that page, it seems \d gives you a reasonable set of digits, the ones in the Unicode general category Nd (number, decimal). These digits come from a variety of writing systems, but are all used decimal-positional, so you can parse at least integers using them with a fairly generic algorithm.

Dealing with roman numerals or counting rods requires specialized algorithms, so you probably don't want to find them in this bucket.

Norbert

On Mar 17, 2012, at 11:58 , Erik Corry wrote:

> 2012/3/17 Steven L. <steves_list at hotmail.com>:
>> I further objected because I think the /u flag would be better used as a
>> ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
>> Python's re.UNICODE or (?u) flag, which does the same thing except that it
>> also covers \s (which is already Unicode-based in ES).
> 
> I am rather skeptical about treating \d like this.  I think "any digit
> including rods and roman characters but not decimal points/commas"
> http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
> would be needed much less often than the digits 0-9, so I think
> hijacking \d for this case is poor use of name space.  The \d escape
> in perl does not cover other Unicode numerals, and even with the
> [:name:] syntax there appears to be no way to get the Unicode
> numerals: http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
> This suggests to me that it's not very useful.

Looking at that page, it seems \d gives you a reasonable set of digits, the ones in the Unicode general category Nd (number, decimal). These digits come from a variety of writing systems, but are all used decimal-positional, so you can parse at least integers using them with a fairly generic algorithm.

Dealing with roman numerals or counting rods requires specialized algorithms, so you probably don't want to find them in this bucket.

Norbert

# Steven L. (13 years ago)

Eric Corry wrote:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to match both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari digits, and probably others. Even if it wasn't often useful, IMO this change is necessary for congruity with Unicode-enabled \w and \b (I'll get to that), and would likely never be detrimental since /u would be opt-in and it's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not /\p{N}/. I.e., it should not match any Unicode number, but rather any Unicode decimal digit (see unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Nd} for the list). And as Norbert noted, that is in fact what Perl's \d matches.

Comparison with other regex flavors:

\w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
\w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).
\b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
\b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).
\d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
\d == \p{Nd} -- .NET, Perl, Python (with (?u)).
\s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
\s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for users to sometimes want them to be Unicode-based--thus, an opt-in flag offers the best of both worlds. In fact, I'd go so far as to say they are broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which currently returns true.

Unicode-based \d would not only help international users/apps, it is also important because otherwise Unicode-based \w and \b would have to use [\p{L}0-9_] rather than [\p{L}\p{Nd}], which breaks portability with .NET, Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used [\p{L}\p{Nd}] but \d used [0-9], then among other consequences (including user confusion), [^\W\d_] could not be used equivalently to \p{L}.

And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works only within character classes. IMO, the POSIX-style [[:name:]] syntax is clumsy and confusing, not to mention backward incompatible. It would potentially also be confusing if ES supports only [:alnum:] without adding the rest of the (not-very-useful) POSIX regex class names.

\b is a little tougher. The Unicode rewrite would be (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like

/(?<!\w)word(?=!\w)/ // Negative look-behind for a \w and negative look-ahead for \w at the end.

which isn't too bad, even if it is much worse than

/\bword\b/

I've already started to explain above why I think Unicode-based \b is important and useful. I'll just add the footnote that relying on lookbehind would in all likelihood perform less efficiently than \b (depending on implementation optimizations).

Indeed. My response was rushed and poorly formed. My apologies.

Gratefully accepted with the hope that my next rushed and poorly formed response will also be forgiven!

Consider it done. ;-P

--Steven Levithan

Eric Corry wrote:

>> I further objected because I think the /u flag would be better used as a
>> ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
>> Python's re.UNICODE or (?u) flag, which does the same thing except that 
>> it
>> also covers \s (which is already Unicode-based in ES).
>
> I am rather skeptical about treating \d like this.  I think "any digit
> including rods and roman characters but not decimal points/commas"
> http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
> would be needed much less often than the digits 0-9, so I think
> hijacking \d for this case is poor use of name space.  The \d escape
> in perl does not cover other Unicode numerals, and even with the
> [:name:] syntax there appears to be no way to get the Unicode
> numerals: 
> http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
>  This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to match 
both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari 
digits, and probably others. Even if it wasn't often useful, IMO this change 
is necessary for congruity with Unicode-enabled \w and \b (I'll get to 
that), and would likely never be detrimental since /u would be opt-in and 
it's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not 
/\p{N}/. I.e., it should not match any Unicode number, but rather any 
Unicode decimal digit (see 
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the 
list). And as Norbert noted, that is in fact what Perl's \d matches.

Comparison with other regex flavors:

* \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
* \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

* \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
* \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

* \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
* \d == \p{Nd} -- .NET, Perl, Python (with (?u)).

* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for 
users to sometimes want them to be Unicode-based--thus, an opt-in flag 
offers the best of both worlds. In fact, I'd go so far as to say they are 
broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which 
currently returns true.

Unicode-based \d would not only help international users/apps, it is also 
important because otherwise Unicode-based \w and \b would have to use 
[\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET, 
Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used 
[\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including 
user confusion), [^\W\d_] could not be used equivalently to \p{L}.

> And instead of changing the meaning of \w, which will be confusing, I
> think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only 
[A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works 
only within character classes. IMO, the POSIX-style [[:name:]] syntax is 
clumsy and confusing, not to mention backward incompatible. It would 
potentially also be confusing if ES supports only [:alnum:] without adding 
the rest of the (not-very-useful) POSIX regex class names.

> \b is a little tougher.  The Unicode rewrite would be
> (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
> obviously too verbose.  But if we take \b for this then the ASCII
> version has to be written as
> (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
> annoying.  However, often you don't need that if you have negative
> lookbehind because you can write something
> like
>
> /(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative
> look-ahead for \w at the end.
>
> which isn't _too_ bad, even if it is much worse than
>
> /\bword\b/

I've already started to explain above why I think Unicode-based \b is 
important and useful. I'll just add the footnote that relying on lookbehind 
would in all likelihood perform less efficiently than \b (depending on 
implementation optimizations).

>> Indeed. My response was rushed and poorly formed. My apologies.
>
> Gratefully accepted with the hope that my next rushed and poorly
> formed response will also be forgiven!

Consider it done. ;-P

--Steven Levithan

# Erik Corry (13 years ago)

2012/3/18 Steven L. <steves_list at hotmail.com>:

Eric Corry wrote:

I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think "any digit including rods and roman characters but not decimal points/commas" en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to match both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari digits, and probably others. Even if it wasn't often useful, IMO this change is necessary for congruity with Unicode-enabled \w and \b (I'll get to that), and would likely never be detrimental since /u would be opt-in and it's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not /\p{N}/. I.e., it should not match any Unicode number, but rather any Unicode decimal digit (see unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Nd} for the list). And as Norbert noted, that is in fact what Perl's \d matches.

Ah, that makes much more sense.

Comparison with other regex flavors:

\w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).

\w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

\b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).

\b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

\d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).

\d == \p{Nd} -- .NET, Perl, Python (with (?u)).

\s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).

\s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for users to sometimes want them to be Unicode-based--thus, an opt-in flag offers the best of both worlds. In fact, I'd go so far as to say they are broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which currently returns true.

Unicode-based \d would not only help international users/apps, it is also important because otherwise Unicode-based \w and \b would have to use [\p{L}0-9_] rather than [\p{L}\p{Nd}], which breaks portability with .NET, Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used [\p{L}\p{Nd}] but \d used [0-9], then among other consequences (including user confusion), [^\W\d_] could not be used equivalently to \p{L}.

And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works

This would be pretty useless and is not true in perl. I tried the following:

perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . "\n";"

and it prints 1, indicating a match.

only within character classes. IMO, the POSIX-style [[:name:]] syntax is clumsy and confusing, not to mention backward incompatible. It would potentially also be confusing if ES supports only [:alnum:] without adding the rest of the (not-very-useful) POSIX regex class names.

The implication was to add the rest too. Seeing things like the regexp at the bottom of this page inimino.org/~inimino/blog/javascript_cset is an indication to me that there is a demand.

\b is a little tougher. The Unicode rewrite would be (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like

/(?<!\w)word(?=!\w)/ // Negative look-behind for a \w and negative look-ahead for \w at the end.

which isn't too bad, even if it is much worse than

/\bword\b/

I've already started to explain above why I think Unicode-based \b is important and useful. I'll just add the footnote that relying on lookbehind would in all likelihood perform less efficiently than \b (depending on implementation optimizations).

OK, I'm convinced that /u should make \d, \b and \w Unicode aware. I don't think the performance will be much different between a lookbehind and a \b though.

2012/3/18 Steven L. <steves_list at hotmail.com>:
> Eric Corry wrote:
>
>>> I further objected because I think the /u flag would be better used as a
>>> ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
>>> Python's re.UNICODE or (?u) flag, which does the same thing except that
>>> it
>>> also covers \s (which is already Unicode-based in ES).
>>
>>
>> I am rather skeptical about treating \d like this.  I think "any digit
>> including rods and roman characters but not decimal points/commas"
>> http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
>> would be needed much less often than the digits 0-9, so I think
>> hijacking \d for this case is poor use of name space.  The \d escape
>> in perl does not cover other Unicode numerals, and even with the
>> [:name:] syntax there appears to be no way to get the Unicode
>> numerals:
>> http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
>>  This suggests to me that it's not very useful.
>
>
> I know from experience that it's common for Arabic speakers to want to match
> both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari
> digits, and probably others. Even if it wasn't often useful, IMO this change
> is necessary for congruity with Unicode-enabled \w and \b (I'll get to
> that), and would likely never be detrimental since /u would be opt-in and
> it's easy to explicitly use [0-9] when that's what you want.
>
> For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not
> /\p{N}/. I.e., it should not match any Unicode number, but rather any
> Unicode decimal digit (see
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the
> list). And as Norbert noted, that is in fact what Perl's \d matches.

Ah, that makes much more sense.

> Comparison with other regex flavors:
>
> * \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
> * \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).
>
> * \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
> * \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).
>
> * \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
> * \d == \p{Nd} -- .NET, Perl, Python (with (?u)).
>
> * \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
> * \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).
>
> Note that Java's \w and \b are inconsistent.
>
> Unicode-based \w and \b are incredibly useful, and it is very common for
> users to sometimes want them to be Unicode-based--thus, an opt-in flag
> offers the best of both worlds. In fact, I'd go so far as to say they are
> broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which
> currently returns true.
>
> Unicode-based \d would not only help international users/apps, it is also
> important because otherwise Unicode-based \w and \b would have to use
> [\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET,
> Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used
> [\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including
> user confusion), [^\W\d_] could not be used equivalently to \p{L}.
>
>
>> And instead of changing the meaning of \w, which will be confusing, I
>> think that [:alnum:] as in perl would work fine.
>
>
> [:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only
> [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works

This would be pretty useless and is not true in perl.  I tried the following:

perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . \"\n\";"

and it prints 1, indicating a match.

> only within character classes. IMO, the POSIX-style [[:name:]] syntax is
> clumsy and confusing, not to mention backward incompatible. It would
> potentially also be confusing if ES supports only [:alnum:] without adding
> the rest of the (not-very-useful) POSIX regex class names.

The implication was to add the rest too.  Seeing things like the
regexp at the bottom of this page
http://inimino.org/~inimino/blog/javascript_cset is an indication to
me that there is a demand.

>> \b is a little tougher.  The Unicode rewrite would be
>> (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
>> obviously too verbose.  But if we take \b for this then the ASCII
>> version has to be written as
>> (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
>> annoying.  However, often you don't need that if you have negative
>> lookbehind because you can write something
>> like
>>
>> /(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative
>> look-ahead for \w at the end.
>>
>> which isn't _too_ bad, even if it is much worse than
>>
>> /\bword\b/
>
>
> I've already started to explain above why I think Unicode-based \b is
> important and useful. I'll just add the footnote that relying on lookbehind
> would in all likelihood perform less efficiently than \b (depending on
> implementation optimizations).

OK, I'm convinced that /u should make \d, \b and \w Unicode aware.  I
don't think the performance will be much different between a
lookbehind and a \b though.

-- 
Erik Corry

# Steven L. (13 years ago)

Steven Levithan wrote:

\s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).

\s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Oops. My ASCII-only version of \s is obviously missing space \x20 and no-break space \xAO (which are included in Unicode's \p{Z}).

Erik Corry wrote:

Steven Levithan wrote:

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing.

This would be pretty useless and is not true in perl. I tried the following:

perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . "\n";"

and it prints 1, indicating a match.

<Updating my mental notes> Roger that. Online docs (including the

Perl-specific page you linked to earlier) typically list [:alnum:] as [A-Za-z0-9], but I've just done some quick testing and it seems that regex packages supporting [:alnum:] give it at least three different meanings:

[A-Za-z0-9]
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}\p{Nl}]

Note that although Java doesn't support POSIX character class syntax, it too supports alnum via \p{Alnum}. Java's alnum matches only [A-Za-z0-9].

Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :)

Erik Corry wrote:

OK, I'm convinced that /u should make \d, \b and \w Unicode aware.

w00t!

--Steven Levithan

Steven Levithan wrote:
> * \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
> * \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Oops. My ASCII-only version of \s is obviously missing space \x20 and 
no-break space \xAO (which are included in Unicode's \p{Z}).

Erik Corry wrote:
> Steven Levithan wrote:
>> [:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only
>> [A-Za-z0-9]. Making it Unicode-based in ES would be confusing.
>
> This would be pretty useless and is not true in perl.  I tried the 
> following:
>
> perl -e "use utf8; print 'æ' =~ /[[:alnum:]]/ . \"\n\";"
>
> and it prints 1, indicating a match.

***<Updating my mental notes>*** Roger that. Online docs (including the 
Perl-specific page you linked to earlier) typically list [:alnum:] as 
[A-Za-z0-9], but I've just done some quick testing and it seems that regex 
packages supporting [:alnum:] give it at least three different meanings:

* [A-Za-z0-9]
* [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]
* [\p{Ll}\p{Lu}\p{Lt}\p{Nd}\p{Nl}]

Note that although Java doesn't support POSIX character class syntax, it too 
supports alnum via \p{Alnum}. Java's alnum matches only [A-Za-z0-9].

Anyway, this is probably all moot, unless someone wants to officially 
propose POSIX character classes for ES RegExp. ...In which case I'll be 
happy to state about a half-dozen reasons to not do so. :)

Erik Corry wrote:
> OK, I'm convinced that /u should make \d, \b and \w Unicode aware.

w00t!

--Steven Levithan

# Erik Corry (13 years ago)

2012/3/18 Steven L. <steves_list at hotmail.com>:

Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :)

Please do, they seem quite sensible to me.

In fact \w with Unicode support seems very similar to [:alnum:] to me. If this one is useful are there not other Unicode categories that would be useful?

2012/3/18 Steven L. <steves_list at hotmail.com>:
> Anyway, this is probably all moot, unless someone wants to officially
> propose POSIX character classes for ES RegExp. ...In which case I'll be
> happy to state about a half-dozen reasons to not do so. :)

Please do, they seem quite sensible to me.

In fact \w with Unicode support seems very similar to [:alnum:] to me.
 If this one is useful are there not other Unicode categories that
would be useful?

-- 
Erik Corry

# Steven L. (13 years ago)

Erik Corry wrote:

Steven Levithan wrote:

Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :)

Please do, they seem quite sensible to me.

My main objections are due to the POSIX character class syntax itself, and my preference for introducing Unicode categories using \p{..} instead. But to get down a little more detail...

They're backward incompatible. /[[:name:]]/ is currently equivalent to /[[:aemn]]/ in web-reality. Granted, this probably won't be a big deal for existing code, but because they're not currently an error, their use could cause latent bugs in old browsers that don't support them and treat them as part of a character class's set.
They work inside of bracket expressions only. This is clumsy and needlessly confusing. [:alnum:] outside of a bracket expression would probably have to continue to be equivalent to [:almnu], which would lead to at least occasional developer frustration and bugs.
Since the exact characters they match differs between regex libraries (beyond just Unicode version variation), they would contribute to the existing landscape of regex features that seem to be portable but actually work slightly differently in different places. We need less of this.
They are either rarely useful or only minor conveniences over existing shorthands, explicit character classes, or Unicode categories that could be matched using \p{..} in more standardized fashion.
Other implementations, at least, do not allow them to be negated on their own, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by using them in negated bracket expressions, but that may negate more than you want.
If ES ever adopts .NET/XPath-style character class subtraction or Java-style character class intersection (the latter was on the cards for ES4), their syntax would become even more confusing.
Bonus pompous bullet point: IMO, there are more useful and important new RegExp features to focus on, including support for Unicode categories (which, IMO, are regex's new and improved version of POSIX character classes). My personal wishlist would probably include at least 20 new regex features above POSIX character classes, even if they were introduced using the \p{..} syntax (which is how Java included them).
Bonus nitpick: The name of the syntax itself causes confusion. POSIX calls them character classes, and calls their container a bracket expression. JavaScripters already call the container a character class. (Not an objection, per se. Presumably we could call them something like "POSIX shorthands" to avoid confusion.)

I'd have no actual objections to adding them using the \p{Name} syntax (as Java does), especially if there is demand for them among regex power-users (you're the first person who I've seen strongly advocate for them). However, I'd still have concerns about exactly which names are added, exactly what they match, and their compatibility with other regex flavors.

In fact \w with Unicode support seems very similar to [:alnum:] to me. If this one is useful are there not other Unicode categories that would be useful?

\w with Unicode should match [\p{L}{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't).

As you said, though, Unicode categories are indeed quite useful. Unicode scripts, too. I'd advocate for them alongside you. Because of how useful they are, I've even made them usable via my XRegExp JavaScript library (see git.io/xregexp ). That lib has a relatively small but enthusiastic user base and is seeing increasing use in server-side JS, where the overhead of loading long Unicode code point ranges doesn't matter as much. But, so long as a /u flag is added for switching \w\b\d to Unicode-mode, I'd argue that even Unicode categories and scripts are less important than various other features I've mentioned recently on es-discuss, including named capture and atomic groups.

-- Steven Levithan

Erik Corry wrote:
> Steven Levithan wrote:
>> Anyway, this is probably all moot, unless someone wants to officially
>> propose POSIX character classes for ES RegExp. ...In which case I'll be
>> happy to state about a half-dozen reasons to not do so. :)
>
> Please do, they seem quite sensible to me.

My main objections are due to the POSIX character class syntax itself, and 
my preference for introducing Unicode categories using \p{..} instead. But 
to get down a little more detail...

* They're backward incompatible. /[[:name:]]/ is currently equivalent to 
/[\[:aemn]\]/ in web-reality. Granted, this probably won't be a big deal for 
existing code, but because they're not currently an error, their use could 
cause latent bugs in old browsers that don't support them and treat them as 
part of a character class's set.

* They work inside of bracket expressions only. This is clumsy and 
needlessly confusing. [:alnum:] outside of a bracket expression would 
probably have to continue to be equivalent to [:almnu], which would lead to 
at least occasional developer frustration and bugs.

* Since the exact characters they match differs between regex libraries 
(beyond just Unicode version variation), they would contribute to the 
existing landscape of regex features that seem to be portable but actually 
work slightly differently in different places. We need less of this.

* They are either rarely useful or only minor conveniences over existing 
shorthands, explicit character classes, or Unicode categories that could be 
matched using \p{..} in more standardized fashion.

* Other implementations, at least, do not allow them to be negated on their 
own, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by using 
them in negated bracket expressions, but that may negate more than you want.

* If ES ever adopts .NET/XPath-style character class subtraction or 
Java-style character class intersection (the latter was on the cards for 
ES4), their syntax would become even more confusing.

* Bonus pompous bullet point: IMO, there are more useful and important new 
RegExp features to focus on, including support for Unicode categories 
(which, IMO, are regex's new and improved version of POSIX character 
classes). My personal wishlist would probably include at least 20 new regex 
features above POSIX character classes, even if they were introduced using 
the \p{..} syntax (which is how Java included them).

* Bonus nitpick: The name of the syntax itself causes confusion. POSIX calls 
them character classes, and calls their container a bracket expression. 
JavaScripters already call the container a character class. (Not an 
objection, per se. Presumably we could call them something like "POSIX 
shorthands" to avoid confusion.)

I'd have no actual objections to adding them using the \p{Name} syntax (as 
Java does), especially if there is demand for them among regex power-users 
(you're the first person who I've seen strongly advocate for them). However, 
I'd still have concerns about exactly which names are added, exactly what 
they match, and their compatibility with other regex flavors.

> In fact \w with Unicode support seems very similar to [:alnum:] to me.
>  If this one is useful are there not other Unicode categories that
> would be useful?

\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for 
[[:alnum:]], for compatibility reasons, would probably be 
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive 
(if you like that exact set) or a negative (many users will think it's 
equivalent to \w with Unicode even though it isn't).

As you said, though, Unicode categories are indeed quite useful. Unicode 
scripts, too. I'd advocate for them alongside you. Because of how useful 
they are, I've even made them usable via my XRegExp JavaScript library (see 
http://git.io/xregexp ). That lib has a relatively small but enthusiastic 
user base and is seeing increasing use in server-side JS, where the overhead 
of loading long Unicode code point ranges doesn't matter as much. But, so 
long as a /u flag is added for switching \w\b\d to Unicode-mode, I'd argue 
that even Unicode categories and scripts are less important than various 
other features I've mentioned recently on es-discuss, including named 
capture and atomic groups.

-- Steven Levithan

# Steven L. (13 years ago)

Steven Levithan wrote:

\w with Unicode should match [\p{L}{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't).

Although some regex libraries indeed implement the above, I've just looked over UTS#18 Annex C 1, which requires that \w be equivalent to:

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]

Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear on whether the differences from \p{L} are fully covered by the inclusion of \p{M} in the above character class. I'm sure there are plenty of people here with greater Unicode expertise than me who could clarify, though.

-- Steven Levithan

Steven Levithan wrote:
> \w with Unicode should match [\p{L}\{Nd}_]. The best way to go for 
> [[:alnum:]], for compatibility reasons, would probably be 
> [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive 
> (if you like that exact set) or a negative (many users will think it's 
> equivalent to \w with Unicode even though it isn't).

Although some regex libraries indeed implement the above, I've just looked 
over UTS#18 Annex C [1], which requires that \w be equivalent to:

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]

Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear 
on whether the differences from \p{L} are fully covered by the inclusion of 
\p{M} in the above character class. I'm sure there are plenty of people here 
with greater Unicode expertise than me who could clarify, though.

-- Steven Levithan

[1]: http://unicode.org/reports/tr18/#Compatibility_Properties

# Steven L. (13 years ago)

Java SE 7 apparently added flag (?U) to do the same thing as Python's (?u). The new flag also affects Java's POSIX character class definitions such as \p{Alnum}.

Note the difference in casing, and also that Java's (?U)\w follows UTS#18, unlike Python's (?u)\w. Java has long supported a lowercase (?u) flag for Unicode-aware case folding.

-- Steven Levithan

-----Original Message---

Java SE 7 apparently added flag (?U) to do the same thing as Python's (?u). 
The new flag also affects Java's POSIX character class definitions such as 
\p{Alnum}.

Note the difference in casing, and also that Java's (?U)\w follows UTS#18, 
unlike Python's (?u)\w. Java has long supported a lowercase (?u) flag for 
Unicode-aware case folding.

-- Steven Levithan

-----Original Message----- 
From: Steven L.
Sent: Monday, March 19, 2012 12:21 PM
To: Erik Corry
Cc: es-discuss at mozilla.org
Subject: Re: Full Unicode based on UTF-16 proposal

Steven Levithan wrote:
> \w with Unicode should match [\p{L}\{Nd}_]. The best way to go for 
> [[:alnum:]], for compatibility reasons, would probably be 
> [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive 
> (if you like that exact set) or a negative (many users will think it's 
> equivalent to \w with Unicode even though it isn't).

Although some regex libraries indeed implement the above, I've just looked
over UTS#18 Annex C [1], which requires that \w be equivalent to:

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]

Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear
on whether the differences from \p{L} are fully covered by the inclusion of
\p{M} in the above character class. I'm sure there are plenty of people here
with greater Unicode expertise than me who could clarify, though.

-- Steven Levithan

[1]: http://unicode.org/reports/tr18/#Compatibility_Properties

_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (13 years ago)

I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. norbertlindenberg.com/2012/03/ecmascript-supplementary-characters

Norbert

I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section.
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/

Norbert


On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote:

> Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
> 
> The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.
> 
> Comments?
> 
> Thanks,
> Norbert
> 
> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>

# Steven Levithan (13 years ago)

Norbert Lindenberg wrote:

I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. norbertlindenberg.com/2012/03/ecmascript-supplementary-characters

Cool.

From the proposal's Updates section:

Indicated that "u" may not be the actual character for the flag for code point mode in regular expressions, as a "u" flag has already been proposed for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.
Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.
[New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used far less often, and more developers would continue to get bitten by code-unit-based processing.

As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

"[S]ome applications might have processed gunk with regular expressions where neither the 'characters' in the patterns nor the input to be matched are text."
"s.match(/^.$/)[0].length can now be 2." I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.
/./g.exec(s) can now increment the regex's lastIndex by 2.

-- Steven Levithan

Norbert Lindenberg wrote:

> I've updated the proposal based on the feedback received so far. Changes
> are listed in the Updates section.
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/

Cool.

>From the proposal's Updates section:

> Indicated that "u" may not be the actual character for the flag for code
> point mode in regular expressions, as a "u" flag has already been proposed
> for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do three 
things at once, making it an all-around "support Unicode better" flag:

1. Switches from code unit to code point mode. /./gu matches any Unicode 
code point, among other benefits outlined by Norbert.

2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. 
[0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match 
ASCII characters only while using /u.

3. [New proposal] Makes /i use Unicode casefolding rules. 
/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

Item number 3 is inspired by but different than Java's lowercase u flag for 
Unicode casefolding. In Java, flag u itself enables Unicode casefolding and 
does not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeing 
widespread use when dealing with anything more than ASCII, at least in 
environments where you don't have to worry about backcompat. This would help 
developers avoid stumbling on code unit issues in the small minority of 
cases where non-BMP characters are used or encountered. If /u's only purpose 
was to switch to code point mode, most likely it would be used *far* less 
often, and more developers would continue to get bitten by code-unit-based 
processing.

As for whether the switch to code-point-based matching should be universal 
or require /u (an issue that your proposal leaves open), IMHO it's better to 
require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] 
to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to 
[{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three 
potentially breaking changes (two of which are explicitly mentioned in your 
proposal):

1. "[S]ome applications might have processed gunk with regular expressions 
where neither the 'characters' in the patterns nor the input to be matched 
are text."

2. "s.match(/^.$/)[0].length can now be 2."
I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.

3. /./g.exec(s) can now increment the regex's lastIndex by 2.

-- Steven Levithan

# Lasse Reichstein (13 years ago)

On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan <steves_list at hotmail.com> wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

...

[New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character.

/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

I think a compliant implementation should (read: ought to) already get that example, since "στιγμας".toUpperCase() == "ΣΤΙΓΜΑΣ".toUpperCase() in the browsers I have checked, and the ignore-case canonicalization is based on toUpperCase. Alas, most of the implementations miss it anyway.

On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan
<steves_list at hotmail.com> wrote:
> I've been wondering whether it might be best for the /u flag to do three
> things at once, making it an all-around "support Unicode better" flag:

...

> 3. [New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :)
Especially if it means dropping the rather naïve canonicalize function
that can't canonicalize an ASCII character with a non-ASCII character.

> /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

I think a compliant implementation should (read: ought to) already get
that example, since "στιγμας".toUpperCase() == "ΣΤΙΓΜΑΣ".toUpperCase()
in the browsers I have checked, and the ignore-case canonicalization
is based on toUpperCase. Alas, most of the implementations miss it
anyway.

/L

# Phillips, Addison (13 years ago)

Comments follow.

Definition of string. You say:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence.

I know what you mean, but others might not. Perhaps:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates).

In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs.
Under "text interpretation" you say:

-- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters).

This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters. Perhaps:

-- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which do not individually represent characters).

0xFFFE and 0xFFFF are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-).
Editorial unnecessary ;-):

-- This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters.

Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note.
Skipping down a lot, to "section 6 source text", you propose:

-- The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15.

I think this should be removed or modified. Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps:

-- Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding.

In "7.6 Identifier Names and Identifiers" you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier.
"15.5.4.6": you say "(a nonnegative integer less than 0x10FFFF)", whereas it should say: "(a nonnegative integer less than or equal to 0x10FFFF)"
In the section on "what about utf-32", you say: " and the code points start at positions 1, 2, 3.". Of course this should be "... and the code points start at positions 0, 1, 2."

Thanks for this proposal!

Addison

Comments follow.

1. Definition of string. You say:

--
However,
    ECMAScript does not place any restrictions or requirements on the sequence
    of code units in a String value, so it may be ill-formed when interpreted
    as a UTF-16 code unit sequence.
--

I know what you mean, but others might not. Perhaps:

--
However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates).
--

2. In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs.

3. Under "text interpretation" you say:

--
For compatibility with existing applications, it
  has to allow surrogate code points (code points between U+D800 and U+DFFF which
  can never represent characters).
--

This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters. Perhaps:

--
For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which do not individually represent characters).
--

4. 0xFFFE and 0xFFFF are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-).

5. Editorial unnecessary ;-):

--
This transformation is rather ugly, but I’m afraid it’s the price ECMAScript
  has to pay for being 12 years late in supporting supplementary characters.
--

6. Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note.

7. Skipping down a lot, to "section 6 source text", you propose:

--
The text is expected to have been normalised
    to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical
    Composition), as described in Unicode Standard Annex #15.
--

I think this should be removed or modified. Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps:

--
Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding.
--

8. In "7.6 Identifier Names and Identifiers" you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier.

9. "15.5.4.6": you say "(a nonnegative integer less than 0x10FFFF)", whereas it should say: "(a nonnegative integer less than or equal to 0x10FFFF)"

10. In the section on "what about utf-32", you say: " and the code points start at positions 1, 2, 3.". Of course this should be "... and the code points start at positions 0, 1, 2."

Thanks for this proposal!

Addison

> -----Original Message-----
> From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com]
> Sent: Thursday, March 22, 2012 10:14 PM
> To: es-discuss at mozilla.org
> Subject: Re: Full Unicode based on UTF-16 proposal
> 
> I've updated the proposal based on the feedback received so far. Changes are
> listed in the Updates section.
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
> 
> Norbert
> 
> 
> On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote:
> 
> > Based on my prioritization of goals for support for full Unicode in ECMAScript
> [1], I've put together a proposal for supporting the full Unicode character set
> based on the existing representation of text in ECMAScript using UTF-16 code
> unit sequences:
> > http://norbertlindenberg.com/2012/03/ecmascript-supplementary-
> characters/index.html
> >
> > The detailed proposed spec changes serve to get a good idea of the scope of
> the changes, but will need some polishing.
> >
> > Comments?
> >
> > Thanks,
> > Norbert
> >
> > [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
> >
>

# Roger Andrews (13 years ago)

Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().

Nb. Already encodeURI throws an URIError exception if 'str' is not a well-formed UTF-16 string.

Concerning UTF-16 surrogate pairs, how about a function like:
       String.isValid( str )
to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().

Nb.  Already encodeURI throws an URIError exception if 'str' is not a 
well-formed UTF-16 string.

-------------
> 1. Definition of string. You say:
>
> --
> However,
>    ECMAScript does not place any restrictions or requirements on the
>    sequence of code units in a String value, so it may be ill-formed when
>    interpreted as a UTF-16 code unit sequence.
> --
>
> I know what you mean, but others might not. Perhaps:
>
> --
> However, ECMAScript does not place any restrictions or requirements on the
> sequence of code units in a String value, so the sequence of code units
> may contain code units that are not valid in Unicode or sequences that do
> not represent Unicode code points (such as unpaired surrogates).
>--

# David Herman (13 years ago)

On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:

Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().

No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics.

On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:

> Concerning UTF-16 surrogate pairs, how about a function like:
>      String.isValid( str )
> to discover whether surrogates are used correctly in 'str'?
> 
> Something like Array.isArray().

No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics.

Dave

# David Herman (13 years ago)

On Mar 23, 2012, at 6:30 AM, Steven Levithan wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

+all my internet points

Now you're talking!!

Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.

Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.

[New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

This is really exciting.

As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

I haven't completely understood this part of the discussion. Looking at /u as a "little red switch" (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.:

js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
["𝌆𝌇𝌈𝌉𝌊"]

I'm still getting up to speed

On Mar 23, 2012, at 6:30 AM, Steven Levithan wrote:

> I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

+all my internet points

Now you're talking!!

> 1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.
> 
> 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.
> 
> 3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

This is really exciting.

> As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

I haven't completely understood this part of the discussion. Looking at /u as a "little red switch" (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.:

    js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
    ["𝌆𝌇𝌈𝌉𝌊"]

I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?

Dave

# David Herman (13 years ago)

Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.

Clarification: the JS source of the regexp literal.

> Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.

Clarification: the JS source *of the regexp literal*.

Dave

# Wes Garland (13 years ago)

On 24 March 2012 15:25, David Herman <dherman at mozilla.com> wrote:

Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.

Clarification: the JS source of the regexp literal.

We certainly can, although this means that certain Unicode Strings cannot be matched by a regexp with this flag. These strings would be the ones containing reserved code points.

That said, why is the JS source suddenly a sequence of UTF-16 code units?I believe JS source code should be a sequence of Unicode code points (and I think ES5 says something to this effect).

The underlying transport format should not be a concern for the JS lexer. The lexer should receive a series of code points from the network transport, allowing web sites to transmit JS in whatever encoding they see fit, provided the browser and server can both agree on it. I think UTF-8 would make a fine transport format for JS source code. IMHO the transport format between the browser and the JS lexer [i.e. the input program encoding] should be allowed to be implementation-defined and not specified by TC-39.

On 24 March 2012 15:25, David Herman <dherman at mozilla.com> wrote:

> > Presumably the JS source, as a sequence of UTF-16 code units, represents
> the tetragram code points as surrogate pairs.
>
> Clarification: the JS source *of the regexp literal*.
>
>
We certainly can, although this means that certain Unicode Strings cannot
be matched by a regexp with this flag. These strings would be the ones
containing reserved code points.

That said, why is the JS source suddenly a sequence of UTF-16 code units?I
believe JS source code should be a sequence of Unicode code points (and I
think ES5 says something to this effect).

The underlying transport format should not be a concern for the JS lexer.
The lexer should receive a series of code points from the network
transport, allowing web sites to transmit JS in whatever encoding they see
fit, provided the browser and server can both agree on it.  I think UTF-8
would make a fine transport format for JS source code.  IMHO the transport
format between the browser and the JS lexer [i.e. the input program
encoding] should be allowed to be implementation-defined and not specified
by TC-39.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120324/8908273b/attachment.html>

# David Herman (13 years ago)

On Mar 24, 2012, at 1:11 PM, Wes Garland wrote:

On 24 March 2012 15:25, David Herman <dherman at mozilla.com> wrote:

Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.

Clarification: the JS source of the regexp literal.

We certainly can, although this means that certain Unicode Strings cannot be matched by a regexp with this flag. These strings would be the ones containing reserved code points.

I didn't mean to imply only allowing non-BMP ranges by their unescaped representation, just that if it's possible that would often be nice and readable. I would certainly expect that we should also allow [\u{xxxxx}-\u{yyyyy}].

That said, why is the JS source suddenly a sequence of UTF-16 code units?I believe JS source code should be a sequence of Unicode code points (and I think ES5 says something to this effect).

I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."

The underlying transport format should not be a concern for the JS lexer.

eval

On Mar 24, 2012, at 1:11 PM, Wes Garland wrote:

> On 24 March 2012 15:25, David Herman <dherman at mozilla.com> wrote:
> > Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.
> 
> Clarification: the JS source *of the regexp literal*.
> 
> 
> We certainly can, although this means that certain Unicode Strings cannot be matched by a regexp with this flag. These strings would be the ones containing reserved code points.

I didn't mean to imply *only* allowing non-BMP ranges by their unescaped representation, just that if it's possible that would often be nice and readable. I would certainly expect that we should also allow [\u{xxxxx}-\u{yyyyy}].

> That said, why is the JS source suddenly a sequence of UTF-16 code units?I believe JS source code should be a sequence of Unicode code points (and I think ES5 says something to this effect).

I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."

> The underlying transport format should not be a concern for the JS lexer.

eval

Dave

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120324/0b677dc0/attachment.html>

# Wes Garland (13 years ago)

On 24 March 2012 17:22, David Herman <dherman at mozilla.com> wrote:

I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."

Ugh, IMHO, that's wrong, and should be "any Unicode code point". (let the flames begin?)

The underlying transport format should not be a concern for the JS lexer.

eval

Eval is a red herring: its input is defined as the contents of the given String. So, we come full-circle back to "what's in a String?". I'm still partial to Brendan's BRS idea, because at least it fixes everything all at once.

On 24 March 2012 17:22, David Herman <dherman at mozilla.com> wrote:
>
> I'm not 100% clear on this point yet, but e.g. the SourceCharacter
> production in Annex A.1 is described as "any Unicode code unit."
>

Ugh, IMHO, that's wrong, and should be "any Unicode code point".  (let the
flames begin?)

> The underlying transport format should not be a concern for the JS lexer.
>
>
> eval
>
>
Eval is a red herring: its input is defined as the contents of the given
String.  So, we come full-circle back to "what's in a String?".   I'm still
partial to Brendan's BRS idea, because at least it fixes everything all at
once.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120324/0ae433bf/attachment-0001.html>

# Norbert Lindenberg (13 years ago)

On Mar 23, 2012, at 6:30 , Steven Levithan wrote:

Norbert Lindenberg wrote:

I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. norbertlindenberg.com/2012/03/ecmascript-supplementary-characters

Cool.

From the proposal's Updates section:

Indicated that "u" may not be the actual character for the flag for code point mode in regular expressions, as a "u" flag has already been proposed for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.

Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.

One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full Unicode sets for such literals?

In the other direction it's clear that using /u for \d\D\w\W\b\B has to imply code point mode.

[New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

We probably should review the complete Unicode Technical Standard #18, Unicode Regular Expressions, and see how we can upgrade RegExp for better Unicode support. Maybe on a separate thread...

Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used far less often, and more developers would continue to get bitten by code-unit-based processing.

Good thinking :-)

On Mar 23, 2012, at 6:30 , Steven Levithan wrote:

> Norbert Lindenberg wrote:
> 
>> I've updated the proposal based on the feedback received so far. Changes
>> are listed in the Updates section.
>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
> 
> Cool.
> 
> From the proposal's Updates section:
> 
>> Indicated that "u" may not be the actual character for the flag for code
>> point mode in regular expressions, as a "u" flag has already been proposed
>> for Unicode-aware digit and word character matching.
> 
> I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:
> 
> 1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.
> 
> 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.

One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full Unicode sets for such literals?

In the other direction it's clear that using /u for \d\D\w\W\b\B has to imply code point mode.

> 3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

We probably should review the complete Unicode Technical Standard #18, Unicode Regular Expressions, and see how we can upgrade RegExp for better Unicode support. Maybe on a separate thread...

> Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i).
> 
> As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used *far* less often, and more developers would continue to get bitten by code-unit-based processing.

Good thinking :-)

> As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):
> 
> 1. "[S]ome applications might have processed gunk with regular expressions where neither the 'characters' in the patterns nor the input to be matched are text."
> 
> 2. "s.match(/^.$/)[0].length can now be 2."
> I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.
> 
> 3. /./g.exec(s) can now increment the regex's lastIndex by 2.
> 
> -- Steven Levithan

# Norbert Lindenberg (13 years ago)

On Mar 23, 2012, at 7:12 , Lasse Reichstein wrote:

On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan <steves_list at hotmail.com> wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

...

[New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character.

/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

I think a compliant implementation should (read: ought to) already get that example, since "στιγμας".toUpperCase() == "ΣΤΙΓΜΑΣ".toUpperCase() in the browsers I have checked, and the ignore-case canonicalization is based on toUpperCase. Alas, most of the implementations miss it anyway.

According to the ES5 spec, /ΣΤΙΓΜΑΣ/i.test("στιγμας") must be true indeed. Chrome and Node (i.e., V8) and IE get this right; Safari, Firefox, and Opera don't.

Note that toUpperCase allows mappings from 1 to multiple code units, while RegExp canonicalization in ES5 doesn't, so /SS/i.test("ß") === false even though "SS".toUpperCase() === "ß".toUpperCase().

Norbert

On Mar 23, 2012, at 7:12 , Lasse Reichstein wrote:

> On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan
> <steves_list at hotmail.com> wrote:
>> I've been wondering whether it might be best for the /u flag to do three
>> things at once, making it an all-around "support Unicode better" flag:
> 
> ...
> 
>> 3. [New proposal] Makes /i use Unicode casefolding rules.
> 
> Yey, I'm for it :)
> Especially if it means dropping the rather naïve canonicalize function
> that can't canonicalize an ASCII character with a non-ASCII character.
> 
>> /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.
> 
> I think a compliant implementation should (read: ought to) already get
> that example, since "στιγμας".toUpperCase() == "ΣΤΙΓΜΑΣ".toUpperCase()
> in the browsers I have checked, and the ignore-case canonicalization
> is based on toUpperCase. Alas, most of the implementations miss it
> anyway.

According to the ES5 spec, /ΣΤΙΓΜΑΣ/i.test("στιγμας") must be true indeed. Chrome and Node (i.e., V8) and IE get this right; Safari, Firefox, and Opera don't.

Note that toUpperCase allows mappings from 1 to multiple code units, while RegExp canonicalization in ES5 doesn't, so /SS/i.test("ß") === false even though "SS".toUpperCase() === "ß".toUpperCase().

Norbert

# Norbert Lindenberg (13 years ago)

Thanks for the detailed comments! Replies below.

Norbert

On Mar 23, 2012, at 9:46 , Phillips, Addison wrote:

Comments follow.

Definition of string. You say:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence.

I know what you mean, but others might not. Perhaps:

-- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates).

I can add a note that ill-formed here means containing unpaired surrogates. If I read chapter 3 of the Unicode Standard correctly, there's no other way for UTF-16 to be ill-formed. UTF-16 code units by themselves cannot be invalid - any 16-bit value can occur in a well-formed UTF-16 string.

In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs.

Makes sense.

Under "text interpretation" you say:

-- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters).

This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters.

The text is about surrogate code points, not about surrogate code units.

0xFFFE and 0xFFFF are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-).

Editorial unnecessary ;-):

-- This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters.

Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note.

Unfortunately, the term "character" is poisoned in ES5 by a redefinition as "code unit" (chapter 6). For ES6, I'd like the spec to be really clear where it means code units and where it means code points. Maybe we can then reintroduce "character" in ES7...

Skipping down a lot, to "section 6 source text", you propose:

-- The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15.

I think this should be removed or modified.

This sentence is essentially copied from ES5 (with corrected references), and as I copied it, I made a note to myself that we need to discuss normalization, just not as part of this proposal...

Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps:

-- Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding.

In "7.6 Identifier Names and Identifiers" you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier.

I can add a note about surrogate code points and non-characters, but, as you say, they are already ruled out because they can't have the required Unicode properties ID_Start or ID_Continue.

The use of ZWJ and ZWNJ is unchanged from ES5. UAX 31 has much stricter rules on where they would be allowed, but I'm not sure we have a strong case for changing the rules in ECMAScript. www.unicode.org/reports/tr31/tr31-9.html#Layout_and_Format_Control_Characters

"15.5.4.6": you say "(a nonnegative integer less than 0x10FFFF)", whereas it should say: "(a nonnegative integer less than or equal to 0x10FFFF)"

Will fix.

In the section on "what about utf-32", you say: " and the code points start at positions 1, 2, 3.". Of course this should be "... and the code points start at positions 0, 1, 2."

Of course.

Thanks for the detailed comments! Replies below.

Norbert


On Mar 23, 2012, at 9:46 , Phillips, Addison wrote:

> Comments follow.
> 
> 1. Definition of string. You say:
> 
> --
> However,
>    ECMAScript does not place any restrictions or requirements on the sequence
>    of code units in a String value, so it may be ill-formed when interpreted
>    as a UTF-16 code unit sequence.
> --
> 
> I know what you mean, but others might not. Perhaps:
> 
> --
> However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates).
> --

I can add a note that ill-formed here means containing unpaired surrogates. If I read chapter 3 of the Unicode Standard correctly, there's no other way for UTF-16 to be ill-formed. UTF-16 code units by themselves cannot be invalid - any 16-bit value can occur in a well-formed UTF-16 string.

> 2. In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs.

Makes sense.

> 3. Under "text interpretation" you say:
> 
> --
> For compatibility with existing applications, it
>  has to allow surrogate code points (code points between U+D800 and U+DFFF which
>  can never represent characters).
> --
> 
> This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters.

The text is about surrogate code points, not about surrogate code units.

> 4. 0xFFFE and 0xFFFF are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-).
> 
> 5. Editorial unnecessary ;-):
> 
> --
> This transformation is rather ugly, but I’m afraid it’s the price ECMAScript
>  has to pay for being 12 years late in supporting supplementary characters.
> --
> 
> 6. Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note.

Unfortunately, the term "character" is poisoned in ES5 by a redefinition as "code unit" (chapter 6). For ES6, I'd like the spec to be really clear where it means code units and where it means code points. Maybe we can then reintroduce "character" in ES7...

> 7. Skipping down a lot, to "section 6 source text", you propose:
> 
> --
> The text is expected to have been normalised
>    to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical
>    Composition), as described in Unicode Standard Annex #15.
> --
> 
> I think this should be removed or modified.

This sentence is essentially copied from ES5 (with corrected references), and as I copied it, I made a note to myself that we need to discuss normalization, just not as part of this proposal...

> Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps:
> 
> --
> Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding.
> --
> 
> 8. In "7.6 Identifier Names and Identifiers" you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier.

I can add a note about surrogate code points and non-characters, but, as you say, they are already ruled out because they can't have the required Unicode properties ID_Start or ID_Continue.

The use of ZWJ and ZWNJ is unchanged from ES5. UAX 31 has much stricter rules on where they would be allowed, but I'm not sure we have a strong case for changing the rules in ECMAScript.
http://www.unicode.org/reports/tr31/tr31-9.html#Layout_and_Format_Control_Characters

> 9. "15.5.4.6": you say "(a nonnegative integer less than 0x10FFFF)", whereas it should say: "(a nonnegative integer less than or equal to 0x10FFFF)"

Will fix.

> 10. In the section on "what about utf-32", you say: " and the code points start at positions 1, 2, 3.". Of course this should be "... and the code points start at positions 0, 1, 2."

Of course.

> Thanks for this proposal!
> 
> Addison
> 
>> -----Original Message-----
>> From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com]
>> Sent: Thursday, March 22, 2012 10:14 PM
>> To: es-discuss at mozilla.org
>> Subject: Re: Full Unicode based on UTF-16 proposal
>> 
>> I've updated the proposal based on the feedback received so far. Changes are
>> listed in the Updates section.
>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
>> 
>> Norbert
>> 
>> 
>> On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote:
>> 
>>> Based on my prioritization of goals for support for full Unicode in ECMAScript
>> [1], I've put together a proposal for supporting the full Unicode character set
>> based on the existing representation of text in ECMAScript using UTF-16 code
>> unit sequences:
>>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-
>> characters/index.html
>>> 
>>> The detailed proposed spec changes serve to get a good idea of the scope of
>> the changes, but will need some polishing.
>>> 
>>> Comments?
>>> 
>>> Thanks,
>>> Norbert
>>> 
>>> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>>> 
>> 
>

# Norbert Lindenberg (13 years ago)

It's easy to provide this function, but in which situations would it be useful? In most cases that I can think of you're interested in far more constrained definitions of validity:

what are valid ECMAScript identifiers?
what are valid BCP 47 language tags?
what are the characters allowed in a certain protocol?
what are the characters that my browser can render?

Thanks, Norbert

It's easy to provide this function, but in which situations would it be useful? In most cases that I can think of you're interested in far more constrained definitions of validity:
- what are valid ECMAScript identifiers?
- what are valid BCP 47 language tags?
- what are the characters allowed in a certain protocol?
- what are the characters that my browser can render?

Thanks,
Norbert


On Mar 24, 2012, at 12:12 , David Herman wrote:

> On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
> 
>> Concerning UTF-16 surrogate pairs, how about a function like:
>>     String.isValid( str )
>> to discover whether surrogates are used correctly in 'str'?
>> 
>> Something like Array.isArray().
> 
> No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics.
> 
> Dave
>

# Norbert Lindenberg (13 years ago)

On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]

As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

I haven't completely understood this part of the discussion. Looking at /u as a "little red switch" (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.:

js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u) ["𝌆𝌇𝌈𝌉𝌊"]

I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?

With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]

>> As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):
> 
> I haven't completely understood this part of the discussion. Looking at /u as a "little red switch" (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.:
> 
>    js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>    ["𝌆𝌇𝌈𝌉𝌊"]
> 
> I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?

With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

# David Herman (13 years ago)

On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote:

One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony).

This idea makes me nervous. Partly because I think we should keep the set of semantic changes between non-module code and module code reasonable small, and partly because the idea of your proposal is to continue to treat strings as sequences of 16-bit code units, not Unicode code points-- which means that quietly switching regexps to be closer to operating at the level of code points seems like it creates a kind of impedance mismatch. It feels more appropriate to me to require programmers to declare explicitly that they're dealing with a string at the level of code points, using the (quite concise) /u flag. That way they're saying "yes, I know this string is just a sequence of 16-bit code points, but it may contain non-BMP data, and I would like to match its contents with a regexp that deals with code points."

(Again, I'm still new to the finer points of Unicode, so I'm prepared to be shown I'm thinking about it wrong.)

On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote:

> One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony).

This idea makes me nervous. Partly because I think we should keep the set of semantic changes between non-module code and module code reasonable small, and partly because the idea of your proposal is to continue to treat strings as sequences of 16-bit code units, not Unicode code points-- which means that quietly switching regexps to be closer to operating at the level of code points seems like it creates a kind of impedance mismatch. It feels more appropriate to me to require programmers to declare explicitly that they're dealing with a string at the level of code points, using the (quite concise) /u flag. That way they're saying "yes, I know this string is just a sequence of 16-bit code points, but it may contain non-BMP data, and I would like to match its contents with a regexp that deals with code points."

(Again, I'm still new to the finer points of Unicode, so I'm prepared to be shown I'm thinking about it wrong.)

Dave

# David Herman (13 years ago)

On Mar 24, 2012, at 11:23 PM, Norbert Lindenberg wrote:

On Mar 24, 2012, at 12:21 , David Herman wrote:

I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?

With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

Excellent!

Thanks,

On Mar 24, 2012, at 11:23 PM, Norbert Lindenberg wrote:

> On Mar 24, 2012, at 12:21 , David Herman wrote:
> 
>> I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?
> 
> With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make
> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
> work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

Excellent!

Thanks,
Dave

# David Herman (13 years ago)

On Mar 24, 2012, at 2:30 PM, Wes Garland wrote:

On 24 March 2012 17:22, David Herman <dherman at mozilla.com> wrote: I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."

Ugh, IMHO, that's wrong, and should be "any Unicode code point". (let the flames begin?)

That sounds nice in theory, but we can't change the past. Even with the BRS, there would still be a compatibility mode where it's code points.

The underlying transport format should not be a concern for the JS lexer.

eval

Eval is a red herring: its input is defined as the contents of the given String. So, we come full-circle back to "what's in a String?". I'm still partial to Brendan's BRS idea, because at least it fixes everything all at once.

I share Erik and others' concerns about the BRS. Working at the heap level sounds brittle to me. It seems like a lot of spec and implementation complexity, and it doesn't really have a good story for integrating legacy code and future code. I think the direction that Norbert, Erik, and Steven have been going is very promising.

On Mar 24, 2012, at 2:30 PM, Wes Garland wrote:

> On 24 March 2012 17:22, David Herman <dherman at mozilla.com> wrote:
> I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as "any Unicode code unit."
> 
> Ugh, IMHO, that's wrong, and should be "any Unicode code point".  (let the flames begin?)

That sounds nice in theory, but we can't change the past. Even with the BRS, there would still be a compatibility mode where it's code points.

>> 
>> The underlying transport format should not be a concern for the JS lexer.
> 
> eval
> 
> 
> Eval is a red herring: its input is defined as the contents of the given String.  So, we come full-circle back to "what's in a String?".   I'm still partial to Brendan's BRS idea, because at least it fixes everything all at once.

I share Erik and others' concerns about the BRS. Working at the heap level sounds brittle to me. It seems like a lot of spec and implementation complexity, and it doesn't really have a good story for integrating legacy code and future code. I think the direction that Norbert, Erik, and Steven have been going is very promising.

Dave

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120325/035f56c8/attachment.html>

# Roger Andrews (13 years ago)

I use something like String.isValid functionality in a transcoder that converts Strings to/from UTF-8, HTML Formdata (MIME type application/x-www-form-urlencoded -- not the same as URI encoding!), and Base64.

Admittedly these currently use 'encodeURI' to do the work, or it just drops out naturally when considering UTF-8 sequences.

(I considered testing the regexp /^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/ against the input string.)

Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers.

From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>

I use something like String.isValid functionality in a transcoder that
converts Strings to/from UTF-8, HTML Formdata (MIME type
application/x-www-form-urlencoded -- not the same as URI encoding!), and
Base64.

Admittedly these currently use 'encodeURI' to do the work, or it just drops
out naturally when considering UTF-8 sequences.

(I considered testing the regexp
/^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
against the input string.)

Maybe the function is too obscure for general use, although its presence 
does flag up the surrogate-pair issue to developers.

--------------------------------------------------
From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>
>
> It's easy to provide this function, but in which situations would it be
> useful? In most cases that I can think of you're interested in far more
> constrained definitions of validity:
> - what are valid ECMAScript identifiers?
> - what are valid BCP 47 language tags?
> - what are the characters allowed in a certain protocol?
> - what are the characters that my browser can render?
>
> Thanks,
> Norbert
>
>
> On Mar 24, 2012, at 12:12 , David Herman wrote:
>
>> On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
>>
>>> Concerning UTF-16 surrogate pairs, how about a function like:
>>>     String.isValid( str )
>>> to discover whether surrogates are used correctly in 'str'?
>>>
>>> Something like Array.isArray().
>>
>> No need for it to be a class method, since it only operates on strings.
>> We could simply have String.prototype.isValid(). Note that it would work
>> for primitive strings as well, thanks to JS's automatic promotion
>> semantics.
>>
>> Dave
>>
>

# Roger Andrews (13 years ago)

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.

Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever.

Could you use this to avoid complicated things in RegExps like [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of interest?

The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character.

(Sorry if I've missed something in the prior discussion.)

From: "Norbert Lindenberg" To: "David Herman"

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character 
literals.  The \U format expresses a full 32-bit code, which could be mapped 
internally to two 16-bit UTF-16 codes.

Then the programmer can describe exactly the required characters without 
caring about their coding in UTF-16 or whatever.

Could you use this to avoid complicated things in RegExps like 
[{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like 
[\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of interest?

The same goes for String literals, where the programmer does not really care 
about the encoding, just specifying the character.

(Sorry if I've missed something in the prior discussion.)

--------------------------------------------------
From: "Norbert Lindenberg"
To: "David Herman"
>
> On Mar 24, 2012, at 12:21 , David Herman wrote:
>
> [snip]
>
>>> As for whether the switch to code-point-based matching should be 
>>> universal or require /u (an issue that your proposal leaves open), IMHO 
>>> it's better to require /u since it avoids the need for transforming 
>>> \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and 
>>> [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and 
>>> additionally avoids as least three potentially breaking changes (two of 
>>> which are explicitly mentioned in your proposal):
>>
>> I haven't completely understood this part of the discussion. Looking at 
>> /u as a "little red switch" (LRS), i.e., an opportunity to make judicious 
>> breaks with compatibility, could we not allow character classes with 
>> unescaped non-BMP code points, e.g.:
>>
>>    js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>    ["𝌆𝌇𝌈𝌉𝌊"]
>>
>> I'm still getting up to speed on Unicode and JS string semantics, so I'm 
>> guessing that I'm missing a reason why that wouldn't work... Presumably 
>> the JS source of the regexp literal, as a sequence of UTF-16 code units, 
>> represents the tetragram code points as surrogate pairs. Can we not 
>> recognize surrogate pairs in character classes within a /u regexp and 
>> interpret them as code points?
>
> With /u, that's exactly what happens. My first proposal was to make this 
> happen even without a new flag, i.e., make
> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
> work based on code points, and Steve is arguing against that because of 
> compatibility risk. My proposal also includes some transformations to keep 
> existing regular expressions working, and Steve correctly observes that if 
> we have a flag for code point mode, then the transformation is not 
> needed - old regular expressions would continue to work in code unit mode, 
> while new regular expressions with /u get code point treatment.
>

# Roger Andrews (13 years ago)

Just confirmed C/C++ do allow \Uxxxxxxxx escaped characters for non-BMP code points in string literals.

Interesting page at: publib.boulder.ibm.com/infocenter/comphelp/v7v91/topic/com.ibm.vacpp7a.doc/language/ref/clrc02unicode_standard.htm

So C/C++ has: \xNN 8-bit character (U+0000 - U+00FF) \uNNNN 16-bit character \UNNNNNNNN 32-bit character

This naturally expresses any character, without worrying about the UTF-16 or whatever encoding.

From: "Roger Andrews" To: "Norbert Lindenberg"

Just confirmed C/C++ do allow \Uxxxxxxxx escaped characters for non-BMP code 
points in string literals.

Interesting page at:
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/topic/com.ibm.vacpp7a.doc/language/ref/clrc02unicode_standard.htm

So C/C++ has:
    \xNN                       8-bit character (U+0000 - U+00FF)
    \uNNNN                16-bit character
    \UNNNNNNNN   32-bit character

This naturally expresses any character, without worrying about the UTF-16 or 
whatever encoding.

--------------------------------------------------
From: "Roger Andrews"
To: "Norbert Lindenberg"
>
> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character 
> literals.  The \U format expresses a full 32-bit code, which could be 
> mapped internally to two 16-bit UTF-16 codes.
>
> Then the programmer can describe exactly the required characters without 
> caring about their coding in UTF-16 or whatever.
>
> Could you use this to avoid complicated things in RegExps like 
> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like 
> [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of 
> interest?
>
> The same goes for String literals, where the programmer does not really 
> care about the encoding, just specifying the character.
>
> (Sorry if I've missed something in the prior discussion.)
>
> --------------------------------------------------
> From: "Norbert Lindenberg"
> To: "David Herman"
>>
>> On Mar 24, 2012, at 12:21 , David Herman wrote:
>>
>> [snip]
>>
>>>> As for whether the switch to code-point-based matching should be 
>>>> universal or require /u (an issue that your proposal leaves open), IMHO 
>>>> it's better to require /u since it avoids the need for transforming 
>>>> \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and 
>>>> [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and 
>>>> additionally avoids as least three potentially breaking changes (two of 
>>>> which are explicitly mentioned in your proposal):
>>>
>>> I haven't completely understood this part of the discussion. Looking at 
>>> /u as a "little red switch" (LRS), i.e., an opportunity to make 
>>> judicious breaks with compatibility, could we not allow character 
>>> classes with unescaped non-BMP code points, e.g.:
>>>
>>>    js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>>    ["𝌆𝌇𝌈𝌉𝌊"]
>>>
>>> I'm still getting up to speed on Unicode and JS string semantics, so I'm 
>>> guessing that I'm missing a reason why that wouldn't work... Presumably 
>>> the JS source of the regexp literal, as a sequence of UTF-16 code units, 
>>> represents the tetragram code points as surrogate pairs. Can we not 
>>> recognize surrogate pairs in character classes within a /u regexp and 
>>> interpret them as code points?
>>
>> With /u, that's exactly what happens. My first proposal was to make this 
>> happen even without a new flag, i.e., make
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>> work based on code points, and Steve is arguing against that because of 
>> compatibility risk. My proposal also includes some transformations to 
>> keep existing regular expressions working, and Steve correctly observes 
>> that if we have a flag for code point mode, then the transformation is 
>> not needed - old regular expressions would continue to work in code unit 
>> mode, while new regular expressions with /u get code point treatment.
>>

# Norbert Lindenberg (13 years ago)

JavaScript source today is a sequence of UTF-16 code units because that's what clause 6 of ES5 says and what most implementations do (V8/Node currently limits to UCS-2, but a fix for that is on the way): "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16."

Actual source code is normally encoded in UTF-8 or some legacy encoding, so it must be converted to UTF-16. The rest of the ES5 spec deals with source text in terms of code units, not in terms of code points.

The term "code point" is defined in clause 6 of ES5 (in a way that's slightly incompatible with the Unicode definition), but the only normative use is in relation to URI mappings in subclause 15.1.3, never in relation to source code.

Allen, Brendan, and I have proposed several ways to move to code point semantics in ES6, with each proposal representing a different trade-off between compatibility with existing code and ease of future development.

Norbert

JavaScript source today is a sequence of UTF-16 code units because that's what clause 6 of ES5 says and what most implementations do (V8/Node currently limits to UCS-2, but a fix for that is on the way): "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16."

Actual source code is normally encoded in UTF-8 or some legacy encoding, so it must be converted to UTF-16. The rest of the ES5 spec deals with source text in terms of code units, not in terms of code points.

The term "code point" is defined in clause 6 of ES5 (in a way that's slightly incompatible with the Unicode definition), but the only normative use is in relation to URI mappings in subclause 15.1.3, never in relation to source code.

Allen, Brendan, and I have proposed several ways to move to code point semantics in ES6, with each proposal representing a different trade-off between compatibility with existing code and ease of future development.

Norbert



On Mar 24, 2012, at 13:11 , Wes Garland wrote:

> On 24 March 2012 15:25, David Herman <dherman at mozilla.com> wrote:
> > Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs.
> 
> Clarification: the JS source *of the regexp literal*.
> 
> 
> We certainly can, although this means that certain Unicode Strings cannot be matched by a regexp with this flag. These strings would be the ones containing reserved code points.
> 
> That said, why is the JS source suddenly a sequence of UTF-16 code units?I believe JS source code should be a sequence of Unicode code points (and I think ES5 says something to this effect).
> 
> The underlying transport format should not be a concern for the JS lexer.  The lexer should receive a series of code points from the network transport, allowing web sites to transmit JS in whatever encoding they see fit, provided the browser and server can both agree on it.  I think UTF-8 would make a fine transport format for JS source code.  IMHO the transport format between the browser and the JS lexer [i.e. the input program encoding] should be allowed to be implementation-defined and not specified by TC-39.
> 
> Wes
> 
> -- 
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (13 years ago)

Perfectly valid concerns.

My thinking here is that normally applications want to deal with code points, but we force them to deal with UTF-16 and additional flags because we need them for compatibility. Within modules, where we know that compatibility is not an issue, I'd rather give applications by default what they need.

Looking back at Java, supporting supplementary characters was fairly painless for many applications despite UTF-16 because Java already had a rich API performing all kinds of operations on strings, so many applications had little need to look at individual characters in the first place. We went through the entire Java SE API and fixed all those operations to use code point semantics (look for "under the hood" at [1] for details). We were also able to switch regular expressions to code point semantics without any flags because regular expressions never worked on binary data and developers hadn't created funky workarounds to support supplementary characters yet. JavaScript today has more constraints, but for new development it would still be good to get as close as possible to that experience.

Norbert

[1] java.sun.com/developer/technicalArticles/Intl/Supplementary

Perfectly valid concerns.

My thinking here is that normally applications want to deal with code points, but we force them to deal with UTF-16 and additional flags because we need them for compatibility. Within modules, where we know that compatibility is not an issue, I'd rather give applications by default what they need.

Looking back at Java, supporting supplementary characters was fairly painless for many applications despite UTF-16 because Java already had a rich API performing all kinds of operations on strings, so many applications had little need to look at individual characters in the first place. We went through the entire Java SE API and fixed all those operations to use code point semantics (look for "under the hood" at [1] for details). We were also able to switch regular expressions to code point semantics without any flags because regular expressions never worked on binary data and developers hadn't created funky workarounds to support supplementary characters yet. JavaScript today has more constraints, but for new development it would still be good to get as close as possible to that experience.

Norbert

[1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

On Mar 24, 2012, at 23:56 , David Herman wrote:

> On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote:
> 
>> One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony).
> 
> This idea makes me nervous. Partly because I think we should keep the set of semantic changes between non-module code and module code reasonable small, and partly because the idea of your proposal is to continue to treat strings as sequences of 16-bit code units, not Unicode code points-- which means that quietly switching regexps to be closer to operating at the level of code points seems like it creates a kind of impedance mismatch. It feels more appropriate to me to require programmers to declare explicitly that they're dealing with a string at the level of code points, using the (quite concise) /u flag. That way they're saying "yes, I know this string is just a sequence of 16-bit code points, but it may contain non-BMP data, and I would like to match its contents with a regexp that deals with code points."
> 
> (Again, I'm still new to the finer points of Unicode, so I'm prepared to be shown I'm thinking about it wrong.)
> 
> Dave
>

# Norbert Lindenberg (13 years ago)

Let's see:

Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to convert it, so isValid doesn't really help. You still have to look at all code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD.
Conversion from UTF-8: For security reasons, you have to check for well-formedness before conversion, in particular to catch non-shortest forms [1].
HTML form data: Same situation as conversion to UTF-8.
Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.

I don't think we'd add API just to flag an issue - that's what documentation is for.

Norbert

[1] www.unicode.org/reports/tr36/#UTF

Let's see:

- Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to convert it, so isValid doesn't really help. You still have to look at all code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD.

- Conversion from UTF-8: For security reasons, you have to check for well-formedness before conversion, in particular to catch non-shortest forms [1].

- HTML form data: Same situation as conversion to UTF-8.

- Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.

I don't think we'd add API just to flag an issue - that's what documentation is for.

Norbert

[1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit



On Mar 25, 2012, at 1:57 , Roger Andrews wrote:

> I use something like String.isValid functionality in a transcoder that
> converts Strings to/from UTF-8, HTML Formdata (MIME type
> application/x-www-form-urlencoded -- not the same as URI encoding!), and
> Base64.
> 
> Admittedly these currently use 'encodeURI' to do the work, or it just drops
> out naturally when considering UTF-8 sequences.
> 
> (I considered testing the regexp
> /^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
> against the input string.)
> 
> Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers.
> 
> --------------------------------------------------
> From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>
>> 
>> It's easy to provide this function, but in which situations would it be
>> useful? In most cases that I can think of you're interested in far more
>> constrained definitions of validity:
>> - what are valid ECMAScript identifiers?
>> - what are valid BCP 47 language tags?
>> - what are the characters allowed in a certain protocol?
>> - what are the characters that my browser can render?
>> 
>> Thanks,
>> Norbert
>> 
>> 
>> On Mar 24, 2012, at 12:12 , David Herman wrote:
>> 
>>> On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
>>> 
>>>> Concerning UTF-16 surrogate pairs, how about a function like:
>>>>    String.isValid( str )
>>>> to discover whether surrogates are used correctly in 'str'?
>>>> 
>>>> Something like Array.isArray().
>>> 
>>> No need for it to be a class method, since it only operates on strings.
>>> We could simply have String.prototype.isValid(). Note that it would work
>>> for primitive strings as well, thanks to JS's automatic promotion
>>> semantics.
>>> 
>>> Dave
>>>

# Norbert Lindenberg (13 years ago)

There is a strawman for code point escapes: strawman:full_unicode_source_code#unicode_escape_sequences

Note that for references to specific characters it's usually best to just use the characters directly, as Dave did in "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as regular expressions where you might have to refer to range limits that aren't actually assigned characters, or in test cases where you might use characters for which your OS doesn't have glyphs yet.

Norbert

There is a strawman for code point escapes:
http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences

Note that for references to specific characters it's usually best to just use the characters directly, as Dave did in "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as regular expressions where you might have to refer to range limits that aren't actually assigned characters, or in test cases where you might use characters for which your OS doesn't have glyphs yet.

Norbert


On Mar 25, 2012, at 2:57 , Roger Andrews wrote:

> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals.  The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.
> 
> Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever.
> 
> Could you use this to avoid complicated things in RegExps like [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of interest?
> 
> The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character.
> 
> (Sorry if I've missed something in the prior discussion.)
> 
> --------------------------------------------------
> From: "Norbert Lindenberg"
> To: "David Herman"
>> 
>> On Mar 24, 2012, at 12:21 , David Herman wrote:
>> 
>> [snip]
>> 
>>>> As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):
>>> 
>>> I haven't completely understood this part of the discussion. Looking at /u as a "little red switch" (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.:
>>> 
>>>   js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>>   ["𝌆𝌇𝌈𝌉𝌊"]
>>> 
>>> I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points?
>> 
>> With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>> work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

# Roger Andrews (13 years ago)

The strawman is for source code characters, and says it has "no implications for string value encodings" (or RegExps). String & regexp literal escape sequences are explicitly defined in ES5 sections 7.8.4 & 7.8.5. Will Strawman style also work in ES6 string & regexp literals? Thus making regexp ranges much nicer (see final example below).

As well as describing code points that have not yet been defined as characters, character escapes in string literals and regexps are good:

control characters don't have glyphs at all,
the various space glyphs are not readily distinguishable (same for some dash/minus/line glyphs),
breaking/non-breaking versions of characters are not distinguishable,
many other glyphs are hard to distinguish (being tiny adjustments in positioning or form detail),
some characters are "combining" -- which makes for a messy and confusing program if you use them raw.

If you use the raw non-ASCII characters in a program then you need some means of creating them, preferably via a normal keyboard and in your favourite text editor. All program readers need appropriate fonts installed to fully understand the program, and program maintainers also need a Unicode-capable text editor (potentially including non-BMP support). All links/stores that the program travels over or rests in must be Unicode-capable. Whereas using only ASCII chars to write a program is easy to do and always works no matter how basic your computing/transmission infrastructure. (ASCII chars never get silently mangled in transmission or text editors.)

How to represent character escapes in a language. C/C++ has: \xNN 8-bit char (U+0000 - U+00FF) \uNNNN 16-bit char (U+0000 - U+FFFF) \UNNNNNNNN 32-bit char (i.e. any 21-bit Unicode char) Strawman for source chars has: \u{N...} 8 to 24-bit char (i.e. any 21-bit Unicode char)

I'm struggling with how non-BMP escapes would be used in practice in strings & regexps -- especially regexp ranges. Will Strawman style be used in string & regexp literals?

Considering U+1D307 (𝌇) as an example (where "𝌇" == "\uD834\uDF07").

To create the string "I like 𝌇" using escapes in C/C++ you can create a string: "I like \U0001D307" if the Strawman style works in strings, in ES6 presumably you say: "I like \u{1D307}" or do you have to know UTF-16 encoding rules and say: "I like \uD834\uDF07"

To use U+1D307 (𝌇) and U+1D356 (𝍖) as a range in a regexp, i.e. /[𝌇-𝍖]/ should the programmer write: C/C++ style /[\U0001D307-\U0001D356]/ or will Strawman style work in regexps /[\u{1D307}-\u{1D356}]/ or in UTF-16 with {} grouping /[{\uD834\uDF07}-{\uD834\uDF56}]/

Either C/C++ style or Strawman style escape is readable, natural, doesn't require knowledge of UTF-16 encoding rules, can be created easily with any old keyboard, and won't upset text editors.

It's a bit unfriendly to require programmers to know UTF-16 rules just to put a non-BMP character in a string or regexp using an escape. And in a regexp range it looks ugly and confusing.

From: "Norbert Lindenberg"

The strawman is for source code characters, and says it has "no implications
for string value encodings" (or RegExps).
String & regexp literal escape sequences are explicitly defined in ES5 
sections 7.8.4 & 7.8.5.
Will Strawman style also work in ES6 string & regexp literals?  Thus making 
regexp ranges much nicer (see final example below).

As well as describing code points that have not yet been defined as
characters, character escapes in string literals and regexps are good:
1)  control characters don't have glyphs at all,
2)  the various space glyphs are not readily distinguishable (same for some 
dash/minus/line glyphs),
3)  breaking/non-breaking versions of characters are not distinguishable,
4)  many other glyphs are hard to distinguish (being tiny adjustments in
positioning or form detail),
5)  some characters are "combining" -- which makes for a messy and confusing
program if you use them raw.

If you use the raw non-ASCII characters in a program then you need some 
means of creating them, preferably via a normal keyboard and in your 
favourite text editor.
All program readers need appropriate fonts installed to fully understand 
the program, and program maintainers also need a Unicode-capable text editor 
(potentially including non-BMP support).
All links/stores that the program travels over or rests in must be
Unicode-capable.
Whereas using only ASCII chars to write a program is easy to do and always
works no matter how basic your computing/transmission infrastructure. 
(ASCII chars never get silently mangled in transmission or text editors.)

How to represent character escapes in a language.
C/C++ has:
    \xNN                        8-bit char (U+0000 - U+00FF)
    \uNNNN                 16-bit char (U+0000 - U+FFFF)
    \UNNNNNNNN    32-bit char (i.e. any 21-bit Unicode char)
Strawman for source chars has:
    \u{N...}               8 to 24-bit char (i.e. any 21-bit Unicode char)

I'm struggling with how non-BMP escapes would be used in practice in strings
& regexps -- especially regexp ranges.  Will Strawman style be used in 
string & regexp literals?

Considering U+1D307 (𝌇) as an example (where "𝌇" == "\uD834\uDF07").

To create the string "I like 𝌇" using escapes
in C/C++ you can create a string:
           "I like \U0001D307"
if the Strawman style works in strings, in ES6 presumably you say:
           "I like \u{1D307}"
or do you have to know UTF-16 encoding rules and say:
           "I like \uD834\uDF07"

To use U+1D307 (𝌇) and U+1D356 (𝍖) as a range in a regexp, i.e. /[𝌇-𝍖]/
should the programmer write:
C/C++ style
            /[\U0001D307-\U0001D356]/
or will Strawman style work in regexps
            /[\u{1D307}-\u{1D356}]/
or in UTF-16 with {} grouping
            /[{\uD834\uDF07}-{\uD834\uDF56}]/

Either C/C++ style or Strawman style escape is readable, natural, doesn't
require knowledge of UTF-16 encoding rules, can be created easily with any 
old keyboard, and won't upset text editors.

It's a bit unfriendly to require programmers to know UTF-16 rules just to
put a non-BMP character in a string or regexp using an escape.  And in a
regexp range it looks ugly and confusing.

--------------------------------------------------
From: "Norbert Lindenberg"
>
> There is a strawman for code point escapes:
> http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences
>
> Note that for references to specific characters it's usually best to just
> use the characters directly, as Dave did in
> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as
> regular expressions where you might have to refer to range limits that
> aren't actually assigned characters, or in test cases where you might use
> characters for which your OS doesn't have glyphs yet.
>
> Norbert
>
>
> On Mar 25, 2012, at 2:57 , Roger Andrews wrote:
>
>> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
>> literals.  The \U format expresses a full 32-bit code, which could be
>> mapped internally to two 16-bit UTF-16 codes.
>>
>> Then the programmer can describe exactly the required characters without
>> caring about their coding in UTF-16 or whatever.
>>
>> Could you use this to avoid complicated things in RegExps like
>> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like
>> [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of
>> interest?
>>
>> The same goes for String literals, where the programmer does not really
>> care about the encoding, just specifying the character.
>>
>> (Sorry if I've missed something in the prior discussion.)
>>
>> --------------------------------------------------
>> From: "Norbert Lindenberg"
>> To: "David Herman"
>>>
>>> On Mar 24, 2012, at 12:21 , David Herman wrote:
>>>
>>> [snip]
>>>
>>>>> As for whether the switch to code-point-based matching should be
>>>>> universal or require /u (an issue that your proposal leaves open),
>>>>> IMHO it's better to require /u since it avoids the need for
>>>>> transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}]
>>>>> and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}],
>>>>> and additionally avoids as least three potentially breaking changes
>>>>> (two of which are explicitly mentioned in your proposal):
>>>>
>>>> I haven't completely understood this part of the discussion. Looking at
>>>> /u as a "little red switch" (LRS), i.e., an opportunity to make
>>>> judicious breaks with compatibility, could we not allow character
>>>> classes with unescaped non-BMP code points, e.g.:
>>>>
>>>>   js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>>>   ["𝌆𝌇𝌈𝌉𝌊"]
>>>>
>>>> I'm still getting up to speed on Unicode and JS string semantics, so
>>>> I'm guessing that I'm missing a reason why that wouldn't work...
>>>> Presumably the JS source of the regexp literal, as a sequence of UTF-16
>>>> code units, represents the tetragram code points as surrogate pairs.
>>>> Can we not recognize surrogate pairs in character classes within a /u
>>>> regexp and interpret them as code points?
>>>
>>> With /u, that's exactly what happens. My first proposal was to make this
>>> happen even without a new flag, i.e., make
>>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>>> work based on code points, and Steve is arguing against that because of
>>> compatibility risk. My proposal also includes some transformations to
>>> keep existing regular expressions working, and Steve correctly observes
>>> that if we have a flag for code point mode, then the transformation is
>>> not needed - old regular expressions would continue to work in code unit
>>> mode, while new regular expressions with /u get code point treatment.
>

# Roger Andrews (13 years ago)

Maybe String.isValid is just not generally useful enough. I accept the point that you don't add APIs simply to flag an issue, (there has to be more weighty justification to carry the trifle).

PS: As for UTF-16 -> UTF-8 or HTML-Formdata, I decided to follow encodeURI /

encodeURIComponent's lead and throw an exception. Maybe that's the wrong thing to do?

My UTF-8 -> UTF-16 does check for well-formed UTF-8 because it seemed the

right thing to do. Thanks for the link which explains why.

Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, same issues as above really.

From: "Norbert Lindenberg"

Maybe String.isValid is just not generally useful enough.  I accept the 
point that you don't add APIs simply to flag an issue, (there has to be more 
weighty justification to carry the trifle).


PS:
As for UTF-16 -> UTF-8 or HTML-Formdata, I decided to follow encodeURI / 
encodeURIComponent's lead and throw an exception.  Maybe that's the wrong 
thing to do?

My UTF-8 -> UTF-16 does check for well-formed UTF-8 because it seemed the 
right thing to do.  Thanks for the link which explains why.

Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, same 
issues as above really.

--------------------------------------------------
From: "Norbert Lindenberg"
>
> Let's see:
>
> - Conversion to UTF-8: If the string isn't well-formed, you wouldn't 
> refuse to convert it, so isValid doesn't really help. You still have to 
> look at all code units, and convert unpaired surrogates to the UTF-8 
> sequence for U+FFFD.
>
> - Conversion from UTF-8: For security reasons, you have to check for 
> well-formedness before conversion, in particular to catch non-shortest 
> forms [1].
>
> - HTML form data: Same situation as conversion to UTF-8.
>
> - Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.
>
> I don't think we'd add API just to flag an issue - that's what 
> documentation is for.
>
> Norbert
>
> [1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit
>
>
>
> On Mar 25, 2012, at 1:57 , Roger Andrews wrote:
>
>> I use something like String.isValid functionality in a transcoder that
>> converts Strings to/from UTF-8, HTML Formdata (MIME type
>> application/x-www-form-urlencoded -- not the same as URI encoding!), and
>> Base64.
>>
>> Admittedly these currently use 'encodeURI' to do the work, or it just 
>> drops
>> out naturally when considering UTF-8 sequences.
>>
>> (I considered testing the regexp
>> /^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
>> against the input string.)
>>
>> Maybe the function is too obscure for general use, although its presence 
>> does flag up the surrogate-pair issue to developers.
>>
>> --------------------------------------------------
>> From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>
>>>
>>> It's easy to provide this function, but in which situations would it be
>>> useful? In most cases that I can think of you're interested in far more
>>> constrained definitions of validity:
>>> - what are valid ECMAScript identifiers?
>>> - what are valid BCP 47 language tags?
>>> - what are the characters allowed in a certain protocol?
>>> - what are the characters that my browser can render?
>>>
>>> Thanks,
>>> Norbert
>>>
>>>
>>> On Mar 24, 2012, at 12:12 , David Herman wrote:
>>>
>>>> On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
>>>>
>>>>> Concerning UTF-16 surrogate pairs, how about a function like:
>>>>>    String.isValid( str )
>>>>> to discover whether surrogates are used correctly in 'str'?
>>>>>
>>>>> Something like Array.isArray().
>>>>
>>>> No need for it to be a class method, since it only operates on strings.
>>>> We could simply have String.prototype.isValid(). Note that it would 
>>>> work
>>>> for primitive strings as well, thanks to JS's automatic promotion
>>>> semantics.
>>>>
>>>> Dave
>>>>
>

# Steven Levithan (13 years ago)

Sorry for jumping between messages...

Roger Andrews wrote:

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.

Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

The example above also hints at additional potentially breaking changes for code point matching by default that haven't yet been discussed in this thread: that the meaning of negated character classes and shorthands would change, and that their match length may be 2 (like the dot).

Roger Andrews wrote:

I'm struggling with how non-BMP escapes would be used in practice in strings & regexps -- especially regexp ranges. Will Strawman style be used in string & regexp literals? [...examples snipped]

I'm not sure whether this was already clear, but the curly braces I included in my paraphrasing of Norbert's proposed transformations were not meant to be included literally. I was trying to describe ranges between arbitrary code points, represented by pairs of high and low surrogates. As far as I understand, no existing proposal would allow a character class range written as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the curly braces. To match a range outside the BMP in a literal RegExp, you would have to use [<char>-<char>] (where <char> represents a literal

character, and this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for this syntax is added, and where X.. represents a hex number between 0 and at least 10FFFF).

Norbert Lindenberg wrote:

[...snip] My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

Although I've argued the compatibility risk angle, on that point I should defer to implementers and others who might have a better sense of the scope of risk/damage to existing programs. More personally affecting, though, is the negative gut reaction I have to the well-thought-out but ugly and complicated (not so much in implementation, but for devs who have to learn about it) transformations that would otherwise be necessary to avoid breaking current regexes. And like David, I think just requiring /u is not so bad, especially since I'd want to use it for its other meanings anyway.

I'm also nervous about using different default semantics inside and out of ES6 modules, but David Herman has already well articulated my concerns and you've already responded, so I'll leave that discussion to you two except to say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable to automatically turn it on in ES6 modules. That's because I think applying only code unit to code point mode switching in modules by default is too magical and confusing, but if it were described as turning on /u by default, that's easy to understand and explain.

Lasse Reichstein wrote:

Steven Levithan wrote:

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag: [...] 3. [New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character.

That would be my hope as well.

Norbert Lindenberg wrote:

One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full Unicode sets for such literals?

As I said above, yes, I think it makes sense to apply all semantics of /u by default within modules. Previously in this thread, I detailed what \d\w\b\s mean in various regex flavors. The ones that give Unicode meanings by default are .NET and Perl, so ES would be in excellent regex company. Additionally, Java's \b (only) supports Unicode by default, as does ES's \s.

Norbert Lindenberg wrote:

In the other direction it's clear that using /u for \d\D\w\W\b\B has to imply code point mode.

Not if their meaning was limited to the BMP, which is already true for \D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point matching it would be false. Yet another reason to tie the multiple proposed meanings of /u together.

Norbert Lindenberg wrote:

Steven Levithan wrote:

[New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

We probably should review the complete Unicode Technical Standard #18, Unicode Regular Expressions, and see how we can upgrade RegExp for better Unicode support. Maybe on a separate thread...

Agreed. You may already be thinking this, but IMO if we're going to add /u as a Little Red Switch (as David called it), the priority should be on making sure that /u gets all aspects of Unicode-aware regular expression semantics done right, before looking at new features from UTS#18 like Unicode property matching.

-- Steven Levithan

Sorry for jumping between messages...

Roger Andrews wrote:
> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character 
> literals.  The \U format expresses a full 32-bit code, which could be 
> mapped internally to two 16-bit UTF-16 codes.

Python uses the same syntax in regular expressions. But, as Norbert noted, 
there is already a strawman for \u{X..}. If it were adopted, I think it is 
clear that it should also be extended to RegExp literals. Of course, this 
adds some complication when referencing numbers above FFFF unless /u is made 
the default everywhere, since it implies code-point-based matching. E.g., 
what does /[^\0-\uFFFF\u{10000}]/ without /u match?

The example above also hints at additional potentially breaking changes for 
code point matching by default that haven't yet been discussed in this 
thread: that the meaning of negated character classes and shorthands would 
change, and that their match length may be 2 (like the dot).

Roger Andrews wrote:
> I'm struggling with how non-BMP escapes would be used in practice in 
> strings
> & regexps -- especially regexp ranges.  Will Strawman style be used in 
> string & regexp literals?
> [...examples snipped]

I'm not sure whether this was already clear, but the curly braces I included 
in my paraphrasing of Norbert's proposed transformations were not meant to 
be included literally. I was trying to describe ranges between arbitrary 
code points, represented by pairs of high and low surrogates. As far as I 
understand, no existing proposal would allow a character class range written 
as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the 
curly braces. To match a range outside the BMP in a literal RegExp, you 
would have to use [<char>-<char>] (where <char> represents a literal 
character, and this may require flag /u to work) or [\u{X..}-\u{X..}] 
(assuming support for this syntax is added, and where X.. represents a hex 
number between 0 and at least 10FFFF).

Norbert Lindenberg wrote:
> [...snip] My first proposal was to make this happen even without a new
> flag, i.e., make
> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
> work based on code points, and Steve is arguing against that because of
> compatibility risk. My proposal also includes some transformations to keep
> existing regular expressions working, and Steve correctly observes that if
> we have a flag for code point mode, then the transformation is not 
> needed -
> old regular expressions would continue to work in code unit mode, while 
> new
> regular expressions with /u get code point treatment.

Although I've argued the compatibility risk angle, on that point I should 
defer to implementers and others who might have a better sense of the scope 
of risk/damage to existing programs. More personally affecting, though, is 
the negative gut reaction I have to the well-thought-out but ugly and 
complicated (not so much in implementation, but for devs who have to learn 
about it) transformations that would otherwise be necessary to avoid 
breaking current regexes. And like David, I think just requiring /u is not 
so bad, especially since I'd want to use it for its other meanings anyway.

I'm also nervous about using different default semantics inside and out of 
ES6 modules, but David Herman has already well articulated my concerns and 
you've already responded, so I'll leave that discussion to you two except to 
say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable 
to automatically turn it on in ES6 modules. That's because I think applying 
only code unit to code point mode switching in modules by default is too 
magical and confusing, but if it were described as turning on /u by default, 
that's easy to understand and explain.

Lasse Reichstein wrote:
> Steven Levithan wrote:
>> I've been wondering whether it might be best for the /u flag to do three
>> things at once, making it an all-around "support Unicode better" flag:
>> [...]
>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>
> Yey, I'm for it :)
> Especially if it means dropping the rather naïve canonicalize function
> that can't canonicalize an ASCII character with a non-ASCII character.

That would be my hope as well.

Norbert Lindenberg wrote:
> One concern: I think code point based matching should be the default for
> regex literals within modules (where we know the code is written for
> Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full
> Unicode sets for such literals?

As I said above, yes, I think it makes sense to apply all semantics of /u by 
default within modules. Previously in this thread, I detailed what \d\w\b\s 
mean in various regex flavors. The ones that give Unicode meanings by 
default are .NET and Perl, so ES would be in excellent regex company. 
Additionally, Java's \b (only) supports Unicode by default, as does ES's \s.

Norbert Lindenberg wrote:
> In the other direction it's clear that using /u for \d\D\w\W\b\B has to
> imply code point mode.

Not if their meaning was limited to the BMP, which is already true for 
\D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point 
matching it would be false. Yet another reason to tie the multiple proposed 
meanings of /u together.

Norbert Lindenberg wrote:
> Steven Levithan wrote:
>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>> /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.
>
> We probably should review the complete Unicode Technical Standard #18,
> Unicode Regular Expressions, and see how we can upgrade RegExp for better
> Unicode support. Maybe on a separate thread...

Agreed. You may already be thinking this, but IMO if we're going to add /u 
as a Little Red Switch (as David called it), the priority should be on 
making sure that /u gets all aspects of Unicode-aware regular expression 
semantics done right, before looking at new features from UTS#18 like 
Unicode property matching.

-- Steven Levithan

# Erik Corry (13 years ago)

2012/3/26 Steven Levithan <steves_list at hotmail.com>:

Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

Without the /u flag it should behave exactly as it has done until now, for reasons of backwards compatibility. On V8 that means that

/[\u{10000}]/

is the same as

/[u01{}]/

2012/3/26 Steven Levithan <steves_list at hotmail.com>:
> Python uses the same syntax in regular expressions. But, as Norbert noted,
> there is already a strawman for \u{X..}. If it were adopted, I think it is
> clear that it should also be extended to RegExp literals. Of course, this
> adds some complication when referencing numbers above FFFF unless /u is made
> the default everywhere, since it implies code-point-based matching. E.g.,
> what does /[^\0-\uFFFF\u{10000}]/ without /u match?

Without the /u flag it should behave exactly as it has done until now,
for reasons of backwards
compatibility.  On V8 that means that

/[\u{10000}]/

is the same as

/[u01{}]/

-- 
Erik Corry

# Gavin Barraclough (13 years ago)

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state: "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations. But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

My concern would be expressions such as: /[\uD800\uDC00\uDC00\uD800]/u Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00". Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?). It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

Hi Norbert,

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state:
	"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations.  But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

My concern would be expressions such as:
	/[\uD800\uDC00\uDC00\uD800]/u
Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00".  Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?).  It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

cheers,
G.

On Mar 16, 2012, at 12:18 AM, Norbert Lindenberg wrote:

> Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
> 
> The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.
> 
> Comments?
> 
> Thanks,
> Norbert
> 
> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Erik Corry (13 years ago)

2012/3/26 Gavin Barraclough <barraclough at apple.com>:

Hi Norbert,

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state: "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations. But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

My concern would be expressions such as: /[\uD800\uDC00\uDC00\uD800]/u Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00". Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?). It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

This is too nasty. The regexp constructor should not have to look up the stack to see what behaviour is expected of it.

2012/3/26 Gavin Barraclough <barraclough at apple.com>:
> Hi Norbert,
>
> I really like the direction you're going in, but have one minor concern relating to regular expressions.
>
> In your proposal, you currently state:
>        "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."
>
> I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations.  But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.
>
> My concern would be expressions such as:
>        /[\uD800\uDC00\uDC00\uD800]/u
> Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00".  Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?).  It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).
>
> It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.
>
> If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

This is too nasty.   The regexp constructor should not have to look up
the stack to see what behaviour is expected of it.

-- 
Erik Corry

>
> cheers,
> G.
>
>
> On Mar 16, 2012, at 12:18 AM, Norbert Lindenberg wrote:
>
>> Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
>>
>> The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.
>>
>> Comments?
>>
>> Thanks,
>> Norbert
>>
>> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Gavin Barraclough (13 years ago)

On Mar 26, 2012, at 2:13 PM, Erik Corry wrote:

This is too nasty. The regexp constructor should not have to look up the stack to see what behaviour is expected of it.

I think you misunderstood me - I think we're saying the same thing. :-)

If we do imply the u flag for regexp literals in modules, and if we do make these regexps unable to match unpaired surrogates, then we may need to provide a method for programmers to create non-unicode aware regexps from within modules.

I was simply stating that since the regexp constructor isn't going to look up the stack to determine where it is being called from (we agree here), then a call to RegExp("\uD800") will create a non-unicode matching regexp, and as such a mechanism to create non-unicode regular expressions from within modules already exists. (If this weren't available we might have wanted to provide a symmetric flag to /u for regexp literals in modules to opt-out of unicode matching, but given that calling the RexExp constructor is a convenient alternative I don't think this is necessary or desirable).

On Mar 26, 2012, at 2:13 PM, Erik Corry wrote:
> This is too nasty.   The regexp constructor should not have to look up
> the stack to see what behaviour is expected of it.

I think you misunderstood me - I think we're saying the same thing. :-)

If we do imply the u flag for regexp literals in modules, and if we do make these regexps unable to match unpaired surrogates, then we may need to provide a method for programmers to create non-unicode aware regexps from within modules.

I was simply stating that since the regexp constructor isn't going to look up the stack to determine where it is being called from (we agree here), then a call to RegExp("\uD800") will create a non-unicode matching regexp, and as such a mechanism to create non-unicode regular expressions from within modules already exists.  (If this weren't available we might have wanted to provide a symmetric flag to /u for regexp literals in modules to opt-out of unicode matching, but given that calling the RexExp constructor is a convenient alternative I don't think this is necessary or desirable).

cheers,
G.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120326/7e0e1a67/attachment.html>

# Glenn Adams (13 years ago)

On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <barraclough at apple.com>wrote:

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state: "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

Just as a reminder, this would be in explicit violation of the Unicode conformance clause C1 unless it can be guaranteed that such a code point will not be interpreted as an abstract character:

C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.

[1] www.unicode.org/versions/Unicode6.1.0/ch03.pdf

Given that such guarantee is likely impractical, this presents a problem for the above proposed language.

On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <barraclough at apple.com>wrote:

> I really like the direction you're going in, but have one minor concern
> relating to regular expressions.
>
> In your proposal, you currently state:
>        "A code unit that is in the range 0xD800 to 0xDFFF, but is not part
> of a surrogate pair, is interpreted as a code point with the same value."
>

Just as a reminder, this would be in explicit violation of the Unicode
conformance clause C1 unless it can be guaranteed that such a code point
will not be interpreted as an abstract character:

C1 A process shall not interpret a high-surrogate code point or a
low-surrogate code point as an abstract character.

[1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf

Given that such guarantee is likely impractical, this presents a problem
for the above proposed language.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120326/6a2d6dc1/attachment.html>

# Steven Levithan (13 years ago)

Erik Corry wrote:

[...snip] what does /[^\0-\uFFFF\u{10000}]/ without /u match? Without the /u flag it should behave exactly as it has done until now, for reasons of backwards compatibility. On V8 that means that /[\u{10000}]/ is the same as /[u01{}]/

That sounds good. Not only does it avoid breaking from web-reality, it also means that regexes without /u don't need to use a weird mix of code unit and code point matching semantics, ever. To extend the backward compatible approach you prescribe here, the following should all be true when /u is not used:

/\u{10}/ eq /u{10}/ (literal u repeated 10 times).
Shorthand classes like \D, \S, and the dot match BMP code units only.
[^\0-\uFFFF] eq [] eq (?!) eq \b\B. (All of these are used in real-world regexes.)
If ES6 or later adds \p{..} for Unicode property matching, it's limited to matching BMP code units.

In other words, without /u, all matching is restricted to BMP code units. With /u, all matching is code point based and works with full 21-bit Unicode.

This also provides another argument in favor of automatically implying /u in ES modules. It would be somewhat obnoxious to not let \u{..} work by default in modules.

-- Steven Levithan

Erik Corry wrote:
>> [...snip] what does /[^\0-\uFFFF\u{10000}]/ without /u match?
> Without the /u flag it should behave exactly as it has done until now,
> for reasons of backwards
> compatibility.  On V8 that means that
> /[\u{10000}]/
> is the same as
> /[u01{}]/

That sounds good. Not only does it avoid breaking from web-reality, it also 
means that regexes without /u don't need to use a weird mix of code unit and 
code point matching semantics, ever. To extend the backward compatible 
approach you prescribe here, the following should all be true when /u is not 
used:

* /\u{10}/ eq /u{10}/ (literal u repeated 10 times).
* Shorthand classes like \D, \S, and the dot match BMP code units only.
* [^\0-\uFFFF] eq [] eq (?!) eq \b\B. (All of these are used in real-world 
regexes.)
* If ES6 or later adds \p{..} for Unicode property matching, it's limited to 
matching BMP code units.

In other words, without /u, all matching is restricted to BMP code units. 
With /u, all matching is code point based and works with full 21-bit 
Unicode.

This also provides another argument in favor of automatically implying /u in 
ES modules. It would be somewhat obnoxious to not let \u{..} work by default 
in modules.

-- Steven Levithan

# Roger Andrews (13 years ago)

Steven Levithan wrote:

[snip]

/\u{10}/ eq /u{10}/ (literal u repeated 10 times).

A point in favour of \Uxxxxxxxx over \u{x...} as a representation of character escapes? -- to avoid ambiguity in regexps.

Steven Levithan wrote:
>[snip]
> * /\u{10}/ eq /u{10}/ (literal u repeated 10 times).

A point in favour of \Uxxxxxxxx over \u{x...} as a representation of 
character escapes? -- to avoid ambiguity in regexps.

# Steven Levithan (13 years ago)

Roger Andrews wrote:

Steven Levithan wrote:

[snip]

/\u{10}/ eq /u{10}/ (literal u repeated 10 times).

A point in favour of \Uxxxxxxxx over \u{x...} as a representation of character escapes? -- to avoid ambiguity in regexps.

No. For backcompat, /\Uxxxxxxxx/ must eq /Uxxxxxxxx/, without some kind of mode-based switching.

-- Steven Levithan

Roger Andrews wrote:
> Steven Levithan wrote:
>> [snip]
>> * /\u{10}/ eq /u{10}/ (literal u repeated 10 times).
>
> A point in favour of \Uxxxxxxxx over \u{x...} as a representation of 
> character escapes? -- to avoid ambiguity in regexps.

No. For backcompat, /\Uxxxxxxxx/ must eq /Uxxxxxxxx/, without some kind of 
mode-based switching.

-- Steven Levithan

# Norbert Lindenberg (13 years ago)

OK, I guess we have to have Unicode code point escapes :-)

I'd expect them to work in identifiers, string literals, and regular expressions (possibly with restrictions coming out of today's emails), but not in JSON source.

Norbert

OK, I guess we have to have Unicode code point escapes :-)

I'd expect them to work in identifiers, string literals, and regular expressions (possibly with restrictions coming out of today's emails), but not in JSON source.

Norbert


On Mar 26, 2012, at 4:45 , Roger Andrews wrote:

> The strawman is for source code characters, and says it has "no implications
> for string value encodings" (or RegExps).
> String & regexp literal escape sequences are explicitly defined in ES5 sections 7.8.4 & 7.8.5.
> Will Strawman style also work in ES6 string & regexp literals?  Thus making regexp ranges much nicer (see final example below).
> 
> 
> As well as describing code points that have not yet been defined as
> characters, character escapes in string literals and regexps are good:
> 1)  control characters don't have glyphs at all,
> 2)  the various space glyphs are not readily distinguishable (same for some dash/minus/line glyphs),
> 3)  breaking/non-breaking versions of characters are not distinguishable,
> 4)  many other glyphs are hard to distinguish (being tiny adjustments in
> positioning or form detail),
> 5)  some characters are "combining" -- which makes for a messy and confusing
> program if you use them raw.
> 
> If you use the raw non-ASCII characters in a program then you need some means of creating them, preferably via a normal keyboard and in your favourite text editor.
> All program readers need appropriate fonts installed to fully understand the program, and program maintainers also need a Unicode-capable text editor (potentially including non-BMP support).
> All links/stores that the program travels over or rests in must be
> Unicode-capable.
> Whereas using only ASCII chars to write a program is easy to do and always
> works no matter how basic your computing/transmission infrastructure. (ASCII chars never get silently mangled in transmission or text editors.)
> 
> How to represent character escapes in a language.
> C/C++ has:
>   \xNN                        8-bit char (U+0000 - U+00FF)
>   \uNNNN                 16-bit char (U+0000 - U+FFFF)
>   \UNNNNNNNN    32-bit char (i.e. any 21-bit Unicode char)
> Strawman for source chars has:
>   \u{N...}               8 to 24-bit char (i.e. any 21-bit Unicode char)
> 
> 
> I'm struggling with how non-BMP escapes would be used in practice in strings
> & regexps -- especially regexp ranges.  Will Strawman style be used in string & regexp literals?
> 
> Considering U+1D307 (𝌇) as an example (where "𝌇" == "\uD834\uDF07").
> 
> To create the string "I like 𝌇" using escapes
> in C/C++ you can create a string:
>          "I like \U0001D307"
> if the Strawman style works in strings, in ES6 presumably you say:
>          "I like \u{1D307}"
> or do you have to know UTF-16 encoding rules and say:
>          "I like \uD834\uDF07"
> 
> To use U+1D307 (𝌇) and U+1D356 (𝍖) as a range in a regexp, i.e. /[𝌇-𝍖]/
> should the programmer write:
> C/C++ style
>           /[\U0001D307-\U0001D356]/
> or will Strawman style work in regexps
>           /[\u{1D307}-\u{1D356}]/
> or in UTF-16 with {} grouping
>           /[{\uD834\uDF07}-{\uD834\uDF56}]/
> 
> Either C/C++ style or Strawman style escape is readable, natural, doesn't
> require knowledge of UTF-16 encoding rules, can be created easily with any old keyboard, and won't upset text editors.
> 
> It's a bit unfriendly to require programmers to know UTF-16 rules just to
> put a non-BMP character in a string or regexp using an escape.  And in a
> regexp range it looks ugly and confusing.
> 
> 
> --------------------------------------------------
> From: "Norbert Lindenberg"
>> 
>> There is a strawman for code point escapes:
>> http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences
>> 
>> Note that for references to specific characters it's usually best to just
>> use the characters directly, as Dave did in
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u). Escapes can be useful in cases such as
>> regular expressions where you might have to refer to range limits that
>> aren't actually assigned characters, or in test cases where you might use
>> characters for which your OS doesn't have glyphs yet.
>> 
>> Norbert
>> 
>> 
>> On Mar 25, 2012, at 2:57 , Roger Andrews wrote:
>> 
>>> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
>>> literals.  The \U format expresses a full 32-bit code, which could be
>>> mapped internally to two 16-bit UTF-16 codes.
>>> 
>>> Then the programmer can describe exactly the required characters without
>>> caring about their coding in UTF-16 or whatever.
>>> 
>>> Could you use this to avoid complicated things in RegExps like
>>> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}], instead have things like
>>> [\U0001xxxx-\U0003yyyy] -- naturally expressing the characters of
>>> interest?
>>> 
>>> The same goes for String literals, where the programmer does not really
>>> care about the encoding, just specifying the character.
>>> 
>>> (Sorry if I've missed something in the prior discussion.)
>>> 
>>> --------------------------------------------------
>>> From: "Norbert Lindenberg"
>>> To: "David Herman"
>>>> 
>>>> On Mar 24, 2012, at 12:21 , David Herman wrote:
>>>> 
>>>> [snip]
>>>> 
>>>>>> As for whether the switch to code-point-based matching should be
>>>>>> universal or require /u (an issue that your proposal leaves open),
>>>>>> IMHO it's better to require /u since it avoids the need for
>>>>>> transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}]
>>>>>> and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}],
>>>>>> and additionally avoids as least three potentially breaking changes
>>>>>> (two of which are explicitly mentioned in your proposal):
>>>>> 
>>>>> I haven't completely understood this part of the discussion. Looking at
>>>>> /u as a "little red switch" (LRS), i.e., an opportunity to make
>>>>> judicious breaks with compatibility, could we not allow character
>>>>> classes with unescaped non-BMP code points, e.g.:
>>>>> 
>>>>>  js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
>>>>>  ["𝌆𝌇𝌈𝌉𝌊"]
>>>>> 
>>>>> I'm still getting up to speed on Unicode and JS string semantics, so
>>>>> I'm guessing that I'm missing a reason why that wouldn't work...
>>>>> Presumably the JS source of the regexp literal, as a sequence of UTF-16
>>>>> code units, represents the tetragram code points as surrogate pairs.
>>>>> Can we not recognize surrogate pairs in character classes within a /u
>>>>> regexp and interpret them as code points?
>>>> 
>>>> With /u, that's exactly what happens. My first proposal was to make this
>>>> happen even without a new flag, i.e., make
>>>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>>>> work based on code points, and Steve is arguing against that because of
>>>> compatibility risk. My proposal also includes some transformations to
>>>> keep existing regular expressions working, and Steve correctly observes
>>>> that if we have a flag for code point mode, then the transformation is
>>>> not needed - old regular expressions would continue to work in code unit
>>>> mode, while new regular expressions with /u get code point treatment.

# Norbert Lindenberg (13 years ago)

I should have said "use appropriate error handling" instead of "convert unpaired surrogates to the UTF-8 sequence for U+FFFD". While using the replacement character is a reasonable default behavior, it's best to let the caller control the behavior. I'd assume that most callers would want as much information to pass through even if there's some stray unpaired surrogate in a string. If your converters just throw exceptions, then many callers will have to go through input strings themselves and remove unpaired surrogates that might have crept in.

For Base64, you could encode UTF-16 directly; you just have to make sure that encoder and decoder agree on the byte order.

Norbert

I should have said "use appropriate error handling" instead of "convert unpaired surrogates to the UTF-8 sequence for U+FFFD". While using the replacement character is a reasonable default behavior, it's best to let the caller control the behavior. I'd assume that most callers would want as much information to pass through even if there's some stray unpaired surrogate in a string. If your converters just throw exceptions, then many callers will have to go through input strings themselves and remove unpaired surrogates that might have crept in.

For Base64, you could encode UTF-16 directly; you just have to make sure that encoder and decoder agree on the byte order.

Norbert


On Mar 26, 2012, at 5:16 , Roger Andrews wrote:

> Maybe String.isValid is just not generally useful enough.  I accept the point that you don't add APIs simply to flag an issue, (there has to be more weighty justification to carry the trifle).
> 
> 
> PS:
> As for UTF-16 -> UTF-8 or HTML-Formdata, I decided to follow encodeURI / encodeURIComponent's lead and throw an exception.  Maybe that's the wrong thing to do?
> 
> My UTF-8 -> UTF-16 does check for well-formed UTF-8 because it seemed the right thing to do.  Thanks for the link which explains why.
> 
> Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, same issues as above really.
> 
> --------------------------------------------------
> From: "Norbert Lindenberg"
>> 
>> Let's see:
>> 
>> - Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to convert it, so isValid doesn't really help. You still have to look at all code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD.
>> 
>> - Conversion from UTF-8: For security reasons, you have to check for well-formedness before conversion, in particular to catch non-shortest forms [1].
>> 
>> - HTML form data: Same situation as conversion to UTF-8.
>> 
>> - Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.
>> 
>> I don't think we'd add API just to flag an issue - that's what documentation is for.
>> 
>> Norbert
>> 
>> [1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit
>> 
>> 
>> 
>> On Mar 25, 2012, at 1:57 , Roger Andrews wrote:
>> 
>>> I use something like String.isValid functionality in a transcoder that
>>> converts Strings to/from UTF-8, HTML Formdata (MIME type
>>> application/x-www-form-urlencoded -- not the same as URI encoding!), and
>>> Base64.
>>> 
>>> Admittedly these currently use 'encodeURI' to do the work, or it just drops
>>> out naturally when considering UTF-8 sequences.
>>> 
>>> (I considered testing the regexp
>>> /^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
>>> against the input string.)
>>> 
>>> Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers.
>>> 
>>> --------------------------------------------------
>>> From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>
>>>> 
>>>> It's easy to provide this function, but in which situations would it be
>>>> useful? In most cases that I can think of you're interested in far more
>>>> constrained definitions of validity:
>>>> - what are valid ECMAScript identifiers?
>>>> - what are valid BCP 47 language tags?
>>>> - what are the characters allowed in a certain protocol?
>>>> - what are the characters that my browser can render?
>>>> 
>>>> Thanks,
>>>> Norbert
>>>> 
>>>> 
>>>> On Mar 24, 2012, at 12:12 , David Herman wrote:
>>>> 
>>>>> On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
>>>>> 
>>>>>> Concerning UTF-16 surrogate pairs, how about a function like:
>>>>>>   String.isValid( str )
>>>>>> to discover whether surrogates are used correctly in 'str'?
>>>>>> 
>>>>>> Something like Array.isArray().
>>>>> 
>>>>> No need for it to be a class method, since it only operates on strings.
>>>>> We could simply have String.prototype.isValid(). Note that it would work
>>>>> for primitive strings as well, thanks to JS's automatic promotion
>>>>> semantics.
>>>>> 
>>>>> Dave
>>>>>

# Norbert Lindenberg (13 years ago)

On Mar 26, 2012, at 9:45 , Steven Levithan wrote:

Sorry for jumping between messages...

Roger Andrews wrote:

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.

Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

As long as the underlying system is UTF-16 based, I'd think \u{10000} is simply a different notation for \uD800\uDC00. But with code unit based matching that will not result in the intended behavior.

The example above also hints at additional potentially breaking changes for code point matching by default that haven't yet been discussed in this thread: that the meaning of negated character classes and shorthands would change, and that their match length may be 2 (like the dot).

Yes.

Roger Andrews wrote:

I'm struggling with how non-BMP escapes would be used in practice in strings & regexps -- especially regexp ranges. Will Strawman style be used in string & regexp literals? [...examples snipped]

I'm not sure whether this was already clear, but the curly braces I included in my paraphrasing of Norbert's proposed transformations were not meant to be included literally. I was trying to describe ranges between arbitrary code points, represented by pairs of high and low surrogates. As far as I understand, no existing proposal would allow a character class range written as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the curly braces. To match a range outside the BMP in a literal RegExp, you would have to use [<char>-<char>] (where <char> represents a literal character, and this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for this syntax is added, and where X.. represents a hex number between 0 and at least 10FFFF).

In my proposal the following two regular expressions are equivalent:

/[𝌆-𝍖]+/u /[\uD834\uDF06-\uD834\uDF56]+/u

They are made equivalent by the first preprocessing step proposed for 15.10.4.1 and the subsequent interpretation of UTF-16 sequences as code points.

I think I'd process Unicode code point escapes by first converting them to equivalent code unit escapes and then following the same path. This would make

/[\u{1D306}-\u{1D356}]+/u

equivalent to the two above.

Norbert Lindenberg wrote:

[...snip] My first proposal was to make this happen even without a new flag, i.e., make "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment.

Although I've argued the compatibility risk angle, on that point I should defer to implementers and others who might have a better sense of the scope of risk/damage to existing programs. More personally affecting, though, is the negative gut reaction I have to the well-thought-out but ugly and complicated (not so much in implementation, but for devs who have to learn about it) transformations that would otherwise be necessary to avoid breaking current regexes. And like David, I think just requiring /u is not so bad, especially since I'd want to use it for its other meanings anyway.

I'm also nervous about using different default semantics inside and out of ES6 modules, but David Herman has already well articulated my concerns and you've already responded, so I'll leave that discussion to you two except to say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable to automatically turn it on in ES6 modules. That's because I think applying only code unit to code point mode switching in modules by default is too magical and confusing, but if it were described as turning on /u by default, that's easy to understand and explain.

Good input.

On Mar 26, 2012, at 9:45 , Steven Levithan wrote:

> Sorry for jumping between messages...
> 
> Roger Andrews wrote:
>> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals.  The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.
> 
> Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

As long as the underlying system is UTF-16 based, I'd think \u{10000} is simply a different notation for \uD800\uDC00. But with code unit based matching that will not result in the intended behavior.

> The example above also hints at additional potentially breaking changes for code point matching by default that haven't yet been discussed in this thread: that the meaning of negated character classes and shorthands would change, and that their match length may be 2 (like the dot).

Yes.

> Roger Andrews wrote:
>> I'm struggling with how non-BMP escapes would be used in practice in strings
>> & regexps -- especially regexp ranges.  Will Strawman style be used in string & regexp literals?
>> [...examples snipped]
> 
> I'm not sure whether this was already clear, but the curly braces I included in my paraphrasing of Norbert's proposed transformations were not meant to be included literally. I was trying to describe ranges between arbitrary code points, represented by pairs of high and low surrogates. As far as I understand, no existing proposal would allow a character class range written as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the curly braces. To match a range outside the BMP in a literal RegExp, you would have to use [<char>-<char>] (where <char> represents a literal character, and this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for this syntax is added, and where X.. represents a hex number between 0 and at least 10FFFF).

In my proposal the following two regular expressions are equivalent:

/[𝌆-𝍖]+/u
/[\uD834\uDF06-\uD834\uDF56]+/u

They are made equivalent by the first preprocessing step proposed for 15.10.4.1 and the subsequent interpretation of UTF-16 sequences as code points.

I think I'd process Unicode code point escapes by first converting them to equivalent code unit escapes and then following the same path. This would make

/[\u{1D306}-\u{1D356}]+/u

equivalent to the two above.

> Norbert Lindenberg wrote:
>> [...snip] My first proposal was to make this happen even without a new
>> flag, i.e., make
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>> work based on code points, and Steve is arguing against that because of
>> compatibility risk. My proposal also includes some transformations to keep
>> existing regular expressions working, and Steve correctly observes that if
>> we have a flag for code point mode, then the transformation is not needed -
>> old regular expressions would continue to work in code unit mode, while new
>> regular expressions with /u get code point treatment.
> 
> Although I've argued the compatibility risk angle, on that point I should defer to implementers and others who might have a better sense of the scope of risk/damage to existing programs. More personally affecting, though, is the negative gut reaction I have to the well-thought-out but ugly and complicated (not so much in implementation, but for devs who have to learn about it) transformations that would otherwise be necessary to avoid breaking current regexes. And like David, I think just requiring /u is not so bad, especially since I'd want to use it for its other meanings anyway.
> 
> I'm also nervous about using different default semantics inside and out of ES6 modules, but David Herman has already well articulated my concerns and you've already responded, so I'll leave that discussion to you two except to say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable to automatically turn it on in ES6 modules. That's because I think applying only code unit to code point mode switching in modules by default is too magical and confusing, but if it were described as turning on /u by default, that's easy to understand and explain.

Good input.

> Lasse Reichstein wrote:
>> Steven Levithan wrote:
>>> I've been wondering whether it might be best for the /u flag to do three
>>> things at once, making it an all-around "support Unicode better" flag:
>>> [...]
>>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>> 
>> Yey, I'm for it :)
>> Especially if it means dropping the rather naïve canonicalize function
>> that can't canonicalize an ASCII character with a non-ASCII character.
> 
> That would be my hope as well.
> 
> Norbert Lindenberg wrote:
>> One concern: I think code point based matching should be the default for
>> regex literals within modules (where we know the code is written for
>> Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full
>> Unicode sets for such literals?
> 
> As I said above, yes, I think it makes sense to apply all semantics of /u by default within modules. Previously in this thread, I detailed what \d\w\b\s mean in various regex flavors. The ones that give Unicode meanings by default are .NET and Perl, so ES would be in excellent regex company. Additionally, Java's \b (only) supports Unicode by default, as does ES's \s.
> 
> Norbert Lindenberg wrote:
>> In the other direction it's clear that using /u for \d\D\w\W\b\B has to
>> imply code point mode.
> 
> Not if their meaning was limited to the BMP, which is already true for \D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point matching it would be false. Yet another reason to tie the multiple proposed meanings of /u together.
> 
> Norbert Lindenberg wrote:
>> Steven Levithan wrote:
>>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>>> /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.
>> 
>> We probably should review the complete Unicode Technical Standard #18,
>> Unicode Regular Expressions, and see how we can upgrade RegExp for better
>> Unicode support. Maybe on a separate thread...
> 
> Agreed. You may already be thinking this, but IMO if we're going to add /u as a Little Red Switch (as David called it), the priority should be on making sure that /u gets all aspects of Unicode-aware regular expression semantics done right, before looking at new features from UTS#18 like Unicode property matching.
> 
> -- Steven Levithan

# Norbert Lindenberg (13 years ago)

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the same as /[\u01{}]/ - it matches "\u01{}u01". In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either.

Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them?

For string literals, I see that most implementations correctly throw a SyntaxError when given "\u{10}". The exception here is V8.

Norbert

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the same as /[\\u01{}]/ - it matches "\\u01{}u01". In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either.

Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them?

For string literals, I see that most implementations correctly throw a SyntaxError when given "\u{10}". The exception here is V8.

Norbert

On Mar 26, 2012, at 12:38 , Erik Corry wrote:

> 2012/3/26 Steven Levithan <steves_list at hotmail.com>:
>> Python uses the same syntax in regular expressions. But, as Norbert noted,
>> there is already a strawman for \u{X..}. If it were adopted, I think it is
>> clear that it should also be extended to RegExp literals. Of course, this
>> adds some complication when referencing numbers above FFFF unless /u is made
>> the default everywhere, since it implies code-point-based matching. E.g.,
>> what does /[^\0-\uFFFF\u{10000}]/ without /u match?
> 
> Without the /u flag it should behave exactly as it has done until now,
> for reasons of backwards
> compatibility.  On V8 that means that
> 
> /[\u{10000}]/
> 
> is the same as
> 
> /[u01{}]/
> 
> -- 
> Erik Corry

# Norbert Lindenberg (13 years ago)

On Mar 26, 2012, at 13:02 , Gavin Barraclough wrote:

Hi Norbert,

I really like the direction you're going in, but have one minor concern relating to regular expressions.

In your proposal, you currently state: "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations. But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

That's worth considering. It seems we're more and more moving towards two separate RegExp versions anyway - a legacy version based on code units and with all kinds of quirks, and an all-around-better version based on code points. It means however that you can't easily remove unpaired surrogates by str.replace(/[\u{D800}-\u{DFFF}]/ug, "\u{FFFD}")

My concern would be expressions such as: /[\uD800\uDC00\uDC00\uD800]/u Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00". Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?). It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

I think/hope that my specification is clear: a surrogate pair is always treated as one entity, not as two pieces. If the input is "\uD800\uDC00", you match "\uD800\uDC00". If you have to backtrack over "\uD800\uDC00", you step back two code units.

It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

Agreed, especially after reading Erik's and your additional emails on this.

On Mar 26, 2012, at 13:02 , Gavin Barraclough wrote:

> Hi Norbert,
> 
> I really like the direction you're going in, but have one minor concern relating to regular expressions.
> 
> In your proposal, you currently state:
> 	"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."
> 
> I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations.  But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error.

That's worth considering. It seems we're more and more moving towards two separate RegExp versions anyway - a legacy version based on code units and with all kinds of quirks, and an all-around-better version based on code points. It means however that you can't easily remove unpaired surrogates by
   str.replace(/[\u{D800}-\u{DFFF}]/ug, "\u{FFFD}")

> My concern would be expressions such as:
> 	/[\uD800\uDC00\uDC00\uD800]/u
> Under my reading of the current proposal, this could match any of "\uD800\uDC00", "\uD800", or "\uDC00".  Allowing this seems to introduce the concept of precedence to character classes (given an input "\uD800\uDC00", should I choose to match "\uD800\uDC00" or "\uD800"?).  It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched "\uD800\uDC00", should I step back by one code unit or two?).

I think/hope that my specification is clear: a surrogate pair is always treated as one entity, not as two pieces. If the input is "\uD800\uDC00", you match "\uD800\uDC00". If you have to backtrack over "\uD800\uDC00", you step back two code units.

> It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful.
> 
> If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a "u" flag being passed), so RegExp("\uD800") should still be available to support non-unicode matching within modules.

Agreed, especially after reading Erik's and your additional emails on this.

# Norbert Lindenberg (13 years ago)

The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted.

My proposal interprets the resulting code points in the following ways:

In regular expressions, they can be used in both patterns and input strings to be matched. They may be compared against other code points, or against character classes, some of which will hopefully soon be defined by Unicode properties. In the case of comparing against other code points, they can't match any code points assigned to abstract characters. In the case of Unicode properties, they'll typically fall into the large bucket of have-nots, along with other unassigned code points or, for example, U+FFFD, unless you ask for their general category.
When parsing identifiers, they will not have the ID_Start or ID_Continue properties, so they'll be excluded, just like other unassigned code points or U+FFFD.
In case conversion, they won't have upper case or lower case equivalents defined, and remain as is, as would happen for unassigned code points or U+FFFD.

I don't think either of these amount to interpretation as abstract characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything.

Norbert

The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted.

My proposal interprets the resulting code points in the following ways:

1) In regular expressions, they can be used in both patterns and input strings to be matched. They may be compared against other code points, or against character classes, some of which will hopefully soon be defined by Unicode properties. In the case of comparing against other code points, they can't match any code points assigned to abstract characters. In the case of Unicode properties, they'll typically fall into the large bucket of have-nots, along with other unassigned code points or, for example, U+FFFD, unless you ask for their general category.

2) When parsing identifiers, they will not have the ID_Start or ID_Continue properties, so they'll be excluded, just like other unassigned code points or U+FFFD.

3) In case conversion, they won't have upper case or lower case equivalents defined, and remain as is, as would happen for unassigned code points or U+FFFD.

I don't think either of these amount to interpretation as abstract characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything.

Norbert

On Mar 26, 2012, at 15:10 , Glenn Adams wrote:

> On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <barraclough at apple.com> wrote:
> I really like the direction you're going in, but have one minor concern relating to regular expressions.
> 
> In your proposal, you currently state:
>        "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."
> 
> Just as a reminder, this would be in explicit violation of the Unicode conformance clause C1 unless it can be guaranteed that such a code point will not be interpreted as an abstract character:
> 
> C1	A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.
> 
> [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf 
> 
> Given that such guarantee is likely impractical, this presents a problem for the above proposed language.

# Steven Levithan (13 years ago)

Norbert Lindenberg wrote:

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the same as /[\u01{}]/ - it matches "\u01{}u01". In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either.

How did you test this. I get consistent results that agree with Erik in IE 9, Firefox 11, Chrome 17, and Safari 5.1:

"\u01{}".match(/[\u{10000}]/g); // ['u','0','1','{','}'] /\u{2}/g.test("uu"); // true

Opera, as you said, returns null and false (tested v11.6 and v10.0).

Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them?

Minus Opera, browsers do agree on them. Admirably so. And they aren't bugs--they're intentional breaks from ES for backcompat with earlier implementations that were themselves designed for backcompat with older non-ES regex behavior. The RegExp Match Web Reality proposal at harmony:regexp_match_web_reality

says to add them to the spec, and Allen has said the web reality proposal should be the top RegExp priority for ES6.

I'd easily believe it's safe enough to change /[\u{n..}]/ because of the four-part sequence involved in \u + { + n.. + } that is fairly unlikely to appear in that specific order in a character class. But I'd have a harder time believing /\u{n..}/ is safe to change. It would of course be great to have some real data on the risks/damage.

For string literals, I see that most implementations correctly throw a SyntaxError when given "\u{10}". The exception here is V8.

I'm sure it would be safer to allow \u{n..} for string literals even if this fortunate SyntaxError wasn't thrown. Users haven't been trained to think of escaped nonmetacharacters as safe for string literals to the extent that they have for regexes, and you can't programmatically generate such escapes so easily as when passing to the RegExp constructor.

-- Steven Levithan

Norbert Lindenberg wrote:
>The ugly world of web reality...
>
>Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the
>same as /[\\u01{}]/ - it matches "\\u01{}u01". In Opera, it doesn't seem to
>match anything, but doesn't throw the specified SyntaxError either.

How did you test this. I get consistent results that agree with Erik in IE 
9, Firefox 11, Chrome 17, and Safari 5.1:

"\\u01{}".match(/[\u{10000}]/g); // ['u','0','1','{','}']
/\u{2}/g.test("uu"); // true

Opera, as you said, returns null and false (tested v11.6 and v10.0).

>Do we know of any applications actually relying on these bugs, seeing that
>browsers don't agree on them?

Minus Opera, browsers do agree on them. Admirably so. And they aren't 
bugs--they're intentional breaks from ES for backcompat with earlier 
implementations that were themselves designed for backcompat with older 
non-ES regex behavior. The RegExp Match Web Reality proposal at 
<http://wiki.ecmascript.org/doku.php?id=harmony:regexp_match_web_reality> 
says to add them to the spec, and Allen has said the web reality proposal 
should be the top RegExp priority for ES6.

I'd easily believe it's safe enough to change /[\u{n..}]/ because of the 
four-part sequence involved in \u + { + n.. + } that is fairly unlikely to 
appear in that specific order in a character class. But I'd have a harder 
time believing /\u{n..}/ is safe to change. It would of course be great to 
have some real data on the risks/damage.

>For string literals, I see that most implementations correctly throw a
>SyntaxError when given "\u{10}". The exception here is V8.

I'm sure it would be safer to allow \u{n..} for string literals even if this 
fortunate SyntaxError wasn't thrown. Users haven't been trained to think of 
escaped nonmetacharacters as safe for string literals to the extent that 
they have for regexes, and you can't programmatically generate such escapes 
so easily as when passing to the RegExp constructor.

-- Steven Levithan

# Glenn Adams (13 years ago)

On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg < ecmascript at norbertlindenberg.com> wrote:

The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted.

True, but if the proposed language

"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."

is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this will increase the likelihood of their being interpreted as abstract characters... e.g., if the unpaired code unit is interpreted as a unpaired surrogate code point, and some process/function performs any predicate or transform on that code point, then that amounts to interpreting it as an abstract character.

I would rather see such unpaired code unit either (1) be mapped to U+00FFFD, or (2) an exception raised when performing an operation that requires conversion of the UTF-16 code unit sequence.

On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
ecmascript at norbertlindenberg.com> wrote:

> The conformance clause doesn't say anything about the interpretation of
> (UTF-16) code units as code points. To check conformance with C1, you have
> to look at how the resulting code points are actually further interpreted.
>

True, but if the proposed language

"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
surrogate pair, is interpreted as a code point with the same value."

is adopted, then will not this have an effect of creating unpaired
surrogates as code points? If so, then by my estimation, this *will* increase
the likelihood of their being interpreted as abstract characters... e.g.,
if the unpaired code unit is interpreted as a unpaired surrogate code
point, and some process/function performs *any* predicate or transform on
that code point, then that amounts to interpreting it as an abstract
character.

I would rather see such unpaired code unit either (1) be mapped to
U+00FFFD, or (2) an exception raised when performing an operation that
requires conversion of the UTF-16 code unit sequence.


> My proposal interprets the resulting code points in the following ways:
>
> 1) In regular expressions, they can be used in both patterns and input
> strings to be matched. They may be compared against other code points, or
> against character classes, some of which will hopefully soon be defined by
> Unicode properties. In the case of comparing against other code points,
> they can't match any code points assigned to abstract characters. In the
> case of Unicode properties, they'll typically fall into the large bucket of
> have-nots, along with other unassigned code points or, for example, U+FFFD,
> unless you ask for their general category.
>
> 2) When parsing identifiers, they will not have the ID_Start or
> ID_Continue properties, so they'll be excluded, just like other unassigned
> code points or U+FFFD.
>
> 3) In case conversion, they won't have upper case or lower case
> equivalents defined, and remain as is, as would happen for unassigned code
> points or U+FFFD.
>
> I don't think either of these amount to interpretation as abstract
> characters. I mention U+FFFD because the alternative interpretation of
> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
> seem to improve anything.
>
> Norbert
>
>
>
> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>
> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
> barraclough at apple.com> wrote:
> > I really like the direction you're going in, but have one minor concern
> relating to regular expressions.
> >
> > In your proposal, you currently state:
> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
> part of a surrogate pair, is interpreted as a code point with the same
> value."
> >
> > Just as a reminder, this would be in explicit violation of the Unicode
> conformance clause C1 unless it can be guaranteed that such a code point
> will not be interpreted as an abstract character:
> >
> > C1    A process shall not interpret a high-surrogate code point or a
> low-surrogate code point as an abstract character.
> >
> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
> >
> > Given that such guarantee is likely impractical, this presents a problem
> for the above proposed language.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/e7e944e8/attachment-0001.html>

# Glenn Adams (13 years ago)

On Tue, Mar 27, 2012 at 12:11 AM, Glenn Adams <glenn at skynav.com> wrote:

I don't think either of these amount to interpretation as abstract

characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything.

I would call it an improvement since it reduces spread|contamination of|by unpaired surrogate code points. I'm not sure what other advantage would be great enough to trump the desire to prevent such contamination.

On Tue, Mar 27, 2012 at 12:11 AM, Glenn Adams <glenn at skynav.com> wrote:

> I don't think either of these amount to interpretation as abstract
>> characters. I mention U+FFFD because the alternative interpretation of
>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>> seem to improve anything.
>>
>
I would call it an improvement since it reduces spread|contamination of|by
unpaired surrogate code points. I'm not sure what other advantage would be
great enough to trump the desire to prevent such contamination.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/93894942/attachment.html>

# Norbert Lindenberg (13 years ago)

On Mar 26, 2012, at 22:49 , Steven Levithan wrote:

Norbert Lindenberg wrote:

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the same as /[\u01{}]/ - it matches "\u01{}u01". In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either.

How did you test this. I get consistent results that agree with Erik in IE 9, Firefox 11, Chrome 17, and Safari 5.1:

"\u01{}".match(/[\u{10000}]/g); // ['u','0','1','{','}'] /\u{2}/g.test("uu"); // true

Sorry, stupid mistake on my side. It is /[u01{}]/, as Erik said.

Opera, as you said, returns null and false (tested v11.6 and v10.0).

Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them?

Minus Opera, browsers do agree on them. Admirably so. And they aren't bugs--they're intentional breaks from ES for backcompat with earlier implementations that were themselves designed for backcompat with older non-ES regex behavior. The RegExp Match Web Reality proposal at harmony:regexp_match_web_reality says to add them to the spec, and Allen has said the web reality proposal should be the top RegExp priority for ES6.

Grumble. How about applying the web reality proposal only to old-style regex, not when /u is set or implied?

Norbert

On Mar 26, 2012, at 22:49 , Steven Levithan wrote:

> Norbert Lindenberg wrote:
>> The ugly world of web reality...
>> 
>> Actually, in V8, Firefox, Safari, and IE, /[\u{10000}]/ seems to be the
>> same as /[\\u01{}]/ - it matches "\\u01{}u01". In Opera, it doesn't seem to
>> match anything, but doesn't throw the specified SyntaxError either.
> 
> How did you test this. I get consistent results that agree with Erik in IE 9, Firefox 11, Chrome 17, and Safari 5.1:
> 
> "\\u01{}".match(/[\u{10000}]/g); // ['u','0','1','{','}']
> /\u{2}/g.test("uu"); // true

Sorry, stupid mistake on my side. It is /[u01{}]/, as Erik said.

> Opera, as you said, returns null and false (tested v11.6 and v10.0).
> 
>> Do we know of any applications actually relying on these bugs, seeing that
>> browsers don't agree on them?
> 
> Minus Opera, browsers do agree on them. Admirably so. And they aren't bugs--they're intentional breaks from ES for backcompat with earlier implementations that were themselves designed for backcompat with older non-ES regex behavior. The RegExp Match Web Reality proposal at <http://wiki.ecmascript.org/doku.php?id=harmony:regexp_match_web_reality> says to add them to the spec, and Allen has said the web reality proposal should be the top RegExp priority for ES6.

Grumble. How about applying the web reality proposal only to old-style regex, not when /u is set or implied?

Norbert

# Steven Levithan (13 years ago)

The idea for /u and the following aspects of it already seem to have some consensus:

Switch from code unit to code point matching.
Make \d\w\b Unicode-aware.
Make /i use proper Unicode casefolding.
Enable \u{x..} (break from web reality).

Since /u may be a one-time opportunity to broadly change RegExp semantics, how about adding another change on the pile?

Break from web reality for escaped A-Z and a-z. Throw a SyntaxError when any letter not assigned a special meaning is escaped, instead of matching the literal character.

I.e., /\i/u etc. must throw a SyntaxError.

This is relevant to future Unicode support, because without breaking web reality we might never be able to add \p{..} and \P{..} for Unicode properties, \X for graphemes, \N{..} for named characters, etc.

Of course, this change would also make it easier to add any from a host of special escapes in other regex libraries (such as \k<..> for named

backreferences) or new ES inventions. It's really ugly that such features might not be able to be added by default everywhere, but them's the breaks, I suppose (I hope I'm wrong).

We could go crazy and start fixing all of ES's RegExp warts when /u is applied, even though such changes would not be related to Unicode support. I'd be happy to pursue that, but I suspect many here would see it as a bridge too far.

Thoughts?

-- Steven Levithan

The idea for /u and the following aspects of it already seem to have some 
consensus:

- Switch from code unit to code point matching.
- Make \d\w\b Unicode-aware.
- Make /i use proper Unicode casefolding.
- Enable \u{x..} (break from web reality).

Since /u may be a one-time opportunity to broadly change RegExp semantics, 
how about adding another change on the pile?

- Break from web reality for escaped A-Z and a-z. Throw a SyntaxError when 
any letter not assigned a special meaning is escaped, instead of matching 
the literal character.

I.e., /\i/u etc. must throw a SyntaxError.

This is relevant to future Unicode support, because without breaking web 
reality we might never be able to add \p{..} and \P{..} for Unicode 
properties, \X for graphemes, \N{..} for named characters, etc.

Of course, this change would also make it easier to add any from a host of 
special escapes in other regex libraries (such as \k<..> for named 
backreferences) or new ES inventions. It's really ugly that such features 
might not be able to be added by default everywhere, but them's the breaks, 
I suppose (I hope I'm wrong).

We could go crazy and start fixing all of ES's RegExp warts when /u is 
applied, even though such changes would not be related to Unicode support. 
I'd be happy to pursue that, but I suspect many here would see it as a 
bridge too far.

Thoughts?

-- Steven Levithan

# Steven Levithan (13 years ago)

Norbert Lindenberg wrote:

Grumble. How about applying the web reality proposal only to old-style regex, not when /u is set or implied?

I'd support that. See my recent email along similar lines, where I suggested using /u as an opportunity to discuss all of ES's RegExp warts. Limiting this to not applying web reality would be a reasonable compromise that would at least allow for future letter escapes and finally kill RegExp octals (which overlap with backreferences, among other problems).

-- Steven Levithan

Norbert Lindenberg wrote:
> Grumble. How about applying the web reality proposal only to old-style
> regex, not when /u is set or implied?

I'd support that. See my recent email along similar lines, where I suggested 
using /u as an opportunity to discuss all of ES's RegExp warts. Limiting 
this to not applying web reality would be a reasonable compromise that would 
at least allow for future letter escapes and finally kill RegExp octals 
(which overlap with backreferences, among other problems).

-- Steven Levithan

# Erik Corry (13 years ago)

2012/3/27 Steven Levithan <steves_list at hotmail.com>:

The idea for /u and the following aspects of it already seem to have some consensus:

Switch from code unit to code point matching.

Make \d\w\b Unicode-aware.

I think we should leave these alone. They are concise and useful and will continue to be so when /u is the default in Harmony code. Instead we should introduce \p{...} immediately which provides the same functionality.

Make /i use proper Unicode casefolding.

Enable \u{x..} (break from web reality).

Make unpaired surrogates in /u regexps a syntax error.

Add /U to mean old-style regexp literals in Harmony code (analogous to /s and /S which have opposite meanings).

Since /u may be a one-time opportunity to broadly change RegExp semantics, how about adding another change on the pile?

Break from web reality for escaped A-Z and a-z. Throw a SyntaxError when any letter not assigned a special meaning is escaped, instead of matching the literal character.

I.e., /\i/u etc. must throw a SyntaxError.

Yes.

Also we should consider the Perl syntax that allows you to switch on and off flags for only part of a regexp, so that case independence does not have to apply to the whole regexp.

2012/3/27 Steven Levithan <steves_list at hotmail.com>:
> The idea for /u and the following aspects of it already seem to have some
> consensus:
>
> - Switch from code unit to code point matching.
> - Make \d\w\b Unicode-aware.

I think we should leave these alone.  They are concise and useful and
will continue to be so when /u is the default in Harmony code.
Instead we should introduce \p{...} immediately which provides the
same functionality.

> - Make /i use proper Unicode casefolding.
> - Enable \u{x..} (break from web reality).

Make unpaired surrogates in /u regexps a syntax error.

Add /U to mean old-style regexp literals in Harmony code (analogous to
/s and /S which have opposite meanings).

> Since /u may be a one-time opportunity to broadly change RegExp semantics,
> how about adding another change on the pile?
>
> - Break from web reality for escaped A-Z and a-z. Throw a SyntaxError when
> any letter not assigned a special meaning is escaped, instead of matching
> the literal character.
>
> I.e., /\i/u etc. must throw a SyntaxError.

Yes.

Also we should consider the Perl syntax that allows you to switch on
and off flags for only part of a regexp, so that case independence
does not have to apply to the whole regexp.

> This is relevant to future Unicode support, because without breaking web
> reality we might never be able to add \p{..} and \P{..} for Unicode
> properties, \X for graphemes, \N{..} for named characters, etc.
>
> Of course, this change would also make it easier to add any from a host of
> special escapes in other regex libraries (such as \k<..> for named
> backreferences) or new ES inventions. It's really ugly that such features
> might not be able to be added by default everywhere, but them's the breaks,
> I suppose (I hope I'm wrong).
>
> We could go crazy and start fixing all of ES's RegExp warts when /u is
> applied, even though such changes would not be related to Unicode support.
> I'd be happy to pursue that, but I suspect many here would see it as a
> bridge too far.
>
> Thoughts?
>
>
> -- Steven Levithan
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (13 years ago)

Would you consider the ICU4J method com.ibm.icu.text.UTF16.charAt [1] to be in violation of C1 because it can return surrogate code points?

Or the ICU4J method com.ibm.icu.lang.UCharacter.isUUppercase [2] because it's a predicate that tells you that surrogate code points do not represent upper case characters?

Or the ICU4J method com.ibm.icu.lang.UCharacter.toUpperCase [3] because it's a transform that maps surrogate code points to themselves as their upper case form?

Norbert

[1] icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html#charAt(java.lang.CharSequence, int) [2] icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#isUUppercase(int) [3] icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#toUpperCase(int)

Would you consider the ICU4J method com.ibm.icu.text.UTF16.charAt [1] to be in violation of C1 because it can return surrogate code points?

Or the ICU4J method com.ibm.icu.lang.UCharacter.isUUppercase [2] because it's a predicate that tells you that surrogate code points do not represent upper case characters?

Or the ICU4J method com.ibm.icu.lang.UCharacter.toUpperCase [3] because it's a transform that maps surrogate code points to themselves as their upper case form?

Norbert

[1] http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html#charAt(java.lang.CharSequence,%20int)
[2] http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#isUUppercase(int)
[3] http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#toUpperCase(int)


On Mar 26, 2012, at 23:11 , Glenn Adams wrote:

> 
> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:
> The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted.
> 
> True, but if the proposed language
> 
> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value."
>  
> is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this will increase the likelihood of their being interpreted as abstract characters... e.g., if the unpaired code unit is interpreted as a unpaired surrogate code point, and some process/function performs any predicate or transform on that code point, then that amounts to interpreting it as an abstract character.
> 
> I would rather see such unpaired code unit either (1) be mapped to U+00FFFD, or (2) an exception raised when performing an operation that requires conversion of the UTF-16 code unit sequence.

# Steven Levithan (13 years ago)

Erik Corry wrote:

Steven Levithan wrote:

Make \d\w\b Unicode-aware.

I think we should leave these alone. They are concise and useful and will continue to be so when /u is the default in Harmony code. Instead we should introduce \p{...} immediately which provides the same functionality.

\w and \b are broken without Unicode. ASCII \d is concise and useful, but so is [0-9]. Unicode-aware \b can't be emulated using \p{..} unless lookbehind is also added (which is tentatively approved for ES6 but could get delayed). Unicode-aware \w\b\d are required by UTS#18. If \w\b\d are not made Unicode-aware by /u, we won't easily be able to fix them in the future.

We went down this road before, and at the end you agreed that \w\b\d with /u should be Unicode aware. :/

I agree with adding \p{..} as soon as possible, with two caveats:

If I recall correctly, mobile browser implementers voiced concerns about overhead during the es4-discuss days.
It can easily be pushed down the road to ES7+.

Delaying /u, on the other hand, might mean also having to delay Norbert's work on code point matching, etc. Introducing \p{..} without code point matching would be nonideal. \p{..} might need to be delayed anyway to allow RegExp proposals already approved by TC39 (match web reality, lookbehind, flag /y), the flag /x strawman, and flag /u to be completed in time. For starters, it's not clear which properties \p{..} in ES would support, and there would be a number of other details to discuss, too.

Erik Corry wrote:

Make unpaired surrogates in /u regexps a syntax error.

Sounds good to me.

-- Steven Levithan

Erik Corry wrote:
> Steven Levithan wrote:
>> - Make \d\w\b Unicode-aware.
>
> I think we should leave these alone.  They are concise and useful and
> will continue to be so when /u is the default in Harmony code.
> Instead we should introduce \p{...} immediately which provides the
> same functionality.

\w and \b are broken without Unicode. ASCII \d is concise and useful, but so 
is [0-9]. Unicode-aware \b can't be emulated using \p{..} unless lookbehind 
is also added (which is tentatively approved for ES6 but could get delayed). 
Unicode-aware \w\b\d are required by UTS#18. If \w\b\d are not made 
Unicode-aware by /u, we won't easily be able to fix them in the future.

We went down this road before, and at the end you agreed that \w\b\d with /u 
should be Unicode aware. :/

I agree with adding \p{..} as soon as possible, with two caveats:

* If I recall correctly, mobile browser implementers voiced concerns about 
overhead during the es4-discuss days.
* It can easily be pushed down the road to ES7+.

Delaying /u, on the other hand, might mean also having to delay Norbert's 
work on code point matching, etc. Introducing \p{..} without code point 
matching would be nonideal. \p{..} might *need* to be delayed anyway to 
allow RegExp proposals already approved by TC39 (match web reality, 
lookbehind, flag /y), the flag /x strawman, and flag /u to be completed in 
time. For starters, it's not clear which properties \p{..} in ES would 
support, and there would be a number of other details to discuss, too.

Erik Corry wrote:
> Make unpaired surrogates in /u regexps a syntax error.

Sounds good to me.

-- Steven Levithan

# Mark Davis ☕ (13 years ago)

That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of "Unicode 16-bit string" in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good.

x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length());

After this operation is done, you want y == a, even if 5 is between D800 and DC00.

Or take: output = ""; for (int i = 0; i < s.length(); ++i) { ch = s.charAt(i); if (ch.equals('&')) { ch = '@'; } output += ch; }

After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b", not "a&\u{FFFD}\u{FFFD}b". It is also an unnecessary burden on lower-level software to always check this stuff.

Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or output, then you do need to either convert to FFFD or take some other action.

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

That, as Norbert explained, is not the intention of the standard. Take a
look at the discussion of "Unicode 16-bit string" in chapter 3. The
committee recognized that fragments may be formed when working with UTF-16,
and that destructive changes may do more harm than good.

x = a.substring(0, 5) + b + a.substring(5, a.length());
y = x.substring(0, 5) + x.substring(6, x.length());

After this operation is done, you want y == a, even if 5 is between D800
and DC00.

Or take:
output = "";
for (int i = 0; i < s.length(); ++i) {
  ch = s.charAt(i);
  if (ch.equals('&')) {
    ch = '@';
  }
  output += ch;
}

After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
not "a&\u{FFFD}\u{FFFD}b".
It is also an unnecessary burden on lower-level software to always check
this stuff.

Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
output, then you do need to either convert to FFFD or take some other
action.

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:

>
> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
> ecmascript at norbertlindenberg.com> wrote:
>
>> The conformance clause doesn't say anything about the interpretation of
>> (UTF-16) code units as code points. To check conformance with C1, you have
>> to look at how the resulting code points are actually further interpreted.
>>
>
> True, but if the proposed language
>
> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
> surrogate pair, is interpreted as a code point with the same value."
>
> is adopted, then will not this have an effect of creating unpaired
> surrogates as code points? If so, then by my estimation, this *will* increase
> the likelihood of their being interpreted as abstract characters... e.g.,
> if the unpaired code unit is interpreted as a unpaired surrogate code
> point, and some process/function performs *any* predicate or transform on
> that code point, then that amounts to interpreting it as an abstract
> character.
>
> I would rather see such unpaired code unit either (1) be mapped to
> U+00FFFD, or (2) an exception raised when performing an operation that
> requires conversion of the UTF-16 code unit sequence.
>
>
>> My proposal interprets the resulting code points in the following ways:
>>
>> 1) In regular expressions, they can be used in both patterns and input
>> strings to be matched. They may be compared against other code points, or
>> against character classes, some of which will hopefully soon be defined by
>> Unicode properties. In the case of comparing against other code points,
>> they can't match any code points assigned to abstract characters. In the
>> case of Unicode properties, they'll typically fall into the large bucket of
>> have-nots, along with other unassigned code points or, for example, U+FFFD,
>> unless you ask for their general category.
>>
>> 2) When parsing identifiers, they will not have the ID_Start or
>> ID_Continue properties, so they'll be excluded, just like other unassigned
>> code points or U+FFFD.
>>
>> 3) In case conversion, they won't have upper case or lower case
>> equivalents defined, and remain as is, as would happen for unassigned code
>> points or U+FFFD.
>>
>> I don't think either of these amount to interpretation as abstract
>> characters. I mention U+FFFD because the alternative interpretation of
>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>> seem to improve anything.
>>
>> Norbert
>>
>>
>>
>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>
>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>> barraclough at apple.com> wrote:
>> > I really like the direction you're going in, but have one minor concern
>> relating to regular expressions.
>> >
>> > In your proposal, you currently state:
>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
>> part of a surrogate pair, is interpreted as a code point with the same
>> value."
>> >
>> > Just as a reminder, this would be in explicit violation of the Unicode
>> conformance clause C1 unless it can be guaranteed that such a code point
>> will not be interpreted as an abstract character:
>> >
>> > C1    A process shall not interpret a high-surrogate code point or a
>> low-surrogate code point as an abstract character.
>> >
>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>> >
>> > Given that such guarantee is likely impractical, this presents a
>> problem for the above proposed language.
>>
>>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/9e6dbc55/attachment.html>

# Glenn Adams (13 years ago)

On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of "Unicode 16-bit string" in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good.

x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length());

After this operation is done, you want y == a, even if 5 is between D800 and DC00.

Assuming that b.length() == 1 in this example, my interpretation of this is that '=', '+', and 'substring' are operations whose domain and co-domain are (currently defined) ES Strings, namely sequences of UTF-16 code units. Since none of these operations entail interpreting the semantics of a code point (i.e., interpreting abstract characters), then there is no violation of C1 here.

Or take:

output = ""; for (int i = 0; i < s.length(); ++i) { ch = s.charAt(i); if (ch.equals('&')) { ch = '@'; } output += ch; }

After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b", not "a&\u{FFFD}\u{FFFD}b". It is also an unnecessary burden on lower-level software to always check this stuff.

Again, in this example, I assume that the string literal "a&\u{10000}b" maps to the UTF-16 code unit sequence:

0061 0026 D800 DC00 0062

Given that 'charAt(i)' is defined on (and is indexing) code units and not code points, and since the 'equals' operator is also defined on code units, this example also does not require interpreting the semantics of code points (i.e., interpreting abstract characters).

However, in Norbert's questions above about isUUppercase(int) and toUpperCase(int), it is clear that the domain of these operations are code points, not code units, and further, that they requiring interpretation as abstract characters in order to determine the semantics of the corresponding characters.

My conclusion is that the determination of whether C1 is violated or not depends upon the domain, codomain, and operation being considered.

On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

> That, as Norbert explained, is not the intention of the standard. Take a
> look at the discussion of "Unicode 16-bit string" in chapter 3. The
> committee recognized that fragments may be formed when working with UTF-16,
> and that destructive changes may do more harm than good.
>
> x = a.substring(0, 5) + b + a.substring(5, a.length());
> y = x.substring(0, 5) + x.substring(6, x.length());
>
> After this operation is done, you want y == a, even if 5 is between D800
> and DC00.
>

Assuming that b.length() == 1 in this example, my interpretation of this is
that '=', '+', and 'substring' are operations whose domain and co-domain
are (currently defined) ES Strings, namely sequences of UTF-16 code units.
Since none of these operations entail interpreting the semantics of a code
point (i.e., interpreting abstract characters), then there is no violation
of C1 here.

Or take:
> output = "";
> for (int i = 0; i < s.length(); ++i) {
>   ch = s.charAt(i);
>   if (ch.equals('&')) {
>     ch = '@';
>   }
>   output += ch;
> }
>
> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
> not "a&\u{FFFD}\u{FFFD}b".
> It is also an unnecessary burden on lower-level software to always check
> this stuff.
>

Again, in this example, I assume that the string literal "a&\u{10000}b"
maps to the UTF-16 code unit sequence:

0061 0026 D800 DC00 0062

Given that 'charAt(i)' is defined on (and is indexing) code units and not
code points, and since the 'equals' operator is also defined on code units,
this example also does not require interpreting the semantics of code
points (i.e., interpreting abstract characters).

However, in Norbert's questions above about isUUppercase(int) and
toUpperCase(int), it is clear that the domain of these operations are code
points, not code units, and further, that they requiring interpretation as
abstract characters in order to determine the semantics of the
corresponding characters.

My conclusion is that the determination of whether C1 is violated or not
depends upon the domain, codomain, and operation being considered.


> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
> output, then you do need to either convert to FFFD or take some other
> action.
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>
>>
>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>> ecmascript at norbertlindenberg.com> wrote:
>>
>>> The conformance clause doesn't say anything about the interpretation of
>>> (UTF-16) code units as code points. To check conformance with C1, you have
>>> to look at how the resulting code points are actually further interpreted.
>>>
>>
>> True, but if the proposed language
>>
>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
>> surrogate pair, is interpreted as a code point with the same value."
>>
>> is adopted, then will not this have an effect of creating unpaired
>> surrogates as code points? If so, then by my estimation, this *will* increase
>> the likelihood of their being interpreted as abstract characters... e.g.,
>> if the unpaired code unit is interpreted as a unpaired surrogate code
>> point, and some process/function performs *any* predicate or transform
>> on that code point, then that amounts to interpreting it as an abstract
>> character.
>>
>> I would rather see such unpaired code unit either (1) be mapped to
>> U+00FFFD, or (2) an exception raised when performing an operation that
>> requires conversion of the UTF-16 code unit sequence.
>>
>>
>>> My proposal interprets the resulting code points in the following ways:
>>>
>>> 1) In regular expressions, they can be used in both patterns and input
>>> strings to be matched. They may be compared against other code points, or
>>> against character classes, some of which will hopefully soon be defined by
>>> Unicode properties. In the case of comparing against other code points,
>>> they can't match any code points assigned to abstract characters. In the
>>> case of Unicode properties, they'll typically fall into the large bucket of
>>> have-nots, along with other unassigned code points or, for example, U+FFFD,
>>> unless you ask for their general category.
>>>
>>> 2) When parsing identifiers, they will not have the ID_Start or
>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>> code points or U+FFFD.
>>>
>>> 3) In case conversion, they won't have upper case or lower case
>>> equivalents defined, and remain as is, as would happen for unassigned code
>>> points or U+FFFD.
>>>
>>> I don't think either of these amount to interpretation as abstract
>>> characters. I mention U+FFFD because the alternative interpretation of
>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>> seem to improve anything.
>>>
>>> Norbert
>>>
>>>
>>>
>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>
>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>> barraclough at apple.com> wrote:
>>> > I really like the direction you're going in, but have one minor
>>> concern relating to regular expressions.
>>> >
>>> > In your proposal, you currently state:
>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
>>> part of a surrogate pair, is interpreted as a code point with the same
>>> value."
>>> >
>>> > Just as a reminder, this would be in explicit violation of the Unicode
>>> conformance clause C1 unless it can be guaranteed that such a code point
>>> will not be interpreted as an abstract character:
>>> >
>>> > C1    A process shall not interpret a high-surrogate code point or a
>>> low-surrogate code point as an abstract character.
>>> >
>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>> >
>>> > Given that such guarantee is likely impractical, this presents a
>>> problem for the above proposed language.
>>>
>>>
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/532e195b/attachment-0001.html>

# Mark Davis ☕ (13 years ago)

That would not be practical, nor predictable. And note that the 700K reserved code points are also not to be interpreted as characters; by your logic all of them would need to be converted to FFFD.

And in practice, an unpaired surrogate is best treated just like a reserved (unassigned) code point. For example, a lowercase operation should convert characters with lowercase correspondants to those correspondants, and leave everything else alone: control characters, format characters, reserved code points, surrogates, etc.

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

That would not be practical, nor predictable. And note that the 700K
reserved code points are also not to be interpreted as characters; by your
logic all of them would need to be converted to FFFD.

And in practice, an unpaired surrogate is best treated just like a reserved
(unassigned) code point. For example, a lowercase operation should convert
characters with lowercase correspondants to those correspondants, and leave
*everything* else alone: control characters, format characters, reserved
code points, surrogates, etc.

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Mar 27, 2012 at 08:02, Glenn Adams <glenn at skynav.com> wrote:

>
>
> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
>
>> That, as Norbert explained, is not the intention of the standard. Take a
>> look at the discussion of "Unicode 16-bit string" in chapter 3. The
>> committee recognized that fragments may be formed when working with UTF-16,
>> and that destructive changes may do more harm than good.
>>
>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>> y = x.substring(0, 5) + x.substring(6, x.length());
>>
>> After this operation is done, you want y == a, even if 5 is between D800
>> and DC00.
>>
>
> Assuming that b.length() == 1 in this example, my interpretation of this
> is that '=', '+', and 'substring' are operations whose domain and co-domain
> are (currently defined) ES Strings, namely sequences of UTF-16 code units.
> Since none of these operations entail interpreting the semantics of a code
> point (i.e., interpreting abstract characters), then there is no violation
> of C1 here.
>
> Or take:
>> output = "";
>> for (int i = 0; i < s.length(); ++i) {
>>   ch = s.charAt(i);
>>   if (ch.equals('&')) {
>>     ch = '@';
>>   }
>>   output += ch;
>> }
>>
>> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
>> not "a&\u{FFFD}\u{FFFD}b".
>> It is also an unnecessary burden on lower-level software to always check
>> this stuff.
>>
>
> Again, in this example, I assume that the string literal "a&\u{10000}b"
> maps to the UTF-16 code unit sequence:
>
> 0061 0026 D800 DC00 0062
>
> Given that 'charAt(i)' is defined on (and is indexing) code units and not
> code points, and since the 'equals' operator is also defined on code units,
> this example also does not require interpreting the semantics of code
> points (i.e., interpreting abstract characters).
>
> However, in Norbert's questions above about isUUppercase(int) and
> toUpperCase(int), it is clear that the domain of these operations are code
> points, not code units, and further, that they requiring interpretation as
> abstract characters in order to determine the semantics of the
> corresponding characters.
>
> My conclusion is that the determination of whether C1 is violated or not
> depends upon the domain, codomain, and operation being considered.
>
>
>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>> output, then you do need to either convert to FFFD or take some other
>> action.
>>
>> ------------------------------
>> Mark <https://plus.google.com/114199149796022210033>
>> *
>> *
>> *— Il meglio è l’inimico del bene —*
>> **
>>
>>
>>
>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>>
>>>
>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>> ecmascript at norbertlindenberg.com> wrote:
>>>
>>>> The conformance clause doesn't say anything about the interpretation of
>>>> (UTF-16) code units as code points. To check conformance with C1, you have
>>>> to look at how the resulting code points are actually further interpreted.
>>>>
>>>
>>> True, but if the proposed language
>>>
>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
>>> surrogate pair, is interpreted as a code point with the same value."
>>>
>>> is adopted, then will not this have an effect of creating unpaired
>>> surrogates as code points? If so, then by my estimation, this *will* increase
>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>> point, and some process/function performs *any* predicate or transform
>>> on that code point, then that amounts to interpreting it as an abstract
>>> character.
>>>
>>> I would rather see such unpaired code unit either (1) be mapped to
>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>> requires conversion of the UTF-16 code unit sequence.
>>>
>>>
>>>> My proposal interprets the resulting code points in the following ways:
>>>>
>>>> 1) In regular expressions, they can be used in both patterns and input
>>>> strings to be matched. They may be compared against other code points, or
>>>> against character classes, some of which will hopefully soon be defined by
>>>> Unicode properties. In the case of comparing against other code points,
>>>> they can't match any code points assigned to abstract characters. In the
>>>> case of Unicode properties, they'll typically fall into the large bucket of
>>>> have-nots, along with other unassigned code points or, for example, U+FFFD,
>>>> unless you ask for their general category.
>>>>
>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>>> code points or U+FFFD.
>>>>
>>>> 3) In case conversion, they won't have upper case or lower case
>>>> equivalents defined, and remain as is, as would happen for unassigned code
>>>> points or U+FFFD.
>>>>
>>>> I don't think either of these amount to interpretation as abstract
>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>>> seem to improve anything.
>>>>
>>>> Norbert
>>>>
>>>>
>>>>
>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>
>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>> barraclough at apple.com> wrote:
>>>> > I really like the direction you're going in, but have one minor
>>>> concern relating to regular expressions.
>>>> >
>>>> > In your proposal, you currently state:
>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
>>>> part of a surrogate pair, is interpreted as a code point with the same
>>>> value."
>>>> >
>>>> > Just as a reminder, this would be in explicit violation of the
>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code
>>>> point will not be interpreted as an abstract character:
>>>> >
>>>> > C1    A process shall not interpret a high-surrogate code point or a
>>>> low-surrogate code point as an abstract character.
>>>> >
>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>> >
>>>> > Given that such guarantee is likely impractical, this presents a
>>>> problem for the above proposed language.
>>>>
>>>>
>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/808833a9/attachment.html>

# Glenn Adams (13 years ago)

This begs the question of what is the point of C1.

This begs the question of what is the point of C1.

On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

> That would not be practical, nor predictable. And note that the 700K
> reserved code points are also not to be interpreted as characters; by your
> logic all of them would need to be converted to FFFD.
>
> And in practice, an unpaired surrogate is best treated just like a
> reserved (unassigned) code point. For example, a lowercase operation should
> convert characters with lowercase correspondants to those correspondants,
> and leave *everything* else alone: control characters, format characters,
> reserved code points, surrogates, etc.
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <glenn at skynav.com> wrote:
>
>>
>>
>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
>>
>>> That, as Norbert explained, is not the intention of the standard. Take a
>>> look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>> committee recognized that fragments may be formed when working with UTF-16,
>>> and that destructive changes may do more harm than good.
>>>
>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>
>>> After this operation is done, you want y == a, even if 5 is between D800
>>> and DC00.
>>>
>>
>> Assuming that b.length() == 1 in this example, my interpretation of this
>> is that '=', '+', and 'substring' are operations whose domain and co-domain
>> are (currently defined) ES Strings, namely sequences of UTF-16 code units.
>> Since none of these operations entail interpreting the semantics of a code
>> point (i.e., interpreting abstract characters), then there is no violation
>> of C1 here.
>>
>> Or take:
>>> output = "";
>>> for (int i = 0; i < s.length(); ++i) {
>>>   ch = s.charAt(i);
>>>   if (ch.equals('&')) {
>>>     ch = '@';
>>>   }
>>>   output += ch;
>>> }
>>>
>>> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
>>> not "a&\u{FFFD}\u{FFFD}b".
>>> It is also an unnecessary burden on lower-level software to always check
>>> this stuff.
>>>
>>
>> Again, in this example, I assume that the string literal "a&\u{10000}b"
>> maps to the UTF-16 code unit sequence:
>>
>> 0061 0026 D800 DC00 0062
>>
>> Given that 'charAt(i)' is defined on (and is indexing) code units and not
>> code points, and since the 'equals' operator is also defined on code units,
>> this example also does not require interpreting the semantics of code
>> points (i.e., interpreting abstract characters).
>>
>> However, in Norbert's questions above about isUUppercase(int) and
>> toUpperCase(int), it is clear that the domain of these operations are code
>> points, not code units, and further, that they requiring interpretation as
>> abstract characters in order to determine the semantics of the
>> corresponding characters.
>>
>> My conclusion is that the determination of whether C1 is violated or not
>> depends upon the domain, codomain, and operation being considered.
>>
>>
>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>>> output, then you do need to either convert to FFFD or take some other
>>> action.
>>>
>>> ------------------------------
>>> Mark <https://plus.google.com/114199149796022210033>
>>> *
>>> *
>>> *— Il meglio è l’inimico del bene —*
>>> **
>>>
>>>
>>>
>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>>>
>>>>
>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>> ecmascript at norbertlindenberg.com> wrote:
>>>>
>>>>> The conformance clause doesn't say anything about the interpretation
>>>>> of (UTF-16) code units as code points. To check conformance with C1, you
>>>>> have to look at how the resulting code points are actually further
>>>>> interpreted.
>>>>>
>>>>
>>>> True, but if the proposed language
>>>>
>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
>>>> a surrogate pair, is interpreted as a code point with the same value."
>>>>
>>>> is adopted, then will not this have an effect of creating unpaired
>>>> surrogates as code points? If so, then by my estimation, this *will* increase
>>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>>> point, and some process/function performs *any* predicate or transform
>>>> on that code point, then that amounts to interpreting it as an abstract
>>>> character.
>>>>
>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>> requires conversion of the UTF-16 code unit sequence.
>>>>
>>>>
>>>>> My proposal interprets the resulting code points in the following ways:
>>>>>
>>>>> 1) In regular expressions, they can be used in both patterns and input
>>>>> strings to be matched. They may be compared against other code points, or
>>>>> against character classes, some of which will hopefully soon be defined by
>>>>> Unicode properties. In the case of comparing against other code points,
>>>>> they can't match any code points assigned to abstract characters. In the
>>>>> case of Unicode properties, they'll typically fall into the large bucket of
>>>>> have-nots, along with other unassigned code points or, for example, U+FFFD,
>>>>> unless you ask for their general category.
>>>>>
>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>>>> code points or U+FFFD.
>>>>>
>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>> equivalents defined, and remain as is, as would happen for unassigned code
>>>>> points or U+FFFD.
>>>>>
>>>>> I don't think either of these amount to interpretation as abstract
>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>>>> seem to improve anything.
>>>>>
>>>>> Norbert
>>>>>
>>>>>
>>>>>
>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>
>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>> barraclough at apple.com> wrote:
>>>>> > I really like the direction you're going in, but have one minor
>>>>> concern relating to regular expressions.
>>>>> >
>>>>> > In your proposal, you currently state:
>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is
>>>>> not part of a surrogate pair, is interpreted as a code point with the same
>>>>> value."
>>>>> >
>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code
>>>>> point will not be interpreted as an abstract character:
>>>>> >
>>>>> > C1    A process shall not interpret a high-surrogate code point or a
>>>>> low-surrogate code point as an abstract character.
>>>>> >
>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>> >
>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>> problem for the above proposed language.
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> es-discuss mailing list
>>>> es-discuss at mozilla.org
>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/ef0b592f/attachment.html>

# Mark Davis ☕ (13 years ago)

The point of C1 is that you can't interpret the surrogate code point U+DC00 as a character, like an "a".

Neither can you interpret the reserved code point U+0378 as a character, like a "b".

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

The point of C1 is that you can't interpret the surrogate code point U+DC00
as a *character*, like an "a".

Neither can you interpret the reserved code point U+0378 as a *character*,
like a "b".

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Mar 27, 2012 at 08:56, Glenn Adams <glenn at skynav.com> wrote:

> This begs the question of what is the point of C1.
>
>
> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
>
>> That would not be practical, nor predictable. And note that the 700K
>> reserved code points are also not to be interpreted as characters; by your
>> logic all of them would need to be converted to FFFD.
>>
>> And in practice, an unpaired surrogate is best treated just like a
>> reserved (unassigned) code point. For example, a lowercase operation should
>> convert characters with lowercase correspondants to those correspondants,
>> and leave *everything* else alone: control characters, format characters,
>> reserved code points, surrogates, etc.
>>
>> ------------------------------
>> Mark <https://plus.google.com/114199149796022210033>
>> *
>> *
>> *— Il meglio è l’inimico del bene —*
>> **
>>
>>
>>
>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <glenn at skynav.com> wrote:
>>
>>>
>>>
>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com>wrote:
>>>
>>>> That, as Norbert explained, is not the intention of the standard. Take
>>>> a look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>>> committee recognized that fragments may be formed when working with UTF-16,
>>>> and that destructive changes may do more harm than good.
>>>>
>>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>>
>>>> After this operation is done, you want y == a, even if 5 is between
>>>> D800 and DC00.
>>>>
>>>
>>> Assuming that b.length() == 1 in this example, my interpretation of this
>>> is that '=', '+', and 'substring' are operations whose domain and co-domain
>>> are (currently defined) ES Strings, namely sequences of UTF-16 code units.
>>> Since none of these operations entail interpreting the semantics of a code
>>> point (i.e., interpreting abstract characters), then there is no violation
>>> of C1 here.
>>>
>>> Or take:
>>>> output = "";
>>>> for (int i = 0; i < s.length(); ++i) {
>>>>   ch = s.charAt(i);
>>>>   if (ch.equals('&')) {
>>>>     ch = '@';
>>>>   }
>>>>   output += ch;
>>>> }
>>>>
>>>> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
>>>> not "a&\u{FFFD}\u{FFFD}b".
>>>> It is also an unnecessary burden on lower-level software to always
>>>> check this stuff.
>>>>
>>>
>>> Again, in this example, I assume that the string literal "a&\u{10000}b"
>>> maps to the UTF-16 code unit sequence:
>>>
>>> 0061 0026 D800 DC00 0062
>>>
>>> Given that 'charAt(i)' is defined on (and is indexing) code units and
>>> not code points, and since the 'equals' operator is also defined on code
>>> units, this example also does not require interpreting the semantics of
>>> code points (i.e., interpreting abstract characters).
>>>
>>> However, in Norbert's questions above about isUUppercase(int) and
>>> toUpperCase(int), it is clear that the domain of these operations are code
>>> points, not code units, and further, that they requiring interpretation as
>>> abstract characters in order to determine the semantics of the
>>> corresponding characters.
>>>
>>> My conclusion is that the determination of whether C1 is violated or not
>>> depends upon the domain, codomain, and operation being considered.
>>>
>>>
>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>>>> output, then you do need to either convert to FFFD or take some other
>>>> action.
>>>>
>>>> ------------------------------
>>>> Mark <https://plus.google.com/114199149796022210033>
>>>> *
>>>> *
>>>> *— Il meglio è l’inimico del bene —*
>>>> **
>>>>
>>>>
>>>>
>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>>>>
>>>>>
>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>>> ecmascript at norbertlindenberg.com> wrote:
>>>>>
>>>>>> The conformance clause doesn't say anything about the interpretation
>>>>>> of (UTF-16) code units as code points. To check conformance with C1, you
>>>>>> have to look at how the resulting code points are actually further
>>>>>> interpreted.
>>>>>>
>>>>>
>>>>> True, but if the proposed language
>>>>>
>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
>>>>> a surrogate pair, is interpreted as a code point with the same value."
>>>>>
>>>>> is adopted, then will not this have an effect of creating unpaired
>>>>> surrogates as code points? If so, then by my estimation, this *will* increase
>>>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>>>> point, and some process/function performs *any* predicate or
>>>>> transform on that code point, then that amounts to interpreting it as an
>>>>> abstract character.
>>>>>
>>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>>> requires conversion of the UTF-16 code unit sequence.
>>>>>
>>>>>
>>>>>> My proposal interprets the resulting code points in the following
>>>>>> ways:
>>>>>>
>>>>>> 1) In regular expressions, they can be used in both patterns and
>>>>>> input strings to be matched. They may be compared against other code
>>>>>> points, or against character classes, some of which will hopefully soon be
>>>>>> defined by Unicode properties. In the case of comparing against other code
>>>>>> points, they can't match any code points assigned to abstract characters.
>>>>>> In the case of Unicode properties, they'll typically fall into the large
>>>>>> bucket of have-nots, along with other unassigned code points or, for
>>>>>> example, U+FFFD, unless you ask for their general category.
>>>>>>
>>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>>>>> code points or U+FFFD.
>>>>>>
>>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>>> equivalents defined, and remain as is, as would happen for unassigned code
>>>>>> points or U+FFFD.
>>>>>>
>>>>>> I don't think either of these amount to interpretation as abstract
>>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>>>>> seem to improve anything.
>>>>>>
>>>>>> Norbert
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>>
>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>>> barraclough at apple.com> wrote:
>>>>>> > I really like the direction you're going in, but have one minor
>>>>>> concern relating to regular expressions.
>>>>>> >
>>>>>> > In your proposal, you currently state:
>>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is
>>>>>> not part of a surrogate pair, is interpreted as a code point with the same
>>>>>> value."
>>>>>> >
>>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code
>>>>>> point will not be interpreted as an abstract character:
>>>>>> >
>>>>>> > C1    A process shall not interpret a high-surrogate code point or
>>>>>> a low-surrogate code point as an abstract character.
>>>>>> >
>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>>> >
>>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>>> problem for the above proposed language.
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> es-discuss mailing list
>>>>> es-discuss at mozilla.org
>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/e06430c8/attachment-0001.html>

# Gavin Barraclough (13 years ago)

On Mar 26, 2012, at 11:57 PM, Erik Corry wrote:

Add /U to mean old-style regexp literals in Harmony code (analogous to /s and /S which have opposite meanings).

Are we sure this has enough utility to be worth adding? - it seems unlikely that programmers are going to often have cause to explicitly opt-out of correct unicode support (since little consideration usually seems to be given to this topic), and as discussed previously, a mechanism to do so already exists if they need it (RegExp("foo") will behave the same as the proposed /foo/U). If we do add a 'U' flag, I'd worry that it may end up more commonly being used in error when people intended to append a 'u'!

On Mar 26, 2012, at 11:57 PM, Erik Corry wrote:
> Add /U to mean old-style regexp literals in Harmony code (analogous to
> /s and /S which have opposite meanings).

Are we sure this has enough utility to be worth adding? - it seems unlikely that programmers are going to often have cause to explicitly opt-out of correct unicode support (since little consideration usually seems to be given to this topic), and as discussed previously, a mechanism to do so already exists if they need it (RegExp("foo") will behave the same as the proposed /foo/U).  If we do add a 'U' flag, I'd worry that it may end up more commonly being used in error when people intended to append a 'u'!

cheers,
G.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/889b8afe/attachment.html>

# Glenn Adams (13 years ago)

So, if as a result of a policy of converting any UTF-16 code unit sequence to a code point sequence one ends up with an unpaired surrogate, e.g., "\u{00DC00}", then performing a predicate on that code point, such as described in D21 (e.g., IsAlphabetic) would entail interpreting it as an abstract character?

I can see that D20 defines code point properties which would not entail interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter, but where does one draw the line?

So, if as a result of a policy of converting any UTF-16 code unit sequence
to a code point sequence one ends up with an unpaired surrogate, e.g.,
"\u{00DC00}", then performing a predicate on that code point, such as
described in D21 (e.g., IsAlphabetic) would entail interpreting it as an
abstract character?

I can see that D20 defines code point properties which would not entail
interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter,
but where does one draw the line?

On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

> The point of C1 is that you can't interpret the surrogate code point
> U+DC00 as a *character*, like an "a".
>
> Neither can you interpret the reserved code point U+0378 as a *character*,
> like a "b".
>
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Tue, Mar 27, 2012 at 08:56, Glenn Adams <glenn at skynav.com> wrote:
>
>> This begs the question of what is the point of C1.
>>
>>
>> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
>>
>>> That would not be practical, nor predictable. And note that the 700K
>>> reserved code points are also not to be interpreted as characters; by your
>>> logic all of them would need to be converted to FFFD.
>>>
>>> And in practice, an unpaired surrogate is best treated just like a
>>> reserved (unassigned) code point. For example, a lowercase operation should
>>> convert characters with lowercase correspondants to those correspondants,
>>> and leave *everything* else alone: control characters, format characters,
>>> reserved code points, surrogates, etc.
>>>
>>> ------------------------------
>>> Mark <https://plus.google.com/114199149796022210033>
>>> *
>>> *
>>> *— Il meglio è l’inimico del bene —*
>>> **
>>>
>>>
>>>
>>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <glenn at skynav.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com>wrote:
>>>>
>>>>> That, as Norbert explained, is not the intention of the standard. Take
>>>>> a look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>>>> committee recognized that fragments may be formed when working with UTF-16,
>>>>> and that destructive changes may do more harm than good.
>>>>>
>>>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>>>
>>>>> After this operation is done, you want y == a, even if 5 is between
>>>>> D800 and DC00.
>>>>>
>>>>
>>>> Assuming that b.length() == 1 in this example, my interpretation of
>>>> this is that '=', '+', and 'substring' are operations whose domain and
>>>> co-domain are (currently defined) ES Strings, namely sequences of UTF-16
>>>> code units. Since none of these operations entail interpreting the
>>>> semantics of a code point (i.e., interpreting abstract characters), then
>>>> there is no violation of C1 here.
>>>>
>>>> Or take:
>>>>> output = "";
>>>>> for (int i = 0; i < s.length(); ++i) {
>>>>>   ch = s.charAt(i);
>>>>>   if (ch.equals('&')) {
>>>>>     ch = '@';
>>>>>   }
>>>>>   output += ch;
>>>>> }
>>>>>
>>>>> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
>>>>> not "a&\u{FFFD}\u{FFFD}b".
>>>>> It is also an unnecessary burden on lower-level software to always
>>>>> check this stuff.
>>>>>
>>>>
>>>> Again, in this example, I assume that the string literal "a&\u{10000}b"
>>>> maps to the UTF-16 code unit sequence:
>>>>
>>>> 0061 0026 D800 DC00 0062
>>>>
>>>> Given that 'charAt(i)' is defined on (and is indexing) code units and
>>>> not code points, and since the 'equals' operator is also defined on code
>>>> units, this example also does not require interpreting the semantics of
>>>> code points (i.e., interpreting abstract characters).
>>>>
>>>> However, in Norbert's questions above about isUUppercase(int) and
>>>> toUpperCase(int), it is clear that the domain of these operations are code
>>>> points, not code units, and further, that they requiring interpretation as
>>>> abstract characters in order to determine the semantics of the
>>>> corresponding characters.
>>>>
>>>> My conclusion is that the determination of whether C1 is violated or
>>>> not depends upon the domain, codomain, and operation being considered.
>>>>
>>>>
>>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>>>>> output, then you do need to either convert to FFFD or take some other
>>>>> action.
>>>>>
>>>>> ------------------------------
>>>>> Mark <https://plus.google.com/114199149796022210033>
>>>>> *
>>>>> *
>>>>> *— Il meglio è l’inimico del bene —*
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>>>>>
>>>>>>
>>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>>>> ecmascript at norbertlindenberg.com> wrote:
>>>>>>
>>>>>>> The conformance clause doesn't say anything about the interpretation
>>>>>>> of (UTF-16) code units as code points. To check conformance with C1, you
>>>>>>> have to look at how the resulting code points are actually further
>>>>>>> interpreted.
>>>>>>>
>>>>>>
>>>>>> True, but if the proposed language
>>>>>>
>>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part
>>>>>> of a surrogate pair, is interpreted as a code point with the same value."
>>>>>>
>>>>>> is adopted, then will not this have an effect of creating unpaired
>>>>>> surrogates as code points? If so, then by my estimation, this *will* increase
>>>>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>>>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>>>>> point, and some process/function performs *any* predicate or
>>>>>> transform on that code point, then that amounts to interpreting it as an
>>>>>> abstract character.
>>>>>>
>>>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>>>> requires conversion of the UTF-16 code unit sequence.
>>>>>>
>>>>>>
>>>>>>> My proposal interprets the resulting code points in the following
>>>>>>> ways:
>>>>>>>
>>>>>>> 1) In regular expressions, they can be used in both patterns and
>>>>>>> input strings to be matched. They may be compared against other code
>>>>>>> points, or against character classes, some of which will hopefully soon be
>>>>>>> defined by Unicode properties. In the case of comparing against other code
>>>>>>> points, they can't match any code points assigned to abstract characters.
>>>>>>> In the case of Unicode properties, they'll typically fall into the large
>>>>>>> bucket of have-nots, along with other unassigned code points or, for
>>>>>>> example, U+FFFD, unless you ask for their general category.
>>>>>>>
>>>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>>>>>> code points or U+FFFD.
>>>>>>>
>>>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>>>> equivalents defined, and remain as is, as would happen for unassigned code
>>>>>>> points or U+FFFD.
>>>>>>>
>>>>>>> I don't think either of these amount to interpretation as abstract
>>>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>>>>>> seem to improve anything.
>>>>>>>
>>>>>>> Norbert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>>>
>>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>>>> barraclough at apple.com> wrote:
>>>>>>> > I really like the direction you're going in, but have one minor
>>>>>>> concern relating to regular expressions.
>>>>>>> >
>>>>>>> > In your proposal, you currently state:
>>>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is
>>>>>>> not part of a surrogate pair, is interpreted as a code point with the same
>>>>>>> value."
>>>>>>> >
>>>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code
>>>>>>> point will not be interpreted as an abstract character:
>>>>>>> >
>>>>>>> > C1    A process shall not interpret a high-surrogate code point or
>>>>>>> a low-surrogate code point as an abstract character.
>>>>>>> >
>>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>>>> >
>>>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>>>> problem for the above proposed language.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> es-discuss mailing list
>>>>>> es-discuss at mozilla.org
>>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/0a0d441a/attachment-0001.html>

# Mark Davis ☕ (13 years ago)

performing a predicate on that code point, such as described in D21 (e.g.,

IsAlphabetic) would entail interpreting it as an abstract character? No.

but where does one draw the line?

The line is already drawn by the Unicode consortium, by consulting the Unicode Character Database properties. If you look at the data in the Unicode Character Database for any particular property, say Alphabetic, you'll find that surrogate code points are not included where the property is a true character property. There are a few special cases where reserved code points are provisionally given "anticipatory" character properties, such as in bidi ranges, simply because that makes implementations is more forward compatible, but there aren't any cases where a "character" property applies to a surrogate code point (other than by returning "No", or "n/a", or some such).

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

>performing a predicate on that code point, such as described in D21 (e.g.,
IsAlphabetic) would entail interpreting it as an abstract character?
No.

> but where does one draw the line?
The line is already drawn by the Unicode consortium, by consulting the Unicode
Character Database properties. If you look at the data in the Unicode
Character Database for any particular property, say Alphabetic, you'll find
that surrogate code points are not included where the property is a true
character property. There are a few special cases where reserved code
points are provisionally given "anticipatory" character properties, such as
in bidi ranges, simply because that makes implementations is more forward
compatible, but there aren't any cases where a "character" property applies
to a surrogate code point (other than by returning "No", or "n/a", or some
such).

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Mar 27, 2012 at 12:07, Glenn Adams <glenn at skynav.com> wrote:

> So, if as a result of a policy of converting any UTF-16 code unit sequence
> to a code point sequence one ends up with an unpaired surrogate, e.g.,
> "\u{00DC00}", then performing a predicate on that code point, such as
> described in D21 (e.g., IsAlphabetic) would entail interpreting it as an
> abstract character?
>
> I can see that D20 defines code point properties which would not entail
> interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter,
> but where does one draw the line?
>
>
> On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
>
>> The point of C1 is that you can't interpret the surrogate code point
>> U+DC00 as a *character*, like an "a".
>>
>> Neither can you interpret the reserved code point U+0378 as a
>> *character*, like a "b".
>>
>>
>> ------------------------------
>> Mark <https://plus.google.com/114199149796022210033>
>> *
>> *
>> *— Il meglio è l’inimico del bene —*
>> **
>>
>>
>>
>> On Tue, Mar 27, 2012 at 08:56, Glenn Adams <glenn at skynav.com> wrote:
>>
>>> This begs the question of what is the point of C1.
>>>
>>>
>>> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <mark at macchiato.com>wrote:
>>>
>>>> That would not be practical, nor predictable. And note that the 700K
>>>> reserved code points are also not to be interpreted as characters; by your
>>>> logic all of them would need to be converted to FFFD.
>>>>
>>>> And in practice, an unpaired surrogate is best treated just like a
>>>> reserved (unassigned) code point. For example, a lowercase operation should
>>>> convert characters with lowercase correspondants to those correspondants,
>>>> and leave *everything* else alone: control characters, format characters,
>>>> reserved code points, surrogates, etc.
>>>>
>>>> ------------------------------
>>>> Mark <https://plus.google.com/114199149796022210033>
>>>> *
>>>> *
>>>> *— Il meglio è l’inimico del bene —*
>>>> **
>>>>
>>>>
>>>>
>>>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <glenn at skynav.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com>wrote:
>>>>>
>>>>>> That, as Norbert explained, is not the intention of the standard.
>>>>>> Take a look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>>>>> committee recognized that fragments may be formed when working with UTF-16,
>>>>>> and that destructive changes may do more harm than good.
>>>>>>
>>>>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>>>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>>>>
>>>>>> After this operation is done, you want y == a, even if 5 is between
>>>>>> D800 and DC00.
>>>>>>
>>>>>
>>>>> Assuming that b.length() == 1 in this example, my interpretation of
>>>>> this is that '=', '+', and 'substring' are operations whose domain and
>>>>> co-domain are (currently defined) ES Strings, namely sequences of UTF-16
>>>>> code units. Since none of these operations entail interpreting the
>>>>> semantics of a code point (i.e., interpreting abstract characters), then
>>>>> there is no violation of C1 here.
>>>>>
>>>>> Or take:
>>>>>> output = "";
>>>>>> for (int i = 0; i < s.length(); ++i) {
>>>>>>   ch = s.charAt(i);
>>>>>>   if (ch.equals('&')) {
>>>>>>     ch = '@';
>>>>>>   }
>>>>>>   output += ch;
>>>>>> }
>>>>>>
>>>>>> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
>>>>>> not "a&\u{FFFD}\u{FFFD}b".
>>>>>> It is also an unnecessary burden on lower-level software to always
>>>>>> check this stuff.
>>>>>>
>>>>>
>>>>> Again, in this example, I assume that the string literal
>>>>> "a&\u{10000}b" maps to the UTF-16 code unit sequence:
>>>>>
>>>>> 0061 0026 D800 DC00 0062
>>>>>
>>>>> Given that 'charAt(i)' is defined on (and is indexing) code units and
>>>>> not code points, and since the 'equals' operator is also defined on code
>>>>> units, this example also does not require interpreting the semantics of
>>>>> code points (i.e., interpreting abstract characters).
>>>>>
>>>>> However, in Norbert's questions above about isUUppercase(int) and
>>>>> toUpperCase(int), it is clear that the domain of these operations are code
>>>>> points, not code units, and further, that they requiring interpretation as
>>>>> abstract characters in order to determine the semantics of the
>>>>> corresponding characters.
>>>>>
>>>>> My conclusion is that the determination of whether C1 is violated or
>>>>> not depends upon the domain, codomain, and operation being considered.
>>>>>
>>>>>
>>>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>>>>>> output, then you do need to either convert to FFFD or take some other
>>>>>> action.
>>>>>>
>>>>>> ------------------------------
>>>>>> Mark <https://plus.google.com/114199149796022210033>
>>>>>> *
>>>>>> *
>>>>>> *— Il meglio è l’inimico del bene —*
>>>>>> **
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>>>>> ecmascript at norbertlindenberg.com> wrote:
>>>>>>>
>>>>>>>> The conformance clause doesn't say anything about the
>>>>>>>> interpretation of (UTF-16) code units as code points. To check conformance
>>>>>>>> with C1, you have to look at how the resulting code points are actually
>>>>>>>> further interpreted.
>>>>>>>>
>>>>>>>
>>>>>>> True, but if the proposed language
>>>>>>>
>>>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part
>>>>>>> of a surrogate pair, is interpreted as a code point with the same value."
>>>>>>>
>>>>>>> is adopted, then will not this have an effect of creating unpaired
>>>>>>> surrogates as code points? If so, then by my estimation, this *will* increase
>>>>>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>>>>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>>>>>> point, and some process/function performs *any* predicate or
>>>>>>> transform on that code point, then that amounts to interpreting it as an
>>>>>>> abstract character.
>>>>>>>
>>>>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>>>>> requires conversion of the UTF-16 code unit sequence.
>>>>>>>
>>>>>>>
>>>>>>>> My proposal interprets the resulting code points in the following
>>>>>>>> ways:
>>>>>>>>
>>>>>>>> 1) In regular expressions, they can be used in both patterns and
>>>>>>>> input strings to be matched. They may be compared against other code
>>>>>>>> points, or against character classes, some of which will hopefully soon be
>>>>>>>> defined by Unicode properties. In the case of comparing against other code
>>>>>>>> points, they can't match any code points assigned to abstract characters.
>>>>>>>> In the case of Unicode properties, they'll typically fall into the large
>>>>>>>> bucket of have-nots, along with other unassigned code points or, for
>>>>>>>> example, U+FFFD, unless you ask for their general category.
>>>>>>>>
>>>>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>>>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>>>>>>> code points or U+FFFD.
>>>>>>>>
>>>>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>>>>> equivalents defined, and remain as is, as would happen for unassigned code
>>>>>>>> points or U+FFFD.
>>>>>>>>
>>>>>>>> I don't think either of these amount to interpretation as abstract
>>>>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>>>>>>> seem to improve anything.
>>>>>>>>
>>>>>>>> Norbert
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>>>>
>>>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>>>>> barraclough at apple.com> wrote:
>>>>>>>> > I really like the direction you're going in, but have one minor
>>>>>>>> concern relating to regular expressions.
>>>>>>>> >
>>>>>>>> > In your proposal, you currently state:
>>>>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is
>>>>>>>> not part of a surrogate pair, is interpreted as a code point with the same
>>>>>>>> value."
>>>>>>>> >
>>>>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code
>>>>>>>> point will not be interpreted as an abstract character:
>>>>>>>> >
>>>>>>>> > C1    A process shall not interpret a high-surrogate code point
>>>>>>>> or a low-surrogate code point as an abstract character.
>>>>>>>> >
>>>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>>>>> >
>>>>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>>>>> problem for the above proposed language.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> es-discuss mailing list
>>>>>>> es-discuss at mozilla.org
>>>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/ab6698d9/attachment-0001.html>

# Glenn Adams (13 years ago)

ok, i'll accept your position at this point and drop my comment; i suppose it is true that if there are already unpaired surrogates in user data as UTF-16, then having unpaired surrogates as code points is no worse;

however, it would be useful if there were an informative pointer from the spec under consideration to a UTC sanctioned list of operations that constitute "interpreting as abstract characters" and, that, if used on such data would possibly violate C1; to this end, it would be useful if C1 itself included a concrete example of such an operation

ok, i'll accept your position at this point and drop my comment; i suppose
it is true that if there are already unpaired surrogates in user data as
UTF-16, then having unpaired surrogates as code points is no worse;

however, it would be useful if there were an informative pointer from the
spec under consideration to a UTC sanctioned list of operations that
constitute "interpreting as abstract characters" and, that, if used on such
data would possibly violate C1; to this end, it would be useful if C1
itself included a concrete example of such an operation

On Tue, Mar 27, 2012 at 2:02 PM, Mark Davis ☕ <mark at macchiato.com> wrote:

> >performing a predicate on that code point, such as described in D21
> (e.g., IsAlphabetic) would entail interpreting it as an abstract character?
> No.
>
> > but where does one draw the line?
>  The line is already drawn by the Unicode consortium, by consulting the Unicode
> Character Database properties. If you look at the data in the Unicode
> Character Database for any particular property, say Alphabetic, you'll find
> that surrogate code points are not included where the property is a true
> character property. There are a few special cases where reserved code
> points are provisionally given "anticipatory" character properties, such as
> in bidi ranges, simply because that makes implementations is more forward
> compatible, but there aren't any cases where a "character" property applies
> to a surrogate code point (other than by returning "No", or "n/a", or some
> such).
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Tue, Mar 27, 2012 at 12:07, Glenn Adams <glenn at skynav.com> wrote:
>
>> So, if as a result of a policy of converting any UTF-16 code unit
>> sequence to a code point sequence one ends up with an unpaired surrogate,
>> e.g., "\u{00DC00}", then performing a predicate on that code point, such as
>> described in D21 (e.g., IsAlphabetic) would entail interpreting it as an
>> abstract character?
>>
>> I can see that D20 defines code point properties which would not entail
>> interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter,
>> but where does one draw the line?
>>
>>
>>  On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ <mark at macchiato.com>wrote:
>>
>>> The point of C1 is that you can't interpret the surrogate code point
>>> U+DC00 as a *character*, like an "a".
>>>
>>> Neither can you interpret the reserved code point U+0378 as a
>>> *character*, like a "b".
>>>
>>>
>>> ------------------------------
>>> Mark <https://plus.google.com/114199149796022210033>
>>> *
>>> *
>>> *— Il meglio è l’inimico del bene —*
>>> **
>>>
>>>
>>>
>>> On Tue, Mar 27, 2012 at 08:56, Glenn Adams <glenn at skynav.com> wrote:
>>>
>>>> This begs the question of what is the point of C1.
>>>>
>>>>
>>>> On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <mark at macchiato.com>wrote:
>>>>
>>>>> That would not be practical, nor predictable. And note that the 700K
>>>>> reserved code points are also not to be interpreted as characters; by your
>>>>> logic all of them would need to be converted to FFFD.
>>>>>
>>>>> And in practice, an unpaired surrogate is best treated just like a
>>>>> reserved (unassigned) code point. For example, a lowercase operation should
>>>>> convert characters with lowercase correspondants to those correspondants,
>>>>> and leave *everything* else alone: control characters, format characters,
>>>>> reserved code points, surrogates, etc.
>>>>>
>>>>> ------------------------------
>>>>> Mark <https://plus.google.com/114199149796022210033>
>>>>> *
>>>>> *
>>>>> *— Il meglio è l’inimico del bene —*
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <glenn at skynav.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com>wrote:
>>>>>>
>>>>>>> That, as Norbert explained, is not the intention of the standard.
>>>>>>> Take a look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>>>>>> committee recognized that fragments may be formed when working with UTF-16,
>>>>>>> and that destructive changes may do more harm than good.
>>>>>>>
>>>>>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>>>>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>>>>>
>>>>>>> After this operation is done, you want y == a, even if 5 is between
>>>>>>> D800 and DC00.
>>>>>>>
>>>>>>
>>>>>> Assuming that b.length() == 1 in this example, my interpretation of
>>>>>> this is that '=', '+', and 'substring' are operations whose domain and
>>>>>> co-domain are (currently defined) ES Strings, namely sequences of UTF-16
>>>>>> code units. Since none of these operations entail interpreting the
>>>>>> semantics of a code point (i.e., interpreting abstract characters), then
>>>>>> there is no violation of C1 here.
>>>>>>
>>>>>> Or take:
>>>>>>> output = "";
>>>>>>> for (int i = 0; i < s.length(); ++i) {
>>>>>>>   ch = s.charAt(i);
>>>>>>>   if (ch.equals('&')) {
>>>>>>>     ch = '@';
>>>>>>>   }
>>>>>>>   output += ch;
>>>>>>> }
>>>>>>>
>>>>>>> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
>>>>>>> not "a&\u{FFFD}\u{FFFD}b".
>>>>>>> It is also an unnecessary burden on lower-level software to always
>>>>>>> check this stuff.
>>>>>>>
>>>>>>
>>>>>> Again, in this example, I assume that the string literal
>>>>>> "a&\u{10000}b" maps to the UTF-16 code unit sequence:
>>>>>>
>>>>>> 0061 0026 D800 DC00 0062
>>>>>>
>>>>>> Given that 'charAt(i)' is defined on (and is indexing) code units and
>>>>>> not code points, and since the 'equals' operator is also defined on code
>>>>>> units, this example also does not require interpreting the semantics of
>>>>>> code points (i.e., interpreting abstract characters).
>>>>>>
>>>>>> However, in Norbert's questions above about isUUppercase(int) and
>>>>>> toUpperCase(int), it is clear that the domain of these operations are code
>>>>>> points, not code units, and further, that they requiring interpretation as
>>>>>> abstract characters in order to determine the semantics of the
>>>>>> corresponding characters.
>>>>>>
>>>>>> My conclusion is that the determination of whether C1 is violated or
>>>>>> not depends upon the domain, codomain, and operation being considered.
>>>>>>
>>>>>>
>>>>>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage
>>>>>>> or output, then you do need to either convert to FFFD or take some other
>>>>>>> action.
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> Mark <https://plus.google.com/114199149796022210033>
>>>>>>> *
>>>>>>> *
>>>>>>> *— Il meglio è l’inimico del bene —*
>>>>>>> **
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>>>>>> ecmascript at norbertlindenberg.com> wrote:
>>>>>>>>
>>>>>>>>> The conformance clause doesn't say anything about the
>>>>>>>>> interpretation of (UTF-16) code units as code points. To check conformance
>>>>>>>>> with C1, you have to look at how the resulting code points are actually
>>>>>>>>> further interpreted.
>>>>>>>>>
>>>>>>>>
>>>>>>>> True, but if the proposed language
>>>>>>>>
>>>>>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part
>>>>>>>> of a surrogate pair, is interpreted as a code point with the same value."
>>>>>>>>
>>>>>>>> is adopted, then will not this have an effect of creating unpaired
>>>>>>>> surrogates as code points? If so, then by my estimation, this *will
>>>>>>>> * increase the likelihood of their being interpreted as abstract
>>>>>>>> characters... e.g., if the unpaired code unit is interpreted as a unpaired
>>>>>>>> surrogate code point, and some process/function performs *any* predicate
>>>>>>>> or transform on that code point, then that amounts to interpreting it as an
>>>>>>>> abstract character.
>>>>>>>>
>>>>>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>>>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>>>>>> requires conversion of the UTF-16 code unit sequence.
>>>>>>>>
>>>>>>>>
>>>>>>>>> My proposal interprets the resulting code points in the following
>>>>>>>>> ways:
>>>>>>>>>
>>>>>>>>> 1) In regular expressions, they can be used in both patterns and
>>>>>>>>> input strings to be matched. They may be compared against other code
>>>>>>>>> points, or against character classes, some of which will hopefully soon be
>>>>>>>>> defined by Unicode properties. In the case of comparing against other code
>>>>>>>>> points, they can't match any code points assigned to abstract characters.
>>>>>>>>> In the case of Unicode properties, they'll typically fall into the large
>>>>>>>>> bucket of have-nots, along with other unassigned code points or, for
>>>>>>>>> example, U+FFFD, unless you ask for their general category.
>>>>>>>>>
>>>>>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>>>>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>>>>>>>> code points or U+FFFD.
>>>>>>>>>
>>>>>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>>>>>> equivalents defined, and remain as is, as would happen for unassigned code
>>>>>>>>> points or U+FFFD.
>>>>>>>>>
>>>>>>>>> I don't think either of these amount to interpretation as abstract
>>>>>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>>>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>>>>>>>> seem to improve anything.
>>>>>>>>>
>>>>>>>>> Norbert
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>>>>>
>>>>>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>>>>>> barraclough at apple.com> wrote:
>>>>>>>>> > I really like the direction you're going in, but have one minor
>>>>>>>>> concern relating to regular expressions.
>>>>>>>>> >
>>>>>>>>> > In your proposal, you currently state:
>>>>>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but
>>>>>>>>> is not part of a surrogate pair, is interpreted as a code point with the
>>>>>>>>> same value."
>>>>>>>>> >
>>>>>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code
>>>>>>>>> point will not be interpreted as an abstract character:
>>>>>>>>> >
>>>>>>>>> > C1    A process shall not interpret a high-surrogate code point
>>>>>>>>> or a low-surrogate code point as an abstract character.
>>>>>>>>> >
>>>>>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>>>>>> >
>>>>>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>>>>>> problem for the above proposed language.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> es-discuss mailing list
>>>>>>>> es-discuss at mozilla.org
>>>>>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/fbc40218/attachment-0001.html>