Unicode support in new ES6 spec draft

# Norbert Lindenberg (13 years ago)

I haven't reviewed the new spec draft in detail yet, but have some comments on the comments from Rich and Allen - see below.

Norbert

On Jul 10, 2012, at 20:53 , Allen Wirfs-Brock wrote:

On Jul 10, 2012, at 7:50 PM, Gillam, Richard wrote:

Allen--

A few comments on the i18n/Unicode-related stuff in the latest draft:

p. 1, §2: It seems a little weird here to be specifying a particular version of the Unicode standard but not of ISO 10646. Down in section 3, you do nail down the version of 10646 and it's long, so I can see why you don't want all this verbiage in section 2 as well, but maybe you want more than you have?

i'm not sure. This was Norbert recommendation. I'm liked want we did in ES5 where we specified version 3.0 as minimum version for which there was guaranteed interoperability but allowed use of more recent versions. Input on what would make the most sense is appreciated.

Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."

p. 14 §6: More substantively, do you really need to go into this level of detail as to what a "Unicode character" is? I would think you could say something like "ECMAScript source text is a sequence of Unicode abstract code point values (or, in this spec, "Unicode characters"). The actual representation of those characters in bits (e.g., UTF-16 or UTF-32 or even a non-Unicode encoding) is implementation-dependent, but a conforming implementation must process source text as if it were an equivalent sequence of SourceCharacter values." I think that for the purposes of this spec, how "Unicode code point" maps to a normal human's idea of "character" is irrelevant; you can define "character" to mean the same thing as Unicode means when it says "code point" and be done with it. (This probably means you can ether get rid of the next paragraph, or at least that that paragraph is entirely informative.)

First, I have so say that there will probably be some controversy about this section in TC39. Norbert proposal was that we specify SourceCharacter as always being UTF-16 encoded, while I went in the direction of essentially defining it as abstract characters identified by code points. No doubt there will be additional discussion about this.

There is enough confusion concerning ECMAScript source code (code point vs code unit, UTF-16 or not, etc.) in previous editions, that I wanted to be as clear as possible. The key point is just that the ECMAScript specification is assigning a meanings to certain Unicode characters/character sequences and that meaning is independent of file encoding processing that may take place within an implementation.

I think basing the specification on UTF-16 code units as source code would be easier, but using Unicode code points as the basis isn't wrong either.

We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.

For "character", people have different ideas what the term means, and redefining it, as ES5 did, would just add to the confusion.

"Unicode character" is not defined in the Unicode standard, as far as I can tell, but seems to be used in the sense of "code point assigned to abstract character" or possibly "designated code point". With either definition, it would exclude code points reserved for future assignment, such as characters that were added in Unicode 6.1 if your implementation was based on Unicode 5.1. Such a restriction would be a constant source of interoperability problems.

"Unicode scalar value" is defined in the Unicode standard as "Any Unicode code point except high-surrogate and low-surrogate code points." We cannot exclude surrogate code points from source code, as this would break compatibility with existing code.

"Unicode code point" and "UTF-16 code unit" are the terms we have to use most of the time.

I agree with Rich that we should limit the discussion to what's relevant to the spec.

p. 19, §7.6: I tend to agree with your comment here-- since this was nailed to Unicode 3.0 before, it seems better to stick with that when we're talking about "portability" (although a note explaining why it's not Unicode 5.1 might be helpful).

I disagree. Unicode 5.1 support is part of ES6 just like the "let" and "class" keywords. I assume we're not going to tell programmers to stay away from "let" and "class". Why should we tell them to stay away from Unicode 5.1?

This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.

p. 24, §7.8.4: In earlier versions of ECMAScript, I could often specify a supplementary-plane character by using two Unicode escape sequences in a row, each representing a surrogate code unit value. Can I still do that? It seems like you'd have to support this for backward compatibility, but you're not really supposed to see bare surrogates in any context except for UTF-16 (I don't think they're strictly illegal, except in UTF-8, but the code point sequence <D800 DC00> isn't equivalent to U+10000, either. I think you want some verbiage here clarifying how this is supposed to work. A \uNNNN escape is a BMP code point so it will always contribute exactly one element to the string value.

I believe that the algorithmic text of the spec. is clear in this regard. But informative text could be added.

What actually happens idepends upon contextual details that your 4th item refers to.

In a string literal, each non-escape sequence character contributes one code point to the literal. String values are made up of 16-bit elements, with non-BMP code points being UTF-16 encoded as two elements. A \u{ } escape also represents one code point that, depending upon its value, will contribute one or two elements to the string value. A \uNNNN escape represents a BMP code point so it is always represented as one string element. If you have existing code that like " \ud800\udc00" you will get the same two element string value that you would get if you wrote "\u{10000}" or \u{d800}\u{dc00}". They are all alternative ways of expressing the same two element string value. The first form must be supported for backwards compatibility.

Outside (I'm actually glossing over a couple of other exceptions) of string literals (and friends, such as quasis) we don't have to worry about UTF-16 encoding because we are dealing with more abstract concepts, such as "identifiers" that we can deal with at the level of Unicode characters. We also don't have backwards compat. issues because in those contexts current implementations look at surrogate pairs as two distinct "characters", either of which result in syntax errors in all non-literal context. EG, current implementations reject identifiers containing supplementary characters that are (according to Unicode) legal identifier characters.

Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

p. 210, §15.5.3.2: I like the idea of introducing fromCodeUnit(), making this function an alias of that one, and marking this function as obsolete. But I'm also wondering if it would make more sense for this function to be called fromCodeUnits(), since you can specify a whole list of code units, and they all contribute to the string.

singular form is following the convention established by fromCharCode. It could change if nobody objects.

fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

p. 210, §15.5.3.3: Same thing: Maybe call this fromCodePoints()? [Note also you have a copy-and-paste problem on the first line, where it still says "String.fromCharCode()".]

fromCodePoints would be fine. There are a few more copy-and-paste references to codeUnits, and a "codePoint" missing an "s".

p. 212, §15.5.4.4: I like the idea of adding a new name for this function, but I'm thinking maybe codeUnitAt(). Or do what Java did (IIRC): Add a new function called char32At(), which behaves like this one, except that if the index you give it points to one half of a surrogate pair, you return the code point value represented by the surrogate pair. (If you don't do some sort of char32At() function, you're probably going to need a function that takes a sequence of UTF-16 code unit values and returns a sequence of Unicode code point values.)

I also have thought about unicodeCharAt() or perhaps uCharAt().

codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.

I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.

p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?

this is covered by the I18N API spec. Right?

It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.

p. 223, §15.5.4.5: First, did something go haywire with the numbering here? Second, this sort of addresses my comment above, but if you can't put this and charCodeAt() (or whatever we called it) together in the spec, can you include a pointer in charCodeAt()'s description to here? Third, it looks like this only works right with surrogate pairs if you specify the position of the first surrogate in the pair. I think you want it to work right if you specify the position of either element in the pair. (I think you may have a typo in step 11 as well: shouldn't that be "…or second > 0xDFFF"?)

It's supposed to be 15.5.4.25. also yes about the step 11 typo. Norbert proposed this function so we should get his thoughts on the addressing issue. As I wrote this I did think a bit about whether or not we need to provide some support for backward iteration over strings.

At some point we have to give chapter 15 a logical structure again, rather than just offering sediment layers.

Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

There are more issues with this function, which I'll comment on separately.

I haven't reviewed the new spec draft in detail yet, but have some comments on the comments from Rich and Allen - see below.

Norbert

On Jul 10, 2012, at 20:53 , Allen Wirfs-Brock wrote:

> 
> On Jul 10, 2012, at 7:50 PM, Gillam, Richard wrote:
> 
>> Allen--
>> 
>> A few comments on the i18n/Unicode-related stuff in the latest draft:
>> 
>> - p. 1, §2: It seems a little weird here to be specifying a particular version of the Unicode standard but not of ISO 10646.  Down in section 3, you _do_ nail down the version of 10646 and it's long, so I can see why you don't want all this verbiage in section 2 as well, but maybe you want more than you have?
> 
> i'm not sure.  This was Norbert recommendation.  I'm liked want we did in ES5 where we specified version 3.0 as minimum version for which there was guaranteed interoperability but allowed use of more recent versions.  Input on what would make the most sense is appreciated.

Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."

>> - p. 14 §6: More substantively, do you really need to go into this level of detail as to what a "Unicode character" is?  I would think you could say something like "ECMAScript source text is a sequence of Unicode abstract code point values (or, in this spec, "Unicode characters").  The actual representation of those characters in bits (e.g., UTF-16 or UTF-32 or even a non-Unicode encoding) is implementation-dependent, but a conforming implementation must process source text as if it were an equivalent sequence of SourceCharacter values."  I think that for the purposes of this spec, how "Unicode code point" maps to a normal human's idea of "character" is irrelevant; you can define "character" to mean the same thing as Unicode means when it says "code point" and be done with it.  (This probably means you can ether get rid of the next paragraph, or at least that that paragraph is entirely informative.)
> 
> First, I have so say that there will probably be some controversy about this section in TC39.  Norbert proposal was that we specify SourceCharacter as always being UTF-16 encoded, while I went in the direction of essentially defining it as abstract characters  identified by code points. No doubt  there will be additional discussion about this.
> 
> There is enough confusion concerning ECMAScript source code (code point vs code unit, UTF-16 or not, etc.) in previous editions, that I wanted to be as clear as possible.  The key point is just that the ECMAScript specification is assigning a meanings to certain Unicode characters/character sequences and that meaning is independent of  file encoding processing that may take place within an implementation.

I think basing the specification on UTF-16 code units as source code would be easier, but using Unicode code points as the basis isn't wrong either.

We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.

For "character", people have different ideas what the term means, and redefining it, as ES5 did, would just add to the confusion.

"Unicode character" is not defined in the Unicode standard, as far as I can tell, but seems to be used in the sense of "code point assigned to abstract character" or possibly "designated code point". With either definition, it would exclude code points reserved for future assignment, such as characters that were added in Unicode 6.1 if your implementation was based on Unicode 5.1. Such a restriction would be a constant source of interoperability problems.

"Unicode scalar value" is defined in the Unicode standard as "Any Unicode code point except high-surrogate and low-surrogate code points." We cannot exclude surrogate code points from source code, as this would break compatibility with existing code.

"Unicode code point" and "UTF-16 code unit" are the terms we have to use most of the time.

I agree with Rich that we should limit the discussion to what's relevant to the spec.

>> - p. 19, §7.6: I tend to agree with your comment here-- since this was nailed to Unicode 3.0 before, it seems better to stick with that when we're talking about "portability" (although a note explaining why it's not Unicode 5.1 might be helpful).

I disagree. Unicode 5.1 support is part of ES6 just like the "let" and "class" keywords. I assume we're not going to tell programmers to stay away from "let" and "class". Why should we tell them to stay away from Unicode 5.1?

This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.

>> - p. 24, §7.8.4: In earlier versions of ECMAScript, I could often specify a supplementary-plane character by using two Unicode escape sequences in a row, each representing a surrogate code unit value.  Can I still do that?  It seems like you'd have to support this for backward compatibility, but you're not really supposed to see bare surrogates in any context except for UTF-16 (I don't think they're strictly illegal, except in UTF-8, but the code point sequence <D800 DC00> isn't equivalent to U+10000, either.  I think you want some verbiage here clarifying how this is supposed to work. A \uNNNN escape is a BMP code point so it will always contribute exactly one element to the string value. 
> 
> I believe that the algorithmic text of the spec. is clear in this regard.  But informative text could be added.
> 
> What actually happens idepends upon contextual details that your 4th item refers to.
> 
> In a string literal, each non-escape sequence character contributes one code point to the literal.  String values are made up of 16-bit elements, with non-BMP code points being UTF-16 encoded as two elements.   A \u{ } escape also represents one code point that, depending upon its value, will contribute one or two elements to the string value.  A \uNNNN escape represents a BMP code point so it is always represented as one string element.  If you have existing code that like " \ud800\udc00" you will get the same two element string value that you would get if you wrote "\u{10000}" or \u{d800}\u{dc00}".  They are all alternative ways of expressing the same two element string value.  The first form must be supported for backwards compatibility. 
> 
> Outside (I'm actually glossing over a couple of other exceptions) of string literals (and friends, such as quasis) we don't have to worry about UTF-16 encoding because we are dealing with more abstract concepts, such as "identifiers" that we can deal with at the level of Unicode characters.  We also don't have backwards compat. issues because in those contexts current implementations look at surrogate  pairs as two distinct "characters", either of which result in syntax errors in all non-literal context.  EG,  current implementations reject identifiers containing supplementary characters that are (according to Unicode) legal identifier characters. 

Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

>> - p. 210, §15.5.3.2: I like the idea of introducing fromCodeUnit(), making this function an alias of that one, and marking this function as obsolete.  But I'm also wondering if it would make more sense for this function to be called fromCodeUnits(), since you can specify a whole list of code units, and they all contribute to the string.
> 
> singular form is following the convention established by fromCharCode.  It could change if nobody objects.

fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

>> - p. 210, §15.5.3.3: Same thing: Maybe call this fromCodePoints()?  [Note also you have a copy-and-paste problem on the first line, where it still says "String.fromCharCode()".]  

fromCodePoints would be fine. There are a few more copy-and-paste references to codeUnits, and a "codePoint" missing an "s".

>> - p. 212, §15.5.4.4: I like the idea of adding a new name for this function, but I'm thinking maybe codeUnitAt().  Or do what Java did (IIRC): Add a new function called char32At(), which behaves like this one, except that if the index you give it points to one half of a surrogate pair, you return the code point value represented by the surrogate pair.  (If you don't do some sort of char32At() function, you're probably going to need a function that takes a sequence of UTF-16 code unit values and returns a sequence of Unicode code point values.)
> 
> I also have thought about unicodeCharAt() or perhaps uCharAt().  

codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.

I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.

>> - p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?
> 
> this is covered by the I18N API spec. Right?

It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.

>> - p. 223, §15.5.4.5: First, did something go haywire with the numbering here?  Second, this sort of addresses my comment above, but if you can't put this and charCodeAt() (or whatever we called it) together in the spec, can you include a pointer in charCodeAt()'s description to here?  Third, it looks like this only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.  (I think you may have a typo in step 11 as well: shouldn't that be "…or second > 0xDFFF"?)
> 
> It's supposed to be 15.5.4.25. also yes about the step 11 typo.  Norbert proposed this function so we should get his thoughts on the addressing issue.  As I wrote this I did think a bit about whether or not we need to provide some support for  backward iteration over strings.

At some point we have to give chapter 15 a logical structure again, rather than just offering sediment layers.

Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

There are more issues with this function, which I'll comment on separately.

# Gillam, Richard (13 years ago)

Commenting on Norbert's comments…

Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."

I like it.

We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.

Points taken. I'll withdraw my suggestion of "Unicode scalar value." As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it. (The fact that Unicode itself doesn't have a formal definition of "character" helps.)

"Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.

This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.

Fair enough.

Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000. This seems like the right way to go.

fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.

I still think it helps.

I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.

I'd have to go back and look at the doc, but yeah: I probably mean codePointAt().

p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?

this is covered by the I18N API spec. Right?

It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.

I agree.

[I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair. I think you want it to work right if you specify the position of either element in the pair.

Norbert proposed this function so we should get his thoughts on the addressing issue. As I wrote this I did think a bit about whether or not we need to provide some support for backward iteration over strings.

Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

Why is it intentional? I don't see the value in restricting it. You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case. What about the random-access case? Is there no such case? Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get the wrong answer, and that seems dangerous. [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]

Commenting on Norbert's comments…

> Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."

I like it.

> We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.

Points taken.  I'll withdraw my suggestion of "Unicode scalar value."  As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it.  (The fact that Unicode itself doesn't have a formal definition of "character" helps.)

"Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.

> This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.

Fair enough.

> Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000.  This seems like the right way to go.

> fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

> codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.

I still think it helps.

> I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.

I'd have to go back and look at the doc, but yeah: I probably mean codePointAt().

>>> - p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?
>> 
>> this is covered by the I18N API spec. Right?
> 
> It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.

I agree.

>>> [I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.
>> 
>> Norbert proposed this function so we should get his thoughts on the addressing issue.  As I wrote this I did think a bit about whether or not we need to provide some support for  backward iteration over strings.
> 
> Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

Why is it intentional?  I don't see the value in restricting it.  You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case.  What about the random-access case?  Is there no such case?  Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get *the wrong answer*, and that seems dangerous.  [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]

--Rich


> There are more issues with this function, which I'll comment on separately.
>

# Mark Davis ☕ (13 years ago)

In order to support backwards iteration (which is sometimes used), we should have codePointBefore.

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

In order to support backwards iteration (which is sometimes used), we
should have codePointBefore.

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Jul 16, 2012 at 2:54 PM, Gillam, Richard <gillam at lab126.com> wrote:

> Why is it intentional?  I don't see the value in restricting it.  You've
> mentioned you're optimizing for the forward-iteration case and want to have
> a separate API for the backward-iteration case.  What about the
> random-access case?  Is there no such case?  Worse, it seems like if you
> use this API for backward iteration or random access, you don't get an
> error; you just get *the wrong answer*, and that seems dangerous.  [I guess
> the "wrong answer" is an unpaired surrogate value, which would tip the
> caller off that he's doing something wrong, but that still seems like extra
> code he'd need to write.]
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120716/b28ce7bd/attachment-0001.html>

# Allen Wirfs-Brock (13 years ago)

On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:

Commenting on Norbert's comments… ...

Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000. This seems like the right way to go.

To further clarify position. I don't currently agree with Norbert's assertion WRT "situations". For more discussion see ecmascript#469, ecmascript#525

The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations. As currently spec'ed, explicit UTF-16 escape sequences such as \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers. Such sequences currently generate errors in existing implementations so there aren't any backwards issues.

I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat, \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.

fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

Depends upon how it is spec'ed. The legacy String.fromCharCode clamps values using toUint16 but does not reject them. The current spec. for fromCodePoint throws for values > 0x10FFFF.

codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.

I still think it helps.

Note that charAt returns a string value, not a numeric code unit value. charCodeAt is used to retrieve a numeric code unit value at a specific string index position. The proposed codePointAt also returns a numeric code point. What is missing from this discussion is a method that returns a string value and which correctly interprets surrogate pairs.

...

On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:

> Commenting on Norbert's comments…
> ...
>> Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
> 
> What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000.  This seems like the right way to go.

To further clarify position.  I don't currently agree with Norbert's assertion WRT  "situations".  For more discussion see 
https://bugs.ecmascript.org/show_bug.cgi?id=469 
https://bugs.ecmascript.org/show_bug.cgi?id=525 

The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations.  As currently spec'ed, explicit UTF-16 escape sequences such as  \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers.  Such sequences currently  generate errors in existing implementations so there aren't any backwards issues. 

I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat,  \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.

> 
>> fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
> 
> But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

Depends upon how it is spec'ed. The legacy String.fromCharCode clamps values using toUint16 but does not reject them.  The current spec. for fromCodePoint throws for values > 0x10FFFF.  

> 
>> codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.
> 
> I still think it helps.

Note that charAt returns a string value, not a numeric code unit value. charCodeAt is used to retrieve a numeric code unit value at a specific string index position.  The proposed codePointAt also returns a numeric code point.  What is missing from this discussion is a method that returns a string value and which correctly interprets surrogate pairs. 

...

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120716/02e84ab5/attachment.html>

# Allen Wirfs-Brock (13 years ago)

On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:

In order to support backwards iteration (which is sometimes used), we should have codePointBefore.

or we can provide a backwards iterator that knows how to parse surrogate pairs: for (let c of str.backwards) ...

On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:

> In order to support backwards iteration (which is sometimes used), we should have codePointBefore.

or we can provide a backwards iterator that knows how to parse surrogate pairs:
    for (let c of str.backwards) ...

Allen


> 
> Mark
> 
> — Il meglio è l’inimico del bene —
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120716/c7fe1cbe/attachment.html>

# Norbert Lindenberg (13 years ago)

And more comments…

On Jul 16, 2012, at 14:54 , Gillam, Richard wrote:

Commenting on Norbert's comments…

[…]

We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.

Points taken. I'll withdraw my suggestion of "Unicode scalar value." As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it. (The fact that Unicode itself doesn't have a formal definition of "character" helps.)

"Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.

In bug 524, I softened this a bit: "The term "Unicode character" can be used when only assigned characters are meant, e.g., when referring to individual characters such as "comma" or "reverse solidus", or to the characters that can be used in identifiers. ecmascript#524

fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.

But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

True. Is this important enough to application developers to warrant a separate method?

[I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair. I think you want it to work right if you specify the position of either element in the pair.

Norbert proposed this function so we should get his thoughts on the addressing issue. As I wrote this I did think a bit about whether or not we need to provide some support for backward iteration over strings.

Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

Why is it intentional? I don't see the value in restricting it. You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case. What about the random-access case? Is there no such case? Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get the wrong answer, and that seems dangerous. [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]

I think the question is the other way round: Is there a valid and common use case that requires random access?

And more comments…

On Jul 16, 2012, at 14:54 , Gillam, Richard wrote:

> Commenting on Norbert's comments…
> 

[…]

>> We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.
> 
> Points taken.  I'll withdraw my suggestion of "Unicode scalar value."  As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it.  (The fact that Unicode itself doesn't have a formal definition of "character" helps.)
> 
> "Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.

In bug 524, I softened this a bit: "The term "Unicode character" can be used when only assigned characters are meant, e.g., when referring to individual characters such as "comma" or "reverse solidus", or to the characters that can be used in identifiers.
https://bugs.ecmascript.org/show_bug.cgi?id=524

>> fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
> 
> But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

True. Is this important enough to application developers to warrant a separate method?

>>>> [I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.
>>> 
>>> Norbert proposed this function so we should get his thoughts on the addressing issue.  As I wrote this I did think a bit about whether or not we need to provide some support for  backward iteration over strings.
>> 
>> Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
> 
> Why is it intentional?  I don't see the value in restricting it.  You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case.  What about the random-access case?  Is there no such case?  Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get *the wrong answer*, and that seems dangerous.  [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]

I think the question is the other way round: Is there a valid and common use case that requires random access?

# Norbert Lindenberg (13 years ago)

What's the use case for backwards iteration, in the sense of looking at all code points in a string in reverse order?

Looking at an individual code point before a given index is necessary, for example, when checking word boundaries, so I can see the need for codePointBefore.

Norbert

What's the use case for backwards iteration, in the sense of looking at all code points in a string in reverse order?

Looking at an individual code point before a given index is necessary, for example, when checking word boundaries, so I can see the need for codePointBefore.

Norbert

On Jul 16, 2012, at 16:44 , Allen Wirfs-Brock wrote:

> 
> On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
> 
>> In order to support backwards iteration (which is sometimes used), we should have codePointBefore.
> 
> or we can provide a backwards iterator that knows how to parse surrogate pairs:
>     for (let c of str.backwards) ...
> 
> Allen
> 
> 
>> 
>> Mark
>> 
>> — Il meglio è l’inimico del bene —
>> 
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (13 years ago)

On Jul 16, 2012, at 16:41 , Allen Wirfs-Brock wrote:

On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:

Commenting on Norbert's comments… ...

Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.

What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000. This seems like the right way to go.

To further clarify position. I don't currently agree with Norbert's assertion WRT "situations". For more discussion see ecmascript#469, ecmascript#525

The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations. As currently spec'ed, explicit UTF-16 escape sequences such as \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers. Such sequences currently generate errors in existing implementations so there aren't any backwards issues.

I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat, \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.

It's not a backwards compatibility issue.

It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.

And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.

On Jul 16, 2012, at 16:41 , Allen Wirfs-Brock wrote:

> 
> On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:
> 
>> Commenting on Norbert's comments…
>> ...
>>> Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
>> 
>> What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000.  This seems like the right way to go.
> 
> To further clarify position.  I don't currently agree with Norbert's assertion WRT  "situations".  For more discussion see 
> https://bugs.ecmascript.org/show_bug.cgi?id=469 
> https://bugs.ecmascript.org/show_bug.cgi?id=525 
> 
> The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations.  As currently spec'ed, explicit UTF-16 escape sequences such as  \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers.  Such sequences currently  generate errors in existing implementations so there aren't any backwards issues. 
> 
> I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat,  \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.

It's not a backwards compatibility issue.

It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.

And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.

# Brendan Eich (13 years ago)

Allen Wirfs-Brock wrote:

On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:

In order to support backwards iteration (which is sometimes used), we should have codePointBefore.

or we can provide a backwards iterator that knows how to parse surrogate pairs: for (let c of str.backwards) ...

Allen

Kind of a spin-off, but I think a String.prototype.reverse that avoids

s.split('').reverse().join('')

overhead and ES6 Unicode hazard splitting on code unit boundary would be swell. It's tiny and matches Array.prototype.reverse but of course without observable in-place mutation.

It wouldn't relieve all use-cases for reverse iteration, but we have iterators and for-of in ES6, we should use 'em.

Allen Wirfs-Brock wrote:
> On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
>
>> In order to support backwards iteration (which is sometimes used), we 
>> should have codePointBefore.
>
> or we can provide a backwards iterator that knows how to parse 
> surrogate pairs:
>     for (let c of str.backwards) ...
>
> Allen

Kind of a spin-off, but I think a String.prototype.reverse that avoids

   s.split('').reverse().join('')

overhead and ES6 Unicode hazard splitting on code unit boundary would be 
swell. It's tiny and matches Array.prototype.reverse but of course 
without observable in-place mutation.

It wouldn't relieve all use-cases for reverse iteration, but we have 
iterators and for-of in ES6, we should use 'em.

/be
>
>
>>
>> ------------------------------------------------------------------------
>> Mark <https://plus.google.com/114199149796022210033>
>> /
>> /
>> /— Il meglio è l’inimico del bene —/
>> //
>>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (13 years ago)

We agreed in November not to add String.prototype.reverse because there was no compelling use case for it. Is there now? esdiscuss/2011-November/018581

Norbert

We agreed in November not to add String.prototype.reverse because there was no compelling use case for it. Is there now?
https://mail.mozilla.org/pipermail/es-discuss/2011-November/018581.html

Norbert


On Jul 17, 2012, at 14:49 , Brendan Eich wrote:

> Allen Wirfs-Brock wrote:
>> On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
>> 
>>> In order to support backwards iteration (which is sometimes used), we should have codePointBefore.
>> 
>> or we can provide a backwards iterator that knows how to parse surrogate pairs:
>>    for (let c of str.backwards) ...
>> 
>> Allen
> 
> Kind of a spin-off, but I think a String.prototype.reverse that avoids
> 
>  s.split('').reverse().join('')
> 
> overhead and ES6 Unicode hazard splitting on code unit boundary would be swell. It's tiny and matches Array.prototype.reverse but of course without observable in-place mutation.
> 
> It wouldn't relieve all use-cases for reverse iteration, but we have iterators and for-of in ES6, we should use 'em.
> 
> /be
>> 
>> 
>>> 
>>> ------------------------------------------------------------------------
>>> Mark <https://plus.google.com/114199149796022210033>
>>> /
>>> /
>>> /— Il meglio è l’inimico del bene —/
>>> //
>>> 
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Brendan Eich (13 years ago)

Norbert Lindenberg wrote:

We agreed in November not to add String.prototype.reverse because there was no compelling use case for it. Is there now?

Thanks, how soon I forgot!

It may be no more compelling case exists now than then. However, Allen mentioned forward Unicode-aware iteration over the result of (I'll call it) s.reverse() as potentially taking away some of the use-case-based need for a Unicode-aware backward iterator.

How common is backward iteration in post-"UCS2" Java, anyone know?

Norbert Lindenberg wrote:
> We agreed in November not to add String.prototype.reverse because there was no compelling use case for it. Is there now?

Thanks, how soon I forgot!

It may be no more compelling case exists now than then. However, Allen 
mentioned forward Unicode-aware iteration over the result of (I'll call 
it) s.reverse() as potentially taking away some of the use-case-based 
need for a Unicode-aware backward iterator.

How common is backward iteration in post-"UCS2" Java, anyone know?

/be
> https://mail.mozilla.org/pipermail/es-discuss/2011-November/018581.html
>
> Norbert
>
>
> On Jul 17, 2012, at 14:49 , Brendan Eich wrote:
>
>> Allen Wirfs-Brock wrote:
>>> On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
>>>
>>>> In order to support backwards iteration (which is sometimes used), we should have codePointBefore.
>>> or we can provide a backwards iterator that knows how to parse surrogate pairs:
>>>     for (let c of str.backwards) ...
>>>
>>> Allen
>> Kind of a spin-off, but I think a String.prototype.reverse that avoids
>>
>>   s.split('').reverse().join('')
>>
>> overhead and ES6 Unicode hazard splitting on code unit boundary would be swell. It's tiny and matches Array.prototype.reverse but of course without observable in-place mutation.
>>
>> It wouldn't relieve all use-cases for reverse iteration, but we have iterators and for-of in ES6, we should use 'em.
>>
>> /be
>>>> ------------------------------------------------------------------------
>>>> Mark<https://plus.google.com/114199149796022210033>
>>>> /
>>>> /
>>>> /— Il meglio è l’inimico del bene —/
>>>> //
>>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es-discuss
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Mark Davis ☕ (13 years ago)

A string reversal is not exactly a high-runner API, and the simple codepoint reversal will have pretty bad results where grapheme-cluster ≠ single code point.

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

A string reversal is not exactly a high-runner API, and the simple
codepoint reversal will have pretty bad results where grapheme-cluster ≠
single code point.

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Jul 17, 2012 at 3:03 PM, Norbert Lindenberg <
ecmascript at norbertlindenberg.com> wrote:

> We agreed in November not to add String.prototype.reverse because there
> was no compelling use case for it. Is there now?
> https://mail.mozilla.org/pipermail/es-discuss/2011-November/018581.html
>
> Norbert
>
>
> On Jul 17, 2012, at 14:49 , Brendan Eich wrote:
>
> > Allen Wirfs-Brock wrote:
> >> On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
> >>
> >>> In order to support backwards iteration (which is sometimes used), we
> should have codePointBefore.
> >>
> >> or we can provide a backwards iterator that knows how to parse
> surrogate pairs:
> >>    for (let c of str.backwards) ...
> >>
> >> Allen
> >
> > Kind of a spin-off, but I think a String.prototype.reverse that avoids
> >
> >  s.split('').reverse().join('')
> >
> > overhead and ES6 Unicode hazard splitting on code unit boundary would be
> swell. It's tiny and matches Array.prototype.reverse but of course without
> observable in-place mutation.
> >
> > It wouldn't relieve all use-cases for reverse iteration, but we have
> iterators and for-of in ES6, we should use 'em.
> >
> > /be
> >>
> >>
> >>>
> >>>
> ------------------------------------------------------------------------
> >>> Mark <https://plus.google.com/114199149796022210033>
> >>> /
> >>> /
> >>> /— Il meglio è l’inimico del bene —/
> >>> //
> >>>
> >>
> >> _______________________________________________
> >> es-discuss mailing list
> >> es-discuss at mozilla.org
> >> https://mail.mozilla.org/listinfo/es-discuss
> > _______________________________________________
> > es-discuss mailing list
> > es-discuss at mozilla.org
> > https://mail.mozilla.org/listinfo/es-discuss
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120717/20811537/attachment-0001.html>

# Mathias Bynens (13 years ago)

On Tue, Jul 17, 2012 at 10:23 PM, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:

To further clarify position. I don't currently agree with Norbert's assertion WRT "situations". For more discussion see ecmascript#469, ecmascript#525

The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations. As currently spec'ed, explicit UTF-16 escape sequences such as \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers. Such sequences currently generate errors in existing implementations so there aren't any backwards issues.

I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat, \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.

It's not a backwards compatibility issue.

It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.

And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.

As stated in ecmascript#469, I agree with Norbert’s sentiments here. With ES 5.1 it’s perfectly possible to make a list of all Unicode characters that are allowed/disallowed in IdentifierStart or IdentifierPart. So, you could say:

The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is

disallowed in identifier names, even though it’s in the [Lo] category. var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t match any of the allowed Unicode categories var 丽; // SyntaxError, as this is equivalent to the above

The latest ES 6 draft makes this far more complicated:

The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is

disallowed in identifier names only if it’s written out using simple Unicode escape sequences for the separate surrogate halves: var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t match any of the allowed Unicode categories var 丽; // allowed var \u{2F800}; // allowed

It’s no longer possible to say “this symbol is allowed/disallowed” as it depends on the way the symbol is represented.

Please consider fixing this inconsistency.

On Tue, Jul 17, 2012 at 10:23 PM, Norbert Lindenberg
<ecmascript at norbertlindenberg.com> wrote:
>> To further clarify position.  I don't currently agree with Norbert's assertion WRT  "situations".  For more discussion see
>> https://bugs.ecmascript.org/show_bug.cgi?id=469
>> https://bugs.ecmascript.org/show_bug.cgi?id=525
>>
>> The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations.  As currently spec'ed, explicit UTF-16 escape sequences such as  \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers.  Such sequences currently  generate errors in existing implementations so there aren't any backwards issues.
>>
>> I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat,  \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.
>
> It's not a backwards compatibility issue.
>
> It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.
>
> And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.

As stated in https://bugs.ecmascript.org/show_bug.cgi?id=469, I agree
with Norbert’s sentiments here. With ES 5.1 it’s perfectly possible to
make a list of all Unicode characters that are allowed/disallowed in
IdentifierStart or IdentifierPart. So, you could say:

    The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is
disallowed in identifier names, even though it’s in the [Lo] category.
    var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t
match any of the allowed Unicode categories
    var 丽; // SyntaxError, as this is equivalent to the above

The latest ES 6 draft makes this far more complicated:

    The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is
disallowed in identifier names only if it’s written out using simple
Unicode escape sequences for the separate surrogate halves:
    var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t
match any of the allowed Unicode categories
    var 丽; // allowed
    var \u{2F800}; // allowed

It’s no longer possible to say “this symbol is allowed/disallowed” as
it depends on the way the symbol is represented.

Please consider fixing this inconsistency.

# Norbert Lindenberg (13 years ago)

On Jul 18, 2012, at 19:42 , Gillam, Richard wrote:

But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?

True. Is this important enough to application developers to warrant a separate method?

I tend to think so. It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.

Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.

Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.

Why is it intentional? I don't see the value in restricting it. You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case. What about the random-access case? Is there no such case? Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get the wrong answer, and that seems dangerous. [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]

I think the question is the other way round: Is there a valid and common use case that requires random access?

If there isn't, then I don't think we should have an API for random access; we should just have iterators. This API looks like it supports random access, but it really doesn't.

I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.

You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.

Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function? What am I missing here?

Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally: norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String

On Jul 18, 2012, at 19:42 , Gillam, Richard wrote:

>>> But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?
>> 
>> True. Is this important enough to application developers to warrant a separate method?
> 
> I tend to think so.  It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.

Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.

>>>> Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
>>> 
>>> Why is it intentional?  I don't see the value in restricting it.  You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case.  What about the random-access case?  Is there no such case?  Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get *the wrong answer*, and that seems dangerous.  [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]
>> 
>> I think the question is the other way round: Is there a valid and common use case that requires random access?
> 
> If there isn't, then I don't think we should have an API for random access; we should just have iterators.  This API looks like it supports random access, but it really doesn't.

I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.

You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.

> Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function?  What am I missing here?

Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String

# Gillam, Richard (13 years ago)

Norbert--

I tend to think so. It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.

Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.

Sorry, my mistake. In that case, yes, you're right-- there probably isn't much value in a separate fromCodeUnit() method.

I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.

You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.

No, we're talking past each other. I did mean truly "random" access. And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that. If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character. This feels wrong to me.

If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on. I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this. If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.

Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function? What am I missing here?

Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:

That's what I thought. Again, this requires that the caller know how the encoding works. And again, maybe that's okay, but it makes me a little uncomfortable.

Norbert--

>> I tend to think so.  It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.
> 
> Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.

Sorry, my mistake.  In that case, yes, you're right-- there probably isn't much value in a separate fromCodeUnit() method.

> I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.
> 
> You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.

No, we're talking past each other.  I did mean truly "random" access.  And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that.  If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character.  This feels wrong to me.

If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on.  I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this.  If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.

>> Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function?  What am I missing here?
> 
> Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:

That's what I thought.  Again, this requires that the caller know how the encoding works.  And again, maybe that's okay, but it makes me a little uncomfortable.

--Rich

# Norbert Lindenberg (13 years ago)

On Jul 19, 2012, at 16:26 , Gillam, Richard wrote:

I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.

You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.

No, we're talking past each other. I did mean truly "random" access. And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that. If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character. This feels wrong to me.

If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on. I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this. If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.

Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function? What am I missing here?

Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:

That's what I thought. Again, this requires that the caller know how the encoding works. And again, maybe that's okay, but it makes me a little uncomfortable.

On both of the above: If an application uses indices into UTF-16 strings, its developers have to understand UTF-16. That's why Allen was pushing for UTF-32, but in the choice between making it easier for developers and maintaining compatibility with existing code compatibility won.

Hopefully we can simplify things at a higher level of abstraction, e.g., by providing a richer API for iterators.

Norbert

On Jul 19, 2012, at 16:26 , Gillam, Richard wrote:

>> I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.
>> 
>> You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.
> 
> No, we're talking past each other.  I did mean truly "random" access.  And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that.  If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character.  This feels wrong to me.
> 
> If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on.  I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this.  If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.
> 
>>> Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function?  What am I missing here?
>> 
>> Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:
> 
> That's what I thought.  Again, this requires that the caller know how the encoding works.  And again, maybe that's okay, but it makes me a little uncomfortable.

On both of the above: If an application uses indices into UTF-16 strings, its developers have to understand UTF-16. That's why Allen was pushing for UTF-32, but in the choice between making it easier for developers and maintaining compatibility with existing code compatibility won.

Hopefully we can simplify things at a higher level of abstraction, e.g., by providing a richer API for iterators.

Norbert