Unicode support in new ES6 spec draft
Commenting on Norbert's comments…
Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."
I like it.
We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.
Points taken. I'll withdraw my suggestion of "Unicode scalar value." As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it. (The fact that Unicode itself doesn't have a formal definition of "character" helps.)
"Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.
This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.
Fair enough.
Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000. This seems like the right way to go.
fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?
codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.
I still think it helps.
I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.
I'd have to go back and look at the doc, but yeah: I probably mean codePointAt().
- p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?
this is covered by the I18N API spec. Right?
It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.
I agree.
[I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair. I think you want it to work right if you specify the position of either element in the pair.
Norbert proposed this function so we should get his thoughts on the addressing issue. As I wrote this I did think a bit about whether or not we need to provide some support for backward iteration over strings.
Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
Why is it intentional? I don't see the value in restricting it. You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case. What about the random-access case? Is there no such case? Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get the wrong answer, and that seems dangerous. [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]
In order to support backwards iteration (which is sometimes used), we should have codePointBefore.
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:
Commenting on Norbert's comments… ...
Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000. This seems like the right way to go.
To further clarify position. I don't currently agree with Norbert's assertion WRT "situations". For more discussion see ecmascript#469, ecmascript#525
The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations. As currently spec'ed, explicit UTF-16 escape sequences such as \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers. Such sequences currently generate errors in existing implementations so there aren't any backwards issues.
I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat, \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.
fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?
Depends upon how it is spec'ed. The legacy String.fromCharCode clamps values using toUint16 but does not reject them. The current spec. for fromCodePoint throws for values > 0x10FFFF.
codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.
I still think it helps.
Note that charAt returns a string value, not a numeric code unit value. charCodeAt is used to retrieve a numeric code unit value at a specific string index position. The proposed codePointAt also returns a numeric code point. What is missing from this discussion is a method that returns a string value and which correctly interprets surrogate pairs.
...
On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
In order to support backwards iteration (which is sometimes used), we should have codePointBefore.
or we can provide a backwards iterator that knows how to parse surrogate pairs: for (let c of str.backwards) ...
And more comments…
On Jul 16, 2012, at 14:54 , Gillam, Richard wrote:
Commenting on Norbert's comments…
[…]
We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.
Points taken. I'll withdraw my suggestion of "Unicode scalar value." As for "character," I understand where you're coming from, but I don't think it's really all that bad to use "character," assuming we define clearly precisely what we mean by it. (The fact that Unicode itself doesn't have a formal definition of "character" helps.)
"Unicode code point" and "UTF-16 code unit" are more precise, with full definitions in the Unicode standard, but "character" may make reading easier for the non-Unicode geeks in the audience, at least in situations where the precise meaning isn't important.
In bug 524, I softened this a bit: "The term "Unicode character" can be used when only assigned characters are meant, e.g., when referring to individual characters such as "comma" or "reverse solidus", or to the characters that can be used in identifiers. ecmascript#524
fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?
True. Is this important enough to application developers to warrant a separate method?
[I]t looks like [String.prototype.codePointAt()] only works right with surrogate pairs if you specify the position of the first surrogate in the pair. I think you want it to work right if you specify the position of either element in the pair.
Norbert proposed this function so we should get his thoughts on the addressing issue. As I wrote this I did think a bit about whether or not we need to provide some support for backward iteration over strings.
Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
Why is it intentional? I don't see the value in restricting it. You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case. What about the random-access case? Is there no such case? Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get the wrong answer, and that seems dangerous. [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]
I think the question is the other way round: Is there a valid and common use case that requires random access?
What's the use case for backwards iteration, in the sense of looking at all code points in a string in reverse order?
Looking at an individual code point before a given index is necessary, for example, when checking word boundaries, so I can see the need for codePointBefore.
Norbert
On Jul 16, 2012, at 16:41 , Allen Wirfs-Brock wrote:
On Jul 16, 2012, at 2:54 PM, Gillam, Richard wrote:
Commenting on Norbert's comments… ...
Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
What I was getting, perhaps erroneously, from Allen's comments is that \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} are all equivalent, all the time, and in all situations where the implementation assigns meaning to the characters, they all refer to the character U+10000. This seems like the right way to go.
To further clarify position. I don't currently agree with Norbert's assertion WRT "situations". For more discussion see ecmascript#469, ecmascript#525
The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations. As currently spec'ed, explicit UTF-16 escape sequences such as \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers. Such sequences currently generate errors in existing implementations so there aren't any backwards issues.
I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat, \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.
It's not a backwards compatibility issue.
It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.
And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.
Allen Wirfs-Brock wrote:
On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
In order to support backwards iteration (which is sometimes used), we should have codePointBefore.
or we can provide a backwards iterator that knows how to parse surrogate pairs: for (let c of str.backwards) ...
Allen
Kind of a spin-off, but I think a String.prototype.reverse that avoids
s.split('').reverse().join('')
overhead and ES6 Unicode hazard splitting on code unit boundary would be swell. It's tiny and matches Array.prototype.reverse but of course without observable in-place mutation.
It wouldn't relieve all use-cases for reverse iteration, but we have iterators and for-of in ES6, we should use 'em.
We agreed in November not to add String.prototype.reverse because there was no compelling use case for it. Is there now? esdiscuss/2011-November/018581
Norbert
Norbert Lindenberg wrote:
We agreed in November not to add String.prototype.reverse because there was no compelling use case for it. Is there now?
Thanks, how soon I forgot!
It may be no more compelling case exists now than then. However, Allen mentioned forward Unicode-aware iteration over the result of (I'll call it) s.reverse() as potentially taking away some of the use-case-based need for a Unicode-aware backward iterator.
How common is backward iteration in post-"UCS2" Java, anyone know?
A string reversal is not exactly a high-runner API, and the simple codepoint reversal will have pretty bad results where grapheme-cluster ≠ single code point.
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
On Tue, Jul 17, 2012 at 10:23 PM, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:
To further clarify position. I don't currently agree with Norbert's assertion WRT "situations". For more discussion see ecmascript#469, ecmascript#525
The current spec draft (except where RegExp still needs updating) treats \ud800\udc00, \u{d800}\u{dc00}, and \u{10000} as equivalent in all literal situations. As currently spec'ed, explicit UTF-16 escape sequences such as \ud800\udc00 are not decoded as a single code point in non-literal contexts such as identifiers. Such sequences currently generate errors in existing implementations so there aren't any backwards issues.
I'm taking this position because I want to discourage programers from hand encoding UTF-16 and instead using \u{} to express actual code points for supplementary characters. For backwards compat, \uDnnn\uDnnn need to be recognized in literals but there is no need allow them where they have not been allowed in existing implementations.
It's not a backwards compatibility issue.
It's an issue with having to explain to developers that sometimes \u{10000} and \ud800\udc00 are equivalent, and sometimes they're not; the kind of inconsistency that makes it more difficult to understand the language.
And it may be an issue with tools that convert non-ASCII characters into (old-style) Unicode escapes.
As stated in ecmascript#469, I agree with Norbert’s sentiments here. With ES 5.1 it’s perfectly possible to make a list of all Unicode characters that are allowed/disallowed in IdentifierStart or IdentifierPart. So, you could say:
The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is
disallowed in identifier names, even though it’s in the [Lo] category. var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t match any of the allowed Unicode categories var 丽; // SyntaxError, as this is equivalent to the above
The latest ES 6 draft makes this far more complicated:
The non-BMP symbol U+2F800 CJK Compatibility Ideograph (`丽`) is
disallowed in identifier names only if it’s written out using simple Unicode escape sequences for the separate surrogate halves: var \uD87E\uDC00; // SyntaxError, as the surrogate halves don’t match any of the allowed Unicode categories var 丽; // allowed var \u{2F800}; // allowed
It’s no longer possible to say “this symbol is allowed/disallowed” as it depends on the way the symbol is represented.
Please consider fixing this inconsistency.
On Jul 18, 2012, at 19:42 , Gillam, Richard wrote:
But fromCodeUnit() wouldn't let you pass in values above \xffff, would it?
True. Is this important enough to application developers to warrant a separate method?
I tend to think so. It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.
Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.
Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
Why is it intentional? I don't see the value in restricting it. You've mentioned you're optimizing for the forward-iteration case and want to have a separate API for the backward-iteration case. What about the random-access case? Is there no such case? Worse, it seems like if you use this API for backward iteration or random access, you don't get an error; you just get the wrong answer, and that seems dangerous. [I guess the "wrong answer" is an unpaired surrogate value, which would tip the caller off that he's doing something wrong, but that still seems like extra code he'd need to write.]
I think the question is the other way round: Is there a valid and common use case that requires random access?
If there isn't, then I don't think we should have an API for random access; we should just have iterators. This API looks like it supports random access, but it really doesn't.
I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.
You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.
Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function? What am I missing here?
Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally: norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String
Norbert--
I tend to think so. It seems like I ought to be able to pass some function the value 0x1d15f and get back a string containing the quarter-note character; otherwise, we're still privileging BMP characters.
Well, the function that does that is String.fromCodePoint(). The question is, is there enough value to warrant a separate fromCodeUnit() method.
Sorry, my mistake. In that case, yes, you're right-- there probably isn't much value in a separate fromCodeUnit() method.
I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.
You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.
No, we're talking past each other. I did mean truly "random" access. And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that. If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character. This feels wrong to me.
If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on. I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this. If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.
Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function? What am I missing here?
Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:
That's what I thought. Again, this requires that the caller know how the encoding works. And again, maybe that's okay, but it makes me a little uncomfortable.
On Jul 19, 2012, at 16:26 , Gillam, Richard wrote:
I misunderstood how you meant random access - I though you meant truly random positions for which you don't know whether they're for the first or the second code unit of a code point. It's this kind of randomness that I don't expect to see in practice.
You meant random as opposed to fully sequential, that is, functions that operate on a small section of a larger string. That does happen a lot - examples: a function to find the next token in program source code; a function that has to find word breaks or hyphenation points around the position where a line overflows; a subexpression matcher of a regular expression. But in these cases it's reasonable to require all code involved to maintain the invariant that positions passed around are always those of the first code unit of a code point, never the second of a supplementary code point.
No, we're talking past each other. I did mean truly "random" access. And while there may indeed not be a use case for it, you're proposing an API that lets the caller pass an arbitrary offset into a string as input and get back the code value at that position, but it doesn't actually do that. If the position I happen to pick is a trailing surrogate, I just get that value back, not the value of the underlying character. This feels wrong to me.
If I know what's going on, I can use the unpaired surrogate value as an indication I should back up one space and try again, but that requires I know what's going on. I guess what I'm questioning is whether most developers will understand the encoding well enough to know to do this. If we think anyone using this API is likely to understand Unicode well enough to know this, and that it's also highly unlikely anyone would be using this API that way in the first place, then maybe it's okay, but it still kind of feels like a hole to me.
Come to think of it, I'm not quite sure how to use it for forward iteration-- wouldn't you want some kind of "codePointAfter" or "indexOfNextChar" function? What am I missing here?
Iteration has to increment the position by 2 if the code point is greater than 0xFFFF - that's what the iterator does internally:
That's what I thought. Again, this requires that the caller know how the encoding works. And again, maybe that's okay, but it makes me a little uncomfortable.
On both of the above: If an application uses indices into UTF-16 strings, its developers have to understand UTF-16. That's why Allen was pushing for UTF-32, but in the choice between making it easier for developers and maintaining compatibility with existing code compatibility won.
Hopefully we can simplify things at a higher level of abstraction, e.g., by providing a richer API for iterators.
Norbert
I haven't reviewed the new spec draft in detail yet, but have some comments on the comments from Rich and Allen - see below.
Norbert
On Jul 10, 2012, at 20:53 , Allen Wirfs-Brock wrote:
Rich's comment was on the lack of any version number for ISO 10646, not on the Unicode version number. We can simplify the statement in clause 2 to "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard and ISO/IEC 10646, both in the versions referenced in clause 3."
I think basing the specification on UTF-16 code units as source code would be easier, but using Unicode code points as the basis isn't wrong either.
We should stay away from the terms "character", "Unicode character", or "Unicode scalar value" however.
For "character", people have different ideas what the term means, and redefining it, as ES5 did, would just add to the confusion.
"Unicode character" is not defined in the Unicode standard, as far as I can tell, but seems to be used in the sense of "code point assigned to abstract character" or possibly "designated code point". With either definition, it would exclude code points reserved for future assignment, such as characters that were added in Unicode 6.1 if your implementation was based on Unicode 5.1. Such a restriction would be a constant source of interoperability problems.
"Unicode scalar value" is defined in the Unicode standard as "Any Unicode code point except high-surrogate and low-surrogate code points." We cannot exclude surrogate code points from source code, as this would break compatibility with existing code.
"Unicode code point" and "UTF-16 code unit" are the terms we have to use most of the time.
I agree with Rich that we should limit the discussion to what's relevant to the spec.
I disagree. Unicode 5.1 support is part of ES6 just like the "let" and "class" keywords. I assume we're not going to tell programmers to stay away from "let" and "class". Why should we tell them to stay away from Unicode 5.1?
This paragraph is really about the fact that some implementations will support Unicode 6.1 or later by the time ES6 becomes a standard, while others will be stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in identifiers would mean that the application only runs on implementations based on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.
Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes.
fromCodeUnit seems rather redundant. Note that any code unit sequence it accepts would be equally accepted, with the same result, by fromCodePoints, as that function accepts surrogate code points and then, in the conversion to UTF-16, erases the distinction between surrogate code points and surrogate code units.
fromCodePoints would be fine. There are a few more copy-and-paste references to codeUnits, and a "codePoint" missing an "s".
codeUnitAt is clearly a better name than charAt (especially in a language that doesn't have a char data type), but since we can't get rid of charAt, I'm not sure it's worth adding codeUnitAt.
I'm not aware of any char32At function in Java. Do you mean codePointAt? That's in both Java and the ES6 draft.
It's not in the Internationalization API edition 1, but seems a prime candidate for edition 2.
At some point we have to give chapter 15 a logical structure again, rather than just offering sediment layers.
Requiring the correct position is intentional; it's the same in java.lang.String.codePointAt. If we want to support backwards iteration, we could add codePointBefore.
There are more issues with this function, which I'll comment on separately.