Code points vs Unicode scalar values
Anne van Kesteren <mailto:annevk at annevk.nl> September 4, 2013 7:48 AM
ES6 introduces String.prototype.codePointAt() and String.codePointFrom()
String.fromCodePoint, rather.
as well as an iterator (not defined). It struck me this is the only place in the platform where we'd expose code point as a concept to developers.
Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.) or Unicode scalar values (anytime you hit the network and use utf-8).
I'm not sure I'm a big fan of having all three concepts around.
You can't avoid it: UTF-8 is a transfer format that can be observed via serialization. String.prototype.charCodeAt and String.fromCharCode are required for backward compatibility. And ES6 wants to expose code points as well, so three.
We could have String.prototype.unicodeAt() and String.unicodeFrom() instead, and have them translate lone surrogates into U+FFFD. Lone surrogates are a bug and I don't see a reason to expose them in more places than just the 16-bit code units.
Sorry, I missed this: how else (other than the charCodeAt/fromCharCode legacy) are lone surrogates exposed?
On Wed, Sep 4, 2013 at 4:58 PM, Brendan Eich <brendan at mozilla.com> wrote:
String.fromCodePoint, rather.
Oops. Any reason this is not just String.from() btw? Give the better method a nice short name?
I'm not sure I'm a big fan of having all three concepts around.
You can't avoid it: UTF-8 is a transfer format that can be observed via serialization.
Yes, but it cannot encode lone surrogates. It can only deal in Unicode scalar values.
String.prototype.charCodeAt and String.fromCharCode are required for backward compatibility. And ES6 wants to expose code points as well, so three.
Unicode scalar values are code points sans surrogates, i.e. completely compatible with what a utf-8 encoder/decoder pair can handle.
Why do you want to expose surrogates?
Sorry, I missed this: how else (other than the charCodeAt/fromCharCode legacy) are lone surrogates exposed?
"\udfff".codePointAt(0) == "\udfff"
It seems better if that returns "\ufffd", as you'd get with utf-8 (assuming it accepts code points as input rather than just Unicode scalar values, in which case it'd throw).
The indexing of codePointAt() is also kind of sad as it just passes through to charCodeAt(), which means for any serious usage you need to use the iterator anyway. What's the reason codePointAt() exists?
Anne van Kesteren <mailto:annevk at annevk.nl> September 4, 2013 9:06 AM
Oops. Any reason this is not just String.from() btw? Give the better method a nice short name?
Because of String.fromCharCode precedent. Balanced names with noun phrases that distinguish the "from" domains are better than longAndPortly vs. tiny.
Yes, but it cannot encode lone surrogates. It can only deal in Unicode scalar values.
Sure, but you wanted to reduce "three concepts" and I don't see how to do that. Most developers can ignore UTF-8, for sure.
Probably I just misunderstood what you meant, and you were simply pointing out that lone surrogates arise only from legacy APIs?
Unicode scalar values are code points sans surrogates, i.e. completely compatible with what a utf-8 encoder/decoder pair can handle.
Why do you want to expose surrogates?
I'm not sure I do! Sounds scandalous. :-P
Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (...codePoints):
The String.fromCodePoint function may be called with a variable number of arguments which form the rest parameter codePoints. The following steps are taken:
1. Assert: codePoints is a well-formed rest parameter object.
2. Let length be the result of Get(codePoints, "length").
3. Let elements be a new List.
4. Let nextIndex be 0.
5. Repeat while nextIndex < length
a. Let next be the result of Get(codePoints, ToString(nextIndex)).
b. Let nextCP be ToNumber(next).
c. ReturnIfAbrupt(nextCP).
d. If SameValue(nextCP, ToInteger(nextCP)) is false,then throw a RangeError exception.
e. If nextCP < 0 or nextCP > 0x10FFFF, then throw a RangeError exception.
f. Append the elements of the UTF-16 Encoding (clause 6) of nextCP to the end of elements.
g. Let nextIndex be nextIndex + 1.
6. Return the String value whose elements are, in order, the elements in the List elements. If length is 0, the empty string is returned.
No exposed surrogates here!
Here's the spec for String.prototype.codePointAt:
When the codePointAt method is called with one argument pos, the following steps are taken:
1. Let O be CheckObjectCoercible(this value).
2. Let S be ToString(O).
3. ReturnIfAbrupt(S).
4. Let position be ToInteger(pos).
5. ReturnIfAbrupt(position).
6. Let size be the number of elements in S.
7. If position < 0 or position ≥ size, return undefined.
8. Let first be the code unit value of the element at index position in the String S.
9. If first < 0xD800 or first > 0xDBFF or position+1 = size, then return first.
10. Let second be the code unit value of the element at index position+1 in the String S.
11. If second < 0xDC00 or second > 0xDFFF, then return first.
12. Return ((first – 0xD800) × 1024) + (second – 0xDC00) + 0x10000.
NOTE The codePointAt function is intentionally generic; it does not require that its this value be a String object. Therefore it can be transferred to other kinds of objects for use as a method.
I take it you are objecting to step 11?
"\udfff".codePointAt(0) == "\udfff"
It seems better if that returns "\ufffd", as you'd get with utf-8 (assuming it accepts code points as input rather than just Unicode scalar values, in which case it'd throw).
Maybe. Allen and Norbert should weigh in.
The indexing of codePointAt() is also kind of sad as it just passes through to charCodeAt(),
I don't see that in the spec cited above.
On 4 Sep 2013, at 18:34, Brendan Eich <brendan at mozilla.com> wrote:
No exposed surrogates here!
I think what Anne means to say is that String.fromCodePoint(0xD800)
returns '\uD800'
as per that algorithm, which is a lone surrogate (and not a scalar value).
Mathias Bynens wrote:
I think what Anne means to say is that
String.fromCodePoint(0xD800)
returns '\uD800`` as per that algorithm, which is a lone surrogate (and not a scalar value).
Gotcha. Yes, the new APIs seem to let you write and read lone surrogates. But the legacy APIs won't go away, and IIRC the reasoning is that we're better off exposing the data than trying to abstract away from it in the new APIs. Allen?
On Sep 4, 2013, at 9:46 AM, Brendan Eich wrote:
Mathias Bynens wrote:
I think what Anne means to say is that
String.fromCodePoint(0xD800)
returns '\uD800` as per that algorithm, which is a lone surrogate (and not a scalar value).Gotcha. Yes, the new APIs seem to let you write and read lone surrogates. But the legacy APIs won't go away, and IIRC the reasoning is that we're better off exposing the data than trying to abstract away from it in the new APIs. Allen?
First a couple meta points
- this stuff is mostly Norbert's design so he may be able to provide better rationale for some of the decisions.
- there are a number of open bugs on the current spec. WRT Unicode handling. We'll get around to those soon.
WRT the larger issue, these API are for people who need to deal with text at the encoding level. They might be writing their own encoders/decoders/translators. At that level, surrogates really are valid code points even though they are not valid Unicode scalar values. People programming at that level in some cases have to deal with malformed encodings. For example, they might be intentionally generating invalid UTF-16 encodings as part of a test driver.
Note that the behavior of String.fromCodePoint parrallels that of string literals:
String.fromCodePoint(0x1d11e)
StringfromCodePoint(0xd834,0xdd12)
"\u{1d11e}"
"\ud834\udd12"
all produce the same string value.
On Wed, Sep 4, 2013 at 5:34 PM, Brendan Eich <brendan at mozilla.com> wrote:
Because of String.fromCharCode precedent. Balanced names with noun phrases that distinguish the "from" domains are better than longAndPortly vs. tiny.
I kinda liked it as analogue to what exists for Array and because developers should probably move away from fromCharCode so the precedent does not matter that much.
Sure, but you wanted to reduce "three concepts" and I don't see how to do that. Most developers can ignore UTF-8, for sure.
The three concepts are: 16-bit code units, code points, and Unicode scalar values. JavaScript, DOM, etc. deal with 16-bit code units. utf-8 et al deal with Unicode scalar values. Nothing, apart from this API, does code points at the moment.
Probably I just misunderstood what you meant, and you were simply pointing out that lone surrogates arise only from legacy APIs?
No, they arise from this API.
Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (...codePoints):
No exposed surrogates here!
Mathias covered this.
Here's the spec for String.prototype.codePointAt:
8. Let first be the code unit value of the element at index position in the String S.
11. If second < 0xDC00 or second > 0xDFFF, then return first.
I take it you are objecting to step 11?
And step 8. The indexing is based on code units so you cannot actually do indexing easily. You'd need to use the iterator to iterate over a string getting only code points out.
The indexing of codePointAt() is also kind of sad as it just passes through to charCodeAt(),
I don't see that in the spec cited above.
How do you read step 8?
On Wed, Sep 4, 2013 at 6:22 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
WRT the larger issue, these API are for people who need to deal with text at the encoding level.
At that level you want to deal with bytes. And we have an API for this: encoding.spec.whatwg.org/#api I'd hope people would be smart enough not to add more encoding cruft, but we can't stop them and I don't think this API should be designed for them.
For example, they might be intentionally generating invalid UTF-16 encodings as part of a test driver.
Generate what though? If you want to generate surrogates you can always go back to using 16-bit code units. There's no need for this to leak through to the higher level abstraction.
Note that the behavior of String.fromCodePoint parrallels that of string literals:
String.fromCodePoint(0x1d11e) StringfromCodePoint(0xd834,0xdd12) "\u{1d11e}" "\ud834\udd12"
all produce the same string value.
If "\u{...}"
is new, it'd be great if that banned surrogates too.
I learned from Simon today Rust is doing the same thing for its char type. (Rust has some other issues where you can assign arbitrary byte values to a string even in safe mode, but it's still early days in that language.)
Anne van Kesteren wrote:
How do you read step 8?
8. Let first be the code unit value of the element at index position in the String S.
This does not "[pass] through to charCodeAt()" literally, which would mean a call to S.charCodeAt(position). I thought that's what you meant.
So you want a code point index, not a code unit index. That would not be useful for the lower-level purposes Allen identified. Again it seems you're trying to abstract away from all the details that probably will matter for string hackers using these APIs. But I summon Norbert at this point!
Previous discussion of allowing surrogate code points:
- esdiscuss/2012-December/thread.html#27057
- esdiscuss/2013-January/thread.html#28086
- www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29
Essentially, ECMAScript strings are Unicode strings as defined in The Unicode Standard section 2.7, and thus may contain unpaired surrogate code units in their 16-bit form or surrogate code points when interpreted as 32-bit sequences. String.fromCodePoint and String.prototype.codePointAt just convert between 16-bit and 32-bit forms; they're not meant to interpret the code points beyond that, and some processing (such as test cases) may depend on them being preserved. This is different from encoding for communication over networks, where the use of valid UTF-8 or UTF-16 (which cannot contain surrogate code points) is generally required.
The indexing issue was first discussed in the form "why can't we just use UTF-32"? See norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32 for pointers to that. It would have been great to use UTF-8, but it's unfortunately not compatible with the past and the DOM.
Adding code point indexing to 16-bit code unit strings would add significant performance overhead. In reality, whether an index is for 16-bit or 32-bit units matters only for some relatively low-level software that needs to process code point by code point. A lot of software deals with complete strings without ever looking inside, or is fine processing code unit by code unit (e.g., String.prototype.indexOf).
Thanks for the reminders -- we've been over this.
On Thu, Sep 5, 2013 at 8:07 PM, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
... they're not meant to interpret the code points beyond that, and some processing (such as test cases) may depend on them being preserved.
Since when are test cases a use case? And why can't a test case use the more difficult route?
I think ideally a string in a language can only represent Unicode scalar values. I.e. it perfectly maps back and forth to utf-8 (or indeed utf-16, although people shouldn't use that).
In ECMAScript a string is 16-bit code units with some sort of utf-16 layer on top in various scenarios.
Now we're adding another layer on top of strings, but instead of exposing ideal strings (Unicode scalar values) we go with some kludge to serve edge cases (whose scenarios have not been fully explained thus far) that are better served using the "low-level" 16-bit code unit API.
In esdiscuss/2012-December/027109 you suggest this is a policy matter, but I do not think it is at all. Unicode scalar values are the code points of Unicode that can be represented in any environment, this is not true for Unicode code points. This is not about policy at all, but rather about what a string ought to be.
Adding code point indexing to 16-bit code unit strings would add significant performance overhead.
Agreed. I don't think we need the *At method for now. Use the iterator.
FWIW, here’s a real-world example of a case where this behavior is annoying/unexpected to developers: cirw.in/blog/node-unicode
On Sep 10, 2013, at 12:14 AM, Mathias Bynens wrote:
FWIW, here’s a real-world example of a case where this behavior is annoying/unexpected to developers: cirw.in/blog/node-unicode
This suggests to me that the problem is in JSON.stringify's Quote operation. I can see an argument that Quote should convert all unpaired surrogates to \uXXXX escapes. I wonder if changing Quote to do this would break anything...
On Tue, Sep 10, 2013 at 8:14 AM, Mathias Bynens <mathias at qiwi.be> wrote:
FWIW, here’s a real-world example of a case where this behavior is annoying/unexpected to developers: cirw.in/blog/node-unicode
That seems like a serious bug in V8 though. A utf-8 encoder should never ever generate CESU-8 byte sequences.
On Thu, Sep 5, 2013 at 10:08 PM, Brendan Eich <brendan at mozilla.com> wrote:
Thanks for the reminders -- we've been over this.
It's not clear the arguments were carefully considered though. Shawn Steele raised the same concerns I did. The unicode.org thread also suggests that the ideal value space for a string is Unicode scalar values (i.e. what utf-8 can do) and not code points. It did indeed indicate they have code points because of legacy, but JavaScript has 16-bit code units due to legacy. If we're going to offer a higher level of abstraction over the basic string type, we can very well make that a utf-8 safe layer.
If you need anything for tests, you can just ignore the higher level of abstraction and operate
On 10 Sep 2013, at 18:30, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
On Sep 10, 2013, at 12:14 AM, Mathias Bynens wrote:
FWIW, here’s a real-world example of a case where this behavior is annoying/unexpected to developers: cirw.in/blog/node-unicode
This suggests to me that the problem is in JSON.stringify's Quote operation. I can see an argument that Quote should convert all unpaired surrogates to \uXXXX escapes. I wonder if changing Quote to do this would break anything…
If this turns out to be a non-breaking change, it would make sense to have JSON.stringify
escape any non-ASCII symbols, as well as any non-printable ASCII symbols, similar to jsesc
’s json
option [1]. This would improve portability of the serialized data in case it was saved to a misconfigured database, saved to a file with a non-UTF-8 encoding, served to a browser without charset=utf-8
in the Content-Type
header, et cetera.
Anne van Kesteren <mailto:annevk at annevk.nl> September 11, 2013 3:43 AM
It's not clear the arguments were carefully considered though. Shawn Steele raised the same concerns I did. The unicode.org thread also suggests that the ideal value space for a string is Unicode scalar values (i.e. what utf-8 can do) and not code points. It did indeed indicate they have code points because of legacy, but JavaScript has 16-bit code units due to legacy. If we're going to offer a higher level of abstraction over the basic string type, we can very well make that a utf-8 safe layer.
You could be right, but this is a deep topic, not sorted out by programming language developers, in my view. It came up recently here:
www.haskell.org/pipermail/haskell-cafe/2013-September/108654.html
That thread continues. The point about C winning because it doesn't have an abstract String type, only char[], is winning in my view. Yes, it's low level and you have to cope with multiple encodings, but any attempt at a more abstract view would have made a badly leaky abstraction, which would have been more of a boat anchor.
On Wed, Sep 11, 2013 at 7:51 PM, Brendan Eich <brendan at mozilla.com> wrote:
You could be right, but this is a deep topic, not sorted out by programming language developers, in my view. It came up recently here:
www.haskell.org/pipermail/haskell-cafe/2013-September/108654.html
That thread continues. The point about C winning because it doesn't have an abstract String type, only char[], is winning in my view. Yes, it's low level and you have to cope with multiple encodings, but any attempt at a more abstract view would have made a badly leaky abstraction, which would have been more of a boat anchor.
I would be okay with not doing these additions until we are more confident about the correct solution. The polyfills for these are relatively straightforward and documented on MDN.
Or is your argument that this code point layer on top of 16-bit code units is not an abstraction? (And only a Unicode scalar value layer would be an abstraction by implication.)
Anne van Kesteren <mailto:annevk at annevk.nl> September 12, 2013 11:39 AM
I would be okay with not doing these additions until we are more confident about the correct solution. The polyfills for these are relatively straightforward and documented on MDN.
Or is your argument that this code point layer on top of 16-bit code units is not an abstraction? (And only a Unicode scalar value layer would be an abstraction by implication.)
I didn't write anything about "not an abstraction" -- do you mean "non-leaky abstraction"? Those are rare birds.
Iterators forward and (if needed backward) over Unicode characters (scalar values; I'm allowed to call those "characters", no?) would be good. Github beats TC39 as usual, prollyfill FTW.
On Thu, Sep 12, 2013 at 6:42 PM, Brendan Eich <brendan at mozilla.com> wrote:
Iterators forward and (if needed backward) over Unicode characters (scalar values; I'm allowed to call those "characters", no?) would be good. Github beats TC39 as usual, prollyfill FTW.
No, there a non-characters that are Unicode scalar values and can (therefore) be expressed using utf-8, such as U+FFFF.
This should do what you asked for, although it's late and it's not an iterator as those don't really work in browsers yet, but should be easy enough to convert:
function toUnicode(str) {
var output = ""
for(var i = 0, l = str.length; i < l; i++) {
var c = str.charCodeAt(i)
if (0xD800 <= c && c <= 0xDBFF) {
nextC = str.charCodeAt(i+1);
if (0xDC00 > nextC || nextC > 0xDFFF) {
output += "\uFFFD"
} else {
output += str[i] += str[++i]
continue
}
}
else if (0xDC00 <= c && c <= 0xDFFF) {
output += "\uFFFD"
} else {
output += str[i]
}
}
return output
}
toUnicode("\ud800a")
toUnicode("\ud800\udc01")
toUnicode("\udc00a")
On Wed, Sep 11, 2013 at 12:40 PM, Anne van Kesteren <annevk at annevk.nl>wrote:
On Tue, Sep 10, 2013 at 8:14 AM, Mathias Bynens <mathias at qiwi.be> wrote:
FWIW, here’s a real-world example of a case where this behavior is annoying/unexpected to developers: cirw.in/blog/node-unicode
That seems like a serious bug in V8 though. A utf-8 encoder should never ever generate CESU-8 byte sequences.
Just to be clear, V8 does not generate CESU-8 if you give it well formed UTF-16.
If you give it broken UTF-16 with unpaired surrogates you can either break the data or emit CESU-8. In the first case, you overwrite the unpaired surrogates with some sort of error character code. In the second case you can generate three-byte UTF-8 sequences that are not strictly legal. The second option will preserve the data if you round-trip it into V8 again (or feed it to other apps that are liberal in what they accept), so that's what V8 currently does.
On Fri, Sep 20, 2013 at 6:28 AM, Erik Corry <erik.corry at gmail.com> wrote:
Just to be clear, V8 does not generate CESU-8 if you give it well formed UTF-16.
Sure.
If you give it broken UTF-16 with unpaired surrogates you can either break the data or emit CESU-8. In the first case, you overwrite the unpaired surrogates with some sort of error character code. In the second case you can generate three-byte UTF-8 sequences that are not strictly legal. The second option will preserve the data if you round-trip it into V8 again (or feed it to other apps that are liberal in what they accept), so that's what V8 currently does.
That's a bug. A utf-8 encoder should never emit byte sequences that are not valid utf-8. You should emit U+FFFD as a byte sequence instead for lone surrogates or terminate processing. Lone surrogates should not round-trip through the encoding layer as you can create down-level security bugs in unsuspecting decoders.
- Anne van Kesteren wrote:
ES6 introduces String.prototype.codePointAt() and String.codePointFrom() as well as an iterator (not defined). It struck me this is the only place in the platform where we'd expose code point as a concept to developers.
Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.) or Unicode scalar values (anytime you hit the network and use utf-8).
I'm not sure I'm a big fan of having all three concepts around. We could have String.prototype.unicodeAt() and String.unicodeFrom() instead, and have them translate lone surrogates into U+FFFD. Lone surrogates are a bug and I don't see a reason to expose them in more places than just the 16-bit code units.
I would regard that as silent data corruption which has the odd habit of causing hazardous anomalies in code and makes reasoning about it harder.
This is akin to adding edge cases. There are many desirable properties a function or its implementation can have, like purity and idempotence or reflexivity. When functions and relations have such properties 99.99% of the time, people tend to write code as if it had them without exception.
An example I came across today is this:
var parsed = JSON.parse("-0"); 1 / parsed === -Infinity; // true 1 / JSON.parse(JSON.stringify(parsed)) === -Infinity; // false
That is, JSON.parse preserves negative zero, but JSON.stringify does not. Well, in Firefox and Webkit; in Opera 12.x both comparisons are false. If JSON.stringify did not silently corrupt negative zero into positive zero, we would probably have one less bug to contend with.
If you look at codePointAt
over the domain of strings of .length 1 at
the first position, then it is injective, in fact, it's the identity
function. And if you apply the fromCodePoint
method to the output of
codePointAt
in this case, the data roundtrips nicely. If instead the
functions would silently corrupt data, if codePointAt
returned 0xFFFD
when the input was 0xFFFD but also when hitting a lone surrogate, these
properties would be lost.
Relatedly, if codePointAt
would throw an exception when hitting a lone
surrogate, you may very well end up with a bug that breaks your whole
application because someone accidentally put an emoji character at the
wrong position in a string in a database and there is some unfortunate
freak combination of code unit oriented API calls, like .substring or a
regular expression, that splits the emoji in the middle. Returning an
error code, like a negative number or undefined, might have the same
effect, depending on what happens if you pass those values to other
string-related functions.
Note that emitting a replacement character when encountering character encoding errors in bitstreams is a well-known form of hazardous silent data corruption and systems that require integrity forbid doing that. As an example, the WebSocket protocol requires implementations to consider a WebSocket connection fatally broken upon encountering a malfored UTF-8 sequence in a text frame. That is the right thing to do because when the sender of those bytes sends the wrong bytes, it may also send the wrong byte count, meaning payload data might be misinterpreted as frame and message meta data (useful only to attackers); and on the receiving end, emitting replacement characters might change the byte length of a string, but some code accidentally uses the unmodified byte length in further processing which quickly leads to memory corruption bugs, which are very bad.
Unfortunately ecmascript makes it very difficult to ensure you do not generate strings with unpaired surrogate code points somewhere in them, it's as easy as taking the first 157 .length units from a string and perhaps appending "..." to abbreviate it. And it's a freak accident if that actually happens in practise because non-BMP characters are rare. We should be very reluctant to introduce hazards hoping to improve our Unicode hygiene.
ES6 introduces String.prototype.codePointAt() and String.codePointFrom() as well as an iterator (not defined). It struck me this is the only place in the platform where we'd expose code point as a concept to developers.
Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.) or Unicode scalar values (anytime you hit the network and use utf-8).
I'm not sure I'm a big fan of having all three concepts around. We could have String.prototype.unicodeAt() and String.unicodeFrom() instead, and have them translate lone surrogates into U+FFFD. Lone surrogates are a bug and I don't see a reason to expose them in more places than just the 16-bit code units.