String.fromCodePoint and surrogate pairs?

# Erik Arvidsson (13 years ago)

It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.

The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.

It was suggested to me that we could probably extend String.fromCodePoint
to be aware of UTF-16 code units too. It seems doable since the lead
surrogate is not a valid code point.

The question is if it is worth it? It seems like we are going down a
slippery slope if we start to do things like this. Should we also handle
UTF-8 code units. Maybe it is better not to do this and try to get people
to move away from UTF-16 code units and move them towards code points.

-- 
erik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121212/a48cab22/attachment.html>

# Norbert Lindenberg (13 years ago)

Do you know what the people who talked to you mean by "aware of UTF-16 code units"?

As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = "😄".

The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄: String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are control characters).

Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.

Norbert

Do you know what the people who talked to you mean by "aware of UTF-16 code units"?

As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
String.fromCodePoint(0xD83D, 0xDE04) =>
"\uD83D\uDE04" =
"😄".

The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) =>
"\u00F0\u009F\u0098\u0084" =
"ð\u009F\u0098\u0084" (the last three are control characters).

Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.

Norbert

On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:

> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
> 
> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
> 
> -- 
> erik

# Shawn Steele (13 years ago)

IMO String.fromCodePoint should disallow U+D800-U+DFFF.

There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning

IMO String.fromCodePoint should disallow U+D800-U+DFFF.

There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences.  IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms.

-Shawn

-----Original Message-----
From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert Lindenberg
Sent: Wednesday, December 12, 2012 1:25 PM
To: Erik Arvidsson
Cc: es-discuss at mozilla.org
Subject: Re: String.fromCodePoint and surrogate pairs?

Do you know what the people who talked to you mean by "aware of UTF-16 code units"?

As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
String.fromCodePoint(0xD83D, 0xDE04) =>
"\uD83D\uDE04" =
"😄".

The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are control characters).

Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.

Norbert

On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:

> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
> 
> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
> 
> --
> erik

_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (13 years ago)

The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].

Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings.

Norbert

[1] www.unicode.org/versions/Unicode6.2.0/ch03.pdf [2] www.unicode.org/versions/Unicode6.2.0/ch16.pdf

The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].

Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings.

Norbert

[1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
[2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf


On Dec 12, 2012, at 13:55 , Shawn Steele wrote:

> IMO String.fromCodePoint should disallow U+D800-U+DFFF.
> 
> There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences.  IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms.
> 
> -Shawn
> 
> -----Original Message-----
> From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert Lindenberg
> Sent: Wednesday, December 12, 2012 1:25 PM
> To: Erik Arvidsson
> Cc: es-discuss at mozilla.org
> Subject: Re: String.fromCodePoint and surrogate pairs?
> 
> Do you know what the people who talked to you mean by "aware of UTF-16 code units"?
> 
> As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
> String.fromCodePoint(0xD83D, 0xDE04) =>
> "\uD83D\uDE04" =
> "😄".
> 
> The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are control characters).
> 
> Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.
> 
> Norbert
> 
> 
> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:
> 
>> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
>> 
>> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
>> 
>> --
>> erik
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Shawn Steele (13 years ago)

I was looking at D75 of 3.8 "Surrogates"

My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair. No valid Unicode string could contain that. So why support it?

I was looking at D75 of 3.8 "Surrogates"

My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair.  No valid Unicode string could contain that.  So why support it?

-Shawn

-----Original Message-----
From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] 
Sent: Wednesday, December 12, 2012 2:40 PM
To: Shawn Steele
Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
Subject: Re: String.fromCodePoint and surrogate pairs?

The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].

Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings.

Norbert

[1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
[2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf


On Dec 12, 2012, at 13:55 , Shawn Steele wrote:

> IMO String.fromCodePoint should disallow U+D800-U+DFFF.
> 
> There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences.  IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms.
> 
> -Shawn
> 
> -----Original Message-----
> From: es-discuss-bounces at mozilla.org 
> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert 
> Lindenberg
> Sent: Wednesday, December 12, 2012 1:25 PM
> To: Erik Arvidsson
> Cc: es-discuss at mozilla.org
> Subject: Re: String.fromCodePoint and surrogate pairs?
> 
> Do you know what the people who talked to you mean by "aware of UTF-16 code units"?
> 
> As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" =
> "😄".
> 
> The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are control characters).
> 
> Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.
> 
> Norbert
> 
> 
> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:
> 
>> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
>> 
>> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
>> 
>> --
>> erik
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (12 years ago)

I don't have a good scenario at hand either that would require support for surrogate code points, but in ECMAScript the question is often asked the other way around: Why reject it? And given that there are several ways already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", String.fromCodeUnit(0xD800)), it's not clear why this particular path should be blocked.

(Sorry for letting this sit in my outbox for such a long time.)

Norbert

I don't have a good scenario at hand either that would require support for surrogate code points, but in ECMAScript the question is often asked the other way around: Why reject it? And given that there are several ways already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", String.fromCodeUnit(0xD800)), it's not clear why this particular path should be blocked.

(Sorry for letting this sit in my outbox for such a long time.)

Norbert


On Dec 12, 2012, at 15:17 , Shawn Steele wrote:

> I was looking at D75 of 3.8 "Surrogates"
> 
> My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair.  No valid Unicode string could contain that.  So why support it?
> 
> -Shawn
> 
> -----Original Message-----
> From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] 
> Sent: Wednesday, December 12, 2012 2:40 PM
> To: Shawn Steele
> Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
> Subject: Re: String.fromCodePoint and surrogate pairs?
> 
> The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].
> 
> Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings.
> 
> Norbert
> 
> [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
> [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
> 
> 
> On Dec 12, 2012, at 13:55 , Shawn Steele wrote:
> 
>> IMO String.fromCodePoint should disallow U+D800-U+DFFF.
>> 
>> There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences.  IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms.
>> 
>> -Shawn
>> 
>> -----Original Message-----
>> From: es-discuss-bounces at mozilla.org 
>> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert 
>> Lindenberg
>> Sent: Wednesday, December 12, 2012 1:25 PM
>> To: Erik Arvidsson
>> Cc: es-discuss at mozilla.org
>> Subject: Re: String.fromCodePoint and surrogate pairs?
>> 
>> Do you know what the people who talked to you mean by "aware of UTF-16 code units"?
>> 
>> As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
>> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" =
>> "😄".
>> 
>> The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
>> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are control characters).
>> 
>> Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.
>> 
>> Norbert
>> 
>> 
>> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:
>> 
>>> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
>>> 
>>> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
>>> 
>>> --
>>> erik
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
> 
> 
>

# Shawn Steele (12 years ago)

It doesn't make sense and is illegal unicode. Eg: it's corrupt data. So the only reason to accept it is to allow corrupt data, perhaps as a way of faking other non-Unicode data as a Unicode context. Which inevitably leads to problems, particularly on the web where people do whatever sneaky things the developer thinks works.

Assuming a use case for illegal unicode were ever found, it could be added later.

It doesn't make sense and is illegal unicode.  Eg: it's corrupt data.  So the only reason to accept it is to allow corrupt data, perhaps as a way of faking other non-Unicode data as a Unicode context.  Which inevitably leads to problems, particularly on the web where people do whatever sneaky things the developer thinks works.

Assuming a use case for illegal unicode were ever found, it could be added later.

-Shawn

-----Original Message-----
From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] 
Sent: Monday, January 14, 2013 4:35 PM
To: Shawn Steele
Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
Subject: Re: String.fromCodePoint and surrogate pairs?

I don't have a good scenario at hand either that would require support for surrogate code points, but in ECMAScript the question is often asked the other way around: Why reject it? And given that there are several ways already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", String.fromCodeUnit(0xD800)), it's not clear why this particular path should be blocked.

(Sorry for letting this sit in my outbox for such a long time.)

Norbert


On Dec 12, 2012, at 15:17 , Shawn Steele wrote:

> I was looking at D75 of 3.8 "Surrogates"
> 
> My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair.  No valid Unicode string could contain that.  So why support it?
> 
> -Shawn
> 
> -----Original Message-----
> From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com]
> Sent: Wednesday, December 12, 2012 2:40 PM
> To: Shawn Steele
> Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
> Subject: Re: String.fromCodePoint and surrogate pairs?
> 
> The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].
> 
> Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings.
> 
> Norbert
> 
> [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
> [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
> 
> 
> On Dec 12, 2012, at 13:55 , Shawn Steele wrote:
> 
>> IMO String.fromCodePoint should disallow U+D800-U+DFFF.
>> 
>> There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences.  IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms.
>> 
>> -Shawn
>> 
>> -----Original Message-----
>> From: es-discuss-bounces at mozilla.org 
>> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert 
>> Lindenberg
>> Sent: Wednesday, December 12, 2012 1:25 PM
>> To: Erik Arvidsson
>> Cc: es-discuss at mozilla.org
>> Subject: Re: String.fromCodePoint and surrogate pairs?
>> 
>> Do you know what the people who talked to you mean by "aware of UTF-16 code units"?
>> 
>> As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair:
>> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" =
>> "😄".
>> 
>> The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for 😄:
>> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are control characters).
>> 
>> Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.
>> 
>> Norbert
>> 
>> 
>> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:
>> 
>>> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
>>> 
>>> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.
>>> 
>>> --
>>> erik
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
> 
> 
>

# Mark Davis ☕ (12 years ago)

There is a long discussion of this on the unicode list recently. A surrogate code point is not illegal Unicode. It is illegal in a UTF string, but is not illegal in a Unicode String ( www.unicode.org/glossary/#unicode_string)

I don't want to repeat that whole long discussion here.

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

There is a long discussion of this on the unicode list recently. A
surrogate code point is not illegal Unicode. It is illegal *in* a UTF
string, but is not illegal in a Unicode String (
http://www.unicode.org/glossary/#unicode_string)

I don't want to repeat that whole long discussion here.


Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 14, 2013 at 4:52 PM, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

> It doesn't make sense and is illegal unicode.  Eg: it's corrupt data.  So
> the only reason to accept it is to allow corrupt data, perhaps as a way of
> faking other non-Unicode data as a Unicode context.  Which inevitably leads
> to problems, particularly on the web where people do whatever sneaky things
> the developer thinks works.
>
> Assuming a use case for illegal unicode were ever found, it could be added
> later.
>
> -Shawn
>
> -----Original Message-----
> From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com]
> Sent: Monday, January 14, 2013 4:35 PM
> To: Shawn Steele
> Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
> Subject: Re: String.fromCodePoint and surrogate pairs?
>
> I don't have a good scenario at hand either that would require support for
> surrogate code points, but in ECMAScript the question is often asked the
> other way around: Why reject it? And given that there are several ways
> already to construct strings that are ill-formed UTF-16 (e.g., "\uD800",
> String.fromCodeUnit(0xD800)), it's not clear why this particular path
> should be blocked.
>
> (Sorry for letting this sit in my outbox for such a long time.)
>
> Norbert
>
>
> On Dec 12, 2012, at 15:17 , Shawn Steele wrote:
>
> > I was looking at D75 of 3.8 "Surrogates"
> >
> > My point is that there's no "legal" scenario for converting basically a
> UTF-32 input to an isolated surrogate pair.  No valid Unicode string could
> contain that.  So why support it?
> >
> > -Shawn
> >
> > -----Original Message-----
> > From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com]
> > Sent: Wednesday, December 12, 2012 2:40 PM
> > To: Shawn Steele
> > Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org
> > Subject: Re: String.fromCodePoint and surrogate pairs?
> >
> > The Unicode standard defines "code point" as any value in the range of
> integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].
> >
> > Once you exclude surrogate code points, you have Unicode scalar values
> (definition D76), so you're basically proposing a String.fromScalarValue
> function. But then, why not also exclude code points that Unicode has
> defined as non-characters (chapter 16.7 [2])? It seems we're getting into
> policy-setting here, and so far ECMAScript has avoided setting policy for
> how you can use strings.
> >
> > Norbert
> >
> > [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
> > [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
> >
> >
> > On Dec 12, 2012, at 13:55 , Shawn Steele wrote:
> >
> >> IMO String.fromCodePoint should disallow U+D800-U+DFFF.
> >>
> >> There's already fromCharCode that does that, and a according to The
> Unicode Standard, isolated surrogates have no meaning on their own and goes
> on to compare them to illegal UTF-8 sequences.  IMO "CodePoint" is a 21
> Unicode code point, and explicitly not UTF-16, so it shouldn't confuse
> things by allowing UTF-16 (or other encoding) forms.
> >>
> >> -Shawn
> >>
> >> -----Original Message-----
> >> From: es-discuss-bounces at mozilla.org
> >> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert
> >> Lindenberg
> >> Sent: Wednesday, December 12, 2012 1:25 PM
> >> To: Erik Arvidsson
> >> Cc: es-discuss at mozilla.org
> >> Subject: Re: String.fromCodePoint and surrogate pairs?
> >>
> >> Do you know what the people who talked to you mean by "aware of UTF-16
> code units"?
> >>
> >> As specified, String.fromCodePoint, accepts all UTF-16 code units
> because they use a subset of the integers allowed as code points (0 to
> 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly
> what you expect. Surrogate values are interpreted as surrogate code points,
> which are valid code points in Unicode (their use makes a string ill-formed
> in Unicode terminology, but the proposed ECMAScript spec ignores issues of
> well-formedness for compatibility with ES5). Since in conversion to UTF-16
> a surrogate code point just becomes the corresponding code unit, it can
> happen that two surrogate code points (an ill-formed sequence) become a
> well-formed surrogate pair:
> >> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" =
> >> "😄".
> >>
> >> The story for UTF-8 is very different: Of course all UTF-8 code units
> would be accepted by String.fromCodePoint, but they would turn into a
> completely different character sequence. E.g., the UTF-8 byte sequence for
> 😄:
> >> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) =>
> "\u00F0\u009F\u0098\u0084" = "ð\u009F\u0098\u0084" (the last three are
> control characters).
> >>
> >> Handling UTF-8 would require a way to identify the character encoding
> to convert from, which indicates the beginning of an encoding conversion
> API, and the internationalization ad-hoc decided not to work on one within
> ECMAScript. There is an API being defined as part of the encoding standard
> project at WhatWG.
> >>
> >> Norbert
> >>
> >>
> >> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote:
> >>
> >>> It was suggested to me that we could probably extend
> String.fromCodePoint to be aware of UTF-16 code units too. It seems doable
> since the lead surrogate is not a valid code point.
> >>>
> >>> The question is if it is worth it? It seems like we are going down a
> slippery slope if we start to do things like this. Should we also handle
> UTF-8 code units. Maybe it is better not to do this and try to get people
> to move away from UTF-16 code units and move them towards code points.
> >>>
> >>> --
> >>> erik
> >>
> >> _______________________________________________
> >> es-discuss mailing list
> >> es-discuss at mozilla.org
> >> https://mail.mozilla.org/listinfo/es-discuss
> >
> >
> >
>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130114/a2f21368/attachment-0001.html>