String.fromCodePoint and surrogate pairs?
Do you know what the people who talked to you mean by "aware of UTF-16 code units"?
As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = "π".
The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for π: String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are control characters).
Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG.
Norbert
Do you know what the people who talked to you mean by "aware of UTF-16 code units"? As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = "π". The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for π: String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are control characters). Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG. Norbert On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote: > It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point. > > The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points. > > -- > erik
IMO String.fromCodePoint should disallow U+D800-U+DFFF.
There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning
IMO String.fromCodePoint should disallow U+D800-U+DFFF. There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences. IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms. -Shawn -----Original Message----- From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert Lindenberg Sent: Wednesday, December 12, 2012 1:25 PM To: Erik Arvidsson Cc: es-discuss at mozilla.org Subject: Re: String.fromCodePoint and surrogate pairs? Do you know what the people who talked to you mean by "aware of UTF-16 code units"? As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = "π". The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for π: String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are control characters). Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG. Norbert On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote: > It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point. > > The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points. > > -- > erik _______________________________________________ es-discuss mailing list es-discuss at mozilla.org https://mail.mozilla.org/listinfo/es-discuss
The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1].
Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings.
Norbert
[1] www.unicode.org/versions/Unicode6.2.0/ch03.pdf [2] www.unicode.org/versions/Unicode6.2.0/ch16.pdf
The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1]. Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings. Norbert [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf On Dec 12, 2012, at 13:55 , Shawn Steele wrote: > IMO String.fromCodePoint should disallow U+D800-U+DFFF. > > There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences. IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms. > > -Shawn > > -----Original Message----- > From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert Lindenberg > Sent: Wednesday, December 12, 2012 1:25 PM > To: Erik Arvidsson > Cc: es-discuss at mozilla.org > Subject: Re: String.fromCodePoint and surrogate pairs? > > Do you know what the people who talked to you mean by "aware of UTF-16 code units"? > > As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: > String.fromCodePoint(0xD83D, 0xDE04) => > "\uD83D\uDE04" = > "π". > > The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for π: > String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are control characters). > > Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG. > > Norbert > > > On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote: > >> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point. >> >> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points. >> >> -- >> erik > > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss
I was looking at D75 of 3.8 "Surrogates"
My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair. No valid Unicode string could contain that. So why support it?
I was looking at D75 of 3.8 "Surrogates" My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair. No valid Unicode string could contain that. So why support it? -Shawn -----Original Message----- From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] Sent: Wednesday, December 12, 2012 2:40 PM To: Shawn Steele Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org Subject: Re: String.fromCodePoint and surrogate pairs? The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1]. Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings. Norbert [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf On Dec 12, 2012, at 13:55 , Shawn Steele wrote: > IMO String.fromCodePoint should disallow U+D800-U+DFFF. > > There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences. IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms. > > -Shawn > > -----Original Message----- > From: es-discuss-bounces at mozilla.org > [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert > Lindenberg > Sent: Wednesday, December 12, 2012 1:25 PM > To: Erik Arvidsson > Cc: es-discuss at mozilla.org > Subject: Re: String.fromCodePoint and surrogate pairs? > > Do you know what the people who talked to you mean by "aware of UTF-16 code units"? > > As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: > String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = > "π". > > The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for π: > String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are control characters). > > Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG. > > Norbert > > > On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote: > >> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point. >> >> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points. >> >> -- >> erik > > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss
I don't have a good scenario at hand either that would require support for surrogate code points, but in ECMAScript the question is often asked the other way around: Why reject it? And given that there are several ways already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", String.fromCodeUnit(0xD800)), it's not clear why this particular path should be blocked.
(Sorry for letting this sit in my outbox for such a long time.)
Norbert
I don't have a good scenario at hand either that would require support for surrogate code points, but in ECMAScript the question is often asked the other way around: Why reject it? And given that there are several ways already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", String.fromCodeUnit(0xD800)), it's not clear why this particular path should be blocked. (Sorry for letting this sit in my outbox for such a long time.) Norbert On Dec 12, 2012, at 15:17 , Shawn Steele wrote: > I was looking at D75 of 3.8 "Surrogates" > > My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair. No valid Unicode string could contain that. So why support it? > > -Shawn > > -----Original Message----- > From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] > Sent: Wednesday, December 12, 2012 2:40 PM > To: Shawn Steele > Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org > Subject: Re: String.fromCodePoint and surrogate pairs? > > The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1]. > > Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings. > > Norbert > > [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf > [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf > > > On Dec 12, 2012, at 13:55 , Shawn Steele wrote: > >> IMO String.fromCodePoint should disallow U+D800-U+DFFF. >> >> There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences. IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms. >> >> -Shawn >> >> -----Original Message----- >> From: es-discuss-bounces at mozilla.org >> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert >> Lindenberg >> Sent: Wednesday, December 12, 2012 1:25 PM >> To: Erik Arvidsson >> Cc: es-discuss at mozilla.org >> Subject: Re: String.fromCodePoint and surrogate pairs? >> >> Do you know what the people who talked to you mean by "aware of UTF-16 code units"? >> >> As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: >> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = >> "π". >> >> The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for π: >> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are control characters). >> >> Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG. >> >> Norbert >> >> >> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote: >> >>> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point. >>> >>> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points. >>> >>> -- >>> erik >> >> _______________________________________________ >> es-discuss mailing list >> es-discuss at mozilla.org >> https://mail.mozilla.org/listinfo/es-discuss > > >
It doesn't make sense and is illegal unicode. Eg: it's corrupt data. So the only reason to accept it is to allow corrupt data, perhaps as a way of faking other non-Unicode data as a Unicode context. Which inevitably leads to problems, particularly on the web where people do whatever sneaky things the developer thinks works.
Assuming a use case for illegal unicode were ever found, it could be added later.
It doesn't make sense and is illegal unicode. Eg: it's corrupt data. So the only reason to accept it is to allow corrupt data, perhaps as a way of faking other non-Unicode data as a Unicode context. Which inevitably leads to problems, particularly on the web where people do whatever sneaky things the developer thinks works. Assuming a use case for illegal unicode were ever found, it could be added later. -Shawn -----Original Message----- From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] Sent: Monday, January 14, 2013 4:35 PM To: Shawn Steele Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org Subject: Re: String.fromCodePoint and surrogate pairs? I don't have a good scenario at hand either that would require support for surrogate code points, but in ECMAScript the question is often asked the other way around: Why reject it? And given that there are several ways already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", String.fromCodeUnit(0xD800)), it's not clear why this particular path should be blocked. (Sorry for letting this sit in my outbox for such a long time.) Norbert On Dec 12, 2012, at 15:17 , Shawn Steele wrote: > I was looking at D75 of 3.8 "Surrogates" > > My point is that there's no "legal" scenario for converting basically a UTF-32 input to an isolated surrogate pair. No valid Unicode string could contain that. So why support it? > > -Shawn > > -----Original Message----- > From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] > Sent: Wednesday, December 12, 2012 2:40 PM > To: Shawn Steele > Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org > Subject: Re: String.fromCodePoint and surrogate pairs? > > The Unicode standard defines "code point" as any value in the range of integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1]. > > Once you exclude surrogate code points, you have Unicode scalar values (definition D76), so you're basically proposing a String.fromScalarValue function. But then, why not also exclude code points that Unicode has defined as non-characters (chapter 16.7 [2])? It seems we're getting into policy-setting here, and so far ECMAScript has avoided setting policy for how you can use strings. > > Norbert > > [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf > [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf > > > On Dec 12, 2012, at 13:55 , Shawn Steele wrote: > >> IMO String.fromCodePoint should disallow U+D800-U+DFFF. >> >> There's already fromCharCode that does that, and a according to The Unicode Standard, isolated surrogates have no meaning on their own and goes on to compare them to illegal UTF-8 sequences. IMO "CodePoint" is a 21 Unicode code point, and explicitly not UTF-16, so it shouldn't confuse things by allowing UTF-16 (or other encoding) forms. >> >> -Shawn >> >> -----Original Message----- >> From: es-discuss-bounces at mozilla.org >> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert >> Lindenberg >> Sent: Wednesday, December 12, 2012 1:25 PM >> To: Erik Arvidsson >> Cc: es-discuss at mozilla.org >> Subject: Re: String.fromCodePoint and surrogate pairs? >> >> Do you know what the people who talked to you mean by "aware of UTF-16 code units"? >> >> As specified, String.fromCodePoint, accepts all UTF-16 code units because they use a subset of the integers allowed as code points (0 to 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly what you expect. Surrogate values are interpreted as surrogate code points, which are valid code points in Unicode (their use makes a string ill-formed in Unicode terminology, but the proposed ECMAScript spec ignores issues of well-formedness for compatibility with ES5). Since in conversion to UTF-16 a surrogate code point just becomes the corresponding code unit, it can happen that two surrogate code points (an ill-formed sequence) become a well-formed surrogate pair: >> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = >> "π". >> >> The story for UTF-8 is very different: Of course all UTF-8 code units would be accepted by String.fromCodePoint, but they would turn into a completely different character sequence. E.g., the UTF-8 byte sequence for π: >> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are control characters). >> >> Handling UTF-8 would require a way to identify the character encoding to convert from, which indicates the beginning of an encoding conversion API, and the internationalization ad-hoc decided not to work on one within ECMAScript. There is an API being defined as part of the encoding standard project at WhatWG. >> >> Norbert >> >> >> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote: >> >>> It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point. >>> >>> The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points. >>> >>> -- >>> erik >> >> _______________________________________________ >> es-discuss mailing list >> es-discuss at mozilla.org >> https://mail.mozilla.org/listinfo/es-discuss > > >
There is a long discussion of this on the unicode list recently. A surrogate code point is not illegal Unicode. It is illegal in a UTF string, but is not illegal in a Unicode String ( www.unicode.org/glossary/#unicode_string)
I don't want to repeat that whole long discussion here.
Mark plus.google.com/114199149796022210033 * * β Il meglio Γ¨ lβinimico del bene β **
There is a long discussion of this on the unicode list recently. A surrogate code point is not illegal Unicode. It is illegal *in* a UTF string, but is not illegal in a Unicode String ( http://www.unicode.org/glossary/#unicode_string) I don't want to repeat that whole long discussion here. Mark <https://plus.google.com/114199149796022210033> * * *β Il meglio Γ¨ lβinimico del bene β* ** On Mon, Jan 14, 2013 at 4:52 PM, Shawn Steele <Shawn.Steele at microsoft.com>wrote: > It doesn't make sense and is illegal unicode. Eg: it's corrupt data. So > the only reason to accept it is to allow corrupt data, perhaps as a way of > faking other non-Unicode data as a Unicode context. Which inevitably leads > to problems, particularly on the web where people do whatever sneaky things > the developer thinks works. > > Assuming a use case for illegal unicode were ever found, it could be added > later. > > -Shawn > > -----Original Message----- > From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] > Sent: Monday, January 14, 2013 4:35 PM > To: Shawn Steele > Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org > Subject: Re: String.fromCodePoint and surrogate pairs? > > I don't have a good scenario at hand either that would require support for > surrogate code points, but in ECMAScript the question is often asked the > other way around: Why reject it? And given that there are several ways > already to construct strings that are ill-formed UTF-16 (e.g., "\uD800", > String.fromCodeUnit(0xD800)), it's not clear why this particular path > should be blocked. > > (Sorry for letting this sit in my outbox for such a long time.) > > Norbert > > > On Dec 12, 2012, at 15:17 , Shawn Steele wrote: > > > I was looking at D75 of 3.8 "Surrogates" > > > > My point is that there's no "legal" scenario for converting basically a > UTF-32 input to an isolated surrogate pair. No valid Unicode string could > contain that. So why support it? > > > > -Shawn > > > > -----Original Message----- > > From: Norbert Lindenberg [mailto:ecmascript at norbertlindenberg.com] > > Sent: Wednesday, December 12, 2012 2:40 PM > > To: Shawn Steele > > Cc: Norbert Lindenberg; Erik Arvidsson; es-discuss at mozilla.org > > Subject: Re: String.fromCodePoint and surrogate pairs? > > > > The Unicode standard defines "code point" as any value in the range of > integers from 0 to 0x10FFFF - see definitions D9 and D10 of chapter 3 [1]. > > > > Once you exclude surrogate code points, you have Unicode scalar values > (definition D76), so you're basically proposing a String.fromScalarValue > function. But then, why not also exclude code points that Unicode has > defined as non-characters (chapter 16.7 [2])? It seems we're getting into > policy-setting here, and so far ECMAScript has avoided setting policy for > how you can use strings. > > > > Norbert > > > > [1] http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf > > [2] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf > > > > > > On Dec 12, 2012, at 13:55 , Shawn Steele wrote: > > > >> IMO String.fromCodePoint should disallow U+D800-U+DFFF. > >> > >> There's already fromCharCode that does that, and a according to The > Unicode Standard, isolated surrogates have no meaning on their own and goes > on to compare them to illegal UTF-8 sequences. IMO "CodePoint" is a 21 > Unicode code point, and explicitly not UTF-16, so it shouldn't confuse > things by allowing UTF-16 (or other encoding) forms. > >> > >> -Shawn > >> > >> -----Original Message----- > >> From: es-discuss-bounces at mozilla.org > >> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norbert > >> Lindenberg > >> Sent: Wednesday, December 12, 2012 1:25 PM > >> To: Erik Arvidsson > >> Cc: es-discuss at mozilla.org > >> Subject: Re: String.fromCodePoint and surrogate pairs? > >> > >> Do you know what the people who talked to you mean by "aware of UTF-16 > code units"? > >> > >> As specified, String.fromCodePoint, accepts all UTF-16 code units > because they use a subset of the integers allowed as code points (0 to > 0xFFFF versus 0 to 0x10FFFF). For non-surrogate values, you get exactly > what you expect. Surrogate values are interpreted as surrogate code points, > which are valid code points in Unicode (their use makes a string ill-formed > in Unicode terminology, but the proposed ECMAScript spec ignores issues of > well-formedness for compatibility with ES5). Since in conversion to UTF-16 > a surrogate code point just becomes the corresponding code unit, it can > happen that two surrogate code points (an ill-formed sequence) become a > well-formed surrogate pair: > >> String.fromCodePoint(0xD83D, 0xDE04) => "\uD83D\uDE04" = > >> "π". > >> > >> The story for UTF-8 is very different: Of course all UTF-8 code units > would be accepted by String.fromCodePoint, but they would turn into a > completely different character sequence. E.g., the UTF-8 byte sequence for > π: > >> String.fromCodePoint(0xF0, 0x9F, 0x98, 0x84) => > "\u00F0\u009F\u0098\u0084" = "Γ°\u009F\u0098\u0084" (the last three are > control characters). > >> > >> Handling UTF-8 would require a way to identify the character encoding > to convert from, which indicates the beginning of an encoding conversion > API, and the internationalization ad-hoc decided not to work on one within > ECMAScript. There is an API being defined as part of the encoding standard > project at WhatWG. > >> > >> Norbert > >> > >> > >> On Dec 12, 2012, at 7:46 , Erik Arvidsson wrote: > >> > >>> It was suggested to me that we could probably extend > String.fromCodePoint to be aware of UTF-16 code units too. It seems doable > since the lead surrogate is not a valid code point. > >>> > >>> The question is if it is worth it? It seems like we are going down a > slippery slope if we start to do things like this. Should we also handle > UTF-8 code units. Maybe it is better not to do this and try to get people > to move away from UTF-16 code units and move them towards code points. > >>> > >>> -- > >>> erik > >> > >> _______________________________________________ > >> es-discuss mailing list > >> es-discuss at mozilla.org > >> https://mail.mozilla.org/listinfo/es-discuss > > > > > > > > > > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130114/a2f21368/attachment-0001.html>
It was suggested to me that we could probably extend String.fromCodePoint to be aware of UTF-16 code units too. It seems doable since the lead surrogate is not a valid code point.
The question is if it is worth it? It seems like we are going down a slippery slope if we start to do things like this. Should we also handle UTF-8 code units. Maybe it is better not to do this and try to get people to move away from UTF-16 code units and move them towards code points.