Unicode normalization problem
On Wed, Apr 1, 2015 at 2:59 PM monolithed <monolithed at gmail.com> wrote:
var text = 'ЙйЁё'; text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
Possible solutions:
text.normalize().split('') // ["Й", "й", "Ё", "ё"]
I like it, but is no so comfortable
Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
[...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
Should the
Array.from
and...text
work as the first example and why?
Why would they imply calling normalize()
? What if that wasn't desired?
Since #1 calls normalize before split(), the actual equivalents would look like this:
Array.from(text.normalize()) // [ "Й", "й", "Ё", "ё" ] [...text.normalize()] // [ "Й", "й", "Ё", "ё" ]
On Wed, Apr 1, 2015 at 9:17 PM, Alexander Guinness <monolithed at gmail.com> wrote:
My reasoning is based on the following example:
var text = '𝐀'; text.length; // 2 Array.from(text).length // 1
What you’re seeing there is not normalization, but rather the string iterator that automatically accounts for surrogate pairs (treating them as a single unit).
What you’re seeing there is not normalization, but rather the string
iterator that automatically accounts for surrogate pairs (treating them as a single unit).
var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1
bar.length; // 2
Array.from(foo).length // 2
I think this is strange. How to safely work with strings?
2015-04-01 22:17 GMT+03:00 Alexander Guinness <monolithed at gmail.com>:
On Wed, Apr 1, 2015 at 10:30 PM, monolithed <monolithed at gmail.com> wrote:
What you’re seeing there is not normalization, but rather the string iterator that automatically accounts for surrogate pairs (treating them as a single unit).
var foo = '𝐀'; var bar = 'Й'; foo.length; // 2 Array.from(foo).length // 1 bar.length; // 2 Array.from(foo).length // 2
I think this is strange. How to safely work with strings?
It depends on your use case. FWIW, I’ve outlined some examples here: mathiasbynens.be/notes/javascript
I think the concern on how people seeing what they see can be understood from JS is more than valid ...
var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1
bar.length; // 2
Array.from(foo).length // 2
Why is that and how to solve?
On 4/1/15 6:56 PM, Andrea Giammarchi wrote:
Why is that
Because those are different things. The first is a single Unicode character that happens to be represented by 2 UTF-16 code units. The second is a pair of Unicode characters that are each represented by one UTF-16 code unit, but also happen to form a single grapheme cluster (because one of them is a combining character). To complicate things further, there is also a single Unicode character that represents that same grapheme cluster....
String length shows the number of UTF-16 code units.
Array.from works on Unicode characters. That explains the foo.length and Array.from(foo).length results.
and how to solve?
Can you clearly explain what problem you are trying to solve?
foo.length; // 2
Array.from(foo).length // 1
bar.length; // 2
Array.from(bar).length // 2
I know already everything you wrote ... now, how to explain to JS users out there and how to solve?
and now I also gonna hope that Array.from(foo).length // 2
wasn't by
accident, instead of bar
...
Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.
For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.
Jordan the purpose of Array.from
is to iterate over the string, and the
point of iteration instead of splitting is to have automagically
codepoints. This, unless I've misunderstood Mathias presentation (might be)
So, here there is a different problem: there are code-points that do not
represent real visual representation ... or maybe, the real problem, is
about broken Array.from
polyfill?
I wouldn't be surprise in such case ;-)
On Thu, Apr 2, 2015 at 1:39 AM, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:
Jordan the purpose of
Array.from
is to iterate over the string, and the point of iteration instead of splitting is to have automagically codepoints. This, unless I've misunderstood Mathias presentation (might be)So, here there is a different problem: there are code-points that do not represent real visual representation ...
Those are called grapheme clusters or just “graphemes”, as Boris mentioned. And here’s how to deal with them: mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters
“Unicode Standard Annex #29 describes an algorithm for determining grapheme cluster boundaries. For a completely accurate solution that works for all Unicode scripts, implement this algorithm in JavaScript, and then count each grapheme cluster as a single symbol.”
or maybe, the real problem, is about broken
Array.from
polyfill?
Array.from
just uses String.prototype[Symbol.iterator]
internally,
and that is defined to deal with code points, not grapheme clusters.
Either choice would have confused some developers. IIRC, Perl 6 has
built-in capabilities to deal with grapheme clusters, but until ES
does, this use case must be addressed in user-land.
Le 2 avr. 2015 à 01:22, Jordan Harband <ljharb at gmail.com> a écrit :
Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.
For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.
And when they think to have understood, they are in fact still in great trouble, because they will confuse it with other unrelated issues like grapheme clusters and/or precomposed characters.
The issue here is specific to the UTF16 encoding, where some Unicode code points are encoded as a sequence of two 16-bit units; and ES strings are (by an accident of history) sequences of 16-bit units, not Unicode code points. I think it is important to stress that it is an issue of encoding, at least in order to have a chance to distinguish it from the other aforementioned issues.
(So, taking your example, the 💩 character is internally represented as a sequence of two 16-bit-units, not “characters”. And, very confusingly, the String methods that contain “char” in their name have nothing to do with “characters”.)
It was the 90s, when 16 bits seemed enough. Wish we could go back. Even in 1995 this was obviously going to fail, but the die had been cast years earlier in Windows and Java APIs and language/implementation designs.
var text = 'ЙйЁё'; text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
Possible solutions:
text.normalize().split('') // ["Й", "й", "Ё", "ё"]
I like it, but is no so comfortable
Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
[...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
Should the
Array.from
and...text
work as the first example and why?Test example