Unicode normalization problem

# monolithed (10 years ago)
var text = 'ЙйЁё';

text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Possible solutions:

text.normalize().split('') // ["Й", "й", "Ё", "ё"]

I like it, but is no so comfortable

Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
[...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Should the Array.from and ...text work as the first example and why?

Test example

# Rick Waldron (10 years ago)

On Wed, Apr 1, 2015 at 2:59 PM monolithed <monolithed at gmail.com> wrote:

var text = 'ЙйЁё';

text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Possible solutions:

text.normalize().split('') // ["Й", "й", "Ё", "ё"]

I like it, but is no so comfortable

Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
[...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Should the Array.from and ...text work as the first example and why?

Why would they imply calling normalize()? What if that wasn't desired?

Since #1 calls normalize before split(), the actual equivalents would look like this:

Array.from(text.normalize()) // [ "Й", "й", "Ё", "ё" ] [...text.normalize()] // [ "Й", "й", "Ё", "ё" ]

# Alexander Guinness (10 years ago)

My reasoning is based on the following example:

var text = '𝐀';

text.length; // 2

Array.from(text).length // 1

2015-04-01 22:05 GMT+03:00 Rick Waldron <waldron.rick at gmail.com>:

# Mathias Bynens (10 years ago)

On Wed, Apr 1, 2015 at 9:17 PM, Alexander Guinness <monolithed at gmail.com> wrote:

My reasoning is based on the following example:

var text = '𝐀';

text.length; // 2

Array.from(text).length // 1

What you’re seeing there is not normalization, but rather the string iterator that automatically accounts for surrogate pairs (treating them as a single unit).

# monolithed (10 years ago)

What you’re seeing there is not normalization, but rather the string

iterator that automatically accounts for surrogate pairs (treating them as a single unit).

var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2

I think this is strange. How to safely work with strings?

2015-04-01 22:17 GMT+03:00 Alexander Guinness <monolithed at gmail.com>:

# Mathias Bynens (10 years ago)

On Wed, Apr 1, 2015 at 10:30 PM, monolithed <monolithed at gmail.com> wrote:

What you’re seeing there is not normalization, but rather the string iterator that automatically accounts for surrogate pairs (treating them as a single unit).

var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2

I think this is strange. How to safely work with strings?

It depends on your use case. FWIW, I’ve outlined some examples here: mathiasbynens.be/notes/javascript

# Andrea Giammarchi (10 years ago)

I think the concern on how people seeing what they see can be understood from JS is more than valid ...

var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2

Why is that and how to solve?

# Boris Zbarsky (10 years ago)

On 4/1/15 6:56 PM, Andrea Giammarchi wrote:

Why is that

Because those are different things. The first is a single Unicode character that happens to be represented by 2 UTF-16 code units. The second is a pair of Unicode characters that are each represented by one UTF-16 code unit, but also happen to form a single grapheme cluster (because one of them is a combining character). To complicate things further, there is also a single Unicode character that represents that same grapheme cluster....

String length shows the number of UTF-16 code units.

Array.from works on Unicode characters. That explains the foo.length and Array.from(foo).length results.

and how to solve?

Can you clearly explain what problem you are trying to solve?

# Andrea Giammarchi (10 years ago)
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(bar).length // 2

I know already everything you wrote ... now, how to explain to JS users out there and how to solve?

# Andrea Giammarchi (10 years ago)

and now I also gonna hope that Array.from(foo).length // 2 wasn't by accident, instead of bar ...

# Jordan Harband (10 years ago)

Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.

For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.

# Andrea Giammarchi (10 years ago)

Jordan the purpose of Array.from is to iterate over the string, and the point of iteration instead of splitting is to have automagically codepoints. This, unless I've misunderstood Mathias presentation (might be)

So, here there is a different problem: there are code-points that do not represent real visual representation ... or maybe, the real problem, is about broken Array.from polyfill?

I wouldn't be surprise in such case ;-)

# Mathias Bynens (10 years ago)

On Thu, Apr 2, 2015 at 1:39 AM, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:

Jordan the purpose of Array.from is to iterate over the string, and the point of iteration instead of splitting is to have automagically codepoints. This, unless I've misunderstood Mathias presentation (might be)

So, here there is a different problem: there are code-points that do not represent real visual representation ...

Those are called grapheme clusters or just “graphemes”, as Boris mentioned. And here’s how to deal with them: mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters

“Unicode Standard Annex #29 describes an algorithm for determining grapheme cluster boundaries. For a completely accurate solution that works for all Unicode scripts, implement this algorithm in JavaScript, and then count each grapheme cluster as a single symbol.”

or maybe, the real problem, is about broken Array.from polyfill?

Array.from just uses String.prototype[Symbol.iterator] internally, and that is defined to deal with code points, not grapheme clusters. Either choice would have confused some developers. IIRC, Perl 6 has built-in capabilities to deal with grapheme clusters, but until ES does, this use case must be addressed in user-land.

# Claude Pache (10 years ago)

Le 2 avr. 2015 à 01:22, Jordan Harband <ljharb at gmail.com> a écrit :

Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.

For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.

And when they think to have understood, they are in fact still in great trouble, because they will confuse it with other unrelated issues like grapheme clusters and/or precomposed characters.

The issue here is specific to the UTF16 encoding, where some Unicode code points are encoded as a sequence of two 16-bit units; and ES strings are (by an accident of history) sequences of 16-bit units, not Unicode code points. I think it is important to stress that it is an issue of encoding, at least in order to have a chance to distinguish it from the other aforementioned issues.

(So, taking your example, the 💩 character is internally represented as a sequence of two 16-bit-units, not “characters”. And, very confusingly, the String methods that contain “char” in their name have nothing to do with “characters”.)

# Brendan Eich (10 years ago)

It was the 90s, when 16 bits seemed enough. Wish we could go back. Even in 1995 this was obviously going to fail, but the die had been cast years earlier in Windows and Java APIs and language/implementation designs.