Unicode normalization problem

# monolithed (10 years ago)

var text = 'ЙйЁё';

text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Possible solutions:

text.normalize().split('') // ["Й", "й", "Ё", "ё"]

I like it, but is no so comfortable

Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

[...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Should the Array.from and ...text work as the first example and why?

Test example

```js
var text = 'ЙйЁё';

text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
```

Possible solutions:

1.

```js
text.normalize().split('') // ["Й", "й", "Ё", "ё"]
```

I like it, but is no so comfortable

2.

```js
Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
```

3.

```js
[...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
```


Should the `Array.from` and `...text` work as the first example and why?

[Test example](http://jsbin.com/baguhiguja/1/edit?js,output)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150401/b3ed8af4/attachment.html>

# Rick Waldron (10 years ago)

On Wed, Apr 1, 2015 at 2:59 PM monolithed <monolithed at gmail.com> wrote:

var text = 'ЙйЁё';

text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Possible solutions:

text.normalize().split('') // ["Й", "й", "Ё", "ё"]

I like it, but is no so comfortable

Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

[...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]

Should the Array.from and ...text work as the first example and why?

Why would they imply calling normalize()? What if that wasn't desired?

Since #1 calls normalize before split(), the actual equivalents would look like this:

Array.from(text.normalize()) // [ "Й", "й", "Ё", "ё" ] [...text.normalize()] // [ "Й", "й", "Ё", "ё" ]

On Wed, Apr 1, 2015 at 2:59 PM monolithed <monolithed at gmail.com> wrote:

> ```js
> var text = 'ЙйЁё';
>
> text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
> ```
>
> Possible solutions:
>
> 1.
>
> ```js
> text.normalize().split('') // ["Й", "й", "Ё", "ё"]
> ```
>
> I like it, but is no so comfortable
>
> 2.
>
> ```js
> Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
> ```
>
> 3.
>
> ```js
> [...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
> ```
>
>
> Should the `Array.from` and `...text` work as the first example and why?
>

Why would they imply calling `normalize()`? What if that wasn't desired?

Since #1 calls normalize before split(), the actual equivalents would look
like this:

  Array.from(text.normalize()) // [ "Й", "й", "Ё", "ё" ]
  [...text.normalize()] // [ "Й", "й", "Ё", "ё" ]

Rick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150401/80caf4c6/attachment.html>

# Alexander Guinness (10 years ago)

My reasoning is based on the following example:

var text = '𝐀';

text.length; // 2

Array.from(text).length // 1

2015-04-01 22:05 GMT+03:00 Rick Waldron <waldron.rick at gmail.com>:

My reasoning is based on the following example:

```js
var text = '𝐀';

text.length; // 2

Array.from(text).length // 1
```

2015-04-01 22:05 GMT+03:00 Rick Waldron <waldron.rick at gmail.com>:

>
>
> On Wed, Apr 1, 2015 at 2:59 PM monolithed <monolithed at gmail.com> wrote:
>
>> ```js
>> var text = 'ЙйЁё';
>>
>> text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
>> ```
>>
>> Possible solutions:
>>
>> 1.
>>
>> ```js
>> text.normalize().split('') // ["Й", "й", "Ё", "ё"]
>> ```
>>
>> I like it, but is no so comfortable
>>
>> 2.
>>
>> ```js
>> Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
>> ```
>>
>> 3.
>>
>> ```js
>> [...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
>> ```
>>
>>
>> Should the `Array.from` and `...text` work as the first example and why?
>>
>
> Why would they imply calling `normalize()`? What if that wasn't desired?
>
> Since #1 calls normalize before split(), the actual equivalents would look
> like this:
>
>   Array.from(text.normalize()) // [ "Й", "й", "Ё", "ё" ]
>   [...text.normalize()] // [ "Й", "й", "Ё", "ё" ]
>
> Rick
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150401/ac2b54d8/attachment.html>

# Mathias Bynens (10 years ago)

On Wed, Apr 1, 2015 at 9:17 PM, Alexander Guinness <monolithed at gmail.com> wrote:

My reasoning is based on the following example:
var text = '𝐀';

text.length; // 2

Array.from(text).length // 1

What you’re seeing there is not normalization, but rather the string iterator that automatically accounts for surrogate pairs (treating them as a single unit).

On Wed, Apr 1, 2015 at 9:17 PM, Alexander Guinness <monolithed at gmail.com> wrote:
> My reasoning is based on the following example:
>
> ```js
> var text = '𝐀';
>
> text.length; // 2
>
> Array.from(text).length // 1
> ```

What you’re seeing there is not normalization, but rather the string
iterator that automatically accounts for surrogate pairs (treating
them as a single unit).

# monolithed (10 years ago)

What you’re seeing there is not normalization, but rather the string

iterator that automatically accounts for surrogate pairs (treating them as a single unit).

var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2

I think this is strange. How to safely work with strings?

2015-04-01 22:17 GMT+03:00 Alexander Guinness <monolithed at gmail.com>:

> What you’re seeing there is not normalization, but rather the string
iterator that automatically accounts for surrogate pairs (treating them as
a single unit).

```js
var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2
```

I think this is strange.
How to safely work with strings?


2015-04-01 22:17 GMT+03:00 Alexander Guinness <monolithed at gmail.com>:

> My reasoning is based on the following example:
>
> ```js
> var text = '𝐀';
>
> text.length; // 2
>
> Array.from(text).length // 1
> ```
>
> 2015-04-01 22:05 GMT+03:00 Rick Waldron <waldron.rick at gmail.com>:
>
>>
>>
>> On Wed, Apr 1, 2015 at 2:59 PM monolithed <monolithed at gmail.com> wrote:
>>
>>> ```js
>>> var text = 'ЙйЁё';
>>>
>>> text.split(''); // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
>>> ```
>>>
>>> Possible solutions:
>>>
>>> 1.
>>>
>>> ```js
>>> text.normalize().split('') // ["Й", "й", "Ё", "ё"]
>>> ```
>>>
>>> I like it, but is no so comfortable
>>>
>>> 2.
>>>
>>> ```js
>>> Array.from(text) // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
>>> ```
>>>
>>> 3.
>>>
>>> ```js
>>> [...text] // ["И", ""̆", "и", ""̆", "Е", ""̈", "е", ""̈"]
>>> ```
>>>
>>>
>>> Should the `Array.from` and `...text` work as the first example and why?
>>>
>>
>> Why would they imply calling `normalize()`? What if that wasn't desired?
>>
>> Since #1 calls normalize before split(), the actual equivalents would
>> look like this:
>>
>>   Array.from(text.normalize()) // [ "Й", "й", "Ё", "ё" ]
>>   [...text.normalize()] // [ "Й", "й", "Ё", "ё" ]
>>
>> Rick
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150401/83ed988c/attachment.html>

# Mathias Bynens (10 years ago)

On Wed, Apr 1, 2015 at 10:30 PM, monolithed <monolithed at gmail.com> wrote:

What you’re seeing there is not normalization, but rather the string iterator that automatically accounts for surrogate pairs (treating them as a single unit).
var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2
I think this is strange. How to safely work with strings?

It depends on your use case. FWIW, I’ve outlined some examples here: mathiasbynens.be/notes/javascript

On Wed, Apr 1, 2015 at 10:30 PM, monolithed <monolithed at gmail.com> wrote:
>> What you’re seeing there is not normalization, but rather the string
>> iterator that automatically accounts for surrogate pairs (treating them as a
>> single unit).
>
> ```js
> var foo = '𝐀';
> var bar = 'Й';
> foo.length; // 2
> Array.from(foo).length // 1
>
> bar.length; // 2
> Array.from(foo).length // 2
> ```
>
> I think this is strange.
> How to safely work with strings?

It depends on your use case. FWIW, I’ve outlined some examples here:
https://mathiasbynens.be/notes/javascript-unicode

# Andrea Giammarchi (10 years ago)

I think the concern on how people seeing what they see can be understood from JS is more than valid ...

var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2

Why is that and how to solve?

I think the concern on how people seeing what they see can be understood
from JS is more than valid ...

```js
var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2
```

Why is that and how to solve?


On Wed, Apr 1, 2015 at 10:32 PM, Mathias Bynens <mathias at qiwi.be> wrote:

> On Wed, Apr 1, 2015 at 10:30 PM, monolithed <monolithed at gmail.com> wrote:
> >> What you’re seeing there is not normalization, but rather the string
> >> iterator that automatically accounts for surrogate pairs (treating them
> as a
> >> single unit).
> >
> > ```js
> > var foo = '𝐀';
> > var bar = 'Й';
> > foo.length; // 2
> > Array.from(foo).length // 1
> >
> > bar.length; // 2
> > Array.from(foo).length // 2
> > ```
> >
> > I think this is strange.
> > How to safely work with strings?
>
> It depends on your use case. FWIW, I’ve outlined some examples here:
> https://mathiasbynens.be/notes/javascript-unicode
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150402/44146c0a/attachment.html>

# Boris Zbarsky (10 years ago)

On 4/1/15 6:56 PM, Andrea Giammarchi wrote:

Why is that

Because those are different things. The first is a single Unicode character that happens to be represented by 2 UTF-16 code units. The second is a pair of Unicode characters that are each represented by one UTF-16 code unit, but also happen to form a single grapheme cluster (because one of them is a combining character). To complicate things further, there is also a single Unicode character that represents that same grapheme cluster....

String length shows the number of UTF-16 code units.

Array.from works on Unicode characters. That explains the foo.length and Array.from(foo).length results.

and how to solve?

Can you clearly explain what problem you are trying to solve?

On 4/1/15 6:56 PM, Andrea Giammarchi wrote:
> Why is that

Because those are different things.  The first is a single Unicode 
character that happens to be represented by 2 UTF-16 code units.  The 
second is a pair of Unicode characters that are each represented by one 
UTF-16 code unit, but also happen to form a single grapheme cluster 
(because one of them is a combining character).  To complicate things 
further, there is also a single Unicode character that represents that 
same grapheme cluster....

String length shows the number of UTF-16 code units.

Array.from works on Unicode characters.  That explains the foo.length 
and Array.from(foo).length results.

> and how to solve?

Can you clearly explain what problem you are trying to solve?

-Boris

# Andrea Giammarchi (10 years ago)

foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(bar).length // 2

I know already everything you wrote ... now, how to explain to JS users out there and how to solve?

```js
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(bar).length // 2
```

I know already everything you wrote ... now, how to explain to JS users out
there and how to solve?

On Thu, Apr 2, 2015 at 1:04 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 4/1/15 6:56 PM, Andrea Giammarchi wrote:
>
>> Why is that
>>
>
> Because those are different things.  The first is a single Unicode
> character that happens to be represented by 2 UTF-16 code units.  The
> second is a pair of Unicode characters that are each represented by one
> UTF-16 code unit, but also happen to form a single grapheme cluster
> (because one of them is a combining character).  To complicate things
> further, there is also a single Unicode character that represents that same
> grapheme cluster....
>
> String length shows the number of UTF-16 code units.
>
> Array.from works on Unicode characters.  That explains the foo.length and
> Array.from(foo).length results.
>
>  and how to solve?
>>
>
> Can you clearly explain what problem you are trying to solve?
>
> -Boris
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150402/0c988c60/attachment.html>

# Andrea Giammarchi (10 years ago)

and now I also gonna hope that Array.from(foo).length // 2 wasn't by accident, instead of bar ...

and now I also gonna hope that `Array.from(foo).length // 2` wasn't by
accident, instead of `bar` ...

On Thu, Apr 2, 2015 at 1:07 AM, Andrea Giammarchi <
andrea.giammarchi at gmail.com> wrote:

> ```js
> foo.length; // 2
> Array.from(foo).length // 1
>
> bar.length; // 2
> Array.from(bar).length // 2
> ```
>
> I know already everything you wrote ... now, how to explain to JS users
> out there and how to solve?
>
> On Thu, Apr 2, 2015 at 1:04 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>
>> On 4/1/15 6:56 PM, Andrea Giammarchi wrote:
>>
>>> Why is that
>>>
>>
>> Because those are different things.  The first is a single Unicode
>> character that happens to be represented by 2 UTF-16 code units.  The
>> second is a pair of Unicode characters that are each represented by one
>> UTF-16 code unit, but also happen to form a single grapheme cluster
>> (because one of them is a combining character).  To complicate things
>> further, there is also a single Unicode character that represents that same
>> grapheme cluster....
>>
>> String length shows the number of UTF-16 code units.
>>
>> Array.from works on Unicode characters.  That explains the foo.length and
>> Array.from(foo).length results.
>>
>>  and how to solve?
>>>
>>
>> Can you clearly explain what problem you are trying to solve?
>>
>> -Boris
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150402/3f9c0107/attachment.html>

# Jordan Harband (10 years ago)

Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.

For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.

Unfortunately we don't have a String#codepoints or something that would
return the number of code points as opposed to the number of characters
(that "length" returns) - something like that imo would greatly simplify
explaining the differences to people.

For the time being, I've been explaining that some characters are actually
made up of two, and the 💩 character (it's a fun example to use) is an
example of two characters combining to make one "code point". It's not a
quick or trivial thing to explain but people do seem to grasp it eventually.

On Wed, Apr 1, 2015 at 4:09 PM, Andrea Giammarchi <
andrea.giammarchi at gmail.com> wrote:

> and now I also gonna hope that `Array.from(foo).length // 2` wasn't by
> accident, instead of `bar` ...
>
> On Thu, Apr 2, 2015 at 1:07 AM, Andrea Giammarchi <
> andrea.giammarchi at gmail.com> wrote:
>
>> ```js
>> foo.length; // 2
>> Array.from(foo).length // 1
>>
>> bar.length; // 2
>> Array.from(bar).length // 2
>> ```
>>
>> I know already everything you wrote ... now, how to explain to JS users
>> out there and how to solve?
>>
>> On Thu, Apr 2, 2015 at 1:04 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>>
>>> On 4/1/15 6:56 PM, Andrea Giammarchi wrote:
>>>
>>>> Why is that
>>>>
>>>
>>> Because those are different things.  The first is a single Unicode
>>> character that happens to be represented by 2 UTF-16 code units.  The
>>> second is a pair of Unicode characters that are each represented by one
>>> UTF-16 code unit, but also happen to form a single grapheme cluster
>>> (because one of them is a combining character).  To complicate things
>>> further, there is also a single Unicode character that represents that same
>>> grapheme cluster....
>>>
>>> String length shows the number of UTF-16 code units.
>>>
>>> Array.from works on Unicode characters.  That explains the foo.length
>>> and Array.from(foo).length results.
>>>
>>>  and how to solve?
>>>>
>>>
>>> Can you clearly explain what problem you are trying to solve?
>>>
>>> -Boris
>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>
>>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150401/b1b4c79f/attachment-0001.html>

# Andrea Giammarchi (10 years ago)

Jordan the purpose of Array.from is to iterate over the string, and the point of iteration instead of splitting is to have automagically codepoints. This, unless I've misunderstood Mathias presentation (might be)

So, here there is a different problem: there are code-points that do not represent real visual representation ... or maybe, the real problem, is about broken Array.from polyfill?

I wouldn't be surprise in such case ;-)

Jordan the purpose of `Array.from` is to iterate over the string, and the
point of iteration instead of splitting is to have automagically
codepoints. This, unless I've misunderstood Mathias presentation (might be)

So, here there is a different problem: there are code-points that do not
represent real visual representation ... or maybe, the real problem, is
about broken `Array.from` polyfill?

I wouldn't be surprise in such case ;-)

On Thu, Apr 2, 2015 at 1:22 AM, Jordan Harband <ljharb at gmail.com> wrote:

> Unfortunately we don't have a String#codepoints or something that would
> return the number of code points as opposed to the number of characters
> (that "length" returns) - something like that imo would greatly simplify
> explaining the differences to people.
>
> For the time being, I've been explaining that some characters are actually
> made up of two, and the [image: 💩] character (it's a fun example to use)
> is an example of two characters combining to make one "code point". It's
> not a quick or trivial thing to explain but people do seem to grasp it
> eventually.
>
> On Wed, Apr 1, 2015 at 4:09 PM, Andrea Giammarchi <
> andrea.giammarchi at gmail.com> wrote:
>
>> and now I also gonna hope that `Array.from(foo).length // 2` wasn't by
>> accident, instead of `bar` ...
>>
>> On Thu, Apr 2, 2015 at 1:07 AM, Andrea Giammarchi <
>> andrea.giammarchi at gmail.com> wrote:
>>
>>> ```js
>>> foo.length; // 2
>>> Array.from(foo).length // 1
>>>
>>> bar.length; // 2
>>> Array.from(bar).length // 2
>>> ```
>>>
>>> I know already everything you wrote ... now, how to explain to JS users
>>> out there and how to solve?
>>>
>>> On Thu, Apr 2, 2015 at 1:04 AM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
>>>
>>>> On 4/1/15 6:56 PM, Andrea Giammarchi wrote:
>>>>
>>>>> Why is that
>>>>>
>>>>
>>>> Because those are different things.  The first is a single Unicode
>>>> character that happens to be represented by 2 UTF-16 code units.  The
>>>> second is a pair of Unicode characters that are each represented by one
>>>> UTF-16 code unit, but also happen to form a single grapheme cluster
>>>> (because one of them is a combining character).  To complicate things
>>>> further, there is also a single Unicode character that represents that same
>>>> grapheme cluster....
>>>>
>>>> String length shows the number of UTF-16 code units.
>>>>
>>>> Array.from works on Unicode characters.  That explains the foo.length
>>>> and Array.from(foo).length results.
>>>>
>>>>  and how to solve?
>>>>>
>>>>
>>>> Can you clearly explain what problem you are trying to solve?
>>>>
>>>> -Boris
>>>>
>>>> _______________________________________________
>>>> es-discuss mailing list
>>>> es-discuss at mozilla.org
>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>
>>>
>>>
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150402/ce075e36/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u1f4a9.png
Type: image/png
Size: 1954 bytes
Desc: not available
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150402/ce075e36/attachment.png>

# Mathias Bynens (10 years ago)

On Thu, Apr 2, 2015 at 1:39 AM, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:

Jordan the purpose of Array.from is to iterate over the string, and the point of iteration instead of splitting is to have automagically codepoints. This, unless I've misunderstood Mathias presentation (might be)

So, here there is a different problem: there are code-points that do not represent real visual representation ...

Those are called grapheme clusters or just “graphemes”, as Boris mentioned. And here’s how to deal with them: mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters

“Unicode Standard Annex #29 describes an algorithm for determining grapheme cluster boundaries. For a completely accurate solution that works for all Unicode scripts, implement this algorithm in JavaScript, and then count each grapheme cluster as a single symbol.”

or maybe, the real problem, is about broken Array.from polyfill?

Array.from just uses String.prototype[Symbol.iterator] internally, and that is defined to deal with code points, not grapheme clusters. Either choice would have confused some developers. IIRC, Perl 6 has built-in capabilities to deal with grapheme clusters, but until ES does, this use case must be addressed in user-land.

On Thu, Apr 2, 2015 at 1:39 AM, Andrea Giammarchi
<andrea.giammarchi at gmail.com> wrote:
> Jordan the purpose of `Array.from` is to iterate over the string, and the point of iteration instead of splitting is to have automagically codepoints. This, unless I've misunderstood Mathias presentation (might be)
>
> So, here there is a different problem: there are code-points that do not represent real visual representation ...

Those are called grapheme clusters or just “graphemes”, as Boris
mentioned. And here’s how to deal with them:
https://mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters

“Unicode Standard Annex #29 describes [an algorithm for determining
grapheme cluster
boundaries](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
For a _completely_ accurate solution that works for all Unicode
scripts, implement this algorithm in JavaScript, and then count each
grapheme cluster as a single symbol.”

> or maybe, the real problem, is about broken `Array.from` polyfill?

`Array.from` just uses `String.prototype[Symbol.iterator]` internally,
and that is defined to deal with code points, not grapheme clusters.
Either choice would have confused some developers. IIRC, Perl 6 has
built-in capabilities to deal with grapheme clusters, but until ES
does, this use case must be addressed in user-land.

# Claude Pache (10 years ago)

Le 2 avr. 2015 à 01:22, Jordan Harband <ljharb at gmail.com> a écrit :

Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.

For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.

And when they think to have understood, they are in fact still in great trouble, because they will confuse it with other unrelated issues like grapheme clusters and/or precomposed characters.

The issue here is specific to the UTF16 encoding, where some Unicode code points are encoded as a sequence of two 16-bit units; and ES strings are (by an accident of history) sequences of 16-bit units, not Unicode code points. I think it is important to stress that it is an issue of encoding, at least in order to have a chance to distinguish it from the other aforementioned issues.

(So, taking your example, the 💩 character is internally represented as a sequence of two 16-bit-units, not “characters”. And, very confusingly, the String methods that contain “char” in their name have nothing to do with “characters”.)

> Le 2 avr. 2015 à 01:22, Jordan Harband <ljharb at gmail.com> a écrit :
> 
> Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.
> 
> For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.

And when they think to have understood, they are in fact still in great trouble, because they will confuse it with other unrelated issues like grapheme clusters and/or precomposed characters.

The issue here is specific to the UTF16 encoding, where some Unicode code points are encoded as a sequence of two 16-bit units; and ES strings are (by an accident of history) sequences of 16-bit units, not Unicode code points. I think it is important to stress that it is an issue of encoding, at least in order to have a chance to distinguish it from the other aforementioned issues.

(So, taking your example, the 💩 character is internally represented as a sequence of two 16-bit-units, not “characters”. And, very confusingly, the String methods that contain “char” in their name have nothing to do with “characters”.)

—Claude

# Brendan Eich (10 years ago)

It was the 90s, when 16 bits seemed enough. Wish we could go back. Even in 1995 this was obviously going to fail, but the die had been cast years earlier in Windows and Java APIs and language/implementation designs.

It was the 90s, when 16 bits seemed enough. Wish we could go back. Even 
in 1995 this was obviously going to fail, but the die had been cast 
years earlier in Windows and Java APIs and language/implementation designs.

/be

Claude Pache wrote:
> (So, taking your example, the 💩 character is internally represented as a sequence of two 16-bit-units, not “characters”. And, very confusingly, the String methods that contain “char” in their name have nothing to do with “characters”.)
>
> —Claude