How to count the number of symbols in a string?

# Mathias Bynens (13 years ago)

ECMAScript 6 introduces some useful new features that make working with astral Unicode symbols easier.

One thing that is still missing though (AFAIK) is an easy way to count the number of symbols / code points in a given string. As you know, we can’t rely on String.prototype.length here, as a string containing nothing but an astral symbol has a length of 2 instead of 1:

var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO poo.length

Of course it’s possible to write some code yourself to loop over all the code units in the string, handle surrogate pairs, and increment a counter manually for each full code point, but that’s a pain.

It would be useful to have a new property on String.prototype that would return the number of Unicode symbols in the string. Something like realLength (of course, it needs a better name, but you get the idea):

poo.realLength

Another possible solution is to add something like String.prototype.codePoints which would be an array of the numerical code point values in the string. That way, getting the length is only a matter of accessing the length property of the array:

poo.codePoints [ 0x1F4A9 ] poo.codePoints.length

Or perhaps this would be better suited as a method?

poo.getCodePoints() [ 0x1F4A9 ] poo.getCodePoints().length

Has anything like this been considered/discussed here yet?

ECMAScript 6 introduces some useful new features that make working with astral Unicode symbols easier.

One thing that is still missing though (AFAIK) is an easy way to count the number of symbols / code points in a given string. As you know, we can’t rely on `String.prototype.length` here, as a string containing nothing but an astral symbol has a length of `2` instead of `1`:

> var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
> poo.length
2

Of course it’s possible to write some code yourself to loop over all the code units in the string, handle surrogate pairs, and increment a counter manually for each full code point, but that’s a pain.

It would be useful to have a new property on `String.prototype` that would return the number of Unicode symbols in the string. Something like `realLength` (of course, it needs a better name, but you get the idea):

> poo.realLength
1

Another possible solution is to add something like `String.prototype.codePoints` which would be an array of the numerical code point values in the string. That way, getting the length is only a matter of accessing the `length` property of the array:

> poo.codePoints
[ 0x1F4A9 ]
> poo.codePoints.length
1

Or perhaps this would be better suited as a method?

> poo.getCodePoints()
[ 0x1F4A9 ]
> poo.getCodePoints().length
1

Has anything like this been considered/discussed here yet?

# Andrea Giammarchi (13 years ago)

already raised a while ago ...

jp.twitter.com/WebReflection/status/260479508912685056

no answer, if I remember correctly Brendan said that .size() or .size is not a good name but I have suggested .points too and nobody came back on this

already raised a while ago ...

https://jp.twitter.com/WebReflection/status/260479508912685056

no answer, if I remember correctly Brendan said that .size() or .size is
not a good name but I have suggested .points too and nobody came back on
this


On Fri, Nov 30, 2012 at 12:33 PM, Mathias Bynens <mathias at qiwi.be> wrote:

> ECMAScript 6 introduces some useful new features that make working with
> astral Unicode symbols easier.
>
> One thing that is still missing though (AFAIK) is an easy way to count the
> number of symbols / code points in a given string. As you know, we can’t
> rely on `String.prototype.length` here, as a string containing nothing but
> an astral symbol has a length of `2` instead of `1`:
>
> > var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
> > poo.length
> 2
>
> Of course it’s possible to write some code yourself to loop over all the
> code units in the string, handle surrogate pairs, and increment a counter
> manually for each full code point, but that’s a pain.
>
> It would be useful to have a new property on `String.prototype` that would
> return the number of Unicode symbols in the string. Something like
> `realLength` (of course, it needs a better name, but you get the idea):
>
> > poo.realLength
> 1
>
> Another possible solution is to add something like
> `String.prototype.codePoints` which would be an array of the numerical code
> point values in the string. That way, getting the length is only a matter
> of accessing the `length` property of the array:
>
> > poo.codePoints
> [ 0x1F4A9 ]
> > poo.codePoints.length
> 1
>
> Or perhaps this would be better suited as a method?
>
> > poo.getCodePoints()
> [ 0x1F4A9 ]
> > poo.getCodePoints().length
> 1
>
> Has anything like this been considered/discussed here yet?
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121130/2a35d1e0/attachment.html>

# Yusuke Suzuki (13 years ago)

I remember that String object iterator produces the sequence of Unicode characters. harmony:iterators#string_iterators

So I think we can get code points by using array comprehension, var points = [ch for ch of string];

Is it right? > all

I remember that String object iterator produces the sequence of Unicode
characters.
http://wiki.ecmascript.org/doku.php?id=harmony:iterators#string_iterators

So I think we can get code points by using array comprehension,
var points = [ch for ch of string];

Is it right? > all


On Sat, Dec 1, 2012 at 5:33 AM, Mathias Bynens <mathias at qiwi.be> wrote:

> ECMAScript 6 introduces some useful new features that make working with
> astral Unicode symbols easier.
>
> One thing that is still missing though (AFAIK) is an easy way to count the
> number of symbols / code points in a given string. As you know, we can’t
> rely on `String.prototype.length` here, as a string containing nothing but
> an astral symbol has a length of `2` instead of `1`:
>
> > var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
> > poo.length
> 2
>
> Of course it’s possible to write some code yourself to loop over all the
> code units in the string, handle surrogate pairs, and increment a counter
> manually for each full code point, but that’s a pain.
>
> It would be useful to have a new property on `String.prototype` that would
> return the number of Unicode symbols in the string. Something like
> `realLength` (of course, it needs a better name, but you get the idea):
>
> > poo.realLength
> 1
>
> Another possible solution is to add something like
> `String.prototype.codePoints` which would be an array of the numerical code
> point values in the string. That way, getting the length is only a matter
> of accessing the `length` property of the array:
>
> > poo.codePoints
> [ 0x1F4A9 ]
> > poo.codePoints.length
> 1
>
> Or perhaps this would be better suited as a method?
>
> > poo.getCodePoints()
> [ 0x1F4A9 ]
> > poo.getCodePoints().length
> 1
>
> Has anything like this been considered/discussed here yet?
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>



-- 
Regards,
Yusuke Suzuki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121201/80b63390/attachment.html>

# Andrea Giammarchi (13 years ago)

good point(s), still this should be a native String.prototype or String method, IMHO

good point(s), still this should be a native String.prototype or String
method, IMHO


On Fri, Nov 30, 2012 at 12:39 PM, Yusuke Suzuki <utatane.tea at gmail.com>wrote:

> I remember that String object iterator produces the sequence of Unicode
> characters.
> http://wiki.ecmascript.org/doku.php?id=harmony:iterators#string_iterators
>
> So I think we can get code points by using array comprehension,
> var points = [ch for ch of string];
>
> Is it right? > all
>
>
> On Sat, Dec 1, 2012 at 5:33 AM, Mathias Bynens <mathias at qiwi.be> wrote:
>
>> ECMAScript 6 introduces some useful new features that make working with
>> astral Unicode symbols easier.
>>
>> One thing that is still missing though (AFAIK) is an easy way to count
>> the number of symbols / code points in a given string. As you know, we
>> can’t rely on `String.prototype.length` here, as a string containing
>> nothing but an astral symbol has a length of `2` instead of `1`:
>>
>> > var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
>> > poo.length
>> 2
>>
>> Of course it’s possible to write some code yourself to loop over all the
>> code units in the string, handle surrogate pairs, and increment a counter
>> manually for each full code point, but that’s a pain.
>>
>> It would be useful to have a new property on `String.prototype` that
>> would return the number of Unicode symbols in the string. Something like
>> `realLength` (of course, it needs a better name, but you get the idea):
>>
>> > poo.realLength
>> 1
>>
>> Another possible solution is to add something like
>> `String.prototype.codePoints` which would be an array of the numerical code
>> point values in the string. That way, getting the length is only a matter
>> of accessing the `length` property of the array:
>>
>> > poo.codePoints
>> [ 0x1F4A9 ]
>> > poo.codePoints.length
>> 1
>>
>> Or perhaps this would be better suited as a method?
>>
>> > poo.getCodePoints()
>> [ 0x1F4A9 ]
>> > poo.getCodePoints().length
>> 1
>>
>> Has anything like this been considered/discussed here yet?
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
>
>
> --
> Regards,
> Yusuke Suzuki
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121130/21617a30/attachment.html>

# Mathias Bynens (13 years ago)

Thanks for the useful info, Andrea and Yusuke!

A {code}points property/getter on the String.prototype like Andrea suggested earlier sure sounds good to me.

IMHO a solution to the “string length” problem should definitely be added to ES6, as even major sites like Twitter.com are doing it wrong currently. Here’s a screenshot of the tweet textarea containing only a single astral symbol: i.imgur.com/IlFxj.png Note how the counter below says “138 [characters left]” instead of 139, as it should.

Thanks for the useful info, Andrea and Yusuke!

A `{code}points` property/getter on the `String.prototype` like Andrea suggested earlier sure sounds good to me.

IMHO a solution to the “string length” problem should definitely be added to ES6, as even major sites like Twitter.com are doing it wrong currently. Here’s a screenshot of the tweet textarea containing only a single astral symbol: http://i.imgur.com/IlFxj.png Note how the counter below says “138 [characters left]” instead of 139, as it should.

# Phillips, Addison (13 years ago)

One question would be what you’d want that specific number for? The number of code points in a string is only marginally interesting in a script. It doesn’t, for example, tell you how many screen positions the text consumes (that’s the grapheme count).

Norbert’s proposal [1] includes an iterator over the code points (so counting the code points is straightforward, but not a property of the string itself, or at least I don’t see it anywhere).

Addison

[1] norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

From: Andrea Giammarchi [mailto:andrea.giammarchi at gmail.com] Sent: Friday, November 30, 2012 12:39 PM To: Mathias Bynens Cc: es-discuss Subject: Re: How to count the number of symbols in a string?

already raised a while ago ...

jp.twitter.com/WebReflection/status/260479508912685056

no answer, if I remember correctly Brendan said that .size() or .size is not a good name but I have suggested .points too and nobody came back on this

On Fri, Nov 30, 2012 at 12:33 PM, Mathias Bynens <mathias at qiwi.be<mailto:mathias at qiwi.be>> wrote:

ECMAScript 6 introduces some useful new features that make working with astral Unicode symbols easier.

var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO poo.length

poo.realLength

poo.codePoints [ 0x1F4A9 ] poo.codePoints.length

Or perhaps this would be better suited as a method?

poo.getCodePoints() [ 0x1F4A9 ] poo.getCodePoints().length

Has anything like this been considered/discussed here yet?

One question would be what you’d want that specific number for? The number of code points in a string is only marginally interesting in a script. It doesn’t, for example, tell you how many screen positions the text consumes (that’s the grapheme count).

Norbert’s proposal [1] includes an iterator over the code points (so counting the code points is straightforward, but not a property of the string itself, or at least I don’t see it anywhere).

Addison

[1] http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

From: Andrea Giammarchi [mailto:andrea.giammarchi at gmail.com]
Sent: Friday, November 30, 2012 12:39 PM
To: Mathias Bynens
Cc: es-discuss
Subject: Re: How to count the number of symbols in a string?

already raised a while ago ...

https://jp.twitter.com/WebReflection/status/260479508912685056

no answer, if I remember correctly Brendan said that .size() or .size is not a good name but I have suggested .points too and nobody came back on this

On Fri, Nov 30, 2012 at 12:33 PM, Mathias Bynens <mathias at qiwi.be<mailto:mathias at qiwi.be>> wrote:
ECMAScript 6 introduces some useful new features that make working with astral Unicode symbols easier.

One thing that is still missing though (AFAIK) is an easy way to count the number of symbols / code points in a given string. As you know, we can’t rely on `String.prototype.length` here, as a string containing nothing but an astral symbol has a length of `2` instead of `1`:

> var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
> poo.length
2

Of course it’s possible to write some code yourself to loop over all the code units in the string, handle surrogate pairs, and increment a counter manually for each full code point, but that’s a pain.

It would be useful to have a new property on `String.prototype` that would return the number of Unicode symbols in the string. Something like `realLength` (of course, it needs a better name, but you get the idea):

> poo.realLength
1

Another possible solution is to add something like `String.prototype.codePoints` which would be an array of the numerical code point values in the string. That way, getting the length is only a matter of accessing the `length` property of the array:

> poo.codePoints
[ 0x1F4A9 ]
> poo.codePoints.length
1

Or perhaps this would be better suited as a method?

> poo.getCodePoints()
[ 0x1F4A9 ]
> poo.getCodePoints().length
1

Has anything like this been considered/discussed here yet?
_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
https://mail.mozilla.org/listinfo/es-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121130/e40c7706/attachment.html>

# Norbert Lindenberg (13 years ago)

There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used. Pointing at Twitter doesn't quite help - it's possible that the number that Twitter shows reflects some limitation in their back-end systems.

Thanks for bringing the issue to this list, btw - tweets aren't as effective in getting TC 39 attention.

Norbert

There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used. Pointing at Twitter doesn't quite help - it's possible that the number that Twitter shows reflects some limitation in their back-end systems.

Thanks for bringing the issue to this list, btw - tweets aren't as effective in getting TC 39 attention.

Norbert


On Nov 30, 2012, at 13:06 , Phillips, Addison wrote:

> One question would be what you’d want that specific number for? The number of code points in a string is only marginally interesting in a script. It doesn’t, for example, tell you how many screen positions the text consumes (that’s the grapheme count).
>  
> Norbert’s proposal [1] includes an iterator over the code points (so counting the code points is straightforward, but not a property of the string itself, or at least I don’t see it anywhere).
>  
> Addison
>  
> [1] http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html
>  
> From: Andrea Giammarchi [mailto:andrea.giammarchi at gmail.com] 
> Sent: Friday, November 30, 2012 12:39 PM
> To: Mathias Bynens
> Cc: es-discuss
> Subject: Re: How to count the number of symbols in a string?
>  
> already raised a while ago ...
>  
> https://jp.twitter.com/WebReflection/status/260479508912685056
>  
> no answer, if I remember correctly Brendan said that .size() or .size is not a good name but I have suggested .points too and nobody came back on this
>  
> 
> On Fri, Nov 30, 2012 at 12:33 PM, Mathias Bynens <mathias at qiwi.be> wrote:
> ECMAScript 6 introduces some useful new features that make working with astral Unicode symbols easier.
> 
> One thing that is still missing though (AFAIK) is an easy way to count the number of symbols / code points in a given string. As you know, we can’t rely on `String.prototype.length` here, as a string containing nothing but an astral symbol has a length of `2` instead of `1`:
> 
> > var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
> > poo.length
> 2
> 
> Of course it’s possible to write some code yourself to loop over all the code units in the string, handle surrogate pairs, and increment a counter manually for each full code point, but that’s a pain.
> 
> It would be useful to have a new property on `String.prototype` that would return the number of Unicode symbols in the string. Something like `realLength` (of course, it needs a better name, but you get the idea):
> 
> > poo.realLength
> 1
> 
> Another possible solution is to add something like `String.prototype.codePoints` which would be an array of the numerical code point values in the string. That way, getting the length is only a matter of accessing the `length` property of the array:
> 
> > poo.codePoints
> [ 0x1F4A9 ]
> > poo.codePoints.length
> 1
> 
> Or perhaps this would be better suited as a method?
> 
> > poo.getCodePoints()
> [ 0x1F4A9 ]
> > poo.getCodePoints().length
> 1
> 
> Has anything like this been considered/discussed here yet?
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>  
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Andrea Giammarchi (13 years ago)

to sanitize, I would say, is the very first use case where if str.length != str.points something might require a fix.

A utf-8 friendly "number of allowed chars", as it would be the twitter case, is another example.

A split able to represent codePoints rather than chars would need points number too ... the fact developers are already asking for a way to obtain these codePoints should also indicate the feature might be needed.

Thoughts?

to sanitize, I would say, is the very first use case where if str.length !=
str.points something might require a fix.

A utf-8 friendly "number of allowed chars", as it would be the twitter
case, is another example.

A split able to represent codePoints rather than chars would need points
number too ... the fact developers are already asking for a way to obtain
these codePoints should also indicate the feature might be needed.

Thoughts?


On Fri, Nov 30, 2012 at 1:06 PM, Phillips, Addison <addison at lab126.com>wrote:

> One question would be what you’d want that specific number for? The number
> of code points in a string is only marginally interesting in a script. It
> doesn’t, for example, tell you how many screen positions the text consumes
> (that’s the grapheme count).****
>
> ** **
>
> Norbert’s proposal [1] includes an iterator over the code points (so
> counting the code points is straightforward, but not a property of the
> string itself, or at least I don’t see it anywhere).****
>
> ** **
>
> Addison****
>
> ** **
>
> [1]
> http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html
> ****
>
> ** **
>
> *From:* Andrea Giammarchi [mailto:andrea.giammarchi at gmail.com]
> *Sent:* Friday, November 30, 2012 12:39 PM
> *To:* Mathias Bynens
> *Cc:* es-discuss
> *Subject:* Re: How to count the number of symbols in a string?****
>
> ** **
>
> already raised a while ago ...****
>
> ** **
>
> https://jp.twitter.com/WebReflection/status/260479508912685056****
>
> ** **
>
> no answer, if I remember correctly Brendan said that .size() or .size is
> not a good name but I have suggested .points too and nobody came back on
> this****
>
> ** **
>
> On Fri, Nov 30, 2012 at 12:33 PM, Mathias Bynens <mathias at qiwi.be> wrote:*
> ***
>
> ECMAScript 6 introduces some useful new features that make working with
> astral Unicode symbols easier.
>
> One thing that is still missing though (AFAIK) is an easy way to count the
> number of symbols / code points in a given string. As you know, we can’t
> rely on `String.prototype.length` here, as a string containing nothing but
> an astral symbol has a length of `2` instead of `1`:
>
> > var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
> > poo.length
> 2
>
> Of course it’s possible to write some code yourself to loop over all the
> code units in the string, handle surrogate pairs, and increment a counter
> manually for each full code point, but that’s a pain.
>
> It would be useful to have a new property on `String.prototype` that would
> return the number of Unicode symbols in the string. Something like
> `realLength` (of course, it needs a better name, but you get the idea):
>
> > poo.realLength
> 1
>
> Another possible solution is to add something like
> `String.prototype.codePoints` which would be an array of the numerical code
> point values in the string. That way, getting the length is only a matter
> of accessing the `length` property of the array:
>
> > poo.codePoints
> [ 0x1F4A9 ]
> > poo.codePoints.length
> 1
>
> Or perhaps this would be better suited as a method?
>
> > poo.getCodePoints()
> [ 0x1F4A9 ]
> > poo.getCodePoints().length
> 1
>
> Has anything like this been considered/discussed here yet?
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss****
>
> ** **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121130/cc270cf6/attachment-0001.html>

# Jussi Kalliokoski (13 years ago)

On Fri, Nov 30, 2012 at 10:39 PM, Yusuke Suzuki <utatane.tea at gmail.com>wrote:

I remember that String object iterator produces the sequence of Unicode characters. harmony:iterators#string_iterators

So I think we can get code points by using array comprehension, var points = [ch for ch of string];

Is it right? > all

So verbose, ugh.

var points = [...string]

Partly kidding here. :D

On Fri, Nov 30, 2012 at 10:39 PM, Yusuke Suzuki <utatane.tea at gmail.com>wrote:

> I remember that String object iterator produces the sequence of Unicode
> characters.
> http://wiki.ecmascript.org/doku.php?id=harmony:iterators#string_iterators
>
> So I think we can get code points by using array comprehension,
> var points = [ch for ch of string];
>
> Is it right? > all
>

So verbose, ugh.

var points = [...string]

Partly kidding here. :D

Cheers,
Jussi


> On Sat, Dec 1, 2012 at 5:33 AM, Mathias Bynens <mathias at qiwi.be> wrote:
>
>> ECMAScript 6 introduces some useful new features that make working with
>> astral Unicode symbols easier.
>>
>> One thing that is still missing though (AFAIK) is an easy way to count
>> the number of symbols / code points in a given string. As you know, we
>> can’t rely on `String.prototype.length` here, as a string containing
>> nothing but an astral symbol has a length of `2` instead of `1`:
>>
>> > var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO
>> > poo.length
>> 2
>>
>> Of course it’s possible to write some code yourself to loop over all the
>> code units in the string, handle surrogate pairs, and increment a counter
>> manually for each full code point, but that’s a pain.
>>
>> It would be useful to have a new property on `String.prototype` that
>> would return the number of Unicode symbols in the string. Something like
>> `realLength` (of course, it needs a better name, but you get the idea):
>>
>> > poo.realLength
>> 1
>>
>> Another possible solution is to add something like
>> `String.prototype.codePoints` which would be an array of the numerical code
>> point values in the string. That way, getting the length is only a matter
>> of accessing the `length` property of the array:
>>
>> > poo.codePoints
>> [ 0x1F4A9 ]
>> > poo.codePoints.length
>> 1
>>
>> Or perhaps this would be better suited as a method?
>>
>> > poo.getCodePoints()
>> [ 0x1F4A9 ]
>> > poo.getCodePoints().length
>> 1
>>
>> Has anything like this been considered/discussed here yet?
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
>
>
> --
> Regards,
> Yusuke Suzuki
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121201/59f3a4b8/attachment.html>

# Phillips, Addison (13 years ago)

Andrea wrote:

to sanitize, I would say, is the very first use case where if str.length != str.points something might require a fix.

A utf-8 friendly "number of allowed chars", as it would be the twitter case, is another example.

Thoughts?

I'm not saying that there is no utility at all. Just pointing out that the codepoint count doesn't represent e.g. the "number of symbols" in the string and that adding it for that purpose might not achieve the aim of someone who wants to know how many symbols will appear.

The Twitter case is an interesting one. Ditto string splitting.

Addison

Andrea wrote:
--
to sanitize, I would say, is the very first use case where if str.length != str.points something might require a fix.

A utf-8 friendly "number of allowed chars", as it would be the twitter case, is another example.

A split able to represent codePoints rather than chars would need points number too ... the fact developers are already asking for a way to obtain these codePoints should also indicate the feature might be needed.

Thoughts?
--

I'm not saying that there is no utility at all. Just pointing out that the codepoint count doesn't represent e.g. the "number of symbols" in the string and that adding it for that purpose might not achieve the aim of someone who wants to know how many symbols will appear. 

The Twitter case is an interesting one. Ditto string splitting. 

Addison

# Mathias Bynens (13 years ago)

On 30 Nov 2012, at 22:50, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:

There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used.

My guess would be that in 99% of all cases where String.prototype.length is used the intention is to count the code points, not the UCS-2/UTF-16 code units. As for evidence:

Pointing at Twitter doesn't quite help - it's possible that the number that Twitter shows reflects some limitation in their back-end systems.

That’s the thing — the Twitter back-end does the right thing and doesn’t discriminate between BMP and astral symbols. Each symbol counts as a single “character” towards the 140 character limit. You can post a tweet consisting of 140 astral symbols just fine as long as you use a Twitter client that supports it.

The behavior you’re seeing in the Twitter web client is a bug. They’re simply getting the length of the input string rather than accounting for surrogate halves and counting the actual full code points.

I feel adding this functionality to ES6 would 1) help raise awareness of the issue, and 2) give developers an easy way to work around ECMAScript’s UCS-2/UTF-16-ish behavior.

On 30 Nov 2012, at 22:50, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:

> There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used.

My guess would be that in 99% of all cases where `String.prototype.length` is used the intention is to count the code points, not the UCS-2/UTF-16 code units. As for evidence:

> Pointing at Twitter doesn't quite help - it's possible that the number that Twitter shows reflects some limitation in their back-end systems.

That’s the thing — the Twitter back-end does the right thing and doesn’t discriminate between BMP and astral symbols. Each symbol counts as a single “character” towards the 140 character limit. You can post a tweet consisting of 140 astral symbols just fine as long as you use a Twitter client that supports it.

The behavior you’re seeing in the Twitter web client is a bug. They’re simply getting the `length` of the input string rather than accounting for surrogate halves and counting the actual full code points.

I feel adding this functionality to ES6 would 1) help raise awareness of the issue, and 2) give developers an easy way to work around ECMAScript’s UCS-2/UTF-16-ish behavior.

# Jason Orendorff (13 years ago)

On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be> wrote:

On 30 Nov 2012, at 22:50, Norbert Lindenberg < ecmascript at norbertlindenberg.com> wrote:

There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used.

My guess would be that in 99% of all cases where String.prototype.length is used the intention is to count the code points, not the UCS-2/UTF-16 code units.

I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used.

That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue.

On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be> wrote:

> On 30 Nov 2012, at 22:50, Norbert Lindenberg <
> ecmascript at norbertlindenberg.com> wrote:
>
> > There's nothing in the proposal yet because I intentionally kept it
> small. It's always possible to add functionality, but we need some evidence
> that it will be widely used.
>
> My guess would be that in 99% of all cases where `String.prototype.length`
> is used the intention is to count the code points, not the UCS-2/UTF-16
> code units.

I don't think this is right. My guess is that in most cases where it
matters either way, the intention is to get a count that's consistent with
.charAt(), .indexOf(), .slice(), RegExp match.index, and every other place
where string indexes are used.

That said, of course this is a sensible feature to add; but calling it
".realLength" wouldn't help anyone understand the rather fine distinction
at issue.

-j
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121204/87e5e447/attachment.html>

# David Bruant (13 years ago)

Le 04/12/2012 20:25, Jason Orendorff a écrit :

On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be <mailto:mathias at qiwi.be>> wrote:
On 30 Nov 2012, at 22:50, Norbert Lindenberg
<ecmascript at norbertlindenberg.com
<mailto:ecmascript at norbertlindenberg.com>> wrote:

> There's nothing in the proposal yet because I intentionally kept
it small. It's always possible to add functionality, but we need
some evidence that it will be widely used.

My guess would be that in 99% of all cases where
`String.prototype.length` is used the intention is to count the
code points, not the UCS-2/UTF-16 code units.
I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used.

I think Twitter has a bug as mentioned earlier in the thread and that's unrelated to consistency with the method you're mentioning. I however agree that if something is added to get the actual length, a whole set of methods needs to be added too.

That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue.

Maybe the solution lies in finding the right prefix to define .*length, .*charAt(), .*indexOf(), etc. Maybe "CP" for "code points" .CPlength? .cpLength/cpCharAt/cpIndexOf... ?

While you're talking about regexps, I think there is an issue with current RegExps. Mathias will know better. Could a new flag solve the issue?

Le 04/12/2012 20:25, Jason Orendorff a écrit :
> On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be 
> <mailto:mathias at qiwi.be>> wrote:
>
>     On 30 Nov 2012, at 22:50, Norbert Lindenberg
>     <ecmascript at norbertlindenberg.com
>     <mailto:ecmascript at norbertlindenberg.com>> wrote:
>
>     > There's nothing in the proposal yet because I intentionally kept
>     it small. It's always possible to add functionality, but we need
>     some evidence that it will be widely used.
>
>     My guess would be that in 99% of all cases where
>     `String.prototype.length` is used the intention is to count the
>     code points, not the UCS-2/UTF-16 code units.
>
>
> I don't think this is right. My guess is that in most cases where it 
> matters either way, the intention is to get a count that's consistent 
> with .charAt(), .indexOf(), .slice(), RegExp match.index, and every 
> other place where string indexes are used.
I think Twitter has a bug as mentioned earlier in the thread and that's 
unrelated to consistency with the method you're mentioning.
I however agree that if something is added to get the actual length, a 
whole set of methods needs to be added too.

> That said, of course this is a sensible feature to add; but calling it 
> ".realLength" wouldn't help anyone understand the rather fine 
> distinction at issue.
Maybe the solution lies in finding the right prefix to define .*length, 
.*charAt(), .*indexOf(), etc. Maybe "CP" for "code points" .CPlength? 
.cpLength/cpCharAt/cpIndexOf... ?

While you're talking about regexps, I think there is an issue with 
current RegExps. Mathias will know better. Could a new flag solve the issue?

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121204/52cccbac/attachment-0001.html>

# Norbert Lindenberg (13 years ago)

On Dec 4, 2012, at 11:43 , David Bruant wrote:

Le 04/12/2012 20:25, Jason Orendorff a écrit :

On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be> wrote:

My guess would be that in 99% of all cases where String.prototype.length is used the intention is to count the code points, not the UCS-2/UTF-16 code units.

I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used. I think Twitter has a bug as mentioned earlier in the thread and that's unrelated to consistency with the method you're mentioning.

One example isn't enough to support a "99% of all cases" claim. And I agree with Jason - many uses of String.length are related to some sort of iteration over the code units of the String, and then consistency with indices is critical. Showing the length of a string to the user is a rare (although important) case.

I however agree that if something is added to get the actual length, a whole set of methods needs to be added too.

Which proposal are you referring and agreeing to?

That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue. Maybe the solution lies in finding the right prefix to define .*length, .*charAt(), .*indexOf(), etc. Maybe "CP" for "code points" .CPlength? .cpLength/cpCharAt/cpIndexOf... ?

"cp" to indicate that code point indices? I think using two parallel index systems would only create confusion. Most string processing, including indexOf, works fine with supplementary characters without doing anything special for them. We need to provide a foundation that lets developers easily support supplementary characters in functionality that needs to be aware of them, but in many applications few changes will be required.

While you're talking about regexps, I think there is an issue with current RegExps. Mathias will know better. Could a new flag solve the issue?

RegExp does require major changes to support supplementary characters. The proposal accepted for ES6 (although not integrated into the spec yet) is at norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#RegExp

Are you aware of issues not addressed there?

Norbert

On Dec 4, 2012, at 11:43 , David Bruant wrote:

> Le 04/12/2012 20:25, Jason Orendorff a écrit :
>> On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be> wrote:
>> 
>>> My guess would be that in 99% of all cases where `String.prototype.length` is used the intention is to count the code points, not the UCS-2/UTF-16 code units.
>>> 
>> I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used.
> I think Twitter has a bug as mentioned earlier in the thread and that's unrelated to consistency with the method you're mentioning.

One example isn't enough to support a "99% of all cases" claim. And I agree with Jason - many uses of String.length are related to some sort of iteration over the code units of the String, and then consistency with indices is critical. Showing the length of a string to the user is a rare (although important) case.

> I however agree that if something is added to get the actual length, a whole set of methods needs to be added too.

Which proposal are you referring and agreeing to?

>> That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue.
> Maybe the solution lies in finding the right prefix to define .*length, .*charAt(), .*indexOf(), etc. Maybe "CP" for "code points" .CPlength? .cpLength/cpCharAt/cpIndexOf... ?

"cp" to indicate that code point indices? I think using two parallel index systems would only create confusion. Most string processing, including indexOf, works fine with supplementary characters without doing anything special for them.  We need to provide a foundation that lets developers easily support supplementary characters in functionality that needs to be aware of them, but in many applications few changes will be required.

> While you're talking about regexps, I think there is an issue with current RegExps. Mathias will know better. Could a new flag solve the issue?

RegExp does require major changes to support supplementary characters. The proposal accepted for ES6 (although not integrated into the spec yet) is at
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#RegExp

Are you aware of issues not addressed there?

Norbert