How to count the number of symbols in a string?

# Mathias Bynens (12 years ago)

ECMAScript 6 introduces some useful new features that make working with astral Unicode symbols easier.

One thing that is still missing though (AFAIK) is an easy way to count the number of symbols / code points in a given string. As you know, we can’t rely on String.prototype.length here, as a string containing nothing but an astral symbol has a length of 2 instead of 1:

var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO poo.length

2

Of course it’s possible to write some code yourself to loop over all the code units in the string, handle surrogate pairs, and increment a counter manually for each full code point, but that’s a pain.

It would be useful to have a new property on String.prototype that would return the number of Unicode symbols in the string. Something like realLength (of course, it needs a better name, but you get the idea):

poo.realLength

1

Another possible solution is to add something like String.prototype.codePoints which would be an array of the numerical code point values in the string. That way, getting the length is only a matter of accessing the length property of the array:

poo.codePoints [ 0x1F4A9 ] poo.codePoints.length

1

Or perhaps this would be better suited as a method?

poo.getCodePoints() [ 0x1F4A9 ] poo.getCodePoints().length

1

Has anything like this been considered/discussed here yet?

# Andrea Giammarchi (12 years ago)

already raised a while ago ...

jp.twitter.com/WebReflection/status/260479508912685056

no answer, if I remember correctly Brendan said that .size() or .size is not a good name but I have suggested .points too and nobody came back on this

# Yusuke Suzuki (12 years ago)

I remember that String object iterator produces the sequence of Unicode characters. harmony:iterators#string_iterators

So I think we can get code points by using array comprehension, var points = [ch for ch of string];

Is it right? > all

# Andrea Giammarchi (12 years ago)

good point(s), still this should be a native String.prototype or String method, IMHO

# Mathias Bynens (12 years ago)

Thanks for the useful info, Andrea and Yusuke!

A {code}points property/getter on the String.prototype like Andrea suggested earlier sure sounds good to me.

IMHO a solution to the “string length” problem should definitely be added to ES6, as even major sites like Twitter.com are doing it wrong currently. Here’s a screenshot of the tweet textarea containing only a single astral symbol: i.imgur.com/IlFxj.png Note how the counter below says “138 [characters left]” instead of 139, as it should.

# Phillips, Addison (12 years ago)

One question would be what you’d want that specific number for? The number of code points in a string is only marginally interesting in a script. It doesn’t, for example, tell you how many screen positions the text consumes (that’s the grapheme count).

Norbert’s proposal [1] includes an iterator over the code points (so counting the code points is straightforward, but not a property of the string itself, or at least I don’t see it anywhere).

Addison

[1] norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

From: Andrea Giammarchi [mailto:andrea.giammarchi at gmail.com] Sent: Friday, November 30, 2012 12:39 PM To: Mathias Bynens Cc: es-discuss Subject: Re: How to count the number of symbols in a string?

already raised a while ago ...

jp.twitter.com/WebReflection/status/260479508912685056

no answer, if I remember correctly Brendan said that .size() or .size is not a good name but I have suggested .points too and nobody came back on this

On Fri, Nov 30, 2012 at 12:33 PM, Mathias Bynens <mathias at qiwi.be<mailto:mathias at qiwi.be>> wrote:

ECMAScript 6 introduces some useful new features that make working with astral Unicode symbols easier.

One thing that is still missing though (AFAIK) is an easy way to count the number of symbols / code points in a given string. As you know, we can’t rely on String.prototype.length here, as a string containing nothing but an astral symbol has a length of 2 instead of 1:

var poo = '\u{1F4A9}'; // U+1F4A9 PILE OF POO poo.length

2

Of course it’s possible to write some code yourself to loop over all the code units in the string, handle surrogate pairs, and increment a counter manually for each full code point, but that’s a pain.

It would be useful to have a new property on String.prototype that would return the number of Unicode symbols in the string. Something like realLength (of course, it needs a better name, but you get the idea):

poo.realLength

1

Another possible solution is to add something like String.prototype.codePoints which would be an array of the numerical code point values in the string. That way, getting the length is only a matter of accessing the length property of the array:

poo.codePoints [ 0x1F4A9 ] poo.codePoints.length

1

Or perhaps this would be better suited as a method?

poo.getCodePoints() [ 0x1F4A9 ] poo.getCodePoints().length

1

Has anything like this been considered/discussed here yet?

# Norbert Lindenberg (12 years ago)

There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used. Pointing at Twitter doesn't quite help - it's possible that the number that Twitter shows reflects some limitation in their back-end systems.

Thanks for bringing the issue to this list, btw - tweets aren't as effective in getting TC 39 attention.

Norbert

# Andrea Giammarchi (12 years ago)

to sanitize, I would say, is the very first use case where if str.length != str.points something might require a fix.

A utf-8 friendly "number of allowed chars", as it would be the twitter case, is another example.

A split able to represent codePoints rather than chars would need points number too ... the fact developers are already asking for a way to obtain these codePoints should also indicate the feature might be needed.

Thoughts?

# Jussi Kalliokoski (12 years ago)

On Fri, Nov 30, 2012 at 10:39 PM, Yusuke Suzuki <utatane.tea at gmail.com>wrote:

I remember that String object iterator produces the sequence of Unicode characters. harmony:iterators#string_iterators

So I think we can get code points by using array comprehension, var points = [ch for ch of string];

Is it right? > all

So verbose, ugh.

var points = [...string]

Partly kidding here. :D

# Phillips, Addison (12 years ago)

Andrea wrote:

to sanitize, I would say, is the very first use case where if str.length != str.points something might require a fix.

A utf-8 friendly "number of allowed chars", as it would be the twitter case, is another example.

A split able to represent codePoints rather than chars would need points number too ... the fact developers are already asking for a way to obtain these codePoints should also indicate the feature might be needed.

Thoughts?

I'm not saying that there is no utility at all. Just pointing out that the codepoint count doesn't represent e.g. the "number of symbols" in the string and that adding it for that purpose might not achieve the aim of someone who wants to know how many symbols will appear.

The Twitter case is an interesting one. Ditto string splitting.

Addison

# Mathias Bynens (12 years ago)

On 30 Nov 2012, at 22:50, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:

There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used.

My guess would be that in 99% of all cases where String.prototype.length is used the intention is to count the code points, not the UCS-2/UTF-16 code units. As for evidence:

Pointing at Twitter doesn't quite help - it's possible that the number that Twitter shows reflects some limitation in their back-end systems.

That’s the thing — the Twitter back-end does the right thing and doesn’t discriminate between BMP and astral symbols. Each symbol counts as a single “character” towards the 140 character limit. You can post a tweet consisting of 140 astral symbols just fine as long as you use a Twitter client that supports it.

The behavior you’re seeing in the Twitter web client is a bug. They’re simply getting the length of the input string rather than accounting for surrogate halves and counting the actual full code points.

I feel adding this functionality to ES6 would 1) help raise awareness of the issue, and 2) give developers an easy way to work around ECMAScript’s UCS-2/UTF-16-ish behavior.

# Jason Orendorff (12 years ago)

On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be> wrote:

On 30 Nov 2012, at 22:50, Norbert Lindenberg < ecmascript at norbertlindenberg.com> wrote:

There's nothing in the proposal yet because I intentionally kept it small. It's always possible to add functionality, but we need some evidence that it will be widely used.

My guess would be that in 99% of all cases where String.prototype.length is used the intention is to count the code points, not the UCS-2/UTF-16 code units.

I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used.

That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue.

# David Bruant (12 years ago)

Le 04/12/2012 20:25, Jason Orendorff a écrit :

On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be <mailto:mathias at qiwi.be>> wrote:

On 30 Nov 2012, at 22:50, Norbert Lindenberg
<ecmascript at norbertlindenberg.com
<mailto:ecmascript at norbertlindenberg.com>> wrote:

> There's nothing in the proposal yet because I intentionally kept
it small. It's always possible to add functionality, but we need
some evidence that it will be widely used.

My guess would be that in 99% of all cases where
`String.prototype.length` is used the intention is to count the
code points, not the UCS-2/UTF-16 code units.

I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used.

I think Twitter has a bug as mentioned earlier in the thread and that's unrelated to consistency with the method you're mentioning. I however agree that if something is added to get the actual length, a whole set of methods needs to be added too.

That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue.

Maybe the solution lies in finding the right prefix to define .*length, .*charAt(), .*indexOf(), etc. Maybe "CP" for "code points" .CPlength? .cpLength/cpCharAt/cpIndexOf... ?

While you're talking about regexps, I think there is an issue with current RegExps. Mathias will know better. Could a new flag solve the issue?

# Norbert Lindenberg (12 years ago)

On Dec 4, 2012, at 11:43 , David Bruant wrote:

Le 04/12/2012 20:25, Jason Orendorff a écrit :

On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be> wrote:

My guess would be that in 99% of all cases where String.prototype.length is used the intention is to count the code points, not the UCS-2/UTF-16 code units.

I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used. I think Twitter has a bug as mentioned earlier in the thread and that's unrelated to consistency with the method you're mentioning.

One example isn't enough to support a "99% of all cases" claim. And I agree with Jason - many uses of String.length are related to some sort of iteration over the code units of the String, and then consistency with indices is critical. Showing the length of a string to the user is a rare (although important) case.

I however agree that if something is added to get the actual length, a whole set of methods needs to be added too.

Which proposal are you referring and agreeing to?

That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue. Maybe the solution lies in finding the right prefix to define .*length, .*charAt(), .*indexOf(), etc. Maybe "CP" for "code points" .CPlength? .cpLength/cpCharAt/cpIndexOf... ?

"cp" to indicate that code point indices? I think using two parallel index systems would only create confusion. Most string processing, including indexOf, works fine with supplementary characters without doing anything special for them. We need to provide a foundation that lets developers easily support supplementary characters in functionality that needs to be aware of them, but in many applications few changes will be required.

While you're talking about regexps, I think there is an issue with current RegExps. Mathias will know better. Could a new flag solve the issue?

RegExp does require major changes to support supplementary characters. The proposal accepted for ES6 (although not integrated into the spec yet) is at norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#RegExp

Are you aware of issues not addressed there?

Norbert