Proposal: `String.prototype.codePointCount`

# fanerge (6 years ago)

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Definition of String.prototype.length

This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.

We refer to the String class in Java

The String class in the Java JVM uses UTF-16 encoding. String.length(): The method returns the number of characters in char in the string; String.codePointCount(): The method returns the number of codewords in the string.

I want the ECMA organization to be able to add a property or method to String.prototype that returns the value of the codePoint of the string. For example: String.prototype.codePointCount can return the actual number of codePoints instead of code unit.


const str1 = ‘1111’;

str1.length; // 4

str1.codePointCount; // 4

// ‘1’.codePointAt(0) // 49




const str2 = '𠮷𠮷𠮷𠮷’;

str2.length; // 8 

str2.codePointCount; // 4 

// '𠮷'.codePointAt(0); // 134071




const str3 = ‘😯😯😯😯’;

str3.length; // 8

str3.codePointCount; // 4

 

// '😯'.codePointAt(0); // 128559

I believe that most developers need such a method and property to get the number of codePoints in a string. I sincerely hope that you can accept my proposal, thanks.

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.




 

Definition of String.prototype.length

 

This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.

 
We refer to the String class in Java
 
The String class in the Java JVM uses UTF-16 encoding. 
String.length(): The method returns the number of characters in char in the string; 
String.codePointCount(): The method returns the number of codewords in the string.




 

I want the ECMA organization to be able to add a property or method to String.prototype that returns the value of the codePoint of the string. For example: String.prototype.codePointCount can return the actual number of codePoints instead of code unit.

```

const str1 = ‘1111’;

str1.length; // 4

str1.codePointCount; // 4

// ‘1’.codePointAt(0) // 49




const str2 = '𠮷𠮷𠮷𠮷’;

str2.length; // 8 

str2.codePointCount; // 4 

// '𠮷'.codePointAt(0); // 134071




const str3 = ‘😯😯😯😯’;

str3.length; // 8

str3.codePointCount; // 4

 

// '😯'.codePointAt(0); // 128559

```

 

I believe that most developers need such a method and property to get the number of codePoints in a string. I sincerely hope that you can accept my proposal, thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20190808/fa291e10/attachment.html>

# kdex (6 years ago)

So what's wrong with Array.from(str).length?

On Thursday, August 8, 2019 4:37:07 AM CEST fanerge wrote:

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Definition of String.prototype.length

This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.

We refer to the String class in Java

The String class in the Java JVM uses UTF-16 encoding. String.length(): The method returns the number of characters in char in the string; String.codePointCount(): The method returns the number of

So what's wrong with `Array.from(str).length`?

On Thursday, August 8, 2019 4:37:07 AM CEST fanerge wrote:
> I expect to be able to add an attribute to String.prototype that returns the
> number of codePoints of the string to reflect the actual number of
> characters instead of the code unit.
 
> 
> 
> 
>  
> 
> Definition of String.prototype.length
> 
>  
> 
> This property returns the number of code units in the string. UTF-16, the
> string format used by JavaScript, uses a single 16-bit code unit to
> represent the most common characters, but needs to use two code units for
> less commonly-used characters, so it's possible for the value returned by
> length to not match the actual number of characters in the string.
 
>  
> We refer to the String class in Java
>  
> The String class in the Java JVM uses UTF-16 encoding. 
> String.length(): The method returns the number of characters in char in the
> string; 
 String.codePointCount(): The method returns the number of
> codewords in the string. 
> 
> 
> 
>  
> 
> I want the ECMA organization to be able to add a property or method to
> String.prototype that returns the value of the codePoint of the string. For
> example: String.prototype.codePointCount can return the actual number of
> codePoints instead of code unit.
 
> ```
> 
> const str1 = ‘1111’;
> 
> str1.length; // 4
> 
> str1.codePointCount; // 4
> 
> // ‘1’.codePointAt(0) // 49
> 
> 
> 
> 
> const str2 = '𠮷𠮷𠮷𠮷’;
> 
> str2.length; // 8 
> 
> str2.codePointCount; // 4 
> 
> // '𠮷'.codePointAt(0); // 134071
> 
> 
> 
> 
> const str3 = ‘😯😯😯😯’;
> 
> str3.length; // 8
> 
> str3.codePointCount; // 4
> 
>  
> 
> // '😯'.codePointAt(0); // 128559
> 
> ```
> 
>  
> 
> I believe that most developers need such a method and property to get the
> number of codePoints in a string. I sincerely hope that you can accept my
> proposal, thanks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20190808/ff87f314/attachment.sig>

# Claude Pache (6 years ago)

Le 8 août 2019 à 04:37, fanerge <fanerge at qq.com> a écrit :

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Note however that “the number of code points” is not the same thing as “the actual number of characters” for what a human usually perceives as “character”. For example:

Object.defineProperty(String.prototype, "codePointCount", {
    get() { return [...this].length }
})
"🇨🇦".codePointCount // 2
"n̈".codePointCount // 2
"é".codePointCount // 1
"é".normalize("NFD").codePointCount // 2

I believe that most developers need such a method and property to get the number of codePoints in a string.

For what purposes do “most developers” need the number of code points?

> Le 8 août 2019 à 04:37, fanerge <fanerge at qq.com> a écrit :
> 
> I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Note however that “the number of code points” is not the same thing as “the actual number of characters” for what a human usually perceives as “character”. For example:

```js
Object.defineProperty(String.prototype, "codePointCount", {
    get() { return [...this].length }
})
"🇨🇦".codePointCount // 2
"n̈".codePointCount // 2
"é".codePointCount // 1
"é".normalize("NFD").codePointCount // 2
```

> 
> I believe that most developers need such a method and property to get the number of codePoints in a string.
> 

For what purposes do “most developers” need the number of code points?

—Claude

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20190808/621894cc/attachment-0001.html>

# fanerge (6 years ago)

Thank you for your reply，I know that there are ways to get the right results, but I still think that there should be such a method on the prototype chain of String, rather than by other means (mostly by converting to an array to find the length). I hope that members of ECMA can consider it.

Javascript itself supports such a method is definitely better than the developer alone.

There are many such requirements in a real development scenario, such as how many characters are allowed to be entered by the user, which is something we should consider not in Unicode for Basic Multilingual Plane. Unify the awareness of users and developers.


eg.

let str = '𠮷𠮷𠮷𠮷';

[...str].length; // 4

or

Array.from(str).length // 4

------------------ 原始邮件 ------------------ 发件人: "Claude Pache"<claude.pache at gmail.com>; 发送时间: 2019年8月8日(星期四) 下午4:45 收件人: "fanerge"<fanerge at qq.com>; 抄送: "es-discuss"<es-discuss at mozilla.org>; 主题: Re: Proposal: String.prototype.codePointCount

Le 8 août 2019 à 04:37, fanerge <fanerge at qq.com> a écrit :

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Note however that “the number of code points” is not the same thing as “the actual number of characters” for what a human usually perceives as “character”. For example:

Object.defineProperty(String.prototype, "codePointCount", {
    get() { return [...this].length }
})
"🇨🇦".codePointCount // 2
"n̈".codePointCount // 2
"é".codePointCount // 1
"é".normalize("NFD").codePointCount // 2

I believe that most developers need such a method and property to get the number of codePoints in a string.

For what purposes do “most developers” need the number of code points?

Thank you for your reply，I know that there are ways to get the right results, but I still think that there should be such a method on the prototype chain of String, rather than by other means (mostly by converting to an array to find the length). I hope that members of ECMA can consider it.




Javascript itself supports such a method is definitely better than the developer alone.

 

There are many such requirements in a real development scenario, such as how many characters are allowed to be entered by the user, which is something we should consider not in Unicode for Basic Multilingual Plane. Unify the awareness of users and developers.

```

eg.

let str = '𠮷𠮷𠮷𠮷';

[...str].length; // 4

or

Array.from(str).length // 4
 
```














------------------ 原始邮件 ------------------
发件人: "Claude Pache"<claude.pache at gmail.com>;
发送时间: 2019年8月8日(星期四) 下午4:45
收件人: "fanerge"<fanerge at qq.com>;
抄送: "es-discuss"<es-discuss at mozilla.org>;
主题: Re: Proposal: `String.prototype.codePointCount`



Le 8 août 2019 à 04:37, fanerge <fanerge at qq.com> a écrit :

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.



Note however that “the number of code points” is not the same thing as “the actual number of characters” for what a human usually perceives as “character”. For example:


```js
Object.defineProperty(String.prototype, "codePointCount", {
    get() { return [...this].length }
})
"🇨🇦".codePointCount // 2
"n̈".codePointCount // 2
"é".codePointCount // 1
"é".normalize("NFD").codePointCount // 2
```







I believe that most developers need such a method and property to get the number of codePoints in a string.




For what purposes do “most developers” need the number of code points?


—Claude
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20190808/ca19f832/attachment.html>

# Claude Pache (6 years ago)

Le 8 août 2019 à 11:07, fanerge <fanerge at qq.com> a écrit :

There are many such requirements in a real development scenario, such as how many characters are allowed to be entered by the user, which is something we should consider not in Unicode for Basic Multilingual Plane.

I have cases where I want to limit the length of user input, for which purpose I just use <input maxlength>, although it gives inconsistent results across the three browsers I have tested: two of them limit the number of UTF-16 code units, one of them limits the number of grapheme clusters (and none of them limit the number of code points).

In fact, for my purpose, I have no reason to impose a limit for a precise number of code points (as opposed to other possible definitions of “length” such as UTF-16 code units or grapheme clusters). Technically, I am usually limited by the size of a column in the database, for which the “size” corresponds typically to the number of bytes in a UTF-8 encoded string. From a user point-of-view, the number of “characters” is better approximated by the number of grapheme clusters. None of those two notions of “length” correspond to the number of code points.

> Le 8 août 2019 à 11:07, fanerge <fanerge at qq.com> a écrit :
> 
> There are many such requirements in a real development scenario, such as how many characters are allowed to be entered by the user, which is something we should consider not in Unicode for Basic Multilingual Plane.


I have cases where I want to limit the length of user input, for which purpose I just use <input maxlength>, although it gives inconsistent results across the three browsers I have tested: two of them limit the number of UTF-16 code units, one of them limits the number of grapheme clusters (and none of them limit the number of code points).

In fact, for my purpose, I have no reason to impose a limit for a precise number of *code points* (as opposed to other possible definitions of “length” such as *UTF-16 code units* or *grapheme clusters*). Technically, I am usually limited by the size of a column in the database, for which the “size” corresponds typically to the number of bytes in a UTF-8 encoded string. From a user point-of-view, the number of “characters” is better approximated by the number of grapheme clusters. None of those two notions of “length” correspond to the number of code points.

—Claude
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20190808/3fe08612/attachment.html>

# Tab Atkins Jr. (6 years ago)

On Thu, Aug 8, 2019 at 2:45 AM Claude Pache <claude.pache at gmail.com> wrote:

In fact, for my purpose, I have no reason to impose a limit for a precise number of code points (as opposed to other possible definitions of “length” such as UTF-16 code units or grapheme clusters). Technically, I am usually limited by the size of a column in the database, for which the “size” corresponds typically to the number of bytes in a UTF-8 encoded string. From a user point-of-view, the number of “characters” is better approximated by the number of grapheme clusters. None of those two notions of “length” correspond to the number of code points.

Yup, code points, while a useful specification concept to work with, are in fact very rarely what you actually need to care about for anything in real use-cases! Bytes or grapheme clusters are almost always what you want.

On Thu, Aug 8, 2019 at 2:45 AM Claude Pache <claude.pache at gmail.com> wrote:
> In fact, for my purpose, I have no reason to impose a limit for a precise number of *code points* (as opposed to other possible definitions of “length” such as *UTF-16 code units* or *grapheme clusters*). Technically, I am usually limited by the size of a column in the database, for which the “size” corresponds typically to the number of bytes in a UTF-8 encoded string. From a user point-of-view, the number of “characters” is better approximated by the number of grapheme clusters. None of those two notions of “length” correspond to the number of code points.

Yup, code points, while a useful specification concept to work with,
are in fact very rarely what you actually need to care about for
anything in real use-cases! Bytes or grapheme clusters are almost
always what you want.

~TJ

# Mathias Bynens (6 years ago)

Prior discussion from 7 years ago: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string

[...string].length does what you want. But it's definitely not always what you need mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters.

Prior discussion from 7 years ago:
https://esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string

[...string].length does what you want. But it's definitely not always what
you need
<https://mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters>.

On Thu, Aug 8, 2019 at 4:37 AM fanerge <fanerge at qq.com> wrote:

> I expect to be able to add an attribute to String.prototype that returns
> the number of codePoints of the string to reflect the actual number of
> characters instead of the code unit.
>
>
> Definition of String.prototype.length
>
> This property returns the number of code units in the string. UTF-16
> <https://en.wikipedia.org/wiki/UTF-16>, the string format used by
> JavaScript, uses a single 16-bit code unit to represent the most common
> characters, but needs to use two code units for less commonly-used
> characters, so it's possible for the value returned by length to not
> match the actual number of characters in the string.
>
> We refer to the String class in Java
>
> The String class in the Java JVM uses UTF-16 encoding.
> String.length(): The method returns the number of characters in char in
> the string;
> String.codePointCount(): The method returns the number of codewords in
> the string.
>
>
> *I want the ECMA organization to be able to add a property or method to
> String.prototype that returns the value of the codePoint of the string. For
> example: String.prototype.codePointCount can return the actual number of
> codePoints instead of code unit.*
>
> *```*
>
> const str1 = ‘1111’;
>
> str1.length; // 4
>
> str1.codePointCount; // 4
>
> // ‘1’.codePointAt(0) // 49
>
>
> const str2 = '𠮷𠮷𠮷𠮷’;
>
> str2.length; // 8
>
> str2.codePointCount; // 4
>
> // '𠮷'.codePointAt(0); // 134071
>
>
> const str3 = ‘😯😯😯😯’;
>
> str3.length; // 8
>
> str3.codePointCount; // 4
>
> // '😯'.codePointAt(0); // 128559
>
> *```*
>
> *I believe that most developers need such a method and property to get the
> number of codePoints in a string. I sincerely hope that you can accept my
> proposal*,* thanks.*
>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20190808/afdb816f/attachment.html>

# Bob Myers (6 years ago)

Consider a language such as Kannada, spoken in sourthern India, and the 25th most widely spoken language in the world, with 60M speakers. "Characters" in the written language are represented in Unicode as elements (sometimes called "letters") which are then composed at the rendering level to produce what native speakers would consider "characters" (for clarify, sometimes called "compound characters", or ottakshara). The portion of the composition algorithm which figures out which sequence of elements belong to the same "characters" is found only in rendering engines, and is itself so complicated that it has resulted in many bugs, including one (in a different but related language) which caused Macs to crash. (The actual positional composition, which involves figuring out not only how to arrange the elements but also how to adjust their size and other details) is even more complicated.

In any case, for Kannada, what kind of characters do you want to count with your new string prototype method? If you're interested in knowing this to make sure that your user does not enter a string longer than will fit in some fixed-length database field, you're going to tell the user that "Name can contain no more than 25 "letters", which will mean nosthing to them? If you want to make sure some name can fit in some space on the screen, you are going to have to count compound characters (which are fixed width, for all practical purposes), but how are you going to do that, without including a huge library to analyze the Kannada strings--a library which is not even publicly available?

-- Bob

Consider a language such as Kannada, spoken in sourthern India, and the
25th most widely spoken language in the world, with 60M speakers.
"Characters" in the written language are represented in Unicode as elements
(sometimes called "letters") which are then composed at the rendering level
to produce what native speakers would consider "characters" (for clarify,
sometimes called "compound characters", or  *ottakshara).* The portion of
the composition algorithm which figures out which sequence of elements
belong to the same "characters" is found only in rendering engines, and is
itself so complicated that it has resulted in many bugs, including one (in
a different but related language) which caused Macs to crash. (The actual
positional composition, which involves figuring out not only how to arrange
the elements but also how to adjust their size and other details) is even
more complicated.

In any case, for Kannada, what kind of characters do you want to count with
your new string prototype method? If you're interested in knowing this to
make sure that your user does not enter a string longer than will fit in
some fixed-length database field, you're going to tell the user that "Name
can contain no more than 25 "letters", which will mean nosthing to them? If
you want to make sure some name can fit in some space on the screen, you
are going to have to count compound characters (which are fixed width, for
all practical purposes), but how are you going to do that, without
including a huge library to analyze the Kannada strings--a library which is
not even publicly available?

--
Bob

On Thu, Aug 8, 2019 at 8:34 AM Mathias Bynens <mathias at qiwi.be> wrote:

> Prior discussion from 7 years ago:
> https://esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string
>
> [...string].length does what you want. But it's definitely not always what
> you need
> <https://mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters>
> .
>
> On Thu, Aug 8, 2019 at 4:37 AM fanerge <fanerge at qq.com> wrote:
>
>> I expect to be able to add an attribute to String.prototype that returns
>> the number of codePoints of the string to reflect the actual number of
>> characters instead of the code unit
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20190808/3ebba3e0/attachment-0001.html>