Proposal: `String.prototype.codePointCount`

# fanerge (5 years ago)

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Definition of String.prototype.length

This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.

We refer to the String class in Java

The String class in the Java JVM uses UTF-16 encoding. String.length(): The method returns the number of characters in char in the string; String.codePointCount(): The method returns the number of codewords in the string.

I want the ECMA organization to be able to add a property or method to String.prototype that returns the value of the codePoint of the string. For example: String.prototype.codePointCount can return the actual number of codePoints instead of code unit.


const str1 = ‘1111’;

str1.length; // 4

str1.codePointCount; // 4

// ‘1’.codePointAt(0) // 49




const str2 = '𠮷𠮷𠮷𠮷’;

str2.length; // 8 

str2.codePointCount; // 4 

// '𠮷'.codePointAt(0); // 134071




const str3 = ‘😯😯😯😯’;

str3.length; // 8

str3.codePointCount; // 4

 

// '😯'.codePointAt(0); // 128559

I believe that most developers need such a method and property to get the number of codePoints in a string. I sincerely hope that you can accept my proposal, thanks.

# kdex (5 years ago)

So what's wrong with Array.from(str).length?

On Thursday, August 8, 2019 4:37:07 AM CEST fanerge wrote:

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Definition of String.prototype.length

This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.

We refer to the String class in Java

The String class in the Java JVM uses UTF-16 encoding. String.length(): The method returns the number of characters in char in the string; String.codePointCount(): The method returns the number of

# Claude Pache (5 years ago)

Le 8 août 2019 à 04:37, fanerge <fanerge at qq.com> a écrit :

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Note however that “the number of code points” is not the same thing as “the actual number of characters” for what a human usually perceives as “character”. For example:

Object.defineProperty(String.prototype, "codePointCount", {
    get() { return [...this].length }
})
"🇨🇦".codePointCount // 2
"n̈".codePointCount // 2
"é".codePointCount // 1
"é".normalize("NFD").codePointCount // 2

I believe that most developers need such a method and property to get the number of codePoints in a string.

For what purposes do “most developers” need the number of code points?

# fanerge (5 years ago)

Thank you for your reply,I know that there are ways to get the right results, but I still think that there should be such a method on the prototype chain of String, rather than by other means (mostly by converting to an array to find the length). I hope that members of ECMA can consider it.

Javascript itself supports such a method is definitely better than the developer alone.

There are many such requirements in a real development scenario, such as how many characters are allowed to be entered by the user, which is something we should consider not in Unicode for Basic Multilingual Plane. Unify the awareness of users and developers.


eg.

let str = '𠮷𠮷𠮷𠮷';

[...str].length; // 4

or

Array.from(str).length // 4
 

------------------ 原始邮件 ------------------ 发件人: "Claude Pache"<claude.pache at gmail.com>; 发送时间: 2019年8月8日(星期四) 下午4:45 收件人: "fanerge"<fanerge at qq.com>; 抄送: "es-discuss"<es-discuss at mozilla.org>; 主题: Re: Proposal: String.prototype.codePointCount

Le 8 août 2019 à 04:37, fanerge <fanerge at qq.com> a écrit :

I expect to be able to add an attribute to String.prototype that returns the number of codePoints of the string to reflect the actual number of characters instead of the code unit.

Note however that “the number of code points” is not the same thing as “the actual number of characters” for what a human usually perceives as “character”. For example:

Object.defineProperty(String.prototype, "codePointCount", {
    get() { return [...this].length }
})
"🇨🇦".codePointCount // 2
"n̈".codePointCount // 2
"é".codePointCount // 1
"é".normalize("NFD").codePointCount // 2

I believe that most developers need such a method and property to get the number of codePoints in a string.

For what purposes do “most developers” need the number of code points?

# Claude Pache (5 years ago)

Le 8 août 2019 à 11:07, fanerge <fanerge at qq.com> a écrit :

There are many such requirements in a real development scenario, such as how many characters are allowed to be entered by the user, which is something we should consider not in Unicode for Basic Multilingual Plane.

I have cases where I want to limit the length of user input, for which purpose I just use <input maxlength>, although it gives inconsistent results across the three browsers I have tested: two of them limit the number of UTF-16 code units, one of them limits the number of grapheme clusters (and none of them limit the number of code points).

In fact, for my purpose, I have no reason to impose a limit for a precise number of code points (as opposed to other possible definitions of “length” such as UTF-16 code units or grapheme clusters). Technically, I am usually limited by the size of a column in the database, for which the “size” corresponds typically to the number of bytes in a UTF-8 encoded string. From a user point-of-view, the number of “characters” is better approximated by the number of grapheme clusters. None of those two notions of “length” correspond to the number of code points.

# Tab Atkins Jr. (5 years ago)

On Thu, Aug 8, 2019 at 2:45 AM Claude Pache <claude.pache at gmail.com> wrote:

In fact, for my purpose, I have no reason to impose a limit for a precise number of code points (as opposed to other possible definitions of “length” such as UTF-16 code units or grapheme clusters). Technically, I am usually limited by the size of a column in the database, for which the “size” corresponds typically to the number of bytes in a UTF-8 encoded string. From a user point-of-view, the number of “characters” is better approximated by the number of grapheme clusters. None of those two notions of “length” correspond to the number of code points.

Yup, code points, while a useful specification concept to work with, are in fact very rarely what you actually need to care about for anything in real use-cases! Bytes or grapheme clusters are almost always what you want.

# Mathias Bynens (5 years ago)

Prior discussion from 7 years ago: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string

[...string].length does what you want. But it's definitely not always what you need mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters.

# Bob Myers (5 years ago)

Consider a language such as Kannada, spoken in sourthern India, and the 25th most widely spoken language in the world, with 60M speakers. "Characters" in the written language are represented in Unicode as elements (sometimes called "letters") which are then composed at the rendering level to produce what native speakers would consider "characters" (for clarify, sometimes called "compound characters", or ottakshara). The portion of the composition algorithm which figures out which sequence of elements belong to the same "characters" is found only in rendering engines, and is itself so complicated that it has resulted in many bugs, including one (in a different but related language) which caused Macs to crash. (The actual positional composition, which involves figuring out not only how to arrange the elements but also how to adjust their size and other details) is even more complicated.

In any case, for Kannada, what kind of characters do you want to count with your new string prototype method? If you're interested in knowing this to make sure that your user does not enter a string longer than will fit in some fixed-length database field, you're going to tell the user that "Name can contain no more than 25 "letters", which will mean nosthing to them? If you want to make sure some name can fit in some space on the screen, you are going to have to count compound characters (which are fixed width, for all practical purposes), but how are you going to do that, without including a huge library to analyze the Kannada strings--a library which is not even publicly available?

-- Bob