On `String.prototype.codePointAt` and `String.fromCodePoint`

# Mathias Bynens (12 years ago)

Patches implementing String.prototype.codePointAt and String.fromCodePoint are available for both SpiderMonkey (bugzilla.mozilla.org/show_bug.cgi?id=918879) and V8 (code.google.com/p/v8/issues/detail?id=2840).

One spec bug remains to be fixed, though: ecmascript#1153. It seems pretty clear the intent is to return undefined and not NaN (the algorithms in both the proposal and the ES6 draft agree on it), but it would be good to have this confirmed.

Is it a good idea for engines to start implementing these methods, or is their design still being discussed? The definitions of these methods have been in the ES6 draft for a long time (since July 2012) without any changes. Does that indicate stability? How sure are we that they will end up in the final ES6 spec?

Patches implementing `String.prototype.codePointAt` and `String.fromCodePoint` are available for both SpiderMonkey (https://bugzilla.mozilla.org/show_bug.cgi?id=918879) and V8 (https://code.google.com/p/v8/issues/detail?id=2840).

One spec bug remains to be fixed, though: <https://bugs.ecmascript.org/show_bug.cgi?id=1153>. It seems pretty clear the intent is to return `undefined` and not `NaN` (the algorithms in both the proposal and the ES6 draft agree on it), but it would be good to have this confirmed.

Is it a good idea for engines to start implementing these methods, or is their design still being discussed? The definitions of these methods have been in the ES6 draft for a long time (since July 2012) without any changes. Does that indicate stability? How sure are we that they will end up in the final ES6 spec?

Mathias  
http://mathiasbynens.be/

# Anne van Kesteren (12 years ago)

On Tue, Sep 24, 2013 at 12:15 PM, Mathias Bynens <mathias at qiwi.be> wrote:

Is it a good idea for engines to start implementing these methods, or is their design still being discussed? The definitions of these methods have been in the ES6 draft for a long time (since July 2012) without any changes. Does that indicate stability? How sure are we that they will end up in the final ES6 spec?

I think I'm convinced that String.fromCodePoint()'s design is correct, especially since the rendering subsystem deals with code points too. String.prototype.codePointAt() however still feels wrong since you always need to iterate from the start to get the correct code unit offset anyway so why would you use it rather than the code point iterator that is planned for inclusion?

On Tue, Sep 24, 2013 at 12:15 PM, Mathias Bynens <mathias at qiwi.be> wrote:
> Is it a good idea for engines to start implementing these methods, or is their design still being discussed? The definitions of these methods have been in the ES6 draft for a long time (since July 2012) without any changes. Does that indicate stability? How sure are we that they will end up in the final ES6 spec?

I think I'm convinced that String.fromCodePoint()'s design is correct,
especially since the rendering subsystem deals with code points too.
String.prototype.codePointAt() however still feels wrong since you
always need to iterate from the start to get the correct code *unit*
offset anyway so why would you use it rather than the code *point*
iterator that is planned for inclusion?


-- 
http://annevankesteren.nl/

# Mathias Bynens (12 years ago)

I think I'm convinced that String.fromCodePoint()'s design is correct, especially since the rendering subsystem deals with code points too.

Glad to hear.

String.prototype.codePointAt() however still feels wrong since you always need to iterate from the start to get the correct code unit offset anyway so why would you use it rather than the code point iterator that is planned for inclusion?

I think there are valid use cases for both.

For example, String.prototype.codePointAt() makes it easy to get only the code point at the first position, i.e. str.codePointAt(0). for…of iterates over all code points in the string by default.

One key difference is that String.prototype.codePointAt is polyfillable in ES3/ES5, while for…of isn’t. This makes it easier to switch to String.prototype.codePointAt in existing code that is (incorrectly) using String.prototype.charCodeAt to loop over all code points in a string.

> I think I'm convinced that String.fromCodePoint()'s design is correct,
> especially since the rendering subsystem deals with code points too.

Glad to hear.

> String.prototype.codePointAt() however still feels wrong since you
> always need to iterate from the start to get the correct code *unit*
> offset anyway so why would you use it rather than the code *point*
> iterator that is planned for inclusion?

I think there are valid use cases for both.

For example, `String.prototype.codePointAt()` makes it easy to get only the code point at the first position, i.e. `str.codePointAt(0)`. `for…of` iterates over all code points in the string by default.

One key difference is that `String.prototype.codePointAt` is polyfillable in ES3/ES5, while `for…of` isn’t. This makes it easier to switch to `String.prototype.codePointAt` in existing code that is (incorrectly) using `String.prototype.charCodeAt` to loop over all code points in a string.

# Erik Arvidsson (12 years ago)

My concern is similar to Anne's. codePointAt will most likely not give the right behavior and I'm concerned adding this without working raising the bar significantly.

Since this is already implementable in ES3 I don't see why we should rush this?

I think we should apply the post ES6 process to this. Let's ship it when we feel confident that we got this right.

My concern is similar to Anne's. codePointAt will most likely not give
the right behavior and I'm concerned adding this without working
raising the bar significantly.

Since this is already implementable in ES3 I don't see why we should rush this?

I think we should apply the post ES6 process to this. Let's ship it
when we feel confident that we got this right.

On Tue, Sep 24, 2013 at 9:54 PM, Mathias Bynens <mathias at qiwi.be> wrote:
>> I think I'm convinced that String.fromCodePoint()'s design is correct,
>> especially since the rendering subsystem deals with code points too.
>
> Glad to hear.
>
>> String.prototype.codePointAt() however still feels wrong since you
>> always need to iterate from the start to get the correct code *unit*
>> offset anyway so why would you use it rather than the code *point*
>> iterator that is planned for inclusion?
>
> I think there are valid use cases for both.
>
> For example, `String.prototype.codePointAt()` makes it easy to get only the code point at the first position, i.e. `str.codePointAt(0)`. `for…of` iterates over all code points in the string by default.
>
> One key difference is that `String.prototype.codePointAt` is polyfillable in ES3/ES5, while `for…of` isn’t. This makes it easier to switch to `String.prototype.codePointAt` in existing code that is (incorrectly) using `String.prototype.charCodeAt` to loop over all code points in a string.
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss



-- 
erik

# Allen Wirfs-Brock (12 years ago)

On Sep 24, 2013, at 9:59 PM, Erik Arvidsson wrote:

My concern is similar to Anne's. codePointAt will most likely not give the right behavior and I'm concerned adding this without working raising the bar significantly.

Since this is already implementable in ES3 I don't see why we should rush this?

I think we should apply the post ES6 process to this. Let's ship it when we feel confident that we got this right.

codePointAt is part of a larger comprehensive proposal and was discussed in the context of that proposal at the March 2012 TC39 meeting. There isn't a lot of detail in the notes but I'm pretty sure we talked about why the proposed definition of codePointAt makes sense. We changed several things about the proposal at that meeting, but not codePointAt. I don't see anything new in this thread that should cause us to revisit the consensus we already have.

On Sep 24, 2013, at 9:59 PM, Erik Arvidsson wrote:

> My concern is similar to Anne's. codePointAt will most likely not give
> the right behavior and I'm concerned adding this without working
> raising the bar significantly.
> 
> Since this is already implementable in ES3 I don't see why we should rush this?
> 
> I think we should apply the post ES6 process to this. Let's ship it
> when we feel confident that we got this right.

codePointAt is part of a larger comprehensive proposal [1] and was discussed in the context of that proposal at the March 2012 TC39 meeting [2].  There isn't a lot of detail in the notes but I'm pretty sure we talked about why the proposed definition of codePointAt makes sense.  We changed several things about the proposal at that meeting, but not codePointAt. I don't see anything new in this thread that should cause us to revisit the consensus we already have.

Allen

[1] http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html 
[2] https://mail.mozilla.org/pipermail/es-discuss/2012-March/021919.html 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130925/a8f95c7f/attachment.html>

# Bjoern Hoehrmann (12 years ago)

Anne van Kesteren wrote:

I think I'm convinced that String.fromCodePoint()'s design is correct, especially since the rendering subsystem deals with code points too. String.prototype.codePointAt() however still feels wrong since you always need to iterate from the start to get the correct code unit offset anyway so why would you use it rather than the code point iterator that is planned for inclusion?

UTF-16 is a self-synchronizing code and you need to move at most one .length unit to get to a proper .codePointAt index in a properly formed string. You only need to start from the beginning if you care about what is between the start and the given index position. If you want to treat proper surrogate pairs as one unit for counting, then .codePointAt let's you do

while (ix < s.length) {
    ix += s.codePointAt(ix) > 0xFFFF;
    ix += 1;
}

That perhaps also illustrates why making the method return a replacement character for unpaired surrogates is a bad idea: you may violate

count_unicode(s1 + s2) === count_unicode(s1) + count_unicode(s2)

if this concatenates two halfs of a surrogate pair. The .codePointAt method is for random indexing, iterators are for sequential access. Random indexing into strings is rare except for a few special positions, but it happens through user input for instance (give me the Unicode scalar value of the first character of the current text selection).

* Anne van Kesteren wrote:
>I think I'm convinced that String.fromCodePoint()'s design is correct,
>especially since the rendering subsystem deals with code points too.
>String.prototype.codePointAt() however still feels wrong since you
>always need to iterate from the start to get the correct code *unit*
>offset anyway so why would you use it rather than the code *point*
>iterator that is planned for inclusion?

UTF-16 is a self-synchronizing code and you need to move at most one
`.length` unit to get to a proper `.codePointAt` index in a properly
formed string. You only need to start from the beginning if you care
about what is between the start and the given index position. If you
want to treat proper surrogate pairs as one unit for counting, then
`.codePointAt` let's you do

  while (ix < s.length) {
    ix += s.codePointAt(ix) > 0xFFFF;
    ix += 1;
  }

That perhaps also illustrates why making the method return a replace-
ment character for unpaired surrogates is a bad idea: you may violate

  count_unicode(s1 + s2) === count_unicode(s1) + count_unicode(s2)

if this concatenates two halfs of a surrogate pair. The `.codePointAt`
method is for random indexing, iterators are for sequential access.
Random indexing into strings is rare except for a few special positions,
but it happens through user input for instance (give me the Unicode
scalar value of the first character of the current text selection).
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/