`String.prototype.symbolAt()` (improved `String.prototype.charAt()`)

# Mathias Bynens (11 months ago)

ES6 fixes String.fromCharCode by introducing String.fromCodePoint.

Similarly, String.prototype.charCodeAt is fixed by String.prototype.codePointAt.

Should there be a method that is like String.prototype.charAt except it deals with astral Unicode symbols wherever possible?

>> '𝌆'.charAt(0) // U+1D306
'\uD834' // the first surrogate half for U+1D306

>> '𝌆'.symbolAt(0) // U+1D306
'𝌆' // U+1D306

Has this been discussed before? If there’s any interest I’d be happy to create a strawman.

# Rick Waldron (11 months ago)

I think the idea is good, but the name may be confusing with regard to Symbols (maybe not?)

# Mathias Bynens (11 months ago)

Yeah, I thought about that, but couldn’t figure out a better name. “Glyph” or “Grapheme” wouldn’t be accurate. Any suggestions?

Anyway, if everyone agrees this is a good idea I’ll get started on fleshing out a proposal. We can then use this thread to bikeshed about the name.

# Benjamin (Inglor) Gruenbaum (11 months ago)

I also noticed the naming similarity to ES6 Symbols.

I've seen people fill String.prototype.getFullChar before and similarly things like String.prototype.fromFullCharCode for dealing with surrogates before. I like String.prototype.signAt but I haven't seen it used before.

I'm eager to hear what Allen has to say about this given his work on unicode in ecmascript. Especially how it settles with this strawman:support_full_unicode_in_strings

I also think that this is important enough to be there.

# Rick Waldron (11 months ago)

On Fri, Oct 18, 2013 at 10:47 AM, Mathias Bynens <mathias at qiwi.be> wrote:

Anyway, if everyone agrees this is a good idea I’ll get started on fleshing out a proposal. We can then use this thread to bikeshed about the name.

I think it's worthwhile to write up a proposal.

And the shed should always be pink ;)

# Mathias Bynens (11 months ago)

Here’s my proposal. Feedback welcome, as well as suggestions for a better name (if any).

String.prototype.symbolAt(pos)

NOTE: Returns a single-element String containing the code point at element position pos in the String value resulting from converting the this object to a String. If there is no element at that position, the result is the empty String. The result is a String value, not a String object.

When the symbolAt method is called with one argument pos, the following steps are taken:

  1. Let O be CheckObjectCoercible(this value).
  2. Let S be ToString(O).
  3. ReturnIfAbrupt(S).
  4. Let position be ToInteger(pos).
  5. ReturnIfAbrupt(position).
  6. Let size be the number of elements in S.
  7. If position < 0 or position ≥ size, return the empty String.
  8. Let first be the code unit at index position in the String S.
  9. Let cuFirst be the code unit value of the element at index 0 in the String first.
  10. If cuFirst < 0xD800 or cuFirst > 0xDBFF or position + 1 = size, then return first.

  11. Let cuSecond be the code unit value of the element at index position + 1 in the String S.

  12. If cuSecond < 0xDC00 or cuSecond > 0xDFFF, then return first.

  13. Let second be the code unit at index position + 1 in the string S.

  14. Let cp be (first – 0xD800) × 0x400 + (second – 0xDC00) + 0x10000.
  15. Return the elements of the UTF-16 Encoding (clause 6) of cp.

NOTE: The symbolAt function is intentionally generic; it does not require that its this value be a String object. Therefore it can be transferred to other kinds of objects for use as a method.

# Rick Waldron (11 months ago)

On Fri, Oct 18, 2013 at 11:15 AM, Mathias Bynens <mathias at qiwi.be> wrote:

Here’s my proposal. Feedback welcome, as well as suggestions for a better name (if any).

String.prototype.symbolAt(pos)

Here goes...

String.prototype.elementAt?

# Domenic Denicola (11 months ago)

Doesn't Unicode have some name for "visual representation of a code point"? Maybe it's "symbol"?

# Anne van Kesteren (11 months ago)

On Fri, Oct 18, 2013 at 1:46 PM, Mathias Bynens <mathias at qiwi.be> wrote:

Similarly, String.prototype.charCodeAt is fixed by String.prototype.codePointAt.

When you phrase it like that, I see another problem with codePointAt(). You can't just replace existing usage of charCodeAt() with codePointAt() as that would fail for input with paired surrogates. E.g. a simple loop over a string that prints code points would print both the code point and the trail surrogate code point for a surrogate pair.

The same goes for this new method. I still think that only offering a better way to iterate strings (as planned) seems like a much safer start into this brave new code point-based world.

# Mathias Bynens (11 months ago)

On 18 Oct 2013, at 10:39, Domenic Denicola <domenic at domenicdenicola.com> wrote:

Doesn't Unicode have some name for "visual representation of a code point"? Maybe it's "symbol"?

Not that I know of. I guess “Character” (www.unicode.org/glossary/#character) comes close, but we can’t really use that because String.prototype.charAt already exists. FWIW, I always use the term “symbol” to refer to a string that represents a single code point.

IMHO it’s not really confusing to name this new method symbolAt because it’s defined on String.prototype, which indicates that it acts on strings and has nothing to do with ES6 Symbols. That said, I welcome better suggestions :)

# Mathias Bynens (11 months ago)

On 18 Oct 2013, at 10:25, Rick Waldron <waldron.rick at gmail.com> wrote:

String.prototype.elementAt?

This may be confusing too, since the spec refers to elements as code units, not code points.

# Mathias Bynens (11 months ago)

On 18 Oct 2013, at 10:48, Anne van Kesteren <annevk at annevk.nl> wrote:

On Fri, Oct 18, 2013 at 1:46 PM, Mathias Bynens <mathias at qiwi.be> wrote:

Similarly, String.prototype.charCodeAt is fixed by String.prototype.codePointAt.

When you phrase it like that, I see another problem with codePointAt(). You can't just replace existing usage of charCodeAt() with codePointAt() as that would fail for input with paired surrogates. E.g. a simple loop over a string that prints code points would print both the code point and the trail surrogate code point for a surrogate pair.

I disagree. In those situations you should just iterate over the string using for…of.

.symbolAt() can be a useful replacement for .charAt() in case you only need to get the first symbol in the string. The same goes for .codePointAt() vs. .charCodeAt().

# Rick Waldron (11 months ago)

On Fri, Oct 18, 2013 at 11:53 AM, Mathias Bynens <mathias at qiwi.be> wrote:

On 18 Oct 2013, at 10:25, Rick Waldron <waldron.rick at gmail.com> wrote:

String.prototype.elementAt?

This may be confusing too, since the spec refers to elements as code units, not code points.

Yes, slight mis-reading of your proposal—thanks for clarifying

# Anne van Kesteren (11 months ago)

On Fri, Oct 18, 2013 at 4:58 PM, Mathias Bynens <mathias at qiwi.be> wrote:

On 18 Oct 2013, at 10:48, Anne van Kesteren <annevk at annevk.nl> wrote:

When you phrase it like that, I see another problem with codePointAt(). You can't just replace existing usage of charCodeAt() with codePointAt() as that would fail for input with paired surrogates. E.g. a simple loop over a string that prints code points would print both the code point and the trail surrogate code point for a surrogate pair.

I disagree. In those situations you should just iterate over the string using for…of.

That seems to iterate over code units as far as I can tell.

for (var x of "💩")
  print(x.charCodeAt(0))

invokes print() twice in Gecko.

# André Bargull (11 months ago)

SpiderMonkey does not implement the (yet to be) spec'ed String.prototype.@@iterator function, instead it simply aliases String.prototype["@@iterator"] to Array.prototype["@@iterator"]:

js> String.prototype["@@iterator"] === Array.prototype["@@iterator"]

true
# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 7:21 AM, Rick Waldron wrote:

I think the idea is good, but the name may be confusing with regard to Symbols (maybe not?)

Given that we have charAt, charCodeAt and codePointAt, I think the most appropiate name for such a method would be 'at':

 '𝌆'.at(0)

The issue when this sort of method has been discussed in the past has been what to do when you index at a trailing surrogate possition:

'𝌆'.at(1)

do you still get '𝌆' or do you get the equivalent of String.fromCharCode('𝌆'[1])?

# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 9:05 AM, Anne van Kesteren wrote:

That seems to iterate over code units as far as I can tell.

for (var x of "💩")
  print(x.charCodeAt(0))

invokes print() twice in Gecko.

No that's not correct, the @@iterator method of String.prototype is supposed to returns an iterator the iterates code points and returns single codepoint strings.

The spec. for this will be in the next draft that I release.

# Andrea Giammarchi (11 months ago)

+1 for the simplified at(symbolIndex)

I would expect '𝌆'.at(1) to fail same way 'a'.charAt(1) or 'a'.charCodeAt(1) would.

I would expect '𝌆'.at(symbolIndex) to behave as length does based on unique symbol (unicode extra) so that everyone, except RAM and CPU, will have life easier with strings.

Long story short: there's no symbol at 1, the symbol is at 0 because the size of that unicode string is 1

That said, I am sure the discussion went through this already ^_^

# Andrea Giammarchi (11 months ago)

"the size of that unicode string is 1" ... meaning the virtual size for human eyes

# Andrea Giammarchi (11 months ago)

if this is true then .at(symbolIndex) should be a no-brain ?

var virtualLength = 0;
for (var x of "💩") {
  virtualLength++;
}

// equivalent of
for(var i = 0; i < virtualLength; i++) {
  "💩".at(i);
}

Am I missing something ?

# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 10:06 AM, Andrea Giammarchi wrote:

+1 for the simplified at(symbolIndex)

I would expect '𝌆'.at(1) to fail same way 'a'.charAt(1) or 'a'.charCodeAt(1) would.

They are comparable, as the 'a' example are "index out of bounds" errors. We only use code unit indices with strings so '𝌆'[1] is valid (and so presumably should be '𝌆'.at(1) with 1 having the same meaning in each case.

The most consistent way to define String.prototype.at be be:

String.prototype.at = function(pos} {
    let cp = this.codePointAt(pos);
    return cp===undefined ? undefined : String.fromCodePoint(cp)
}
# Mathias Bynens (11 months ago)

On 18 Oct 2013, at 11:05, Anne van Kesteren <annevk at annevk.nl> wrote:

That seems to iterate over code units as far as I can tell.

for (var x of "💩")
  print(x.charCodeAt(0))

invokes print() twice in Gecko.

Woah, that doesn’t seem very useful. Is that a bug, or the way it’s supposed to work? I thought it was supposed to only iterate over whole code points (i.e. only print once for each code point, not once for each surrogate half).

# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 10:18 AM, Andrea Giammarchi wrote:

Am I missing something ?

Yes, we don't want to introduce code point based direct indexing, which alway requires scanning from the front of the string. We already made that decision in the context of charPointAt which only use code unit indices.

# Jason Orendorff (11 months ago)

On Fri, Oct 18, 2013 at 12:03 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

for (var x of "💩")
  print(x.charCodeAt(0))

invokes print() twice in Gecko.

No that's not correct, the @@iterator method of String.prototype is supposed to returns an iterator the iterates code points and returns single codepoint strings.

Filed: bugzilla.mozilla.org/show_bug.cgi?id=928508

# Andrea Giammarchi (11 months ago)

fair enough, that was my point about

except for RAM and CPU, life is going to be easier for devs

so my counter-question would be: is there any way to do that in core so that we can “💩💩💩”.split() it so that we can have an ArrayLike that with [1] gives back the single “💩” and not the whole thing ?

Or does Mathyas have already a RegExp able to split like that with reasonable perfomance ?

P.S. I am in Chrome and Safari and I had no idea until I've seen that on twitter what kind of “💩” we were talking about :D

# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 1:12 PM, Andrea Giammarchi wrote:

fair enough, that was my point about

except for RAM and CPU, life is going to be easier for devs

so my counter-question would be: is there any way to do that in core so that we can “💩💩💩”.split() it so that we can have an ArrayLike that with [1] gives back the single “💩” and not the whole thing ?

Array.from( '𝌆𝌆𝌆'))[1]

# Mathias Bynens (11 months ago)

Please ignore my previous email; it has been answered already. (It was a draft I wrote up this morning before I lost my internet connection.)

On 18 Oct 2013, at 11:57, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

Given that we have charAt, charCodeAt and codePointAt, I think the most appropiate name for such a method would be 'at': '𝌆'.at(0)

Love it!

The issue when this sort of method has been discussed in the past has been what to do when you index at a trailing surrogate possition:

'𝌆'.at(1)

do you still get '𝌆' or do you get the equivalent of String.fromCharCode('𝌆'[1]) ?

In my proposal it would return the equivalent of String.fromCharCode('𝌆'[1]). I think that’s the most sane behavior in that case. This also mimics the way String.codePointAt works in such a case.

Here’s a prollyfill for String.prototype.at based on my earlier proposal: mathiasbynens/String.prototype.at Tests: mathiasbynens/String.prototype.at/blob/master/tests/tests.js

# Mathias Bynens (11 months ago)

On 18 Oct 2013, at 15:12, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:

so my counter-question would be: is there any way to do that in core so that we can “💩💩💩”.split() it so that we can have an ArrayLike that with [1] gives back the single “💩” and not the whole thing ?

This brings us back to the earlier discussion of whether something like String.prototype.codePoints should be added: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string I think it would be useful

# Andrea Giammarchi (11 months ago)

If I understand Allen answer looks like Array.from(“💩&💩”).length would do, being 3, and making the operation straight forward?

# Joshua Bell (11 months ago)

Given that you can only use the proposed String.prototype.at() properly for indexes > 0 if you know the index of a non-BMP character or lead surrogate by some other means, or if you will test the return value for a trailing surrogate, is it really an advantage over using codePointAt / fromCodePoint?

The name "at" is so tempting I'm imagining naive scripts of the form for (i = 0; i < s.length; ++i) { r += s.at(i); } which will work fine until they get a non-BMP input at which point they're suddenly duplicating the trailing surrogates.

Pushing people towards for-of iteration and even Allen's Array.from('𝌆𝌆𝌆'))[1] seems safer; users who need more subtle things have have codePointAt / fromCodePoint available and hopefully the knowledge to use them.

# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 1:29 PM, Allen Wirfs-Brock wrote:

Array.from( '𝌆𝌆𝌆'))[1]

maybe even better:

Uint32Array.from( '𝌆𝌆𝌆'))[1]
# Allen Wirfs-Brock (11 months ago)

err...maybe not if you want a string value:

String.fromCodePoint(Uint32Array.from( '𝌆𝌆𝌆')[1])
# André Bargull (11 months ago)

On Oct 18, 2013, at 4:01 PM, Allen Wirfs-Brock wrote:

String.fromCodePoint(Uint32Array.from( '???')[1])

That does not seem to be too useful:

js> String.fromCodePoint(Uint32Array.from("\u{1d306}\u{1d306}\u{1d306}")[1])
"\u0000"

According to norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String, String.prototype[@@iterator] does not return plain code points, but the String value for the code point.

# Mathias Bynens (11 months ago)

On 18 Oct 2013, at 17:51, Joshua Bell <jsbell at google.com> wrote:

Given that you can only use the proposed String.prototype.at() properly for indexes > 0 if you know the index of a non-BMP character or lead surrogate by some other means, or if you will test the return value for a trailing surrogate, is it really an advantage over using codePointAt / fromCodePoint?

The name "at" is so tempting I'm imagining naive scripts of the form for (i = 0; i < s.length; ++i) { r += s.at(i); } which will work fine until they get a non-BMP input at which point they're suddenly duplicating the trailing surrogates.

Pushing people towards for-of iteration and even Allen's Array.from( '𝌆𝌆𝌆'))[1] seems safer; users who need more subtle things have have codePointAt / fromCodePoint available and hopefully the knowledge to use them.

Just because new features can be used incorrectly doesn’t mean the feature isn’t useful. for…of on strings and String.prototype.at are two very different things for two very different use cases. It’s a matter of using the right tool for the job, IMHO.

In your example (iterating over all code points in a string), for…of should be used.

String.prototype.codePointAt or String.prototype.at come in handy in case you only need to get the first code point or symbol in a string, for example.

# Domenic Denicola (11 months ago)

On 19 Oct 2013, at 01:12, "Mathias Bynens" <mathias at qiwi.be> wrote:

String.prototype.codePointAt or String.prototype.at come in handy in case you only need to get the first code point or symbol in a string, for example.

Are they useful for anything else, though? For example, if I wanted to get the second symbol in a string, how would I do that?

# Andrea Giammarchi (11 months ago)

so it's a for/of with a break when it finds a code point? if that's the only use case I'd like to have an example of how convenient it is. I am just wondering, not saying is not useful (trying to understand when/where/why I'd like to use .at())

# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 10:53 PM, Domenic Denicola wrote:

On 19 Oct 2013, at 01:12, "Mathias Bynens" <mathias at qiwi.be> wrote:

String.prototype.codePointAt or String.prototype.at come in handy in case you only need to get the first code point or symbol in a string, for example.

Are they useful for anything else, though? For example, if I wanted to get the second symbol in a string, how would I do that?

We discussed the utility of 'codePointAt' in the context of Norbert's full Unicode support proposal. At that time we concluded that it was something we needed. I don't see any new evidence that suggests that we need to reopen that decision at this point in the process.

The utility of a hypothetical 'at' method is presumably exactly that of 'codePointAt'.

str.at(p)

would just be a convenience for expressing

String.fromCodePoint(str.codePointAt(p))

So the real question is probably, how common is that use case.

It's relatively easy using 'at' do a for loop over the characters of a string using 'at'. Something like:

let c = '';
for (let p=0; p<str.length; p+=c.length) {
   c = str.at(p);
   ...
}

although, a for-of would be better in most cases:

for (let c of str)

The use case that we don't support well is any sort of back wards iteration of the characters of a string. We don't current have an iterator specified to do it, nor do we have a one stop way to test whether we at looking at the trailing surrogate of a surrogate pair.

# Allen Wirfs-Brock (11 months ago)

On Oct 18, 2013, at 4:22 PM, André Bargull wrote:

That does not seem to be too useful:

js> String.fromCodePoint(Uint32Array.from("\u{1d306}\u{1d306}\u{1d306}")[1])
"\u0000"

right, it would need to be

String.fromCodePoint(Uint32Array.from( '𝌆𝌆𝌆', s=>s.codePointAt(0))[1])

According to norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String, String.prototype[@@iterator] does not return plain code points, but the String value for the code point.

yes, that's correct and how I have it spec'ed in rev20

# Bjoern Hoehrmann (11 months ago)

Allen Wirfs-Brock wrote:

The utility of a hypothetical 'at' method is presumably exactly that of 'codePointAt'.

str.at(p)

would just be a convenience for expressing

String.fromCodePoint(str.codePointAt(p))

So the real question is probably, how common is that use case.

Certainly not common enough to warrant a two-character method on the native string type. Odds are people will use it incorrectly in an attempt to make their code look concise, not understanding that it'll retrieve a substring of .length 1 or 2, possibly consisting of a lone surrogate, based on a 16 bit index that might fall in the middle of a character; the problematic cases are fairly rare, so it's hard to notice improper use of .at in automated testing or in code review.

# Mathias Bynens (11 months ago)

On 19 Oct 2013, at 12:15, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

Certainly not common enough to warrant a two-character method on the native string type. Odds are people will use it incorrectly in an attempt to make their code look concise […]

Are you saying that changing the name to something that is longer than at would solve this problem?

[…] not understanding that it'll retrieve a substring of .length 1 or 2, possibly consisting of a lone surrogate, based on a 16 bit index that might fall in the middle of a character; the problematic cases are fairly rare, so it's hard to notice improper use of .at in automated testing or in code review.

People are using String.prototype.charAt() incorrectly too, expecting it to return whole symbols instead of surrogate halves wherever possible. How would not introducing a method that avoids this problem help?

# Mathias Bynens (11 months ago)

On 19 Oct 2013, at 00:53, Domenic Denicola <domenic at domenicdenicola.com> wrote:

On 19 Oct 2013, at 01:12, "Mathias Bynens" <mathias at qiwi.be> wrote:

String.prototype.codePointAt or String.prototype.at come in handy in case you only need to get the first code point or symbol in a string, for example.

Are they useful for anything else, though? For example, if I wanted to get the second symbol in a string, how would I do that?

Yeah, that’s the problem with these methods. Additional user code is required to handle non-zero position arguments, unless you’re sure the position is actually the start of a code point (and not in the middle of a surrogate pair). I guess there are situations where that’s a certainty, for example when you’re dealing with a string in which the user selected some text.

This brings us back to the earlier discussion of whether something like String.prototype.codePoints should be added: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string It could be a getter or a generator… Or does for…of iteration handle this use case adequately?

# Domenic Denicola (11 months ago)

From: Mathias Bynens [mailto:mathias at qiwi.be]

This brings us back to the earlier discussion of whether something like String.prototype.codePoints should be added: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string It could be a getter or a generator… Or does for…of iteration handle this use case adequately?

It sounds like you are proposing a second name for String.prototype[Symbol.iterator], which does not sound very useful.

A property for the string's "real length" does seem somewhat useful, as does a method that does random-access on "real characters." Certainly more useful than the proposed symbolAt/at. But I suppose we can pave whatever cowpaths arise.

My proposed cowpaths:

Object.mixin(String.prototype, {
  realCharacterAt(i) {
    let index = 0;
    for (var c of this) {
      if (index++ === i) {
        return c;
      }
    }
  }
  get realLength() {
    let counter = 0;
    for (var c of this) {
      ++counter;
    }
    return counter;
  }
});

This would allow you to e.g. find the character in the "real" middle of a string with code like

var middleIndex = Math.floor(theString.realLength / 2);
var middleRealCharacter = theString.realCharacterAt(middleIndex);
# Andrea Giammarchi (11 months ago)

AFAIK that's also what Allen said didn't want to implement in core. An expensive operation per each invocation due stateless loop over arbitrary indexes.

Although, strings are immutable in JS so I'd implement that logic creating a snapshot once and use that as if it was an Array ... something like the following:


!function(dict){

  function getOrCreate(str) {
    if (!(str in dict)) {
      dict[str] = {
        i: 0,
        l: 0,
        v: (Array.from || function(){
          // miserable callback
          return str.split('')
        })(str)
        // or the for/of loop
      };
    }
    // times it's used
    dict[str].i++;
    return dict[str].v;
  }

  setInterval(function () {
    var key, value;
    for(key in dict) {
      value = dict[key];
      value.l = value.i - value.l;
      // used only once or never used again
      if (value.l < 2) {
        // free all the RAM
        delete dict[key];
      }
    }
  }, 5000); // 5 seconds should be enough ?
            // incremental works better with
            // slower timeout though
            // 500 might be good too

  Object.defineProperties(
    String.prototype,
    {
      at: {
        configurable: true,
        writable: true,
        value: function at(i) {
          return getOrCreate(this)[i];
        }
      },
      // or any meaningful name
      size: {
        configurable: true,
        get: function () {
          return getOrCreate(this).length;
        }
      }
    }
  );

}(Object.create(null));


// @example
var str = 'abc';
alert([
  str.size, // 3
  str.at(1) // b
]);
# Andrea Giammarchi (11 months ago)

example mroe readable and with some typo fixed in github: gist.github.com/WebReflection/7059536

license wtfpl v2 www.wtfpl.net/txt/copying

# Bjoern Hoehrmann (11 months ago)

Mathias Bynens wrote:

Are you saying that changing the name to something that is longer than at would solve this problem?

If it was .getOneOrTwoCodepointLongSubstringAtUcs2CodeUnitIndex(...) I am sure people would be reluctant using it because it's unreasonably long compared to String.fromCodePoint(str.codePointAt(p)) and harder to understand than the combination of those two primitives.

People are using String.prototype.charAt() incorrectly too, expecting it to return whole symbols instead of surrogate halves wherever possible. How would not introducing a method that avoids this problem help?

Right now people do not have much of a choice other than writing code that does not do the right thing when faced with malformed strings or non-BMP characters, it's unreasonable to call a method like substr and then manually smooth it up around the edges and perhaps scan the interior for lone surrogates to ensure that at least your code doesn't do the wrong thing. That gives you "well-known bad" code, which is a good thing to have, better than more complicated code that might have unknown bugs. Allen's loop for (let p=0; p<str.length; p+=c.length) for instance is just waiting for someone to improve or replace it with code that increments by 1 instead of .length because that's simpler.

The methods fromCodePoint and codePointAt can be used to get ugly constants out of code that tries to do the right thing, and they will offer some insight into how developers might go from UCS-only code to something more proper, but for the moment duplicating all the UCS-based methods strikes me as premature, especially when giving them seductive names. How would a somewhat-surrogate-aware substring method work and what would it be called, for instance? If it is omitted, we would be back to square one, someone in need of substring functionality has to jump through overly complicated hoops to make it work "correctly" and ends up mixing surrogate-pair-aware with -unaware code.

# Brendan Eich (11 months ago)

Allen Wirfs-Brock wrote:

The use case that we don't support well is any sort of back wards iteration of the characters of a string. We don't current have an iterator specified to do it, nor do we have a one stop way to test whether we at looking at the trailing surrogate of a surrogate pair.

What do you mean by "one stop"? O(1)? We aren't going to mandate implementations make such tests (or backward iteration) that cheap.

Is there yet a real world (from the field, not a testcase) use-case for backward iteration?

# Andrea Giammarchi (11 months ago)

a nested loop might be a concrete case where O(n) happens ... not so common with strings but quite possibly used in many parsers implemented in JS itself.

# Mathias Bynens (11 months ago)

On 19 Oct 2013, at 12:54, Domenic Denicola <domenic at domenicdenicola.com> wrote:

My proposed cowpaths:

Object.mixin(String.prototype, {
 realCharacterAt(i) {
   let index = 0;
   for (var c of this) {
     if (index++ === i) {
       return c;
     }
   }
 }
 get realLength() {
   let counter = 0;
   for (var c of this) {
     ++counter;
   }
   return counter;
 }
});

Good stuff!

To account for lookalike symbols due to combining marks, just add a call to String.prototype.normalize:

Object.mixin(String.prototype, {
  get realLength() {
    let counter = 0;
    for (var c of this.normalize('NFC')) {
      ++counter;
    }
    return counter;
  }
});

assert('ma\xF1ana'.realLength == 'man\u0303ana'.realLength);
# Mathias Bynens (7 months ago)

Allen mentioned that String#at might not make it to ES6 because nobody in TC39 is championing it. I’ve now asked Rick if he would be the champion for this, and he agreed. (Thanks again!)

Looking over the ‘TC39 progress’ document at docs.google.com/a/chromium.org/document/d/1QbEE0BsO4lvl7NFTn5WXWeiEIBfaVUF7Dk0hpPpPDzU, it seems most of the work is already taken care of: the use case was discussed in this thread, the proposal has a complete spec text, and there’s an example implementation/polyfill with unit tests. See mths.be/at.

Is there anything else I can do to help get this included as a non-TC39-member?

# Domenic Denicola (7 months ago)

This was the method that was only useful if you pass 0 to it?

# C. Scott Ananian (7 months ago)

Note that Array.from(str) and str[Symbol.iterator] overlap significantly. In particular, it's somewhat awkward to iterate over code points using String#symbolAt; it's much easier to use substr() and then use the StringIterator.

ps. I see that Domenic has said something similar.

# Mathias Bynens (7 months ago)

On 14 Feb 2014, at 11:11, Domenic Denicola <domenic at domenicdenicola.com> wrote:

This was the method that was only useful if you pass 0 to it?

I’ll just avoid the infinite loop here by pointing to earlier posts in this thread where this was discussed before: esdiscuss.org/topic/string-prototype-symbolat-improved-string-prototype-charat#content-34 and esdiscuss.org/topic/string-prototype-symbolat-improved-string-prototype-charat#content-40.

This method is just as useful as String.prototype.codePointAt. If that method is included, so should String.prototype.at. If String.prototype.at is found not to be useful, String.prototype.codePointAt should be removed too.

# Mathias Bynens (7 months ago)

On 14 Feb 2014, at 11:14, C. Scott Ananian <ecmascript at cscott.net> wrote:

Note that Array.from(str) and str[Symbol.iterator] overlap significantly. In particular, it's somewhat awkward to iterate over code points using String#symbolAt; it's much easier to use substr() and then use the StringIterator.

String#at is not meant for iterating over code points – that’s what the StringIterator is for.

String#at is exactly like String#codePointAt except it returns strings (containing the symbol) instead of numbers (representing the code point value). It can be used to get the symbol at a given code unit position in a string (similar to how String#codePointAt can be used to get the code point at a given code unit position in a string).

# C. Scott Ananian (7 months ago)

Yes, I know what String#at is supposed to do.

I was pointing out that String#at makes it easy to do the wrong thing. If you do Array.from(str) then you suddenly have a complete random-access data structure where you can find out the number of code points in the String, iterate it in reverse from the end to the start, slice it, find the midpoint, etc. Array.from looks like an O(n) operation, and it is -- so it encourages developers to cache the value and reuse it.

That said, I can see where a lexer might want to use String#at, being careful to do the correct index bump based on result.length. However, the fastest JS lexers don't create String objects, they operate directly on the code point (see marijnhaverbeke.nl/acorn/#section-58). So I'm -0, mostly because the name isn't great. But I have exactly zero say in the matter anyway. So I'll shut up now.

# Domenic Denicola (7 months ago)

I think Mathias's point, that it is exactly as useful or useless as codePointAt, is a reasonable one. However,

This method is just as useful as String.prototype.codePointAt. If that method is included, so should String.prototype.at. If String.prototype.at is found not to be useful, String.prototype.codePointAt should be removed too.

This does not follow. The choice is not between adding two useless methods and adding zero. There is no reason to exclude the possibility of adding only one useless method.

But anyway, as some people seem to think that both methods are in fact useful---including Rick, who has agreed to champion---I agree with Scott that after having said our piece it's time to exit the thread.

# Rick Waldron (7 months ago)

On Fri, Feb 14, 2014 at 1:34 AM, Mathias Bynens <mathias at qiwi.be> wrote:

Allen mentioned that String#at might not make it to ES6 because nobody in TC39 is championing it. I've now asked Rick if he would be the champion for this, and he agreed. (Thanks again!)

Published to wiki here: strawman:string_at

# Allen Wirfs-Brock (7 months ago)

On Feb 14, 2014, at 1:34 AM, Mathias Bynens wrote:

Allen mentioned that String#at might not make it to ES6 because nobody in TC39 is championing it. I’ve now asked Rick if he would be the champion for this, and he agreed. (Thanks again!)

Looking over the ‘TC39 progress’ document at docs.google.com/a/chromium.org/document/d/1QbEE0BsO4lvl7NFTn5WXWeiEIBfaVUF7Dk0hpPpPDzU, it seems most of the work is already taken care of: the use case was discussed in this thread, the proposal has a complete spec text, and there’s an example implementation/polyfill with unit tests. See mths.be/at.

Is there anything else I can do to help get this included as a non-TC39-member?

But just to be even clear, the new feature gate for ES6 is officially closed.

It's a really high bar to get over that closed gate. Unless the exclusion of a feature was a mistake, fixes a bug, or is somehow essentially to supporting something that is already in ES6 I don't think we should be talking about adding it to ES6.

I don't think String.prototype.at fits any of those criteria. We've talked about it several times, including in the context of Norbert's original ES6 full unicode support proposal, and never achieved consensus on including it. Personally, I think it should be there but it's time to start talking about it for ES7 not ES6.

# Rick Waldron (7 months ago)

Yes, I absolutely agree, apologies as I realize that was not addressed in my previous message.

# C. Scott Ananian (7 months ago)

I'm excited to start working on es7-shim once we get to that point! (String.prototype.at has a particularly simple shim, thankfully...)

# Rick Waldron (7 months ago)

On Fri, Feb 14, 2014 at 12:23 PM, C. Scott Ananian <ecmascript at cscott.net>wrote:

I'm excited to start working on es7-shim once we get to that point! (String.prototype.at has a particularly simple shim, thankfully...)

Have you seen: mathiasbynens/String.prototype.at ?

# C. Scott Ananian (7 months ago)

yes, of course. es6-shim is a large-ish collection of such.

However, it would be much better to use an implementation of String#at which used substr and thus avoided creating and appending a new string object.

# Brendan Eich (7 months ago)

Aside: "ECMASpeak" is neither accurate (we don't work for Ecma, it's JS not ES :-P), nor euphonious. But here's a pointer:

C. Scott Ananian wrote:

new string object.

"new string primitive", because "string object" (especially with "new" in front) suggests new String('hi').

# Mathias Bynens (7 months ago)

On 14 Feb 2014, at 19:59, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

It's a really high bar to get over that closed gate. Unless the exclusion of a feature was a mistake […] I don't think we should be talking about adding it to ES6.

It does feel like a mistake to me to introduce String.prototype.codePointAt, but no similar function that returns the symbol instead.

# C. Scott Ananian (7 months ago)

On Feb 15, 2014 9:13 AM, "Brendan Eich" <brendan at mozilla.com> wrote:

Aside: "ECMASpeak" is neither accurate (we don't work for Ecma, it's JS not ES :-P), nor euphonious.

I'm learning all sorts of things! I guess there are two names here; what's your preferred phrase for "the language used to write algorithms in the ES6 spec" (JS6?), and, if it differs, "the language used by members of the TC39 committee among themselves when describing language primitives in a very precise way"?

"new string primitive", because "string object" (especially with "new" in front) suggests new String('hi').

I wrestled with the phrasing there. I think what I really mean is "avoid allocating new backing storage", since there are "new string primitives" returned regardless. If there's a better phrase for "string backing storage" I'd be glad to add that to my dictionary.

# Brendan Eich (7 months ago)

C. Scott Ananian wrote:

I'm learning all sorts of things! I guess there are two names here; what's your preferred phrase for "the language used to write algorithms in the ES6 spec" (JS6?), and, if it differs, "the language used by members of the TC39 committee among themselves when describing language primitives in a very precise way"?

When I'm in a bad mood, I call it VisualCobol. It's painfully low-level and verbose, yet hard to verify. Let's hope that the JSCert work will help, and Allen has been common'ing subroutines. Whatever we call it, the spec language ain't great.

Using "-Speak" as a stem conjures Orwell. Not good.

The definition of array-like -- an informal bit of jargon, useful (e.g., "array-like vs. iterable" in context in larger discussions about Array.from) until it's time to get precise -- is a spec matter. I agree we need a common definition that we use consistently.

I wrestled with the phrasing there. I think what I really mean is "avoid allocating new backing storage", since there are "new string primitives" returned regardless. If there's a better phrase for "string backing storage" I'd be glad to add that to my dictionary.

What does "backing storage" mean? There are no new String objects in any event. There may be ropes or dependent strings under the hood, but that's all unobservable (apart from performance) implementation-land.

# Allen Wirfs-Brock (7 months ago)

On Feb 15, 2014, at 11:47 AM, Brendan Eich wrote:

When I'm in a bad mood, I call it VisualCobol. It's painfully low-level and verbose, yet hard to verify. Let's hope that the JSCert work will help, and Allen has been common'ing subroutines. Whatever we call it, the spec language ain't great.

But remember, prior to ES5, it was closer to Cobolish machine language. No structured control, goto's targeting numeric step numbers, intermediate results referenced by step number (sorta SSA with numeric ids), etc.

There has never been a complete redo, just incremental improvements and refactorings. But we've definitely advanced from the early 1950s to the late 1970s.

# Andreas Rossberg (7 months ago)

Well, Algol-60 already was more structured a language than our spec-speak. Let alone how far the Algol-68 spec was ahead of us. :)

# Andreas Rossberg (7 months ago)

On 15 February 2014 20:47, Brendan Eich <brendan at mozilla.com> wrote:

Using "-Speak" as a stem conjures Orwell. Not good.

Ah, relax. Gilad Bracha even named his own language Newspeak. Self-mockery is good.

# Allen Wirfs-Brock (7 months ago)

On Feb 17, 2014, at 4:38 AM, Andreas Rossberg wrote:

Well, Algol-60 already was more structured a language than our spec-speak. Let alone how far the Algol-68 spec was ahead of us. :)

We were discussing the nature of the ES spec. pseudo code, not comparing pseudo code to a complete programming language.

Structured programming styles weren't widely adopted until the mid to late 1970's.

The Algol 60 Report used English prose to describe its semantics. the ES specs are closer in style to the Algol 68 Report although less formal and arguably more approachable.

# Brendan Eich (7 months ago)

Andreas Rossberg wrote:

Ah, relax. Gilad Bracha even named his own language Newspeak.

Yeah, but no "ECMA" -- the double-whammy.

Self-mockery is good.

I pay my dues (see "wat" played with commentary at Fluent 2012 and narrated with tech details at Strange Loop 2012).

# C. Scott Ananian (7 months ago)

Are recordings available?

# Brendan Eich (7 months ago)

C. Scott Ananian wrote:

Are recordings available?

www.infoq.com/presentations/State-JavaScript starting at 1:50

Youtube has more.