`String.prototype.symbolAt()` (improved `String.prototype.charAt()`)
I think the idea is good, but the name may be confusing with regard to Symbols (maybe not?)
Yeah, I thought about that, but couldn’t figure out a better name. “Glyph” or “Grapheme” wouldn’t be accurate. Any suggestions?
Anyway, if everyone agrees this is a good idea I’ll get started on fleshing out a proposal. We can then use this thread to bikeshed about the name.
I also noticed the naming similarity to ES6 Symbol
s.
I've seen people fill String.prototype.getFullChar
before and similarly
things like String.prototype.fromFullCharCode
for dealing with surrogates
before. I like String.prototype.signAt
but I haven't seen it used before.
I'm eager to hear what Allen has to say about this given his work on unicode in ecmascript. Especially how it settles with this strawman:support_full_unicode_in_strings
I also think that this is important enough to be there.
On Fri, Oct 18, 2013 at 10:47 AM, Mathias Bynens <mathias at qiwi.be> wrote:
Anyway, if everyone agrees this is a good idea I’ll get started on fleshing out a proposal. We can then use this thread to bikeshed about the name.
I think it's worthwhile to write up a proposal.
And the shed should always be pink ;)
Here’s my proposal. Feedback welcome, as well as suggestions for a better name (if any).
String.prototype.symbolAt(pos)
NOTE: Returns a single-element String containing the code point at element position pos
in the String value
resulting from converting the this
object to a String. If there is no element at that position, the result is the empty String. The result is a String value, not a String object.
When the symbolAt
method is called with one argument pos
, the following steps are taken:
-
Let
O
beCheckObjectCoercible(this value)
. -
Let
S
beToString(O)
. -
ReturnIfAbrupt(S)
. -
Let
position
beToInteger(pos)
. -
ReturnIfAbrupt(position)
. -
Let
size
be the number of elements inS
. -
If
position < 0
orposition ≥ size
, return the empty String. -
Let
first
be the code unit at indexposition
in the StringS
. -
Let
cuFirst
be the code unit value of the element at index0
in the Stringfirst
. -
If
cuFirst < 0xD800
orcuFirst > 0xDBFF
orposition + 1 = size
, then returnfirst
. -
Let
cuSecond
be the code unit value of the element at indexposition + 1
in the StringS
. -
If
cuSecond < 0xDC00
orcuSecond > 0xDFFF
, then returnfirst
. -
Let
second
be the code unit at indexposition + 1
in the stringS
. -
Let
cp
be(first – 0xD800) × 0x400 + (second – 0xDC00) + 0x10000
. -
Return the elements of the UTF-16 Encoding (clause 6) of
cp
.
NOTE: The symbolAt
function is intentionally generic; it does not require that its this
value be a String object. Therefore it can be transferred to other kinds of objects for use as a method.
Doesn't Unicode have some name for "visual representation of a code point"? Maybe it's "symbol"?
On Fri, Oct 18, 2013 at 1:46 PM, Mathias Bynens <mathias at qiwi.be> wrote:
Similarly,
String.prototype.charCodeAt
is fixed byString.prototype.codePointAt
.
When you phrase it like that, I see another problem with codePointAt(). You can't just replace existing usage of charCodeAt() with codePointAt() as that would fail for input with paired surrogates. E.g. a simple loop over a string that prints code points would print both the code point and the trail surrogate code point for a surrogate pair.
The same goes for this new method. I still think that only offering a better way to iterate strings (as planned) seems like a much safer start into this brave new code point-based world.
On 18 Oct 2013, at 10:39, Domenic Denicola <domenic at domenicdenicola.com> wrote:
Doesn't Unicode have some name for "visual representation of a code point"? Maybe it's "symbol"?
Not that I know of. I guess “Character” (www.unicode.org/glossary/#character) comes close, but we can’t really use that because String.prototype.charAt
already exists. FWIW, I always use the term “symbol” to refer to a string that represents a single code point.
IMHO it’s not really confusing to name this new method symbolAt
because it’s defined on String.prototype
, which indicates that it acts on strings and has nothing to do with ES6 Symbols. That said, I welcome better suggestions :)
On 18 Oct 2013, at 10:48, Anne van Kesteren <annevk at annevk.nl> wrote:
On Fri, Oct 18, 2013 at 1:46 PM, Mathias Bynens <mathias at qiwi.be> wrote:
Similarly,
String.prototype.charCodeAt
is fixed byString.prototype.codePointAt
.When you phrase it like that, I see another problem with codePointAt(). You can't just replace existing usage of charCodeAt() with codePointAt() as that would fail for input with paired surrogates. E.g. a simple loop over a string that prints code points would print both the code point and the trail surrogate code point for a surrogate pair.
I disagree. In those situations you should just iterate over the string using for…of
.
.symbolAt()
can be a useful replacement for .charAt()
in case you only need to get the first symbol in the string. The same goes for .codePointAt()
vs. .charCodeAt()
.
On Fri, Oct 18, 2013 at 11:53 AM, Mathias Bynens <mathias at qiwi.be> wrote:
On 18 Oct 2013, at 10:25, Rick Waldron <waldron.rick at gmail.com> wrote:
String.prototype.elementAt?
This may be confusing too, since the spec refers to
elements
as code units, not code points.
Yes, slight mis-reading of your proposal—thanks for clarifying
On Fri, Oct 18, 2013 at 4:58 PM, Mathias Bynens <mathias at qiwi.be> wrote:
On 18 Oct 2013, at 10:48, Anne van Kesteren <annevk at annevk.nl> wrote:
When you phrase it like that, I see another problem with codePointAt(). You can't just replace existing usage of charCodeAt() with codePointAt() as that would fail for input with paired surrogates. E.g. a simple loop over a string that prints code points would print both the code point and the trail surrogate code point for a surrogate pair.
I disagree. In those situations you should just iterate over the string using
for…of
.
That seems to iterate over code units as far as I can tell.
for (var x of "💩")
print(x.charCodeAt(0))
invokes print() twice in Gecko.
SpiderMonkey does not implement the (yet to be) spec'ed
String.prototype.@@iterator
function, instead it simply aliases
String.prototype["@@iterator"]
to Array.prototype["@@iterator"]
:
js> String.prototype["@@iterator"] === Array.prototype["@@iterator"]
true
On Oct 18, 2013, at 7:21 AM, Rick Waldron wrote:
I think the idea is good, but the name may be confusing with regard to Symbols (maybe not?)
Given that we have charAt, charCodeAt and codePointAt, I think the most appropiate name for such a method would be 'at':
'𝌆'.at(0)
The issue when this sort of method has been discussed in the past has been what to do when you index at a trailing surrogate possition:
'𝌆'.at(1)
do you still get '𝌆'
or do you get the equivalent of String.fromCharCode('𝌆'[1])
?
On Oct 18, 2013, at 9:05 AM, Anne van Kesteren wrote:
That seems to iterate over code units as far as I can tell.
for (var x of "💩") print(x.charCodeAt(0))
invokes print() twice in Gecko.
No that's not correct, the @@iterator method of String.prototype
is supposed to returns an iterator the iterates code points and returns single codepoint strings.
The spec. for this will be in the next draft that I release.
+1 for the simplified at(symbolIndex)
I would expect '𝌆'.at(1)
to fail same way 'a'.charAt(1)
or
'a'.charCodeAt(1)
would.
I would expect '𝌆'.at(symbolIndex)
to behave as length
does based on
unique symbol (unicode extra) so that everyone, except RAM and CPU, will
have life easier with strings.
Long story short: there's no symbol at 1, the symbol is at 0 because the size of that unicode string is 1
That said, I am sure the discussion went through this already ^_^
"the size of that unicode string is 1" ... meaning the virtual size for human eyes
if this is true then .at(symbolIndex) should be a no-brain ?
var virtualLength = 0;
for (var x of "💩") {
virtualLength++;
}
// equivalent of
for(var i = 0; i < virtualLength; i++) {
"💩".at(i);
}
Am I missing something ?
On Oct 18, 2013, at 10:06 AM, Andrea Giammarchi wrote:
+1 for the simplified
at(symbolIndex)
I would expect
'𝌆'.at(1)
to fail same way'a'.charAt(1)
or'a'.charCodeAt(1)
would.
They are comparable, as the 'a'
example are "index out of bounds" errors. We only use code unit indices with strings so '𝌆'[1]
is valid (and so presumably should be '𝌆'.at(1)
with 1 having the same meaning in each case.
The most consistent way to define String.prototype.at
be be:
String.prototype.at = function(pos} {
let cp = this.codePointAt(pos);
return cp===undefined ? undefined : String.fromCodePoint(cp)
}
On 18 Oct 2013, at 11:05, Anne van Kesteren <annevk at annevk.nl> wrote:
That seems to iterate over code units as far as I can tell.
for (var x of "💩") print(x.charCodeAt(0))
invokes print() twice in Gecko.
Woah, that doesn’t seem very useful. Is that a bug, or the way it’s supposed to work? I thought it was supposed to only iterate over whole code points (i.e. only print once for each code point, not once for each surrogate half).
On Oct 18, 2013, at 10:18 AM, Andrea Giammarchi wrote:
Am I missing something ?
Yes, we don't want to introduce code point based direct indexing, which alway requires scanning from the front of the string. We already made that decision in the context of charPointAt which only use code unit indices.
On Fri, Oct 18, 2013 at 12:03 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
for (var x of "💩") print(x.charCodeAt(0))
invokes print() twice in Gecko.
No that's not correct, the @@iterator method of
String.prototype
is supposed to returns an iterator the iterates code points and returns single codepoint strings.
fair enough, that was my point about
except for RAM and CPU, life is going to be easier for devs
so my counter-question would be: is there any way to do that in core so that we can “💩💩💩”.split() it so that we can have an ArrayLike that with [1] gives back the single “💩” and not the whole thing ?
Or does Mathyas have already a RegExp able to split like that with reasonable perfomance ?
P.S. I am in Chrome and Safari and I had no idea until I've seen that on twitter what kind of “💩” we were talking about :D
On Oct 18, 2013, at 1:12 PM, Andrea Giammarchi wrote:
fair enough, that was my point about
except for RAM and CPU, life is going to be easier for devs
so my counter-question would be: is there any way to do that in core so that we can “💩💩💩”.split() it so that we can have an ArrayLike that with [1] gives back the single “💩” and not the whole thing ?
Array.from( '𝌆𝌆𝌆'))[1]
Please ignore my previous email; it has been answered already. (It was a draft I wrote up this morning before I lost my internet connection.)
On 18 Oct 2013, at 11:57, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
Given that we have charAt, charCodeAt and codePointAt, I think the most appropiate name for such a method would be 'at': '𝌆'.at(0)
Love it!
The issue when this sort of method has been discussed in the past has been what to do when you index at a trailing surrogate possition:
'𝌆'.at(1)
do you still get '𝌆' or do you get the equivalent of String.fromCharCode('𝌆'[1]) ?
In my proposal it would return the equivalent of String.fromCharCode('𝌆'[1])
. I think that’s the most sane behavior in that case. This also mimics the way String.codePointAt
works in such a case.
Here’s a prollyfill for String.prototype.at
based on my earlier proposal: mathiasbynens/String.prototype.at Tests: mathiasbynens/String.prototype.at/blob/master/tests/tests.js
On 18 Oct 2013, at 15:12, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:
so my counter-question would be: is there any way to do that in core so that we can “💩💩💩”.split() it so that we can have an ArrayLike that with [1] gives back the single “💩” and not the whole thing ?
This brings us back to the earlier discussion of whether something like String.prototype.codePoints
should be added: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string I think it would be useful
If I understand Allen answer looks like Array.from(“💩&💩”).length
would
do, being 3, and making the operation straight forward?
Given that you can only use the proposed String.prototype.at()
properly for
indexes > 0 if you know the index of a non-BMP character or lead surrogate by some other means, or if you will test the return value for a trailing surrogate, is it really an advantage over using codePointAt / fromCodePoint?
The name "at" is so tempting I'm imagining naive scripts of the form for (i = 0; i < s.length; ++i) { r += s.at(i); }
which will work fine until they get a non-BMP input at which point they're suddenly duplicating the trailing surrogates.
Pushing people towards for-of iteration and even Allen's Array.from('𝌆𝌆𝌆'))[1]
seems safer; users who need more subtle things have have codePointAt / fromCodePoint available and hopefully the knowledge to use them.
On Oct 18, 2013, at 1:29 PM, Allen Wirfs-Brock wrote:
Array.from( '𝌆𝌆𝌆'))[1]
maybe even better:
Uint32Array.from( '𝌆𝌆𝌆'))[1]
err...maybe not if you want a string value:
String.fromCodePoint(Uint32Array.from( '𝌆𝌆𝌆')[1])
On Oct 18, 2013, at 4:01 PM, Allen Wirfs-Brock wrote:
String.fromCodePoint(Uint32Array.from( '???')[1])
That does not seem to be too useful:
js> String.fromCodePoint(Uint32Array.from("\u{1d306}\u{1d306}\u{1d306}")[1])
"\u0000"
According to
norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String,
String.prototype[@@iterator]
does not return plain code points, but the
String value for the code point.
On 18 Oct 2013, at 17:51, Joshua Bell <jsbell at google.com> wrote:
Given that you can only use the proposed
String.prototype.at()
properly for indexes > 0 if you know the index of a non-BMP character or lead surrogate by some other means, or if you will test the return value for a trailing surrogate, is it really an advantage over using codePointAt / fromCodePoint?The name "at" is so tempting I'm imagining naive scripts of the form
for (i = 0; i < s.length; ++i) { r += s.at(i); }
which will work fine until they get a non-BMP input at which point they're suddenly duplicating the trailing surrogates.Pushing people towards for-of iteration and even Allen's
Array.from( '𝌆𝌆𝌆'))[1]
seems safer; users who need more subtle things have have codePointAt / fromCodePoint available and hopefully the knowledge to use them.
Just because new features can be used incorrectly doesn’t mean the feature isn’t useful. for…of
on strings and String.prototype.at
are two very different things for two very different use cases. It’s a matter of using the right tool for the job, IMHO.
In your example (iterating over all code points in a string), for…of
should be used.
String.prototype.codePointAt
or String.prototype.at
come in handy in case you only need to get the first code point or symbol in a string, for example.
On 19 Oct 2013, at 01:12, "Mathias Bynens" <mathias at qiwi.be> wrote:
String.prototype.codePointAt
orString.prototype.at
come in handy in case you only need to get the first code point or symbol in a string, for example.
Are they useful for anything else, though? For example, if I wanted to get the second symbol in a string, how would I do that?
so it's a for/of with a break when it finds a code point? if that's the only use case I'd like to have an example of how convenient it is. I am just wondering, not saying is not useful (trying to understand when/where/why I'd like to use .at())
On Oct 18, 2013, at 10:53 PM, Domenic Denicola wrote:
On 19 Oct 2013, at 01:12, "Mathias Bynens" <mathias at qiwi.be> wrote:
String.prototype.codePointAt
orString.prototype.at
come in handy in case you only need to get the first code point or symbol in a string, for example.Are they useful for anything else, though? For example, if I wanted to get the second symbol in a string, how would I do that?
We discussed the utility of 'codePointAt' in the context of Norbert's full Unicode support proposal. At that time we concluded that it was something we needed. I don't see any new evidence that suggests that we need to reopen that decision at this point in the process.
The utility of a hypothetical 'at' method is presumably exactly that of 'codePointAt'.
str.at(p)
would just be a convenience for expressing
String.fromCodePoint(str.codePointAt(p))
So the real question is probably, how common is that use case.
It's relatively easy using 'at' do a for loop over the characters of a string using 'at'. Something like:
let c = '';
for (let p=0; p<str.length; p+=c.length) {
c = str.at(p);
...
}
although, a for-of would be better in most cases:
for (let c of str)
The use case that we don't support well is any sort of back wards iteration of the characters of a string. We don't current have an iterator specified to do it, nor do we have a one stop way to test whether we at looking at the trailing surrogate of a surrogate pair.
On Oct 18, 2013, at 4:22 PM, André Bargull wrote:
That does not seem to be too useful:
js> String.fromCodePoint(Uint32Array.from("\u{1d306}\u{1d306}\u{1d306}")[1]) "\u0000"
right, it would need to be
String.fromCodePoint(Uint32Array.from( '𝌆𝌆𝌆', s=>s.codePointAt(0))[1])
According to norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String,
String.prototype[@@iterator]
does not return plain code points, but the String value for the code point.
yes, that's correct and how I have it spec'ed in rev20
Allen Wirfs-Brock wrote:
The utility of a hypothetical 'at' method is presumably exactly that of 'codePointAt'.
str.at(p)
would just be a convenience for expressing
String.fromCodePoint(str.codePointAt(p))
So the real question is probably, how common is that use case.
Certainly not common enough to warrant a two-character method on the
native string type. Odds are people will use it incorrectly in an
attempt to make their code look concise, not understanding that it'll
retrieve a substring of .length 1 or 2, possibly consisting of a lone
surrogate, based on a 16 bit index that might fall in the middle of a
character; the problematic cases are fairly rare, so it's hard to
notice improper use of .at
in automated testing or in code review.
On 19 Oct 2013, at 12:15, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:
Certainly not common enough to warrant a two-character method on the native string type. Odds are people will use it incorrectly in an attempt to make their code look concise […]
Are you saying that changing the name to something that is longer than at
would solve this problem?
[…] not understanding that it'll retrieve a substring of .length 1 or 2, possibly consisting of a lone surrogate, based on a 16 bit index that might fall in the middle of a character; the problematic cases are fairly rare, so it's hard to notice improper use of
.at
in automated testing or in code review.
People are using String.prototype.charAt()
incorrectly too, expecting it to return whole symbols instead of surrogate halves wherever possible. How would not introducing a method that avoids this problem help?
On 19 Oct 2013, at 00:53, Domenic Denicola <domenic at domenicdenicola.com> wrote:
On 19 Oct 2013, at 01:12, "Mathias Bynens" <mathias at qiwi.be> wrote:
String.prototype.codePointAt
orString.prototype.at
come in handy in case you only need to get the first code point or symbol in a string, for example.Are they useful for anything else, though? For example, if I wanted to get the second symbol in a string, how would I do that?
Yeah, that’s the problem with these methods. Additional user code is required to handle non-zero position
arguments, unless you’re sure the position
is actually the start of a code point (and not in the middle of a surrogate pair). I guess there are situations where that’s a certainty, for example when you’re dealing with a string in which the user selected some text.
This brings us back to the earlier discussion of whether something like String.prototype.codePoints
should be added: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string It could be a getter or a generator… Or does for…of
iteration handle this use case adequately?
From: Mathias Bynens [mailto:mathias at qiwi.be]
This brings us back to the earlier discussion of whether something like
String.prototype.codePoints
should be added: esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string It could be a getter or a generator… Or doesfor…of
iteration handle this use case adequately?
It sounds like you are proposing a second name for String.prototype[Symbol.iterator]
, which does not sound very useful.
A property for the string's "real length" does seem somewhat useful, as does a method that does random-access on "real characters." Certainly more useful than the proposed symbolAt/at. But I suppose we can pave whatever cowpaths arise.
My proposed cowpaths:
Object.mixin(String.prototype, {
realCharacterAt(i) {
let index = 0;
for (var c of this) {
if (index++ === i) {
return c;
}
}
}
get realLength() {
let counter = 0;
for (var c of this) {
++counter;
}
return counter;
}
});
This would allow you to e.g. find the character in the "real" middle of a string with code like
var middleIndex = Math.floor(theString.realLength / 2);
var middleRealCharacter = theString.realCharacterAt(middleIndex);
AFAIK that's also what Allen said didn't want to implement in core. An expensive operation per each invocation due stateless loop over arbitrary indexes.
Although, strings are immutable in JS so I'd implement that logic creating a snapshot once and use that as if it was an Array ... something like the following:
!function(dict){
function getOrCreate(str) {
if (!(str in dict)) {
dict[str] = {
i: 0,
l: 0,
v: (Array.from || function(){
// miserable callback
return str.split('')
})(str)
// or the for/of loop
};
}
// times it's used
dict[str].i++;
return dict[str].v;
}
setInterval(function () {
var key, value;
for(key in dict) {
value = dict[key];
value.l = value.i - value.l;
// used only once or never used again
if (value.l < 2) {
// free all the RAM
delete dict[key];
}
}
}, 5000); // 5 seconds should be enough ?
// incremental works better with
// slower timeout though
// 500 might be good too
Object.defineProperties(
String.prototype,
{
at: {
configurable: true,
writable: true,
value: function at(i) {
return getOrCreate(this)[i];
}
},
// or any meaningful name
size: {
configurable: true,
get: function () {
return getOrCreate(this).length;
}
}
}
);
}(Object.create(null));
// @example
var str = 'abc';
alert([
str.size, // 3
str.at(1) // b
]);
example mroe readable and with some typo fixed in github: gist.github.com/WebReflection/7059536
license wtfpl v2 www.wtfpl.net/txt/copying
Mathias Bynens wrote:
Are you saying that changing the name to something that is longer than
at
would solve this problem?
If it was .getOneOrTwoCodepointLongSubstringAtUcs2CodeUnitIndex(...)
I am sure people would be reluctant using it because it's unreasonably
long compared to String.fromCodePoint(str.codePointAt(p))
and harder
to understand than the combination of those two primitives.
People are using
String.prototype.charAt()
incorrectly too, expecting it to return whole symbols instead of surrogate halves wherever possible. How would not introducing a method that avoids this problem help?
Right now people do not have much of a choice other than writing code
that does not do the right thing when faced with malformed strings or
non-BMP characters, it's unreasonable to call a method like substr
and then manually smooth it up around the edges and perhaps scan the
interior for lone surrogates to ensure that at least your code doesn't
do the wrong thing. That gives you "well-known bad" code, which is a
good thing to have, better than more complicated code that might have
unknown bugs. Allen's loop for (let p=0; p<str.length; p+=c.length)
for instance is just waiting for someone to improve or replace it with
code that increments by 1
instead of .length
because that's simpler.
The methods fromCodePoint
and codePointAt
can be used to get ugly
constants out of code that tries to do the right thing, and they will
offer some insight into how developers might go from UCS-only code to
something more proper, but for the moment duplicating all the UCS-based
methods strikes me as premature, especially when giving them seductive
names. How would a somewhat-surrogate-aware substring
method work and
what would it be called, for instance? If it is omitted, we would be
back to square one, someone in need of substring functionality has to
jump through overly complicated hoops to make it work "correctly" and
ends up mixing surrogate-pair-aware with -unaware code.
Allen Wirfs-Brock wrote:
The use case that we don't support well is any sort of back wards iteration of the characters of a string. We don't current have an iterator specified to do it, nor do we have a one stop way to test whether we at looking at the trailing surrogate of a surrogate pair.
What do you mean by "one stop"? O(1)? We aren't going to mandate implementations make such tests (or backward iteration) that cheap.
Is there yet a real world (from the field, not a testcase) use-case for backward iteration?
a nested loop might be a concrete case where O(n)
happens ... not so
common with strings but quite possibly used in many parsers implemented in
JS itself.
On 19 Oct 2013, at 12:54, Domenic Denicola <domenic at domenicdenicola.com> wrote:
My proposed cowpaths:
Object.mixin(String.prototype, { realCharacterAt(i) { let index = 0; for (var c of this) { if (index++ === i) { return c; } } } get realLength() { let counter = 0; for (var c of this) { ++counter; } return counter; } });
Good stuff!
To account for [lookalike symbols due to combining marks] 1, just add a call to String.prototype.normalize
:
Object.mixin(String.prototype, {
get realLength() {
let counter = 0;
for (var c of this.normalize('NFC')) {
++counter;
}
return counter;
}
});
assert('ma\xF1ana'.realLength == 'man\u0303ana'.realLength);
Allen mentioned that String#at
might not make it to ES6 because nobody in TC39 is championing it. I’ve now asked Rick if he would be the champion for this, and he agreed. (Thanks again!)
Looking over the ‘TC39 progress’ document at docs.google.com/a/chromium.org/document/d/1QbEE0BsO4lvl7NFTn5WXWeiEIBfaVUF7Dk0hpPpPDzU, it seems most of the work is already taken care of: the use case was discussed in this thread, the proposal has a complete spec text, and there’s an example implementation/polyfill with unit tests. See mths.be/at.
Is there anything else I can do to help get this included as a non-TC39-member?
This was the method that was only useful if you pass 0
to it?
Note that Array.from(str)
and str[Symbol.iterator]
overlap
significantly. In particular, it's somewhat awkward to iterate over
code points using String#symbolAt
; it's much easier to use
substr()
and then use the StringIterator.
ps. I see that Domenic has said something similar.
On 14 Feb 2014, at 11:11, Domenic Denicola <domenic at domenicdenicola.com> wrote:
This was the method that was only useful if you pass
0
to it?
I’ll just avoid the infinite loop here by pointing to earlier posts in this thread where this was discussed before: esdiscuss.org/topic/string-prototype-symbolat-improved-string-prototype-charat#content-34 and esdiscuss.org/topic/string-prototype-symbolat-improved-string-prototype-charat#content-40.
This method is just as useful as String.prototype.codePointAt
. If that method is included, so should String.prototype.at
. If String.prototype.at
is found not to be useful, String.prototype.codePointAt
should be removed too.
On 14 Feb 2014, at 11:14, C. Scott Ananian <ecmascript at cscott.net> wrote:
Note that
Array.from(str)
andstr[Symbol.iterator]
overlap significantly. In particular, it's somewhat awkward to iterate over code points usingString#symbolAt
; it's much easier to usesubstr()
and then use the StringIterator.
String#at
is not meant for iterating over code points – that’s what the StringIterator
is for.
String#at
is exactly like String#codePointAt
except it returns strings (containing the symbol) instead of numbers (representing the code point value). It can be used to get the symbol at a given code unit position in a string (similar to how String#codePointAt
can be used to get the code point at a given code unit position in a string).
Yes, I know what String#at
is supposed to do.
I was pointing out that String#at
makes it easy to do the wrong
thing. If you do Array.from(str)
then you suddenly have a complete
random-access data structure where you can find out the number of code
points in the String, iterate it in reverse from the end to the start,
slice it, find the midpoint, etc. Array.from
looks like an O(n)
operation, and it is -- so it encourages developers to cache the value
and reuse it.
That said, I can see where a lexer might want to use String#at
,
being careful to do the correct index bump based on result.length
.
However, the fastest JS lexers don't create String objects, they
operate directly on the code point (see
marijnhaverbeke.nl/acorn/#section-58). So I'm -0, mostly
because the name isn't great. But I have exactly zero say in the
matter anyway. So I'll shut up now.
I think Mathias's point, that it is exactly as useful or useless as codePointAt
, is a reasonable one. However,
This method is just as useful as
String.prototype.codePointAt
. If that method is included, so shouldString.prototype.at
. IfString.prototype.at
is found not to be useful,String.prototype.codePointAt
should be removed too.
This does not follow. The choice is not between adding two useless methods and adding zero. There is no reason to exclude the possibility of adding only one useless method.
But anyway, as some people seem to think that both methods are in fact useful---including Rick, who has agreed to champion---I agree with Scott that after having said our piece it's time to exit the thread.
On Fri, Feb 14, 2014 at 1:34 AM, Mathias Bynens <mathias at qiwi.be> wrote:
Allen mentioned that
String#at
might not make it to ES6 because nobody in TC39 is championing it. I've now asked Rick if he would be the champion for this, and he agreed. (Thanks again!)
Published to wiki here: strawman:string_at
On Feb 14, 2014, at 1:34 AM, Mathias Bynens wrote:
Allen mentioned that
String#at
might not make it to ES6 because nobody in TC39 is championing it. I’ve now asked Rick if he would be the champion for this, and he agreed. (Thanks again!)Looking over the ‘TC39 progress’ document at docs.google.com/a/chromium.org/document/d/1QbEE0BsO4lvl7NFTn5WXWeiEIBfaVUF7Dk0hpPpPDzU, it seems most of the work is already taken care of: the use case was discussed in this thread, the proposal has a complete spec text, and there’s an example implementation/polyfill with unit tests. See mths.be/at.
Is there anything else I can do to help get this included as a non-TC39-member?
But just to be even clear, the new feature gate for ES6 is officially closed.
It's a really high bar to get over that closed gate. Unless the exclusion of a feature was a mistake, fixes a bug, or is somehow essentially to supporting something that is already in ES6 I don't think we should be talking about adding it to ES6.
I don't think String.prototype.at fits any of those criteria. We've talked about it several times, including in the context of Norbert's original ES6 full unicode support proposal, and never achieved consensus on including it. Personally, I think it should be there but it's time to start talking about it for ES7 not ES6.
Yes, I absolutely agree, apologies as I realize that was not addressed in my previous message.
I'm excited to start working on es7-shim once we get to that point! (String.prototype.at has a particularly simple shim, thankfully...)
On Fri, Feb 14, 2014 at 12:23 PM, C. Scott Ananian <ecmascript at cscott.net>wrote:
I'm excited to start working on es7-shim once we get to that point! (String.prototype.at has a particularly simple shim, thankfully...)
Have you seen: mathiasbynens/String.prototype.at ?
yes, of course. es6-shim is a large-ish collection of such.
However, it would be much better to use an implementation of
String#at
which used substr and thus avoided creating and appending
a new string object.
Aside: "ECMASpeak" is neither accurate (we don't work for Ecma, it's JS not ES :-P), nor euphonious. But here's a pointer:
C. Scott Ananian wrote:
new string object.
"new string primitive", because "string object" (especially with "new" in front) suggests new String('hi').
On 14 Feb 2014, at 19:59, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
It's a really high bar to get over that closed gate. Unless the exclusion of a feature was a mistake […] I don't think we should be talking about adding it to ES6.
It does feel like a mistake to me to introduce String.prototype.codePointAt
, but no similar function that returns the symbol instead.
On Feb 15, 2014 9:13 AM, "Brendan Eich" <brendan at mozilla.com> wrote:
Aside: "ECMASpeak" is neither accurate (we don't work for Ecma, it's JS not ES :-P), nor euphonious.
I'm learning all sorts of things! I guess there are two names here; what's your preferred phrase for "the language used to write algorithms in the ES6 spec" (JS6?), and, if it differs, "the language used by members of the TC39 committee among themselves when describing language primitives in a very precise way"?
"new string primitive", because "string object" (especially with "new" in front) suggests new String('hi').
I wrestled with the phrasing there. I think what I really mean is "avoid allocating new backing storage", since there are "new string primitives" returned regardless. If there's a better phrase for "string backing storage" I'd be glad to add that to my dictionary.
C. Scott Ananian wrote:
I'm learning all sorts of things! I guess there are two names here; what's your preferred phrase for "the language used to write algorithms in the ES6 spec" (JS6?), and, if it differs, "the language used by members of the TC39 committee among themselves when describing language primitives in a very precise way"?
When I'm in a bad mood, I call it VisualCobol. It's painfully low-level and verbose, yet hard to verify. Let's hope that the JSCert work will help, and Allen has been common'ing subroutines. Whatever we call it, the spec language ain't great.
Using "-Speak" as a stem conjures Orwell. Not good.
The definition of array-like -- an informal bit of jargon, useful (e.g., "array-like vs. iterable" in context in larger discussions about Array.from) until it's time to get precise -- is a spec matter. I agree we need a common definition that we use consistently.
I wrestled with the phrasing there. I think what I really mean is "avoid allocating new backing storage", since there are "new string primitives" returned regardless. If there's a better phrase for "string backing storage" I'd be glad to add that to my dictionary.
What does "backing storage" mean? There are no new String objects in any event. There may be ropes or dependent strings under the hood, but that's all unobservable (apart from performance) implementation-land.
On Feb 15, 2014, at 11:47 AM, Brendan Eich wrote:
When I'm in a bad mood, I call it VisualCobol. It's painfully low-level and verbose, yet hard to verify. Let's hope that the JSCert work will help, and Allen has been common'ing subroutines. Whatever we call it, the spec language ain't great.
But remember, prior to ES5, it was closer to Cobolish machine language. No structured control, goto's targeting numeric step numbers, intermediate results referenced by step number (sorta SSA with numeric ids), etc.
There has never been a complete redo, just incremental improvements and refactorings. But we've definitely advanced from the early 1950s to the late 1970s.
Well, Algol-60 already was more structured a language than our spec-speak. Let alone how far the Algol-68 spec was ahead of us. :)
On 15 February 2014 20:47, Brendan Eich <brendan at mozilla.com> wrote:
Using "-Speak" as a stem conjures Orwell. Not good.
Ah, relax. Gilad Bracha even named his own language Newspeak. Self-mockery is good.
On Feb 17, 2014, at 4:38 AM, Andreas Rossberg wrote:
Well, Algol-60 already was more structured a language than our spec-speak. Let alone how far the Algol-68 spec was ahead of us. :)
We were discussing the nature of the ES spec. pseudo code, not comparing pseudo code to a complete programming language.
Structured programming styles weren't widely adopted until the mid to late 1970's.
The Algol 60 Report used English prose to describe its semantics. the ES specs are closer in style to the Algol 68 Report although less formal and arguably more approachable.
Andreas Rossberg wrote:
Ah, relax. Gilad Bracha even named his own language Newspeak.
Yeah, but no "ECMA" -- the double-whammy.
Self-mockery is good.
I pay my dues (see "wat" played with commentary at Fluent 2012 and narrated with tech details at Strange Loop 2012).
Are recordings available?
C. Scott Ananian wrote:
Are recordings available?
www.infoq.com/presentations/State-JavaScript starting at 1:50
Youtube has more.
ES6 fixes
String.fromCharCode
by introducingString.fromCodePoint
.Similarly,
String.prototype.charCodeAt
is fixed byString.prototype.codePointAt
.Should there be a method that is like
String.prototype.charAt
except it deals with astral Unicode symbols wherever possible?Has this been discussed before? If there’s any interest I’d be happy to create a strawman.