Wiki updates for String, Number and Math libraries

# Luke Hoban (12 years ago)

The Harmony proposals page contains several additions to the core String, Number and Math libraries, adding some of the most commonly created and requested helper operations to these libraries for ES6. I've uploaded to the wiki a first draft of candidate spec text for these library additions:

There are a few remaining open issues, which I've listed in the corresponding proposal pages:

Feedback welcome.

Luke

# Norbert Lindenberg (12 years ago)

A few comments on the proposed String functions:

  1. String.prototype.reverse(), as proposed, corrupts supplementary characters. Clause 6 of Ecma-262 redefines the word "character" as "a 16-bit unsigned value used to represent a single 16-bit unit of text", that is, a UTF-16 code unit. In contrast, the phrase "Unicode character" is used for Unicode code points. For reverse(), this means that the proposed spec will reverse the sequence of the two UTF-16 code units representing a supplementary character, resulting in corruption. If this function is really needed (is it? for what?), it should preserve the order of surrogate pairs, as does java.lang.StringBuilder.reverse: download.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html#reverse()

  2. String.prototype.toArray(), as proposed, breaks up the surrogate pairs representing supplementary characters and returns each UTF-16 code unit separately. This follows an unfortunate precedent set by String.prototype.split() and other functions. The function should be named to clearly indicate that it returns an array of UTF-16 code units. This also allows us to offer a parallel function that returns an array of code points.

  3. String.prototype.toArray() was originally proposed as an equivalent of this.split(''), that is, returning an array of String values. The proposed specification fills the array with elements defined as "the character at position n in S", which according to clause 6 would mean 16-bit unsigned values. It seems there needs to be a conversion to the intended type of the elements.

, Norbert

# gaz Heyes (12 years ago)

On 16 November 2011 01:37, Luke Hoban <lukeh at microsoft.com> wrote:

Those pretty much don't add anything useful to String IMO :( I think a String function everyone has been crying out for is HTML entity encode/decode.

# Luke Hoban (12 years ago)
  1. String.prototype.reverse(), as proposed, corrupts supplementary characters.

It was agreed at the meeting yesterday that this concern is significant enough, and reverse does not have sufficiently compelling use cases, so should not be included.

  1. String.prototype.toArray(), as proposed, breaks up the surrogate pairs representing supplementary characters and returns each UTF-16 code unit separately.

This behaviour is consistent with the rest of the Array.prototype functions, and we shouldn't diverge on a case by case basis. We may separately want to consider a set of String APIs that do recognize Unicode characters instead of code units, but that would be a separate Strawman to pursue.

  1. String.prototype.toArray() ... fills the array with elements defined as "the character at position n in S", which according to clause 6 would mean 16-bit unsigned values. It seems there needs to be a conversion to the intended type of the elements.

The resulting array will contain those same "characters", which will each be length 1 strings with a single 16-bit unsigned value.

Luke

# Gillam, Richard (12 years ago)
  1. String.prototype.reverse(), as proposed, corrupts supplementary characters. Clause 6 of Ecma-262 redefines the word "character" as "a 16-bit unsigned value used to represent a single 16-bit unit of text", that is, a UTF-16 code unit. In contrast, the phrase "Unicode character" is used for Unicode code points. For reverse(), this means that the proposed spec will reverse the sequence of the two UTF-16 code units representing a supplementary character, resulting in corruption. If this function is really needed (is it? for what?), it should preserve the order of surrogate pairs, as does java.lang.StringBuilder.reverse: download.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html#reverse()

It's actually worse than this: it'll also reverse the order of combining character sequences, causing any combining characters to attach to a different base character than they did in the original string: a-accent-e, when "accent" is a combining accent, means the accent is on the a; reversing the string would put the accent on the e.

--Rich Gillam

# Norbert Lindenberg (12 years ago)

Fortunately we dropped reverse in the TC 39 meeting yesterday - nobody had an idea who would use it.

I brought up combining character sequences as a concern for the other proposed functions (startsWith etc.). There the majority opinion was that the model of the existing String functions, ignoring the semantics of Unicode characters, should be followed. It was suggested that eventually there should be a parallel set of Unicode aware functions - Mark Miller suggested "WString".

Norbert

# Norbert Lindenberg (12 years ago)

For String.prototype.toArray, I didn't propose different behavior. The part of my message that you omitted continues "The function should be named to clearly indicate that it returns an array of UTF-16 code units. This also allows us to offer a parallel function that returns an array of code points."

How about toCharArray, seeing that we already have String.prototype.charAt, which returns a string with a UTF-16 code unit?

As to the element type, "length 1 string with a single 16-bit unsigned value" would be clearer than (and, at least to me, different from) "character".

Thanks, Norbert