Flexible String Representation - full Unicode for ES6?

# Chris Angelico (13 years ago)

I was directed here from the V8 discussion list, hope this is the right place to raise this.

I've read norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html and some of the related discussion (of which there is a considerable amount!). The problem with UTF-16 encodings has been biting me in a project where we allow untrusted users to configure our application by providing a script from which we call functions. The script is manipulating text, so it makes good sense to support full Unicode; and compatibility with older ECMAScript engines/interpreters is not a significant point. I'm fully aware that this is a major barrier to change in most situations, though; I am inclined toward some form of BRS as proposed by Brendan Eich.

Some worthwhile reading: unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings

If the language provides a string type that's UTF-16 and then has a few functions that count code points (as described in the norbertlindenberg page), the temptation will be strong for programmers to ignore non-BMP characters, and then to quietly still be buggy in the face of surrogates. To truly support full Unicode, the language has to expose to its programmers only Unicode, not some encoding used to represent Unicode characters in memory. The easiest way to do this is to store strings as UTF-32, allowing O(1) indexing etc, but that's really wasteful.

There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.

Most scripts are going to have a large number of pure-ASCII strings in them - variable names, identifiers, HTML tags, etc. These would benefit from a switch to Pike-strings. And any strings that don't actually have astral characters in them would suffer no penalty. Only strings that are actually affected need pay the price. And we could then trust that no surrogates ever get separated during transmission.

Chris Angelico

Hi! I was directed here from the V8 discussion list, hope this is the
right place to raise this.

I've read http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html
and some of the related discussion (of which there is a considerable
amount!). The problem with UTF-16 encodings has been biting me in a
project where we allow untrusted users to configure our application by
providing a script from which we call functions. The script is
manipulating text, so it makes good sense to support full Unicode; and
compatibility with older ECMAScript engines/interpreters is not a
significant point. I'm fully aware that this is a major barrier to
change in most situations, though; I am inclined toward some form of
BRS as proposed by Brendan Eich.

Some worthwhile reading:
http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

If the language provides a string type that's UTF-16 and then has a
few functions that count code points (as described in the
norbertlindenberg page), the temptation will be strong for programmers
to ignore non-BMP characters, and then to quietly still be buggy in
the face of surrogates. To truly support full Unicode, the language
has to expose to its programmers *only* Unicode, not some encoding
used to represent Unicode characters in memory. The easiest way to do
this is to store strings as UTF-32, allowing O(1) indexing etc, but
that's really wasteful.

There is an alternative. Python (as of version 3.3) has implemented a
new Flexible String Representation, aka PEP-393; the same has existed
in Pike for some time. A string is stored in memory with a fixed
number of bytes per character, based on the highest codepoint in that
string - if there are any non-BMP characters, 4 bytes; if any
U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings
being immutable (otherwise there'd be an annoying string-copy
operation when a too-large character gets put in), which is true of
ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but
with the leading 0 bytes elided when they're not needed.

Most scripts are going to have a large number of pure-ASCII strings in
them - variable names, identifiers, HTML tags, etc. These would
benefit from a switch to Pike-strings. And any strings that don't
actually have astral characters in them would suffer no penalty. Only
strings that are actually affected need pay the price. And we could
then trust that no surrogates ever get separated during transmission.

Chris Angelico

# Erik Arvidsson (13 years ago)

On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:

There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.

This is how most VMs already work.

I agree with you that it would be a better world if this was the case but I don't hear you suggesting how we might be able to change this without breaking the web?

-- erik

On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:

> There is an alternative. Python (as of version 3.3) has implemented a
> new Flexible String Representation, aka PEP-393; the same has existed
> in Pike for some time. A string is stored in memory with a fixed
> number of bytes per character, based on the highest codepoint in that
> string - if there are any non-BMP characters, 4 bytes; if any
> U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings
> being immutable (otherwise there'd be an annoying string-copy
> operation when a too-large character gets put in), which is true of
> ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but
> with the leading 0 bytes elided when they're not needed.

This is how most VMs already work.

I agree with you that it would be a better world if this was the case
but I don't hear you suggesting how we might be able to change this
without breaking the web?

--
erik

# Chris Angelico (13 years ago)

On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson <erik.arvidsson at gmail.com> wrote:

On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:

There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.

This is how most VMs already work.

I agree with you that it would be a better world if this was the case but I don't hear you suggesting how we might be able to change this without breaking the web?

Why, if that's how it's already being done, can't there be an easy way to expose it to the script that way? Just flip the Big Red Switch and suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but if the script can have some kind of marker (like "use strict") to show that it's compliant, or if the engine can simply be told "be compliant", we could begin to move forward. Otherwise, we're stuck where we are.

Chris Angelico

On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson
<erik.arvidsson at gmail.com> wrote:
> On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:
>
>> There is an alternative. Python (as of version 3.3) has implemented a
>> new Flexible String Representation, aka PEP-393; the same has existed
>> in Pike for some time. A string is stored in memory with a fixed
>> number of bytes per character, based on the highest codepoint in that
>> string - if there are any non-BMP characters, 4 bytes; if any
>> U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings
>> being immutable (otherwise there'd be an annoying string-copy
>> operation when a too-large character gets put in), which is true of
>> ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but
>> with the leading 0 bytes elided when they're not needed.
>
> This is how most VMs already work.
>
> I agree with you that it would be a better world if this was the case
> but I don't hear you suggesting how we might be able to change this
> without breaking the web?

Why, if that's how it's already being done, can't there be an easy way
to expose it to the script that way? Just flip the Big Red Switch and
suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but
if the script can have some kind of marker (like "use strict") to show
that it's compliant, or if the engine can simply be told "be
compliant", we could begin to move forward. Otherwise, we're stuck
where we are.

Chris Angelico

# Mark Davis ☕ (13 years ago)

The man main complication for compatibility is indexing.

See macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

If you look back about a year in this list's archive you'll find a long discussion.

{phone}

The man main complication for compatibility is indexing.

See
http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

If you look back about a year in this list's archive you'll find a long
discussion.

{phone}
On Dec 21, 2012 9:34 PM, "Chris Angelico" <rosuav at gmail.com> wrote:

> On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson
> <erik.arvidsson at gmail.com> wrote:
> > On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com>
> wrote:
> >
> >> There is an alternative. Python (as of version 3.3) has implemented a
> >> new Flexible String Representation, aka PEP-393; the same has existed
> >> in Pike for some time. A string is stored in memory with a fixed
> >> number of bytes per character, based on the highest codepoint in that
> >> string - if there are any non-BMP characters, 4 bytes; if any
> >> U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings
> >> being immutable (otherwise there'd be an annoying string-copy
> >> operation when a too-large character gets put in), which is true of
> >> ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but
> >> with the leading 0 bytes elided when they're not needed.
> >
> > This is how most VMs already work.
> >
> > I agree with you that it would be a better world if this was the case
> > but I don't hear you suggesting how we might be able to change this
> > without breaking the web?
>
> Why, if that's how it's already being done, can't there be an easy way
> to expose it to the script that way? Just flip the Big Red Switch and
> suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but
> if the script can have some kind of marker (like "use strict") to show
> that it's compliant, or if the engine can simply be told "be
> compliant", we could begin to move forward. Otherwise, we're stuck
> where we are.
>
> Chris Angelico
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121221/cd52b7ad/attachment.html>

# Chris Angelico (13 years ago)

On Sat, Dec 22, 2012 at 5:20 PM, Mark Davis ☕ <mark at macchiato.com> wrote:

The man main complication for compatibility is indexing.

See macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

Right, and that's the exact issue. If a programming language had a bug wherein division would sometimes give completely wrong results (and not counting old Pentiums), would that be considered a bug to be fixed, or a critical backward-compatibility issue? Or, closer to the issue: If the presence of an asterisk at the beginning of a string caused its characters to be indexed from 1 instead of from 0, that would be considered a bug, right? And if code happened to be depending on the bug, it would simply be broken on the broken interpreter(s) - similarly to what happens when code has to run on old versions of Internet Explorer and needs to have special compatibility handlers. There's no changing the old interpreters, but at least new interpreters can start getting things right.

If you look back about a year in this list's archive you'll find a long discussion.

Very long, yes... I've read some of the posts, but not all.

Chris Angelico

On Sat, Dec 22, 2012 at 5:20 PM, Mark Davis ☕ <mark at macchiato.com> wrote:
> The man main complication for compatibility is indexing.
>
> See
> http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

Right, and that's the exact issue. If a programming language had a bug
wherein division would sometimes give completely wrong results (and
not counting old Pentiums), would that be considered a bug to be
fixed, or a critical backward-compatibility issue? Or, closer to the
issue: If the presence of an asterisk at the beginning of a string
caused its characters to be indexed from 1 instead of from 0, that
would be considered a bug, right? And if code happened to be depending
on the bug, it would simply be broken on the broken
interpreter(s) - similarly to what happens when code has to run on old
versions of Internet Explorer and needs to have special compatibility
handlers. There's no changing the old interpreters, but at least new
interpreters can start getting things right.

> If you look back about a year in this list's archive you'll find a long
> discussion.

Very long, yes... I've read some of the posts, but not all.

Chris Angelico

# Rick Waldron (13 years ago)

On Saturday, December 22, 2012 at 12:34 AM, Chris Angelico wrote:

On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson <erik.arvidsson at gmail.com> wrote:

On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:

There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.

This is how most VMs already work.

I agree with you that it would be a better world if this was the case but I don't hear you suggesting how we might be able to change this without breaking the web?

Why, if that's how it's already being done, can't there be an easy way to expose it to the script that way? Just flip the Big Red Switch and suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but if the script can have some kind of marker (like "use strict") to show that it's compliant,

That would effectively be a second version of the language, which violates the 1JS promise (no opt-ins or unlockables)

On Saturday, December 22, 2012 at 12:34 AM, Chris Angelico wrote:
> On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson
> <erik.arvidsson at gmail.com> wrote:
> > On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:
> > 
> > > There is an alternative. Python (as of version 3.3) has implemented a
> > > new Flexible String Representation, aka PEP-393; the same has existed
> > > in Pike for some time. A string is stored in memory with a fixed
> > > number of bytes per character, based on the highest codepoint in that
> > > string - if there are any non-BMP characters, 4 bytes; if any
> > > U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings
> > > being immutable (otherwise there'd be an annoying string-copy
> > > operation when a too-large character gets put in), which is true of
> > > ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but
> > > with the leading 0 bytes elided when they're not needed.
> > > 
> > 
> > 
> > This is how most VMs already work.
> > 
> > I agree with you that it would be a better world if this was the case
> > but I don't hear you suggesting how we might be able to change this
> > without breaking the web?
> > 
> 
> 
> Why, if that's how it's already being done, can't there be an easy way
> to expose it to the script that way? Just flip the Big Red Switch and
> suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but
> if the script can have some kind of marker (like "use strict") to show
> that it's compliant, 
> 
> 

That would effectively be a second version of the language, which violates the 1JS promise (no opt-ins or unlockables)

Rick
 
> or if the engine can simply be told "be
> compliant", we could begin to move forward. Otherwise, we're stuck
> where we are.
> 
> Chris Angelico
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20121223/8e43d191/attachment.html>