Flexible String Representation - full Unicode for ES6?
On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:
There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.
This is how most VMs already work.
I agree with you that it would be a better world if this was the case but I don't hear you suggesting how we might be able to change this without breaking the web?
-- erik
On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson <erik.arvidsson at gmail.com> wrote:
On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:
There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.
This is how most VMs already work.
I agree with you that it would be a better world if this was the case but I don't hear you suggesting how we might be able to change this without breaking the web?
Why, if that's how it's already being done, can't there be an easy way to expose it to the script that way? Just flip the Big Red Switch and suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but if the script can have some kind of marker (like "use strict") to show that it's compliant, or if the engine can simply be told "be compliant", we could begin to move forward. Otherwise, we're stuck where we are.
Chris Angelico
The man main complication for compatibility is indexing.
See macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html
If you look back about a year in this list's archive you'll find a long discussion.
{phone}
On Sat, Dec 22, 2012 at 5:20 PM, Mark Davis ☕ <mark at macchiato.com> wrote:
The man main complication for compatibility is indexing.
See macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html
Right, and that's the exact issue. If a programming language had a bug wherein division would sometimes give completely wrong results (and not counting old Pentiums), would that be considered a bug to be fixed, or a critical backward-compatibility issue? Or, closer to the issue: If the presence of an asterisk at the beginning of a string caused its characters to be indexed from 1 instead of from 0, that would be considered a bug, right? And if code happened to be depending on the bug, it would simply be broken on the broken interpreter(s) - similarly to what happens when code has to run on old versions of Internet Explorer and needs to have special compatibility handlers. There's no changing the old interpreters, but at least new interpreters can start getting things right.
If you look back about a year in this list's archive you'll find a long discussion.
Very long, yes... I've read some of the posts, but not all.
Chris Angelico
On Saturday, December 22, 2012 at 12:34 AM, Chris Angelico wrote:
On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson <erik.arvidsson at gmail.com> wrote:
On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico <rosuav at gmail.com> wrote:
There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.
This is how most VMs already work.
I agree with you that it would be a better world if this was the case but I don't hear you suggesting how we might be able to change this without breaking the web?
Why, if that's how it's already being done, can't there be an easy way to expose it to the script that way? Just flip the Big Red Switch and suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but if the script can have some kind of marker (like "use strict") to show that it's compliant,
That would effectively be a second version of the language, which violates the 1JS promise (no opt-ins or unlockables)
I was directed here from the V8 discussion list, hope this is the right place to raise this.
I've read norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html and some of the related discussion (of which there is a considerable amount!). The problem with UTF-16 encodings has been biting me in a project where we allow untrusted users to configure our application by providing a script from which we call functions. The script is manipulating text, so it makes good sense to support full Unicode; and compatibility with older ECMAScript engines/interpreters is not a significant point. I'm fully aware that this is a major barrier to change in most situations, though; I am inclined toward some form of BRS as proposed by Brendan Eich.
Some worthwhile reading: unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings
If the language provides a string type that's UTF-16 and then has a few functions that count code points (as described in the norbertlindenberg page), the temptation will be strong for programmers to ignore non-BMP characters, and then to quietly still be buggy in the face of surrogates. To truly support full Unicode, the language has to expose to its programmers only Unicode, not some encoding used to represent Unicode characters in memory. The easiest way to do this is to store strings as UTF-32, allowing O(1) indexing etc, but that's really wasteful.
There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed.
Most scripts are going to have a large number of pure-ASCII strings in them - variable names, identifiers, HTML tags, etc. These would benefit from a switch to Pike-strings. And any strings that don't actually have astral characters in them would suffer no penalty. Only strings that are actually affected need pay the price. And we could then trust that no surrogates ever get separated during transmission.
Chris Angelico