New full Unicode for ES6 idea

big

Once more unto the breach, dear friends!

ES1 dates from when Unicode fit in 16 bits, and in those days, nickels 
had pictures of bumblebees on 'em ("Gimme five bees for a quarter", 
you'd say ;-).

Clearly that was a while ago. These days, we would like full 21-bit 
Unicode character support in JS. Some (mranney at Voxer) contend that it 
is a requirement.

Full 21-bit Unicode support means all of:

* indexing by characters, not uint16 storage units;
* counting length as one greater than the last index; and
* supporting escapes with (up to) six hexadecimal digits.

ES4 saw bold proposals including Lars Hansen's, to allow implementations 
to change string indexing and length incompatibly, and let Darwin sort 
it out. I recall that was when we agreed to support "\u{XXXXXX}" as an 
extension for spelling non-BMP characters.

Allen's strawman from last year, 
http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings, 
proposed a brute-force change to support full Unicode (albeit with too 
many hex digits allowed in "\u{...}"), observing that "There are very 
few places where the ECMAScript specification has actual dependencies 
upon the size of individual characters so the compatibility impact of 
supporting full Unicode is quite small." But two problems remained:

P1. As Allen wrote, "There is a larger impact on actual 
implementations", and no implementors that I can recall were satisfied 
that the cost was acceptable. It might be, we just didn't know, and 
there are enough signs of high cost to create this concern.

P2. The change is not backward compatible. In JS today, one read a 
string s from somewhere and hard-code, e.g., s.indexOf("0xd800" to find 
part of a surrogate pair, then advance to the next-indexed uint16 unit 
and read the other half, then combine to compute some result. Such usage 
would break.

Example from Allen:

var c = "😁" // where the single character between the quotes is the 
Unicode character U+1f638

c.length == 2;
c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683
c.charCodeAt(0) == 0xd83d;
c.charCodeAt(1) == 0xd338;

(Allen points out how browsers, node.js, and other environments blindly 
handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of 
the JS engine, so the above actually works without any spec-language in 
ECMA-262 saying it should.)

So based on a recent twitter/github exchange, gist recorded at 
https://gist.github.com/1850768, I would like to propose a variation on 
Allen's proposal that resolves both of these problems. Here are 
resolutions in reverse order:

R2. No incompatible change without opt-in. If you hardcode as in Allen's 
example, don't opt in without changing your index, length, and 
char/code-at assumptions.

Such opt-in cannot be a pragma since those have lexical scope and affect 
code, not the heap where strings and String.prototype methods live.

We also wish to avoid exposing a "full Unicode" representation type and 
duplicated suite of the String static and prototype methods, as Java 
did. (We may well want UTF-N transcoding helpers; we certainly want 
ByteArray <-> UTF-8 transcoding APIs.)

True, R2 implies there are two string primitive representations at most, 
or more likely "1.x" for some fraction .x. Say, a flag bit in the string 
header to distinguish JS's uint16-based indexing ("UCS-2") from 
non-O(1)-indexing UTF-16. Lots of non-observable implementation options 
here.

Instead of any such *big* new observables, I propose a so-called "Big 
Red [opt-in] Switch" (BRS) on the side of a unit of VM isolation: 
specifically the global object.

Why the global object? Because for many VMs, each global has its own 
heap or sub-heap ("compartment"), and all references outside that heap 
are to local proxies that copy from, or in the case of immutable data, 
reference the remote heap. Also because inter-compartment traffic is (we 
conjecture) infrequent enough to tolerate the proxy/copy overhead.

For strings and String objects, such proxies would consult the remote 
heap's BRS setting and transcode indexed access, and .length gets, 
accordingly. It doesn't matter if the BRS is in the global or its String 
constructor or String.prototype, as the latter are unforgeably linked to 
the global.

This means a script intent on comparing strings from two globals with 
different BRS settings could indeed tell that one discloses non-BMP 
char/codes, e.g. charCodeAt return values >= 0x10000. This is the 
*small* new observable I claim we can live with, because someone opted 
into it at least in one of the related global objects.

Note that implementations such as Node.js can pre-set the BRS to "full 
Unicode" at startup. Embeddings that fully isolate each global and its 
reachable objects and strings pay no string-proxy or -copy overhead.

R1. To keep compatibility with DOM APIs, the DOM glue used to mediate 
calls from JS to (typically) C++ would have to proxy or copy any strings 
containing non-BMP characters. Strings with only BMP characters would 
work as today.

Note that we are dealing only in spec observables here. It doesn't 
matter whether the JS engine uses UTF-8 and the DOM UCS-2 (in which case 
there is already a transcoding penalty; IIRC WebKit libxml and libxslt 
use UTF-8 and so must transcode to interface with WebKit's DOM). The 
only issue at this boundary, I believe, is how indexing and .length work.

Ok, there you have it: resolutions for both problems that killed the 
last assault on Castle '90s-JS.

Implementations that use uint16 vectors as the character data 
representation type for both "UCS-2" and "UTF-16" string variants would 
probably want another flag bit per string header indicating whether, for 
the UTF-16 case, the string indeed contained any non-BMP characters. If 
not, no proxy/copy needed.

Such implementations probably would benefit from string (primitive 
value) proxies not just copies, since the underlying uint16 vector could 
be shared by two different string headers with whatever metadata flag 
bits, etc., are needed to disclose different length values, access 
different methods from distinct globals' String.prototype objects, etc.

We could certainly also work with the W3C to revise the DOM to check the 
BRS setting, if that is possible, to avoid this non-BMP-string 
proxy/copy overhead.

How is the BRS configured? Again, not via a pragma, and not by 
imperative state update inside the language (mutating hidden BRS state 
at a given program point could leave strings created before mutation 
observably different from those created after, unless the implementation 
in effect scanned the local heap and wrapped or copied any 
non-BMP-char-bearing ones creatd before).

The obvious way to express the BRS in HTML is a <meta> tag in document 
<head>, but I don't want to get hung up on this point. I do welcome 
expert guidance. Here is another W3C/WHATWG interaction point. For this 
reason I'm cc'ing public-script-coord.

The upshot of this proposal is to get JS out of the '90s without a 
mandatory breaking change. With simple-enough opt-in expressed at 
coarse-enough boundaries so as not to impose high cost or unintended 
string type confusion bugs, the complexity is mostly borne by 
implementors, and at less than a 2x cost comparing string 
implementations (I think -- demonstration required of course).

In particular, Node.js can get modern at startup, and perhaps engines 
such as V8 as used in Node could even support compile-time (#ifdef) 
configury by which to support only full Unicode.

Comments welcome.

/be