New full Unicode for ES6 idea
I'm not sure what to think about this, being a big fan of the UTF-8 simplicity. :) But anyhow, I like the idea of opt-in, actually so much that I started thinking, why not make JS be encoding-agnostic?
What I mean here is that maybe we could have multi-charset Strings in JS? This would be useful especially on the server-side JS. So, what I'm suggesting is an extension to the String class, maybe defined as follows (while being just the first thing from the top of my head).
Let's say we have a loadFile function that takes a filename and reads its contents to a string it returns.
loadFile('my-utf8-file').charset === 'UTF-8' String(loadFile('my-file'), 'UTF-16').charset === 'UTF-16' loadFile('my-file').toString('UTF-9').charset === 'UTF-9' 32..toString(10, 'UTF-8').charset === 'UTF-8'
// And hence, we could add easy sugar to the function as well, loadFile('my-utf8-file', 'UTF-16').charset === 'UTF-16';
What do you think?
Obviously this creates a lot of problems, but backwards compatibility could (maybe) be preserved without an opt-in, if the default charset would stay the same.
On Feb 19, 2012, at 9:33 , Brendan Eich wrote:
Instead of any such big new observables, I propose a so-called "Big Red [opt-in] Switch" (BRS) on the side of a unit of VM isolation: specifically the global object.
es-discuss-only idea: could that BRS be made to carry more weight? Could it be a switch for all breaking ES.next changes?
Do we know how many scripts actually rely on \uXXXX15 to produce a stringth length of 3? Might it make more sense to put the new unicode escape under a different escape? Something like \e for "extended unicode" for example. Or is this "acceptable migration tax"...
On a side note, if we're going to do this, can we also have aliasses in regex to parse certain unicode categories? For instance, the es spec defines the Uppercase Letter" (Lu), "Lowercase Letter" (Ll), "Titlecase letter" (Lt), "Modifier letter" (Lm), "Other letter" (Lo), "Letter number" (Nl), "Non-spacing mark" (Mn), "Combining spacing mark (Mc), "Decimal number" (Nd) and "Connector punctuation" (Pc) as possible identifier parts. But right now I have to go very out of my way (qfox.nl/notes/90) to generate and end up with a 56k script that's almost pure regex.
This works and performance is amazingly fair, but it'd make more sense to be able to do \pLt or something, to parse any character in the "Titlecase letter" category. As far as I understand, these categories have to be known and supported anyways so these switches shouldn't cause too much trouble in that regard, at least.
On a side note, if we're going to do this, can we also have aliasses in regex to parse certain unicode categories? For instance, the es spec defines the Uppercase Letter" (Lu), "Lowercase Letter" (Ll), "Titlecase letter" (Lt), "Modifier letter" (Lm), "Other letter" (Lo), "Letter number" (Nl), "Non-spacing mark" (Mn), "Combining spacing mark (Mc), "Decimal number" (Nd) and "Connector punctuation" (Pc) as possible identifier parts. But right now I have to go very out of my way (qfox.nl/notes/90) to generate and end up with a 56k script that's almost pure regex.
FWIW, it can be done in “just” 11,335 characters:
/^(?!(?:do|if|in|for|let|new|try|var|case|else|enum|false|null|this|true|void|with|break|catch|class|const|super|throw|while|yield|delete|export|import|public|return|static|switch|typeof|default|extends|finally|package|private|continue|debugger|function|interface|protected|implements|instanceof)$)[$A-Z_a-z\xaa\xb5\xba\xc0-\xd6\xd8-\xf6\xf8-\u02c1\u02c6-\u02d1\u02e0-\u02e4\u02ec\u02ee\u0370-\u0374\u0376-\u0377\u037a-\u037d\u0386\u0388-\u038a\u038c\u038e-\u03a1\u03a3-\u03f5\u03f7-\u0481\u048a-\u0527\u0531-\u0556\u0559\u0561-\u0587\u05d0-\u05ea\u05f0-\u05f2\u0620-\u064a\u066e-\u066f\u0671-\u06d3\u06d5\u06e5-\u06e6\u06ee-\u06ef\u06fa-\u06fc\u06ff\u0710\u0712-\u072f\u074d-\u07a5\u07b1\u07ca-\u07ea\u07f4-\u07f5\u07fa\u0800-\u0815\u081a\u0824\u0828\u0840-\u0858\u08a0\u08a2-\u08ac\u0904-\u0939\u093d\u0950\u0958-\u0961\u0971-\u0977\u0979-\u097f\u0985-\u098c\u098f-\u0990\u0993-\u09a8\u09aa-\u09b0\u09b2\u09b6-\u09b9\u09bd\u09ce\u09dc-\u09dd\u09df-\u09e1\u09f0-\u09f1\u0a05-\u0a0a\u0a0f-\u0a10\u0a13-\u0a28\u0a2a-\u0a30\u0a32-\u0a33\u0a35-\u0a36\u0a38-\u0a39\u0a59-\u0a5c\u0a5e\u0a72-\u0a74\u0a85-\u0a8d\u0a8f-\u0a91\u0a93-\u0aa8\u0aaa-\u0ab0\u0ab2-\u0ab3\u0ab5-\u0ab9\u0abd\u0ad0\u0ae0-\u0ae1\u0b05-\u0b0c\u0b0f-\u0b10\u0b13-\u0b28\u0b2a-\u0b30\u0b32-\u0b33\u0b35-\u0b39\u0b3d\u0b5c-\u0b5d\u0b5f-\u0b61\u0b71\u0b83\u0b85-\u0b8a\u0b8e-\u0b90\u0b92-\u0b95\u0b99-\u0b9a\u0b9c\u0b9e-\u0b9f\u0ba3-\u0ba4\u0ba8-\u0baa\u0bae-\u0bb9\u0bd0\u0c05-\u0c0c\u0c0e-\u0c10\u0c12-\u0c28\u0c2a-\u0c33\u0c35-\u0c39\u0c3d\u0c58-\u0c59\u0c60-\u0c61\u0c85-\u0c8c\u0c8e-\u0c90\u0c92-\u0ca8\u0caa-\u0cb3\u0cb5-\u0cb9\u0cbd\u0cde\u0ce0-\u0ce1\u0cf1-\u0cf2\u0d05-\u0d0c\u0d0e-\u0d10\u0d12-\u0d3a\u0d3d\u0d4e\u0d60-\u0d61\u0d7a-\u0d7f\u0d85-\u0d96\u0d9a-\u0db1\u0db3-\u0dbb\u0dbd\u0dc0-\u0dc6\u0e01-\u0e30\u0e32-\u0e33\u0e40-\u0e46\u0e81-\u0e82\u0e84\u0e87-\u0e88\u0e8a\u0e8d\u0e94-\u0e97\u0e99-\u0e9f\u0ea1-\u0ea3\u0ea5\u0ea7\u0eaa-\u0eab\u0ead-\u0eb0\u0eb2-\u0eb3\u0ebd\u0ec0-\u0ec4\u0ec6\u0edc-\u0edf\u0f00\u0f40-\u0f47\u0f49-\u0f6c\u0f88-\u0f8c\u1000-\u102a\u103f\u1050-\u1055\u105a-\u105d\u1061\u1065-\u1066\u106e-\u1070\u1075-\u1081\u108e\u10a0-\u10c5\u10c7\u10cd\u10d0-\u10fa\u10fc-\u1248\u124a-\u124d\u1250-\u1256\u1258\u125a-\u125d\u1260-\u1288\u128a-\u128d\u1290-\u12b0\u12b2-\u12b5\u12b8-\u12be\u12c0\u12c2-\u12c5\u12c8-\u12d6\u12d8-\u1310\u1312-\u1315\u1318-\u135a\u1380-\u138f\u13a0-\u13f4\u1401-\u166c\u166f-\u167f\u1681-\u169a\u16a0-\u16ea\u16ee-\u16f0\u1700-\u170c\u170e-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176c\u176e-\u1770\u1780-\u17b3\u17d7\u17dc\u1820-\u1877\u1880-\u18a8\u18aa\u18b0-\u18f5\u1900-\u191c\u1950-\u196d\u1970-\u1974\u1980-\u19ab\u19c1-\u19c7\u1a00-\u1a16\u1a20-\u1a54\u1aa7\u1b05-\u1b33\u1b45-\u1b4b\u1b83-\u1ba0\u1bae-\u1baf\u1bba-\u1be5\u1c00-\u1c23\u1c4d-\u1c4f\u1c5a-\u1c7d\u1ce9-\u1cec\u1cee-\u1cf1\u1cf5-\u1cf6\u1d00-\u1dbf\u1e00-\u1f15\u1f18-\u1f1d\u1f20-\u1f45\u1f48-\u1f4d\u1f50-\u1f57\u1f59\u1f5b\u1f5d\u1f5f-\u1f7d\u1f80-\u1fb4\u1fb6-\u1fbc\u1fbe\u1fc2-\u1fc4\u1fc6-\u1fcc\u1fd0-\u1fd3\u1fd6-\u1fdb\u1fe0-\u1fec\u1ff2-\u1ff4\u1ff6-\u1ffc\u2071\u207f\u2090-\u209c\u2102\u2107\u210a-\u2113\u2115\u2119-\u211d\u2124\u2126\u2128\u212a-\u212d\u212f-\u2139\u213c-\u213f\u2145-\u2149\u214e\u2160-\u2188\u2c00-\u2c2e\u2c30-\u2c5e\u2c60-\u2ce4\u2ceb-\u2cee\u2cf2-\u2cf3\u2d00-\u2d25\u2d27\u2d2d\u2d30-\u2d67\u2d6f\u2d80-\u2d96\u2da0-\u2da6\u2da8-\u2dae\u2db0-\u2db6\u2db8-\u2dbe\u2dc0-\u2dc6\u2dc8-\u2dce\u2dd0-\u2dd6\u2dd8-\u2dde\u2e2f\u3005-\u3007\u3021-\u3029\u3031-\u3035\u3038-\u303c\u3041-\u3096\u309d-\u309f\u30a1-\u30fa\u30fc-\u30ff\u3105-\u312d\u3131-\u318e\u31a0-\u31ba\u31f0-\u31ff\u3400-\u4db5\u4e00-\u9fcc\ua000-\ua48c\ua4d0-\ua4fd\ua500-\ua60c\ua610-\ua61f\ua62a-\ua62b\ua640-\ua66e\ua67f-\ua697\ua6a0-\ua6ef\ua717-\ua71f\ua722-\ua788\ua78b-\ua78e\ua790-\ua793\ua7a0-\ua7aa\ua7f8-\ua801\ua803-\ua805\ua807-\ua80a\ua80c-\ua822\ua840-\ua873\ua882-\ua8b3\ua8f2-\ua8f7\ua8fb\ua90a-\ua925\ua930-\ua946\ua960-\ua97c\ua984-\ua9b2\ua9cf\uaa00-\uaa28\uaa40-\uaa42\uaa44-\uaa4b\uaa60-\uaa76\uaa7a\uaa80-\uaaaf\uaab1\uaab5-\uaab6\uaab9-\uaabd\uaac0\uaac2\uaadb-\uaadd\uaae0-\uaaea\uaaf2-\uaaf4\uab01-\uab06\uab09-\uab0e\uab11-\uab16\uab20-\uab26\uab28-\uab2e\uabc0-\uabe2\uac00-\ud7a3\ud7b0-\ud7c6\ud7cb-\ud7fb\uf900-\ufa6d\ufa70-\ufad9\ufb00-\ufb06\ufb13-\ufb17\ufb1d\ufb1f-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb3e\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\uff21-\uff3a\uff41-\uff5a\uff66-\uffbe\uffc2-\uffc7\uffca-\uffcf\uffd2-\uffd7\uffda-\uffdc][$A-Z_a-z\xaa\xb5\xba\xc0-\xd6\xd8-\xf6\xf8-\u02c1\u02c6-\u02d1\u02e0-\u02e4\u02ec\u02ee\u0370-\u0374\u0376-\u0377\u037a-\u037d\u0386\u0388-\u038a\u038c\u038e-\u03a1\u03a3-\u03f5\u03f7-\u0481\u048a-\u0527\u0531-\u0556\u0559\u0561-\u0587\u05d0-\u05ea\u05f0-\u05f2\u0620-\u064a\u066e-\u066f\u0671-\u06d3\u06d5\u06e5-\u06e6\u06ee-\u06ef\u06fa-\u06fc\u06ff\u0710\u0712-\u072f\u074d-\u07a5\u07b1\u07ca-\u07ea\u07f4-\u07f5\u07fa\u0800-\u0815\u081a\u0824\u0828\u0840-\u0858\u08a0\u08a2-\u08ac\u0904-\u0939\u093d\u0950\u0958-\u0961\u0971-\u0977\u0979-\u097f\u0985-\u098c\u098f-\u0990\u0993-\u09a8\u09aa-\u09b0\u09b2\u09b6-\u09b9\u09bd\u09ce\u09dc-\u09dd\u09df-\u09e1\u09f0-\u09f1\u0a05-\u0a0a\u0a0f-\u0a10\u0a13-\u0a28\u0a2a-\u0a30\u0a32-\u0a33\u0a35-\u0a36\u0a38-\u0a39\u0a59-\u0a5c\u0a5e\u0a72-\u0a74\u0a85-\u0a8d\u0a8f-\u0a91\u0a93-\u0aa8\u0aaa-\u0ab0\u0ab2-\u0ab3\u0ab5-\u0ab9\u0abd\u0ad0\u0ae0-\u0ae1\u0b05-\u0b0c\u0b0f-\u0b10\u0b13-\u0b28\u0b2a-\u0b30\u0b32-\u0b33\u0b35-\u0b39\u0b3d\u0b5c-\u0b5d\u0b5f-\u0b61\u0b71\u0b83\u0b85-\u0b8a\u0b8e-\u0b90\u0b92-\u0b95\u0b99-\u0b9a\u0b9c\u0b9e-\u0b9f\u0ba3-\u0ba4\u0ba8-\u0baa\u0bae-\u0bb9\u0bd0\u0c05-\u0c0c\u0c0e-\u0c10\u0c12-\u0c28\u0c2a-\u0c33\u0c35-\u0c39\u0c3d\u0c58-\u0c59\u0c60-\u0c61\u0c85-\u0c8c\u0c8e-\u0c90\u0c92-\u0ca8\u0caa-\u0cb3\u0cb5-\u0cb9\u0cbd\u0cde\u0ce0-\u0ce1\u0cf1-\u0cf2\u0d05-\u0d0c\u0d0e-\u0d10\u0d12-\u0d3a\u0d3d\u0d4e\u0d60-\u0d61\u0d7a-\u0d7f\u0d85-\u0d96\u0d9a-\u0db1\u0db3-\u0dbb\u0dbd\u0dc0-\u0dc6\u0e01-\u0e30\u0e32-\u0e33\u0e40-\u0e46\u0e81-\u0e82\u0e84\u0e87-\u0e88\u0e8a\u0e8d\u0e94-\u0e97\u0e99-\u0e9f\u0ea1-\u0ea3\u0ea5\u0ea7\u0eaa-\u0eab\u0ead-\u0eb0\u0eb2-\u0eb3\u0ebd\u0ec0-\u0ec4\u0ec6\u0edc-\u0edf\u0f00\u0f40-\u0f47\u0f49-\u0f6c\u0f88-\u0f8c\u1000-\u102a\u103f\u1050-\u1055\u105a-\u105d\u1061\u1065-\u1066\u106e-\u1070\u1075-\u1081\u108e\u10a0-\u10c5\u10c7\u10cd\u10d0-\u10fa\u10fc-\u1248\u124a-\u124d\u1250-\u1256\u1258\u125a-\u125d\u1260-\u1288\u128a-\u128d\u1290-\u12b0\u12b2-\u12b5\u12b8-\u12be\u12c0\u12c2-\u12c5\u12c8-\u12d6\u12d8-\u1310\u1312-\u1315\u1318-\u135a\u1380-\u138f\u13a0-\u13f4\u1401-\u166c\u166f-\u167f\u1681-\u169a\u16a0-\u16ea\u16ee-\u16f0\u1700-\u170c\u170e-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176c\u176e-\u1770\u1780-\u17b3\u17d7\u17dc\u1820-\u1877\u1880-\u18a8\u18aa\u18b0-\u18f5\u1900-\u191c\u1950-\u196d\u1970-\u1974\u1980-\u19ab\u19c1-\u19c7\u1a00-\u1a16\u1a20-\u1a54\u1aa7\u1b05-\u1b33\u1b45-\u1b4b\u1b83-\u1ba0\u1bae-\u1baf\u1bba-\u1be5\u1c00-\u1c23\u1c4d-\u1c4f\u1c5a-\u1c7d\u1ce9-\u1cec\u1cee-\u1cf1\u1cf5-\u1cf6\u1d00-\u1dbf\u1e00-\u1f15\u1f18-\u1f1d\u1f20-\u1f45\u1f48-\u1f4d\u1f50-\u1f57\u1f59\u1f5b\u1f5d\u1f5f-\u1f7d\u1f80-\u1fb4\u1fb6-\u1fbc\u1fbe\u1fc2-\u1fc4\u1fc6-\u1fcc\u1fd0-\u1fd3\u1fd6-\u1fdb\u1fe0-\u1fec\u1ff2-\u1ff4\u1ff6-\u1ffc\u2071\u207f\u2090-\u209c\u2102\u2107\u210a-\u2113\u2115\u2119-\u211d\u2124\u2126\u2128\u212a-\u212d\u212f-\u2139\u213c-\u213f\u2145-\u2149\u214e\u2160-\u2188\u2c00-\u2c2e\u2c30-\u2c5e\u2c60-\u2ce4\u2ceb-\u2cee\u2cf2-\u2cf3\u2d00-\u2d25\u2d27\u2d2d\u2d30-\u2d67\u2d6f\u2d80-\u2d96\u2da0-\u2da6\u2da8-\u2dae\u2db0-\u2db6\u2db8-\u2dbe\u2dc0-\u2dc6\u2dc8-\u2dce\u2dd0-\u2dd6\u2dd8-\u2dde\u2e2f\u3005-\u3007\u3021-\u3029\u3031-\u3035\u3038-\u303c\u3041-\u3096\u309d-\u309f\u30a1-\u30fa\u30fc-\u30ff\u3105-\u312d\u3131-\u318e\u31a0-\u31ba\u31f0-\u31ff\u3400-\u4db5\u4e00-\u9fcc\ua000-\ua48c\ua4d0-\ua4fd\ua500-\ua60c\ua610-\ua61f\ua62a-\ua62b\ua640-\ua66e\ua67f-\ua697\ua6a0-\ua6ef\ua717-\ua71f\ua722-\ua788\ua78b-\ua78e\ua790-\ua793\ua7a0-\ua7aa\ua7f8-\ua801\ua803-\ua805\ua807-\ua80a\ua80c-\ua822\ua840-\ua873\ua882-\ua8b3\ua8f2-\ua8f7\ua8fb\ua90a-\ua925\ua930-\ua946\ua960-\ua97c\ua984-\ua9b2\ua9cf\uaa00-\uaa28\uaa40-\uaa42\uaa44-\uaa4b\uaa60-\uaa76\uaa7a\uaa80-\uaaaf\uaab1\uaab5-\uaab6\uaab9-\uaabd\uaac0\uaac2\uaadb-\uaadd\uaae0-\uaaea\uaaf2-\uaaf4\uab01-\uab06\uab09-\uab0e\uab11-\uab16\uab20-\uab26\uab28-\uab2e\uabc0-\uabe2\uac00-\ud7a3\ud7b0-\ud7c6\ud7cb-\ud7fb\uf900-\ufa6d\ufa70-\ufad9\ufb00-\ufb06\ufb13-\ufb17\ufb1d\ufb1f-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb3e\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\uff21-\uff3a\uff41-\uff5a\uff66-\uffbe\uffc2-\uffc7\uffca-\uffcf\uffd2-\uffd7\uffda-\uffdc0-9\u0300-\u036f\u0483-\u0487\u0591-\u05bd\u05bf\u05c1-\u05c2\u05c4-\u05c5\u05c7\u0610-\u061a\u064b-\u0669\u0670\u06d6-\u06dc\u06df-\u06e4\u06e7-\u06e8\u06ea-\u06ed\u06f0-\u06f9\u0711\u0730-\u074a\u07a6-\u07b0\u07c0-\u07c9\u07eb-\u07f3\u0816-\u0819\u081b-\u0823\u0825-\u0827\u0829-\u082d\u0859-\u085b\u08e4-\u08fe\u0900-\u0903\u093a-\u093c\u093e-\u094f\u0951-\u0957\u0962-\u0963\u0966-\u096f\u0981-\u0983\u09bc\u09be-\u09c4\u09c7-\u09c8\u09cb-\u09cd\u09d7\u09e2-\u09e3\u09e6-\u09ef\u0a01-\u0a03\u0a3c\u0a3e-\u0a42\u0a47-\u0a48\u0a4b-\u0a4d\u0a51\u0a66-\u0a71\u0a75\u0a81-\u0a83\u0abc\u0abe-\u0ac5\u0ac7-\u0ac9\u0acb-\u0acd\u0ae2-\u0ae3\u0ae6-\u0aef\u0b01-\u0b03\u0b3c\u0b3e-\u0b44\u0b47-\u0b48\u0b4b-\u0b4d\u0b56-\u0b57\u0b62-\u0b63\u0b66-\u0b6f\u0b82\u0bbe-\u0bc2\u0bc6-\u0bc8\u0bca-\u0bcd\u0bd7\u0be6-\u0bef\u0c01-\u0c03\u0c3e-\u0c44\u0c46-\u0c48\u0c4a-\u0c4d\u0c55-\u0c56\u0c62-\u0c63\u0c66-\u0c6f\u0c82-\u0c83\u0cbc\u0cbe-\u0cc4\u0cc6-\u0cc8\u0cca-\u0ccd\u0cd5-\u0cd6\u0ce2-\u0ce3\u0ce6-\u0cef\u0d02-\u0d03\u0d3e-\u0d44\u0d46-\u0d48\u0d4a-\u0d4d\u0d57\u0d62-\u0d63\u0d66-\u0d6f\u0d82-\u0d83\u0dca\u0dcf-\u0dd4\u0dd6\u0dd8-\u0ddf\u0df2-\u0df3\u0e31\u0e34-\u0e3a\u0e47-\u0e4e\u0e50-\u0e59\u0eb1\u0eb4-\u0eb9\u0ebb-\u0ebc\u0ec8-\u0ecd\u0ed0-\u0ed9\u0f18-\u0f19\u0f20-\u0f29\u0f35\u0f37\u0f39\u0f3e-\u0f3f\u0f71-\u0f84\u0f86-\u0f87\u0f8d-\u0f97\u0f99-\u0fbc\u0fc6\u102b-\u103e\u1040-\u1049\u1056-\u1059\u105e-\u1060\u1062-\u1064\u1067-\u106d\u1071-\u1074\u1082-\u108d\u108f-\u109d\u135d-\u135f\u1712-\u1714\u1732-\u1734\u1752-\u1753\u1772-\u1773\u17b4-\u17d3\u17dd\u17e0-\u17e9\u180b-\u180d\u1810-\u1819\u18a9\u1920-\u192b\u1930-\u193b\u1946-\u194f\u19b0-\u19c0\u19c8-\u19c9\u19d0-\u19d9\u1a17-\u1a1b\u1a55-\u1a5e\u1a60-\u1a7c\u1a7f-\u1a89\u1a90-\u1a99\u1b00-\u1b04\u1b34-\u1b44\u1b50-\u1b59\u1b6b-\u1b73\u1b80-\u1b82\u1ba1-\u1bad\u1bb0-\u1bb9\u1be6-\u1bf3\u1c24-\u1c37\u1c40-\u1c49\u1c50-\u1c59\u1cd0-\u1cd2\u1cd4-\u1ce8\u1ced\u1cf2-\u1cf4\u1dc0-\u1de6\u1dfc-\u1dff\u200c-\u200d\u203f-\u2040\u2054\u20d0-\u20dc\u20e1\u20e5-\u20f0\u2cef-\u2cf1\u2d7f\u2de0-\u2dff\u302a-\u302f\u3099-\u309a\ua620-\ua629\ua66f\ua674-\ua67d\ua69f\ua6f0-\ua6f1\ua802\ua806\ua80b\ua823-\ua827\ua880-\ua881\ua8b4-\ua8c4\ua8d0-\ua8d9\ua8e0-\ua8f1\ua900-\ua909\ua926-\ua92d\ua947-\ua953\ua980-\ua983\ua9b3-\ua9c0\ua9d0-\ua9d9\uaa29-\uaa36\uaa43\uaa4c-\uaa4d\uaa50-\uaa59\uaa7b\uaab0\uaab2-\uaab4\uaab7-\uaab8\uaabe-\uaabf\uaac1\uaaeb-\uaaef\uaaf5-\uaaf6\uabe3-\uabea\uabec-\uabed\uabf0-\uabf9\ufb1e\ufe00-\ufe0f\ufe20-\ufe26\ufe33-\ufe34\ufe4d-\ufe4f\uff10-\uff19\uff3f]*$/
Note that this regex includes a check for ES5.1 reserved words, which aren’t allowed as identifiers. (I wrote this while working on mathiasbynens.be/notes/javascript-identifiers and mothereff.in/js-variables.)
But yeah, supporting Unicode categories in JS regex would be a very useful addition.
On Sun, Feb 19, 2012 at 12:33 AM, Brendan Eich <brendan at mozilla.com> wrote: [...]
Why the global object? Because for many VMs, each global has its own heap or sub-heap ("compartment"), and all references outside that heap are to local proxies that copy from, or in the case of immutable data, reference the remote heap.
[...]
Is this true for same origin iframes? I have always assumed that mixing heaps between same origin iframes results in unmediated direct object-to-object access. If these are already mediated, what was the issue that drove us to that?
On Sun, Feb 19, 2012 at 12:12 PM, Mark S. Miller <erights at google.com> wrote:
On Sun, Feb 19, 2012 at 12:33 AM, Brendan Eich <brendan at mozilla.com> wrote: [...]
Why the global object? Because for many VMs, each global has its own heap or sub-heap ("compartment"), and all references outside that heap are to local proxies that copy from, or in the case of immutable data, reference the remote heap.
[...]
Is this true for same origin iframes? I have always assumed that mixing heaps between same origin iframes results in unmediated direct object-to-object access. If these are already mediated, what was the issue that drove us to that?
In V8, same origin contexts (or really, any contexts that might communicate in any way) live in the same heap. Originally, that meant anything running in the same process was in the same heap, but with isolates, there can be more heaps in the same process. You can still determine the origin of an object, to do any necessary security checks, but references to "foreign" objects are always plain pointers into the same heap.
If I have understood the description correctly, I believe Opera merge heaps from different frames if they start communicating, effectively putting them in the same heap. my.opera.com/core/blog/2009/12/22/carakan-revisited
On 19 February 2012 03:33, Brendan Eich <brendan at mozilla.com> wrote:
S1 dates from when Unicode fit in 16 bits, and in those days, nickels had pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say ;-).
Say, is that an onion on your belt?
- indexing by characters, not uint16 storage units;
- counting length as one greater than the last index; and
These are the two items that IME trip up developers who are either not careful or not aware of UTF-16 encoding details and don't test with non-BMP input. Frankly, JS developers should not have to be aware of character encodings. Strings should "just work".
I think that explicitly making strings Unicode and applying the fix above would solve a lot of problems. If I had this option, I would go so far as to throw the BRS in my build processes, hg grep all our source code for strings like D800 and eliminate all the extra UTF-16 machinations.
Another option might be to make ES.next have full Unicode strings; fix .length and .charCodeAt etc when we are in ES.next context, leaving them "broken" otherwise. I'm not fond of this option, though: since there would be no BRS, developers might often find themselves unsure of just what the heck it is they are working with.
So, I like per-global BRS.
- supporting escapes with (up to) six hexadecimal digits.
This is necessary too; developers should be thinking about code points, not encoding details.
P2. The change is not backward compatible. In JS today, one read a string s from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a surrogate pair, then advance to the next-indexed uint16 unit and read the other half, then combine to compute some result. Such usage would break.
While that is true in the general case, there are many specific cases where that would not break. I'm thinking I have an implementation of UnicodeStrlen around here somewhere which works by subracting the number of 0xD800 characters from .length. In this case, that code would continue to generate correct length counts because it would never find a 0xD800 in a valid Unicode string (it's a reserved code point).
We also wish to avoid exposing a "full Unicode" representation type and duplicated suite of the String static and prototype methods, as Java did. (We may well want UTF-N transcoding helpers; we certainly want ByteArray <-> UTF-8 transcoding APIs.)
These are both good goals, in particular, avoiding a "full Unicode" type means reducing bug counts in the long term.
Is there a proposal for interaction with JSON?
Also because inter-compartment traffic is (we conjecture) infrequent enough to tolerate the proxy/copy overhead.
Not to mention that the only thing you'd have to do is to tweak [[get]], charCodeAt and .length when crossing boundaries; you can keep the same backing store.
You might not even need to do this is the engine keeps the same backing store for both kinds of strings.
This means a script intent on comparing strings from two globals with different BRS settings could indeed tell that one discloses non-BMP char/codes, e.g. charCodeAt return values >= 0x10000. This is the small new observable I claim we can live with, because someone opted into it at least in one of the related global objects.
Funny question, if I have two strings, both "hello", from two globals with different BRS settings, are they ==? How about ===?
R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls from JS to (typically) C++ would have to proxy or copy any strings containing non-BMP characters. Strings with only BMP characters would work as today.
Is that true if the "full unicode" backing store is 16-bit code units using UTF-16 encoding? (Any way, it's an implementation detail)
In particular, Node.js can get modern at startup, and perhaps engines such
as V8 as used in Node could even support compile-time (#ifdef) configury by which to support only full Unicode.
Sure, this is analogous to how SpiderMonkey deals with UTF-8 C Strings. Flip a BRS before creating the runtime. :)
Jussi Kalliokoski wrote:
I'm not sure what to think about this, being a big fan of the UTF-8 simplicity. :)
UTF-8 is great, but it's a transfer format, perfect for C and other such systems languages (especially ones that use byte-wide char from old days). It is not appropriate for JS, which gives users a "One True String" (sorry for caps) primitive type that has higher-level "just Unicode" semantics. Alas, JS's "just Unicode" was from '96.
There are lots of transfer formats and character set encodings. Implementations could use many, depending on what chars a given string uses. E.g. ASCII + UTF-16, UTF-8 only as you suggest, other combinations. But this would all be under the hood, and at some cost to the engine as well as some potential (space, mostly) savings.
But anyhow, I like the idea of opt-in, actually so much that I started thinking, why not make JS be encoding-agnostic?
That is precisely the idea. Setting the BRS to "full Unicode" gives the appearance of 21 bits per character via indexing and length accounting. You'd have to spell non-BMP literal escapes via "\u{...}", no big deal.
What I mean here is that maybe we could have multi-charset Strings in JS?
Now you're saying something else. Having one agnostic higher-level "just Unicode" string type is one thing. That's JS's design goal, always has been. It does not imply adding multiple observable CSEs or UTFs that break the "just Unicode" abstraction.
If you can put a JS string in memory for low-level systems languages such as C to view, of course there are abstraction breaks. Engine APIs may or may not allow such views for optimizations. This is an issue, for sure, when embedding (e.g. V8 in Node). It's not a language design issue, though, and I'm focused on observables in the language because that is where JS currently fails by livin' in the '90s.
Axel Rauschmayer wrote:
es-discuss-only idea: could that BRS be made to carry more weight? Could it be a switch for all breaking ES.next changes?
What do you have in mind? It had better be important. We just had the breakthrough championed by dherman for "One JavaScript". Why make trouble by adding runtime semantic changes unduly?
es-discuss-only idea: could that BRS be made to carry more weight? Could it be a switch for all breaking ES.next changes?
What do you have in mind? It had better be important. We just had the breakthrough championed by dherman for "One JavaScript". Why make trouble by adding runtime semantic changes unduly?
Two points:
- IIRC, attributes such as onclick are not yet solved, the ideas proposed sounded like a BRS, so maybe the two solutions can be combined.
- If we keep in the back of our minds the possibility of using the BRS as ES6 opt-in, while going forward with "One JavaScript" (1JS), both approaches can be tested in the wild. I’m mostly sold on 1JS, but a few doubts remain, trying out the BRS "clean cut" solution for Unicode will either allay or confirm those doubts.
Axel Rauschmayer wrote:
es-discuss-only idea: could that BRS be made to carry more weight? Could it be a switch for all breaking ES.next changes?
What do you have in mind? It had better be important. We just had the breakthrough championed by dherman for "One JavaScript". Why make trouble by adding runtime semantic changes unduly?
Two points:
- IIRC, attributes such as onclick are not yet solved, the ideas proposed sounded like a BRS, so maybe the two solutions can be combined.
"One JavaScript" means there's nothing to solve for event handlers. (Previously, with version opt-in, the solution was Content-Script- Type [1], via a HTTP header or <meta http-equiv>).
Could you state the problem with an example?
Perhaps you mean the issue of 'let' at top level in prior scripts (or 'const'). I think we're all agreed that such bindings (as well as 'module' bindings at top level) must be visible in event handlers.
- If we keep in the back of our minds the possibility of using the BRS as ES6 opt-in, while going forward with "One JavaScript" (1JS), both approaches can be tested in the wild. I’m mostly sold on 1JS, but a few doubts remain,
Namely?
We have to test whether extensions such as const and function-in-block can be broken, but Apple and Mozilla (at least) are covering that.
We shouldn't over-hedge or try doing two things less well, instead of one thing well.
trying out the BRS "clean cut" solution for Unicode will either allay or confirm those doubts.
Adding a BRS and then starting to hang other hats on it is a design mistake. When in doubt, leave it out. This proposal is only for Unicode. Of course we can consider other uses if the need truly arises, but we should not go looking for them right now.
/be
Mark S. Miller wrote:
On Sun, Feb 19, 2012 at 12:33 AM, Brendan Eich <brendan at mozilla.com <mailto:brendan at mozilla.com>> wrote: [...]
Why the global object? Because for many VMs, each global has its own heap or sub-heap ("compartment"), and all references outside that heap are to local proxies that copy from, or in the case of immutable data, reference the remote heap.
[...]
Is this true for same origin iframes? I have always assumed that mixing heaps between same origin iframes results in unmediated direct object-to-object access. If these are already mediated, what was the issue that drove us to that?
Not all engines mediate cross-same-origin-window accesses. I hear IE9+ may, indeed rumor is it remotes to another process sometimes (breaking run-to-completion a bit; something we should explore breaking in the future for window=vat). SpiderMonkey just recently (not sure if this is in a Firefox channel yet) went to compartment per global, for good savings once things were refactored to maximize sharing of internal immutables.
My R2 resolution is not specific to any engine, but I have hopes it can be accepted. It is concrete enough to help overcome large-yet-vague doubts about implementation impact (at least IMHO). Recall that document.domain setting may have to split a merged same-origin window/frame graph, at any time. Again implementation solutions vary, but this suggests cross-window mediation can be interposed lazily.
Another point: HTML5 WindowProxy (vs. Window, the global object on the scope chain) exists to solve navigation-away-from-same-origin security problems. Any JS that passes strings from one window to another must be using a WindowProxy to reference the other. There's a mediation point too.
Brendan Eich wrote:
My R2 resolution is not specific to any engine, but I have hopes it can be accepted. It is concrete enough to help overcome large-yet-vague doubts about implementation impact (at least IMHO). Recall that document.domain setting may have to split a merged same-origin window/frame graph, at any time. Again implementation solutions vary, but this suggests cross-window mediation can be interposed lazily.
Another point: HTML5 WindowProxy (vs. Window, the global object on the scope chain) exists to solve navigation-away-from-same-origin security problems. Any JS that passes strings from one window to another must be using a WindowProxy to reference the other. There's a mediation point too.
IOW, whether there' s a heap/compartment per global is not critical (but if there is, then strings are already copied and could be transcoded as to their meta-data distinguishing non-BMP+UTF16-aka-full-Unicode from bad-ol'-90s-UCS2). The cross-window mediation via WindowProxy is.
The flag bits I mentioned could be combined: 1 flag bit for both non-BMP-chars-in-this-string-and-full-Unicode-BRS-setting-in-effect-for-its-global.
Le 19/02/2012 09:33, Brendan Eich a écrit :
(...)
How is the BRS configured? Again, not via a pragma, and not by imperative state update inside the language (mutating hidden BRS state at a given program point could leave strings created before mutation observably different from those created after, unless the implementation in effect scanned the local heap and wrapped or copied any non-BMP-char-bearing ones creatd before).
The obvious way to express the BRS in HTML is a <meta> tag in document <head>, but I don't want to get hung up on this point. I do welcome expert guidance. Here is another W3C/WHATWG interaction point. For this reason I'm cc'ing public-script-coord.
I'm not sure a <meta> is that obvious of a choice.
A bit of related experience about <meta>s:
At the end of 2011, I started a thread [1] about the meta referrer [2]. One use case for it would be to set the value to "never" so that you declare that document-wise, no HTTP "referer" header should be sent when downloading a script/stylesheet/image or clicking a link/posting a form. An intervention by Simon Pieters [3] mentionned speculative parsing and the fact that resources may be fetched before the meta is read, hence leaking the referer while the intention of the author might have been that there should be no leak. Since there seems to be no satisfying HTML-based solution for this, I suggested [4] that it's when the document is delivered by the server that the server should express how the document should behave regarding sending referer headers. The discussion ended by Adam Barth agreeing [5] and planning to propose this for CSP 1.1 (That's how I learned about CSP [6]).
Unless I'm missing something, I think the same discussion can be had about the BRS being declared as a <meta>. Consider:
<script>
// some code that can observe the difference between BRS mode and
non-BRS </script> <meta BRS>
Should the browser read all <meta>s before executing any script? Worse:
what if an inline script does "document.write('<meta BRS>')"?
I think a CSP-like solution should be explored.
As a side note, after some time studying the Content Security Policy (CSP), I came to realize that it doesn't have to be related security (though that's what motivated it in the first place) and could be considered as a "Content Delivery Policy", offering space to break semantics and repair things that would be worth the cost of the opt-in (like the script execution rules or when the referer header is sent).
Worth exploring for the BRS or certainly a lot of other things.
David
[1] lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034275.html [2] wiki.whatwg.org/index.php?title=Meta_referrer [3] lists.whatwg.org/pipermail/whatwg-whatwg.org/2012-January/034283.html [4] lists.whatwg.org/pipermail/whatwg-whatwg.org/2012-January/034522.html [5] lists.whatwg.org/pipermail/whatwg-whatwg.org/2012-January/034523.html [6] dvcs.w3.org/hg/content-security-policy/raw-file/tip/csp-specification.dev.html
On Sun, Feb 19, 2012 at 11:49 AM, Brendan Eich <brendan at mozilla.com> wrote: [...]
Not all engines mediate cross-same-origin-window accesses. I hear IE9+ may, indeed rumor is it remotes to another process sometimes (breaking run-to-completion a bit; something we should explore breaking in the future for window=vat). SpiderMonkey just recently (not sure if this is in a Firefox channel yet) went to compartment per global, for good savings once things were refactored to maximize sharing of internal immutables.
Other than the origin truncation issue that I am still confused about, what other benefits are there to mediating interframe access within the same origin?
My R2 resolution is not specific to any engine, but I have hopes it can be accepted. It is concrete enough to help overcome large-yet-vague doubts about implementation impact (at least IMHO). Recall that document.domain setting may have to split a merged same-origin window/frame graph, at any time. Again implementation solutions vary, but this suggests cross-window mediation can be interposed lazily.
How? By doing a full walk of the object graph and doing surgery on it? This sounds more painful than imposing mediation up front. But I'm still hoping that objects same origin iframes can communicate directly, without mediation.
Wes Garland wrote:
Is there a proposal for interaction with JSON?
From www.ietf.org/rfc/rfc4627, 2.5:
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
Also because inter-compartment traffic is (we conjecture) infrequent enough to tolerate the proxy/copy overhead.
Not to mention that the only thing you'd have to do is to tweak [[get]], charCodeAt and .length when crossing boundaries; you can keep the same backing store.
String methods are not generally self-hosted, so internal C++ vector access would need to change depending on the string's flag bit, in this implementation approach.
You might not even need to do this is the engine keeps the same backing store for both kinds of strings.
Yes, sharing the uint16 vector is good. But string methods would have to index and .length differently (if I can verb .length ;-).
This means a script intent on comparing strings from two globals with different BRS settings could indeed tell that one discloses non-BMP char/codes, e.g. charCodeAt return values >= 0x10000. This is the *small* new observable I claim we can live with, because someone opted into it at least in one of the related global objects.
Funny question, if I have two strings, both "hello", from two globals with different BRS settings, are they ==? How about ===?
Of course, strings with the same characters are == and ===. Strings appear to be values. If you think of them as immutable reference types there's still an obligation to compare characters for strings because computed strings are not intern'ed.
R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls from JS to (typically) C++ would have to proxy or copy any strings containing non-BMP characters. Strings with only BMP characters would work as today.
Is that true if the "full unicode" backing store is 16-bit code units using UTF-16 encoding? (Any way, it's an implementation detail)
Yes, because DOMString has intrinsic length and indexing notions and these must (pending any coordination with w3c) remain ignorant of the BRS and livin' in the '90s (DOM too emerged in the UCS-2 era).
Mark S. Miller wrote:
On Sun, Feb 19, 2012 at 11:49 AM, Brendan Eich <brendan at mozilla.com <mailto:brendan at mozilla.com>> wrote: [...]
Not all engines mediate cross-same-origin-window accesses. I hear IE9+ may, indeed rumor is it remotes to another process sometimes (breaking run-to-completion a bit; something we should explore breaking in the future for window=vat). SpiderMonkey just recently (not sure if this is in a Firefox channel yet) went to compartment per global, for good savings once things were refactored to maximize sharing of internal immutables.
Other than the origin truncation issue that I am still confused about,
Do you mean document.domain setting? That allows code in an origin to join its origin's super-domain (but not a dotless top level). See
www.w3.org/TR/2009/WD-html5-20090212/browsers.html#dom-document-domain
and
www.w3.org/TR/2009/WD-html5-20090212/browsers.html#effective-script-origin
what other benefits are there to mediating interframe access within the same origin?
The WindowProxy in HTML5 reflects a de-facto standard developed by browser implementors to avoid closure-survives-navigation-to-other-origin attacks. See
www.w3.org/TR/html5/browsers.html#the-windowproxy-object
Demons from the First Age included attacks that loaded a document containing a script defining a closure from evil.org into a subframe, then stuck a ref to the closure in the super-frame, then navigated the sub-frame to victim.com. Guess whose scope the closure saw, with only Window objects and no WindowProxy wrappers for the named (not implicit in identifier resolution) window/frame objects?
My R2 resolution is not specific to any engine, but I have hopes it can be accepted. It is concrete enough to help overcome large-yet-vague doubts about implementation impact (at least IMHO). Recall that document.domain setting may have to split a merged same-origin window/frame graph, at any time. Again implementation solutions vary, but this suggests cross-window mediation can be interposed lazily.
How? By doing a full walk of the object graph and doing surgery on it? This sounds more painful than imposing mediation up front.
No, by indirection, of course ;-). The details vary among browsers.
But I'm still hoping that objects same origin iframes can communicate directly, without mediation.
Why? Anyway, it's unsafe, wherefore WindowProxy. No big deal. There's no mediation for identifier resolution (i.e., scope chain lookup) and indeed JITting VMs optimize the heck out of local global accesses already.
David Bruant wrote:
Le 19/02/2012 09:33, Brendan Eich a écrit :
(...)
How is the BRS configured? Again, not via a pragma, and not by imperative state update inside the language (mutating hidden BRS state at a given program point could leave strings created before mutation observably different from those created after, unless the implementation in effect scanned the local heap and wrapped or copied any non-BMP-char-bearing ones creatd before).
The obvious way to express the BRS in HTML is a<meta> tag in document <head>, but I don't want to get hung up on this point. I do welcome expert guidance. Here is another W3C/WHATWG interaction point. For this reason I'm cc'ing public-script-coord. I'm not sure a<meta> is that obvious of a choice.
Sure, guidance welcome as noted. I probably should have started with an HTTP header, but then authors may prefer to set it with <meta http-equiv...> which is verbose:
<meta http-equiv="ECMAScript-Full-Unicode" content="1" />
We can't have
<meta http-equiv="BRS" content="1" />
as BRS is too short and obscure. It's a good joke (should s/switch/button/ -- the big red button was the button Elmer Fudd warned Daffy Duck never to press in "Design for Leaving": www.youtube.com/watch?v=gms_NKzNLUs). Anyway, whatever the header name it will be a pain to type the full <meta> tag.
Unless I'm missing something, I think the same discussion can be had about the BRS being declared as a<meta>. Consider:
<script> // some code that can observe the difference between BRS mode and
non-BRS </script> <meta BRS>
Should the browser read all<meta>s before executing any script? Worse: what if an inline script does "document.write('<meta BRS>')"?
Since I was thinking of <meta http-equiv> (possibly with a short-hand),
your example simply puts the <meta> out of order. It can't work, so it
should not work (console warning traffic appropriate).
In mentioning <meta> I did not mean to exclude better ideas. Obviously a
multi-window/frame app might want a Really Big Red Switch expressed in one place only. Ignoring Web Apps with manifest files, where would that place be? Hmm, CSP...
I think a CSP-like solution should be explored.
Good suggestion. I hope others on the lists are up-to-date on CSP.
Brendan Eich wrote:
the big red button was the button Elmer Fudd warned Daffy Duck never to press in "Design for Leaving": www.youtube.com/watch?v=gms_NKzNLUs
Got Elmer and Daffy reversed there --getting old!
Why would converting the existing UCS-2 support to be UTF-16 not be a good idea? There is nothing intrinsically wrong that I can see with that approach and it would be the most compatible with existing scripts, with no special "modes", "flags", or interactions.
Yes, the complexity of supplementary characters (i.e. non-BMP characters) represented as surrogate pairs must still be dealt with. It would also expose the possibility of invalid strings (with unpaired surrogates). But this would not be unlike other programming languages--or even ES as it exists today. The purity of a "Unicode string" would be watered down, but perhaps not fatally. The Java language went through this (yeah, I know, I know...) and seems to have emerged unscathed. Norbert has a lovely doc here about the choices that lead to this, which seems useful to consider: [1]. W3C I18N Core WG has a wiki page shared with TC39 awhile ago here: [2].
To me, switching to UTF-16 seems like a relatively small, containable, non-destructive change to allow supplementary character support. It's not a pure as a true code-point based "Unicode string" solution. But purity isn't everything.
What am I missing?
Addison
Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG) <--- hat is OFF in this message
Internationalization is not a feature. It is an architecture.
[1] java.sun.com/developer/technicalArticles/Intl/Supplementary [2] www.w3.org/International/wiki/JavaScriptInternationalization
Anne van Kesteren wrote:
On Sun, 19 Feb 2012 21:29:48 +0100, David Bruant <bruant.d at gmail.com> wrote:
I think a CSP-like solution should be explored.
FWIW, the feedback on CORS (CSP-like) thus far has been that it's quite hard to set up custom headers.
I've heard this for years, can believe it in old-school big-company settings, but have a not-to-be-shattered hope that with Node.js etc. it is easier for content authors to configure headers. Go on, break my heart!
So for something as commonly used as JavaScript I'm not sure we'd want to require that. And although more difficult, if we want <meta> it can be made to work, it's just more complicated than simply defining a name and a value. But maybe it should be something simpler, e.g.
<html unicode>
in the top-level browsing context's document.
That's pretty but is it misleading? This is the big-red-switch-for-JS, not for the whole doc. In particular what is the Content-Type, with what charset parameter, and how does this attribute interact? Perhaps it's just misnamed.
What are libraries supposed to do by the way, check the length of "😁" and adjust code accordingly?
Most JS libraries (I'd love to see couterexamples) do not process surrogate pairs at all. They too live in the '90s.
As far as the DOM and Web IDL are concerned, I think we would need two definitions for "code unit". One that means 16-bit code unit and one that means "Unicode code unit"
I'm not a Unicode expert but I believe the latter is called "character".
or some such. Looking at dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#characterdata the rest should follow quite naturally.
What happens with surrogate code points in these new strings? I think we do not want to change that each unit is an integer of some kind and can be set to any value. And if that is the case, will it hold values greater than U+10FFFF?
JS must keep the "\uXXXX" notation for uint16 storage units, and one can create invalid Unicode strings already. This hazard does not go away, we keep compatibility, but the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of invalid-Unicode-string bugs.
The "\u{...}" notation is independent and should work whatever the BRS setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. In "UTF-16" setting, it makes only characters. And of course in the latter case indexing and length count characters.
Le 19/02/2012 22:57, Anne van Kesteren a écrit :
On Sun, 19 Feb 2012 21:29:48 +0100, David Bruant <bruant.d at gmail.com> wrote:
I think a CSP-like solution should be explored.
FWIW, the feedback on CORS (CSP-like) thus far has been that it's quite hard to set up custom headers.
Do you have reference on this feedback? Under which circumstances is it hard? One major annoyance I see in HTTP headers is that I have never heard of an hosting service allowing to choose the HTTP your services is served with and that's problematic. <meta http-equiv> is of some help to
provide the feature without having control over the HTTP response, but in some cases, we want the HTTP header to mean something that is document-wise and a <meta> can be too late.
So for something as commonly used as JavaScript I'm not sure we'd want to require that. And although more difficult, if we want <meta> it can be made to work, it's just more complicated than simply defining a name and a value. But maybe it should be something simpler, e.g.
<html unicode>
in the top-level browsing context's document.
I'm not sure it solves anything since a script could be the first thing an HTML renderer comes across, even before a doctype, even before an <html> starting tag.
My guess would be that the HTML spec defines that this script should be executed even if the "<script>" opening tag are the first bytes of the
document.
On 2/19/12 3:31 PM, Mark S. Miller wrote:
Other than the origin truncation issue that I am still confused about, what other benefits are there to mediating interframe access within the same origin?
In Gecko's case, at least, there are certain benefits to garbage collection, memory locality, memory accounting, faster determination of an object's effective origin, etc.
The important part being the separate per-frame heaps; the mediation is just a consequence.
Phillips, Addison wrote:
Why would converting the existing UCS-2 support to be UTF-16 not be a good idea? There is nothing intrinsically wrong that I can see with that approach and it would be the most compatible with existing scripts, with no special "modes", "flags", or interactions.
Allen proposed this, essentially (some confusion surrounded the discussion by mixing observable-in-language with encoding/format/serialization issues, leading to talk of 32-bit characters), last year. As I wrote in the o.p., this led to two objections: big implementation hit; incompatible change.
I tackled the second with the BRS and (in detail) mediation across DOM window boundaries. This I believe takes the sting out of the first (lesser implementation change in light of existing mediation at those boundaries).
Yes, the complexity of supplementary characters (i.e. non-BMP characters) represented as surrogate pairs must still be dealt with.
I'm not sure what you mean. JS today allows (ignoring invalid pairs) such surrogates but they count as two indexes and add two to length, not one. That is the first problem to fix (ignoring literal escape-notation expressiveness).
It would also expose the possibility of invalid strings (with unpaired surrogates).
That problem exists today.
But this would not be unlike other programming languages--or even ES as it exists today.
Right! We should do better. As I noted, Node.js heavy hitters (mranney of Voxer) testify that they want full Unicode, not what's specified today with indexing and length-accounting by uint16 storage units.
The purity of a "Unicode string" would be watered down, but perhaps not fatally. The Java language went through this (yeah, I know, I know...) and seems to have emerged unscathed.
Java's dead on the client. It is used by botnets (bugzilla.mozilla.org recently suffered a DDOS from one, the bad guys didn't even bother changing the user-agent from the default one for the Java runtime). See Brian Krebs' blog.
Norbert has a lovely doc here about the choices that lead to this, which seems useful to consider: [1]. W3C I18N Core WG has a wiki page shared with TC39 awhile ago here: [2].
To me, switching to UTF-16 seems like a relatively small, containable, non-destructive change to allow supplementary character support.
I still don't know what you mean. How would what you call "switching to UTF-16" differ from today, where one can inject surrogates into literals by transcoding from an HTML document or .js file CSE?
In particular, what do string indexing and .length count, uint16 units or characters?
On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:
Anne van Kesteren wrote:
...
As far as the DOM and Web IDL are concerned, I think we would need two definitions for "code unit". One that means 16-bit code unit and one that means "Unicode code unit"
I'm not a Unicode expert but I believe the latter is called "character".
Me neither, but I believe the correct term is "code point" which refers to the full 21-bit code while "Unicode character" is the logical entity corresponding to that code point. That usage of "character" is difference from the current usage within ECMAScript where "character" is what we call the elements of the vector of 16-bit number that are used to represent a String value. You can access then as sting values of length 1 via [ ] or as numeric values via the charCodeAt method.
or some such. Looking at dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#characterdata the rest should follow quite naturally.
What happens with surrogate code points in these new strings? I think we do not want to change that each unit is an integer of some kind and can be set to any value. And if that is the case, will it hold values greater than U+10FFFF?
JS must keep the "\uXXXX" notation for uint16 storage units, and one can create invalid Unicode strings already. This hazard does not go away, we keep compatibility, but the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of invalid-Unicode-string bugs.
The "\u{...}" notation is independent and should work whatever the BRS setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. In "UTF-16" setting, it makes only characters. And of course in the latter case indexing and length count characters.
I think your names for the BRS modes are misleading. What you call "UTF-16" actually manifests itself to the ES programmer as UTF-32 as each index position within a string corresponds to a unencoded Unicode code point. There are no visible UTF-16 surrogate pairs, even if the implementation is internally using a UTF-16 encoding.
Similarly, "UCS-2" as currently implemented actually manifests itself to the ES programmer as UTF-16 because implementations turn non-BMP string literal characters into UTF-16 surrogate pairs that visibly occupy two index positions.
Allen Wirfs-Brock wrote:
On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:
I'm not a Unicode expert but I believe the latter is called "character".
Me neither, but I believe the correct term is "code point" which refers to the full 21-bit code while "Unicode character" is the logical entity corresponding to that code point. That usage of "character" is difference from the current usage within ECMAScript where "character" is what we call the elements of the vector of 16-bit number that are used to represent a String value. You can access then as string values of length 1 via [ ] or as numeric values via the charCodeAt method.
Thanks. We have a confusing transposition of terms between Unicode and ECMA-262, it seems. Should we fix?
JS must keep the "\uXXXX" notation for uint16 storage units, and one can create invalid Unicode strings already. This hazard does not go away, we keep compatibility, but the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of invalid-Unicode-string bugs.
The "\u{...}" notation is independent and should work whatever the BRS setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. In "UTF-16" setting, it makes only characters. And of course in the latter case indexing and length count characters.
I think your names for the BRS modes are misleading.
You got me, in fact I used "full Unicode" for the BRS-thrown setting elsewhere.
My implementor's bias is showing, because I expect many engines would use UTF-16 internally and have non-O(1) indexing for strings with the contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
Brendan Eich wrote:
Mark S. Miller wrote:
On Sun, Feb 19, 2012 at 12:33 AM, Brendan Eich <brendan at mozilla.com <mailto:brendan at mozilla.com>> wrote: [...]
Why the global object? Because for many VMs, each global has its own heap or sub-heap ("compartment"), and all references outside that heap are to local proxies that copy from, or in the case of immutable data, reference the remote heap.
[...]
Is this true for same origin iframes? I have always assumed that mixing heaps between same origin iframes results in unmediated direct object-to-object access. If these are already mediated, what was the issue that drove us to that?
Not all engines mediate cross-same-origin-window accesses.
Sorry, I misused "mediate" incorrectly here to mean heap/compartment isolation. All engines in browsers that conform to HTML5 must mediate cross-frame Window (global object) accesses via WindowProxy, as discussed in other followups.
I hear IE9+ may, indeed rumor is it remotes to another process sometimes (breaking run-to-completion a bit; something we should explore breaking in the future for window=vat).
(Hope that parenthetical aside has you charged up -- we need a fresh thread on that topic, though... ;-)
On Feb 19, 2012, at 2:44 PM, Brendan Eich wrote:
Allen Wirfs-Brock wrote:
On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:
I'm not a Unicode expert but I believe the latter is called "character".
Me neither, but I believe the correct term is "code point" which refers to the full 21-bit code while "Unicode character" is the logical entity corresponding to that code point. That usage of "character" is difference from the current usage within ECMAScript where "character" is what we call the elements of the vector of 16-bit number that are used to represent a String value. You can access then as string values of length 1 via [ ] or as numeric values via the charCodeAt method.
Thanks. We have a confusing transposition of terms between Unicode and ECMA-262, it seems. Should we fix?
The ES5.1 spec.is ok because it always uses (as defined in section 6) the term "Unicode character" when it means exactly that and uses "character" when talking about the elements of String values. It says that both "code unit" and "character" refer to a 16-bit unsigned value.
Your proposal would change that equivalence. In one sense, the BSR would be a switch that controls whether a ES "character" corresponds to "code unit" or a "code point"
JS must keep the "\uXXXX" notation for uint16 storage units, and one can create invalid Unicode strings already. This hazard does not go away, we keep compatibility, but the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of invalid-Unicode-string bugs.
The "\u{...}" notation is independent and should work whatever the BRS setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. In "UTF-16" setting, it makes only characters. And of course in the latter case indexing and length count characters.
I think your names for the BRS modes are misleading.
You got me, in fact I used "full Unicode" for the BRS-thrown setting elsewhere.
My implementor's bias is showing, because I expect many engines would use UTF-16 internally and have non-O(1) indexing for strings with the contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
A fine implementation, but not observable. Another implementation approach that would preserve O(1) indexing would be to simply have two or three different internal string representations with 1, 2, or 4 byte internal characters. (You can automatically pick the needed character size when the string is created because string are immutable and created with their value). A not-quite O(1) approach would segment strings into substring spans using such an representation. Representation choice probably depends a lot
Allen Wirfs-Brock wrote:
On Feb 19, 2012, at 2:44 PM, Brendan Eich wrote:
Thanks. We have a confusing transposition of terms between Unicode and ECMA-262, it seems. Should we fix?
The ES5.1 spec.is ok because it always uses (as defined in section 6) the term "Unicode character" when it means exactly that and uses "character" when talking about the elements of String values. It says that both "code unit" and "character" refer to a 16-bit unsigned value.
That is still pretty confusing. I hope we can stop abusing "character" by overloading it in ECMA-262 in this way.
Your proposal would change that equivalence. In one sense, the BSR would be a switch that controls whether a ES "character" corresponds to "code unit" or a "code point"
Yes, and we might rather have a different word on that basis too.
How about "character element"? "Element" to capture indexing as the means of accessing the thing in question.
Trimming to es-discuss.
Brendan Eich wrote:
How about "character element"? "Element" to capture indexing as the means of accessing the thing in question.
Or avoid the c-word altogether via "string element" or "string indexed property"? Latter's too long but you see what I mean.
On Feb 19, 2012, at 3:18 PM, Brendan Eich wrote:
Allen Wirfs-Brock wrote: ...
Your proposal would change that equivalence. In one sense, the BSR would be a switch that controls whether a ES "character" corresponds to "code unit" or a "code point"
Yes, and we might rather have a different word on that basis too.
How about "character element"? "Element" to capture indexing as the means of accessing the thing in question.
I generally try to use "element" in informal contexts for exactly that reason. However, shouldn't it be "string element" and we could let "character" and "Unicode character" mean the same thing?
On Sun, Feb 19, 2012 at 1:52 PM, Brendan Eich <brendan at mozilla.com> wrote: [...]
How? By doing a full walk of the object graph and doing surgery on it?
This sounds more painful than imposing mediation up front.
No, by indirection, of course ;-). The details vary among browsers.
I think just we're having a terminology problem. To me, such indirection is mediation.
Mark S. Miller wrote:
On Sun, Feb 19, 2012 at 1:52 PM, Brendan Eich <brendan at mozilla.com <mailto:brendan at mozilla.com>> wrote: [...]
How? By doing a full walk of the object graph and doing surgery on it? This sounds more painful than imposing mediation up front. No, by indirection, of course ;-). The details vary among browsers.
I think just we're having a terminology problem. To me, such indirection is mediation.
Definitely I was unclear. The (different) mediation by WindowProxy is good because local global (oxymoronic, ugh -- let's say "this global") accesses are unmediated and can be super-optimized.
The mediation by trust-label indirection I was referring to above is for-all-accesses. That is painful.
A compartment per global makes the process of finding the trust-label (called a "Principal" in Gecko) significantly faster.
Allen Wirfs-Brock wrote:
On Feb 19, 2012, at 3:18 PM, Brendan Eich wrote:
Allen Wirfs-Brock wrote: ...
Your proposal would change that equivalence. In one sense, the BSR would be a switch that controls whether a ES "character" corresponds to "code unit" or a "code point"
Yes, and we might rather have a different word on that basis too.
How about "character element"? "Element" to capture indexing as the means of accessing the thing in question.
I generally try to use "element" in informal contexts for exactly that reason. However, shouldn't it be "string element" and we could let "character" and "Unicode character" mean the same thing?
Yes, see my es-discuss+you followup -- should have measured twice and cut once.
I like this much better than anything overloading "character".
To hope to make this sideshow beneficial to all the cc: list, what do DOM specs use to talk about uint16 units vs. code points?
Brendan Eich:
To hope to make this sideshow beneficial to all the cc: list, what do DOM specs use to talk about uint16 units vs. code points?
I say "code unit" as a shorter way of saying "16 bit unsigned integer code unit"
dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit
(which DOM4 also links to) and then just "code point" to refer to 21 bit numbers that might correspond to a Unicode character, which you can see used in
On Feb 19, 2012, at 1:34 PM, Brendan Eich wrote:
Wes Garland wrote:
Is there a proposal for interaction with JSON?
From www.ietf.org/rfc/rfc4627, 2.5:
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
I think it is actually more complex than just the above. 2.5 also says:
"All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)." (emphasis added)
and 3. says:
"JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." and then goes on to talk about how to detect UTF-8, 16, and 32 LE and BE encodings. So all those are legal.
It is presumably up a a JSON parser to decide how non-BMP characters in strings are encoded for whatever internal representation it is targeting. Currently JS JSON.parse takes its input from a JavaScript string that is composed of 16-bit UCS-2 elements so there are no unencoded non-BMP characters in the string. However, according to the ES5.1 spec, JSON.parse (and JSON.stringify) will just pass through any UTF-16 surrogate pairs that are encountered.
With the BRS, JSON.parse and JSON.stringify could encounter non-BMP characters in the JS string it is processing and those also would presumably pass through transparently. The one requirement of rfc 4627 that would be impacted by the BRS would be the 12-charcter escape sequences mentioned above. Currently JSON.parse implementations encode those as UTF-16 surrogate pairs in the generated strings. If the BSR is flipped, the rfc seems to require that they generate a single string element. Because, the JSON.stringify spec. does not escape anything other than control characters, any non-BMP characters it encounter would pass through unencoded. This implies that JSON.parse input of the form "\uD834\uDD1E" would probably round trip back out via JSON.stringify as JSON string containing the single unencoded G clef character. Logically equivalent but not the identical JSON text.
First, it would be great to get full Unicode support in JS. I know that's been a problem for us at Google.
Secondly, while I agree with Addison that the approach that Java took is workable, it does cause problems. Ideally someone would be able to loop (a very common construct) with:
for (codepoint cp : someString) { doSomethingWith(cp); }
In Java, you have to do:
int cp; for (int i = 0; i < someString.length(); i += Character.countChar(cp)) { cp = someString.codePointAt(i); doSomethingWith(cp); }
There are good reasons for why Java did what it did, basically for compatibility. But if there is some way that JS can work around those, that'd be great.
- There's some confusion about the Unicode terminology. Here's a quick clarification:
code point: number from 0 to 0x10FFFF
character: a code point that is assigned. Eg, 0x61 represents 'a' and is a character. 0x378 is a code point, but not (yet) a character.
code unit: an encoding 'chunk'. UTF-8 represents a code point as 1-4 8-bit code units UTF-16 represents a code point as 2 or 4 16-bit code units UTF-32 represents a code point as 1 32-bit code unit.
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
Mark wrote:
First, it would be great to get full Unicode support in JS. I know that's been a problem for us at Google.
AP> +1: I think we’ve waited for supplementary character support long enough!
Secondly, while I agree with Addison that the approach that Java took is workable, it does cause problems.
AP> The tension is between “compatibility” and “ease of use” here, I think. The question is whether very many scripts depend on the ‘uint16’ nature of a character in ES, use surrogates to effect supplementary character support, or are otherwise tied to the existing encoding model and are broken as a result of changes. In its ideal form, an ES string would logically be a sequence of Unicode characters (code points) and only the internal representation would worry about whatever character encoding scheme made the most sense (in many cases, this might actually be UTF-16).
AP> … but what I think is hard to deal with are different modes of processing scripts depending on “fullness of the Unicode inside”. Admittedly, the approach I favor is rather conservative and presents a number of challenges, most notably in adapting regex or for users who want to work strictly in terms of character values.
There are good reasons for why Java did what it did, basically for compatibility. But if there is some way that JS can work around those, that'd be great.
AP> Yes, it would.
~Addison
On Feb 19, 2012, at 3:13 PM, Allen Wirfs-Brock wrote:
My implementor's bias is showing, because I expect many engines would use UTF-16 internally and have non-O(1) indexing for strings with the contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
A fine implementation, but not observable. Another implementation approach that would preserve O(1) indexing would be to simply have two or three different internal string representations with 1, 2, or 4 byte internal characters. (You can automatically pick the needed character size when the string is created because string are immutable and created with their value). A not-quite O(1) approach would segment strings into substring spans using such an representation. Representation choice probably depends a lot on what you think are the most common use cases. If it is string processing in JS then a fast representations is probably what you want to choose. If it is just passing text that is already UTF-8 or UTF-16 encoded from inputs to output then a representation that minimizing transcoding would probably be a higher priority.
One way in which the proposal under discussion seems to differ from the previous strawman is in the behavior arising from concatenation of strings ending/beginning with a surrogate hi and lo element. How do we want to handle how do we want to handle unpaired UTF-16 surrogates in a full-unicode string? I can see three options:
- Prohibit values from strings that do not map to valid unicode characters (either throw an exception, or replace with the unicode replacement character).
- Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
- Allow invalid unicode characters in strings, but allow surrogate pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
It seems desirable for full-unicode strings to logically be a sequence of unicode characters, stored and processed in a encoding-agnostic manner. option 3 would seem to violate that, exposing the underlying UTF-16 implementation. It also loses a distributive property of .length over concatenation that I believe is true in ES5 for strings, in that currently for all strings s1 & s2: s1.length + s2.length == (s1 + s2).length However if we allow concatenation to fuse surrogate pairs into a single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no longer be true.
I guess I wonder if it's worth considering either options 1) or 2) – either prohibiting invalid unicode characters in strings, or consider something closer to the previous strawman, where string storage is defined to be 32-bit (with a BRS that instead of changing iteration would change string creation, introducing an implicit UTF16-UTF32 conversion).
Cameron McCormack wrote:
Brendan Eich:
To hope to make this sideshow beneficial to all the cc: list, what do DOM specs use to talk about uint16 units vs. code points?
I say "code unit" as a shorter way of saying "16 bit unsigned integer code unit"
dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit
(which DOM4 also links to) and then just "code point" to refer to 21 bit numbers that might correspond to a Unicode character, which you can see used in
Well then, you are one up on ECMA-262, and from Mark Davis's message using canonical Unicode terms. We shall strive to align terms.
Here's another q for the DOM folks and others using WebIDL: is extending the DOM and other specs to respect the BRS and support full Unicode conceivable from where you sit? Desirable? Thanks.
Gavin Barraclough wrote:
One way in which the proposal under discussion seems to differ from the previous strawman is in the behavior arising from concatenation of strings ending/beginning with a surrogate hi and lo element. How do we want to handle how do we want to handle unpaired UTF-16 surrogates in a full-unicode string? I can see three options:
- Prohibit values from strings that do not map to valid unicode characters (either throw an exception, or replace with the unicode replacement character).
- Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
- Allow invalid unicode characters in strings, but allow surrogate pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
It seems desirable for full-unicode strings to logically be a sequence of unicode characters, stored and processed in a encoding-agnostic manner. option 3 would seem to violate that, exposing the underlying UTF-16 implementation. It also loses a distributive property of .length over concatenation that I believe is true in ES5 for strings, in that currently for all strings s1 & s2: s1.length + s2.length == (s1 + s2).length However if we allow concatenation to fuse surrogate pairs into a single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no longer be true.
I guess I wonder if it's worth considering either options 1) or 2) – either prohibiting invalid unicode characters in strings, or consider something closer to the previous strawman, where string storage is defined to be 32-bit (with a BRS that instead of changing iteration would change string creation, introducing an implicit UTF16-UTF32 conversion).
Great post. I agree 3 is not good. I was thinking based on today's exchanges that the BRS being set to "full Unicode" could mean that "\uXXXX" is illegal and you must use "\u{...}" to write Unicode code points (not code units).
Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?
On Feb 19, 2012, at 6:54 PM, Gavin Barraclough wrote:
On Feb 19, 2012, at 3:13 PM, Allen Wirfs-Brock wrote:
My implementor's bias is showing, because I expect many engines would use UTF-16 internally and have non-O(1) indexing for strings with the contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
A fine implementation, but not observable. Another implementation approach that would preserve O(1) indexing would be to simply have two or three different internal string representations with 1, 2, or 4 byte internal characters. (You can automatically pick the needed character size when the string is created because string are immutable and created with their value). A not-quite O(1) approach would segment strings into substring spans using such an representation. Representation choice probably depends a lot on what you think are the most common use cases. If it is string processing in JS then a fast representations is probably what you want to choose. If it is just passing text that is already UTF-8 or UTF-16 encoded from inputs to output then a representation that minimizing transcoding would probably be a higher priority.
One way in which the proposal under discussion seems to differ from the previous strawman is in the behavior arising from concatenation of strings ending/beginning with a surrogate hi and lo element. How do we want to handle how do we want to handle unpaired UTF-16 surrogates in a full-unicode string? I can see three options:
- Prohibit values from strings that do not map to valid unicode characters (either throw an exception, or replace with the unicode replacement character).
- Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
- Allow invalid unicode characters in strings, but allow surrogate pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
It seems desirable for full-unicode strings to logically be a sequence of unicode characters, stored and processed in a encoding-agnostic manner. option 3 would seem to violate that, exposing the underlying UTF-16 implementation. It also loses a distributive property of .length over concatenation that I believe is true in ES5 for strings, in that currently for all strings s1 & s2: s1.length + s2.length == (s1 + s2).length However if we allow concatenation to fuse surrogate pairs into a single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no longer be true.
I guess I wonder if it's worth considering either options 1) or 2) – either prohibiting invalid unicode characters in strings, or consider something closer to the previous strawman, where string storage is defined to be 32-bit (with a BRS that instead of changing iteration would change string creation, introducing an implicit UTF16-UTF32 conversion).
I think 2) is the only reasonable alternative.
I don't think 1) would be a very good choice, if for no other reason the set of valid unicode characters is a moving target that you wouldn't want to hardwire into either the ES specification or implementations.
More importantly, some applications require "string processing" strings containing invalid unicode characters. In particular, any sort of transcoders between character sets requires this. If you want to take a full unicode string, convert it to UTF-16 and then output it, you may generate an intermediate strings with elements that contain individual high and low surrogate codes. If you were transcoding to a non-Unicode character set any value might be possible.
I really don't think any Unicode semantics should be build into the basic string representation. We need to decide
On 2/19/12 at 21:45, allen at wirfs-brock.com (Allen Wirfs-Brock) wrote:
I really don't think any Unicode semantics should be build into the basic string representation. We need to decide on a max element size and Unicode motivates 21 bits, but it could be 32-bits. Personally, I've lived through enough address space exhaustion episodes in my career be skeptical of "small" values like 2^21 being good enough for the long term.
Can we future-proof any limit an implementation may chose by saying that all characters whose code point is too large for a particular implementation must be replaced by an "invalid character" code point (which fits into the implementation's representation size) on input? An implementation which chooses 21 bits as the size will become obsolete when Unicode characters that need 22 bits are defined. However it will still work with characters that fit in 21 bits, and will do something rational with ones that do not. Users who need characters in the over 21 bit set will be encouraged to upgrade.
Cheers - Bill
Bill Frantz | If the site is supported by | Periwinkle (408)356-8506 | ads, you are the product. | 16345 Englewood Ave www.pwpconsult.com | | Los Gatos, CA 95032
On Feb 19, 2012, at 7:52 PM, Brendan Eich wrote:
Gavin Barraclough wrote:
One way in which the proposal under discussion seems to differ from the previous strawman is in the behavior arising from concatenation of strings ending/beginning with a surrogate hi and lo element. How do we want to handle how do we want to handle unpaired UTF-16 surrogates in a full-unicode string? I can see three options:
- Prohibit values from strings that do not map to valid unicode characters (either throw an exception, or replace with the unicode replacement character).
- Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
- Allow invalid unicode characters in strings, but allow surrogate pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
It seems desirable for full-unicode strings to logically be a sequence of unicode characters, stored and processed in a encoding-agnostic manner. option 3 would seem to violate that, exposing the underlying UTF-16 implementation. It also loses a distributive property of .length over concatenation that I believe is true in ES5 for strings, in that currently for all strings s1 & s2: s1.length + s2.length == (s1 + s2).length However if we allow concatenation to fuse surrogate pairs into a single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no longer be true.
I guess I wonder if it's worth considering either options 1) or 2) – either prohibiting invalid unicode characters in strings, or consider something closer to the previous strawman, where string storage is defined to be 32-bit (with a BRS that instead of changing iteration would change string creation, introducing an implicit UTF16-UTF32 conversion).
Great post. I agree 3 is not good. I was thinking based on today's exchanges that the BRS being set to "full Unicode" could mean that "\uXXXX" is illegal and you must use "\u{...}" to write Unicode code points (not code units).
Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?
I think we need to be careful not to equate the syntax of ES string literals with the actual encoding space of string elements. Whether you say "\ud800" or "\u{00d800}", or call a function that does full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with string elements containing upper or lower half surrogates. Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing.
What it might do, however, is eliminate the ambiguity about the intended meaning of "\uD800\uDc00" in legacy code. If "full unicode string mode" only supported \u{} escapes then existing code that uses \uXXXX would have to be updated before it could be used in that mode. That might be a good thing.
On Feb 19, 2012, at 10:05 PM, Allen Wirfs-Brock wrote:
Great post. I agree 3 is not good. I was thinking based on today's exchanges that the BRS being set to "full Unicode" could mean that "\uXXXX" is illegal and you must use "\u{...}" to write Unicode code points (not code units).
Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?
I think we need to be careful not to equate the syntax of ES string literals with the actual encoding space of string elements. Whether you say "\ud800" or "\u{00d800}", or call a function that does full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with string elements containing upper or lower half surrogates. Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing.
What it might do, however, is eliminate the ambiguity about the intended meaning of "\uD800\uDc00" in legacy code. If "full unicode string mode" only supported \u{} escapes then existing code that uses \uXXXX would have to be updated before it could be used in that mode. That might be a good thing.
Ah, this is a good point. I was going to ask whether it would be inconsistent to deprecate \uXXXX but not \xXX, since both could just be considered shorthand for \u{...}, but this is a good practical reason why it matters more for \uXXXX (and I can imagine there may be complaints if we take \xXX away!).
So, just to clarify, var s1 = "\u{0d800}\u{0dc00}"; var s2 = String.fromCharCode(0xd800) + String.fromCharCode(0xdc00); s1.length === 2; // true s2.length === 2; // true s1 === s2; // true Does this sound like the expected behavior?
Also, what would happen to String.fromCharCode?
-
Leave this unchanged, it would continue to truncate the input with ToUint16?
-
Change its behavior to allow any code point (maybe switch to ToUint32, or ToInteger, and throw a RangeError for input > 0x10FFFF?).
-
Make it sensitive to the state of the corresponding global object's BRS.
If we were to leave it unchanged, using ToUInt16, then I guess we would need a new String.fromCodePoint function, to be able to create strings for non-BMP characters? Presumably we would then want a new String.codePointAt function, for symmetry? This would also raise a question of what String.charCodeAt should return for code points outside of the Uint16 range – should it return the actual value, or ToUint16 of the code point to mirror the truncation performed by fromCharCode?
I guess my preference here would be to go with option 3 – tie the potentially breaking change to the BRS, but no need for new interface.
On 20 February 2012 00:45, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
- Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
I think 2) is the only reasonable alternative.
I think so, too -- especially as any sequence of Unicode code points -- including invalid and reserved code points -- constitutes a valid Unicode string, according to my recollection of the Unicode specification.
In addition to the reasons you listed, it should also be noted that
-
- is cheaper to implement
-
- keeps more old code working; ignoring the examples where developers use String as uint16[], there are also the cases where developers scan strings for 0xD800. 0xD800 is a reserved code point.
I don't think 1) would be a very good choice, if for no other reason the
set of valid unicode characters is a moving target that you wouldn't want to hardwire into either the ES specification or implementations.
To play the devil's advocate, I could point out that the spec language could say something about reserved code points. Those code points are reserved because, IIRC, they are not representable in UTF-16; they include the ranges for the surrogate pairs.
On 19 February 2012 16:34, Brendan Eich <brendan at mozilla.com> wrote:
Wes Garland wrote:
Is there a proposal for interaction with JSON?
From www.ietf.org/rfc/rfc4627, 2.5
snip - so the proposal is to keep encoding JSON in UTF-16. What happens if the BRS is set to Unicode and we want to encode the string "\uD834\uDD1E" -- the Unicode string which contains two reserved code points? We do not want to deserialize this as U+1D11E.
I think we should consider that BRS-on should mean six-character escapes in JSON for non-BMP characters. It might even be possible to add matching support for JSON.parse() when BRS-off. The one caveat is that might make JSON interchange fragile between BRS-on systems and ES5 engines.
Yes, sharing the uint16 vector is good. But string methods would have to
index and .length differently (if I can verb .length ;-).
.lengthing is easy; cost is about the same as strlen() and can be cached. Indexed access is something I have thought about from the implementor's POV for a while [but not heavily]. I haven't come up with a ground-breaking technique, I keep coming up with something that looks like a lookup table for surrogate pairs, degrading to an extra uint32[] when there are many of them. Anyhow, implementation detail.
Of course, strings with the same characters are == and ===. Strings appear to be values. If you think of them as immutable reference types there's still an obligation to compare characters for strings because computed strings are not intern'ed.
What about strings with the same sequence of code units but different code points? They would have identical backing stores if the backing store were either UTF-8 or uint32. This can happen if we have BRS-on Strings which contain non-BMP code points. (Actually, does BRS-on mean that we have to abandon UTF-16 to store Unicode strings containing invalid code points? Mark Davis, are you reading?)
How about strings which are considered equal by Unicode but which do not share the same representation? Will Unicode normalization be performed when Strings are created/parsed? On comparison? If on compare, would we skip normalization for ===?
I assume normalizing to NFC form, similar to what W3C does, is the target?
www.macchiato.com/unicode/nfc-faq (Mark Davis) unicode.org/faq/normalization.html
Most content actually only tries to access characters of a string like this:
for (var i = 0; i < str.length(); i++) { str[i]; }
While a naive implementation using UTF-8 encoding strings would be O(n^2) if the previous lookup result was cached it is possible to achieve a reasonably fast O(n) behaviour on such a loop. It feels like some kind of iterator would be more efficient but I don't think iterators would "feel right" in ECMAScript.
You can encode unmatched surrogates in UTF-8 (although they may have to be removed before the string is passed to the browser DOM code) so it may be possible to simply always encode strings in UTF-8 allowing for much simpler sharing of strings between code that wants UTF-8 support and code that is using the old model at the expense of more complex behaviour where UTF-16 surrogates are referenced.
Issues only arise in code that tries to treat a string as an array of 16-bit integers, and I don't think we should be particularly bothered by performance of code which misuses strings in this fashion (but clearly this should still work without opt-in to new string handling).
I think this is a nicer and more flexible model than string representations being dependent on which heap they came from - all issues related to encoding can be contained in the String object implementation.
While this is being discussed, for any new string handling I think we should make any invalid strings (according to the rules in Unicode) cause some kind of exception on creation.
On 20 February 2012 09:56, Andrew Oakley <andrew at ado.is-a-geek.net> wrote:
While this is being discussed, for any new string handling I think we should make any invalid strings (according to the rules in Unicode) cause some kind of exception on creation.
Can you clarify which definition in the Unicode standard you are proposing for "invalid string"?
Most content actually only tries to access characters of a string like
this:
for (var i = 0; i < str.length(); i++) { str[i]; }
Does anybody have any data on this? I'm genuinely curious about how much code "on the web" does any kind of character access on strings; the only common use-case that comes to mind (other than wanting uint16[]) is users who are doing UTF-16 on top of UCS-2.
Allen Wirfs-Brock wrote:
Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?
I think we need to be careful not to equate the syntax of ES string literals with the actual encoding space of string elements.
I agree, which is why I'm saying with the BRS set, we should forbid "\uXXXX" since that is not a code point rather a code unit.
Whether you say "\ud800" or "\u{00d800}", or call a function that does full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with string elements containing upper or lower half surrogates.
I don't agree in the case of "\u{00d800}". That's simply an illegal code point, not a code unit (upper or lower half). We can reject it statically.
Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing.
True, but not my point!
What it might do, however, is eliminate the ambiguity about the intended meaning of "\uD800\uDc00" in legacy code.
And arising from concatenations, avoiding the loss of Gavin's distributive .length property.
If "full unicode string mode" only supported \u{} escapes then existing code that uses \uXXXX would have to be updated before it could be used in that mode. That might be a good thing.
My point! ;-)
Gavin Barraclough wrote:
What it might do, however, is eliminate the ambiguity about the intended meaning of "\uD800\uDc00" in legacy code. If "full unicode string mode" only supported \u{} escapes then existing code that uses \uXXXX would have to be updated before it could be used in that mode. That might be a good thing.
Ah, this is a good point. I was going to ask whether it would be inconsistent to deprecate \uXXXX but not \xXX, since both could just be considered shorthand for \u{...}, but this is a good practical reason why it matters more for \uXXXX (and I can imagine there may be complaints if we take \xXX away!).
Yes. "\xXX" is innocuous, since ISO 8859-1 is a proper subset of Unicode and can't be used to forge surrogate pair halves.
So, just to clarify, var s1 = "\u{0d800}\u{0dc00}"; var s2 = String.fromCharCode(0xd800) + String.fromCharCode(0xdc00); s1.length === 2; // true s2.length === 2; // true s1 === s2; // true Does this sound like the expected behavior?
Rather, I'd statically reject the invalid code points.
Also, what would happen to String.fromCharCode?
BRS makes 21-bit chars, so just as String.prototype.charCodeAt returns a code point, String.fromCharCode takes actual code point arguments.
Again I'd reject (dynamically in the case of String.fromCharCode) any in [0xd800, 0xdfff]. Other code points that are not characters I'd let through to future-proof, but not these reserved ones. Also any > 0x10ffff.
- Leave this unchanged, it would continue to truncate the input with ToUint16?
No, that violates the BRS intent.
- Change its behavior to allow any code point (maybe switch to ToUint32, or ToInteger, and throw a RangeError for input > 0x10FFFF?).
The last.
- Make it sensitive to the state of the corresponding global object's BRS.
In any event, yes: this. The BRS is a switch, you can think of it as swapping in the other String implementation, or as a flag tested within one shared String implementation whose methods use if statements (which could be messy but would work).
We should specify carefully the identity or lack of identity of myGlobal.String and otherGlobalWithPossiblyDifferentBRSSetting.String, etc. Consider this one-line .html file:
<iframe src="javascript:alert(parent.String === String)"/>
I get false from Chrome, Firefox and Safari, as expected. So the BRS could swap in another String, or simply mutate hidden state associated with the global in question (as mentioned in my previous post, globals keep track of the original values of their built-ins' prototypes, so implementations could put the BRS in String or String.prototype too, and use random logic instead of separate objects).
If we were to leave it unchanged, using ToUInt16, then I guess we would need a new String.fromCodePoint function, to be able to create strings for non-BMP characters?
This goes against the BRS design and falls down the Java slippery slope. We want one set of standard methods, extended from 16- to 21-bit chars, er, code points.
I guess my preference here would be to go with option 3 – tie the potentially breaking change to the BRS, but no need for new interface.
Definitely! That's perhaps unclear in my o.p. but I made a to-do out of rejecting Java and keeping the duplicate methods or hidden if statements under the "implementation hood" ("bonnet" for you ;-).
Andrew Oakley wrote:
Issues only arise in code that tries to treat a string as an array of 16-bit integers, and I don't think we should be particularly bothered by performance of code which misuses strings in this fashion (but clearly this should still work without opt-in to new string handling).
This is all strings in JS and the DOM, today.
That is, we do not have any measure of code that treats strings as uint16s, forges strings using "\uXXXX", etc. but the ES and DOM specs have allowed this for > 14 years. Based on bitter experience, it's
likely that if we change by fiat to 21-bit code points from 16-bit code units, some code on the Web will break.
And as noted in the o.p. and in the thread based on Allen's proposal last year, browser implementations definitely count on representation via array of 16-bit integers, with length property or method counting same.
Breaking the Web is off the table. Breaking implementations, less so. I'm not sure why you bring up UTF-8. It's good for encoding and decoding but for JS, unlike C, we want string to be a high level "full Unicode" abstraction. Not bytes with bits optionally set indicating more bytes follow to spell code points.
I think this is a nicer and more flexible model than string representations being dependent on which heap they came from - all issues related to encoding can be contained in the String object implementation.
You're ignoring the compatibility break here. Browser vendors can't afford to do that.
While this is being discussed, for any new string handling I think we should make any invalid strings (according to the rules in Unicode) cause some kind of exception on creation.
This is future-hostile if done for all code points. If done only for the code points in [D800,DFFF] both for literals using "\u{...}" and for constructive methods such as String.fromCharCode, then I agree.
On Feb 20, 2012, at 8:20 AM, Brendan Eich wrote:
Allen Wirfs-Brock wrote:
Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?
I think we need to be careful not to equate the syntax of ES string literals with the actual encoding space of string elements.
I agree, which is why I'm saying with the BRS set, we should forbid "\uXXXX" since that is not a code point rather a code unit.
Whether you say "\ud800" or "\u{00d800}", or call a function that does full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with string elements containing upper or lower half surrogates.
I don't agree in the case of "\u{00d800}". That's simply an illegal code point, not a code unit (upper or lower half). We can reject it statically.
quoting:
On Feb 20, 2012, at 4:19 AM, Wes Garland wrote:
I think so, too -- especially as any sequence of Unicode code points -- including invalid and reserved code points -- constitutes a valid Unicode string, according to my recollection of the Unicode specification.
For the moment, I'll simply take Wes' word for the above, as it logically makes sense. For some uses, you want to process all possible code points (for example, when validating data from an external source). At this lowest level you don't want to impose higher level Unicode semantic constraints:
if (stringFromElseWhere.indexOf("\u{d800}")) ....
Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing.
True, but not my point!
but else where you said you would reject String.fromCharCode(0xd800)
so it sounds to me like you are trying to actually ban the occurrence of 0xd800 as the value of a string element.
What it might do, however, is eliminate the ambiguity about the intended meaning of "\uD800\uDc00" in legacy code.
And arising from concatenations, avoiding the loss of Gavin's distributive .length property.
These aren't the same thing.
"\ud8000\udc00" is a specific syntactic construct where there must have been a specific user intent in writing it. Our legacy problem is that the intent becomes ambiguous when that same sequence might be interpreted under different BRS settings.
str1 + str2 is much less specific and all we know at runtime (assuming either str1 or str2 are strings) is that the user wants to concatenate them. The values might be: str1= String.fromCharCode(0xd800); str2=String.fromCharCode(0xddc00);
and the user might be intentionally constructing a string containing an explicit UTF-16 encoding that is going to be passed off to an external agent that specifically requires UTF-16.
Another way to express what I see as the problem with what you are proposing about imposing such string semantics:
Could the revised ECMAScript be used to implement a language that had similar but not identical semantic rules to those you are suggested for ES strings. My sense is that if we went down the path you are suggesting, such a implementation would have to use binary data arrays for all of its internal string processing and could not use ES string functions to process them.
Allen Wirfs-Brock wrote:
For the moment, I'll simply take Wes' word for the above, as it logically makes sense. For some uses, you want to process all possible code points (for example, when validating data from an external source). At this lowest level you don't want to impose higher level Unicode semantic constraints:
if (stringFromElseWhere.indexOf("\u{d800}")) ....
Sorry, I disagree. We have a chance to keep Strings consistent with "full Unicode", or broken into uint16 pieces. There is no self-consistent third way that has 21-bit code points but allows one to jam what up until now have been both code points and code units into code points, where they will be misinterpreted.
If someone wants to do data hacking, Binary Data (Typed Arrays) are there (even in IE10pp).
Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing.
True, but not my point!
but else where you said you would reject String.fromCharCode(0xd800)
I'm being consistent (I hope!). I'd reject "\uXXXX" altogether with the BRS set. It's ambiguous at best, or (I argue, and you argue some of the time) it means code units, not code points. We're doing points now, no units, with the BRS set, so it has to go.
Same goes for constructive APIs taking (with the BRS set) code points. I see nothing but mischief arising from allowing [D800-DFFF]. Unicode gurus should school us if there's a use-case that can be sanely composed with "full Unicode" and "code points, not units" iteration.
so it sounds to me like you are trying to actually ban the occurrence of 0xd800 as the value of a string element.
Under the BRS set to "full Unicode", as a code point, yes.
What it might do, however, is eliminate the ambiguity about the intended meaning of "\uD800\uDc00" in legacy code. And arising from concatenations, avoiding the loss of Gavin's distributive .length property.
These aren't the same thing.
"\ud8000\udc00" is a specific syntactic construct where there must have been a specific user intent in writing it.
(One too many 0s there.)
We do not want to guess. All I know is that "\ud800\udc00" means what it means today in ECMA-262 and conforming implementations. With the BRS set to "full Unicode", it could be taken to mean two code points, but that results in invalid Unicode and is not backward compatible. It could be read as one code point but that is what "\u{...}" is for and we want anyone migrating such "hardcoded" code into the BRS to check and choose.
Our legacy problem is that the intent becomes ambiguous when that same sequence might be interpreted under different BRS settings.
I propose to solve that by forbiding "\uXXXX" when the BRS is set.
str1 + str2 is much less specific and all we know at runtime (assuming either str1 or str2 are strings) is that the user wants to concatenate them. The values might be: str1= String.fromCharCode(0xd800); str2=String.fromCharCode(0xddc00);
and the user might be intentionally constructing a string containing an explicit UTF-16 encoding that is going to be passed off to an external agent that specifically requires UTF-16.
Nope, cuz I'm proposing String.fromCharCode calls such as those throw.
We should not be making more type-confusion hazards just to play a guessing game that might (but probably won't) preserve some edge-case "hardcoded" surrogate hacking that exists in code on the Web or behind a firewall today. Such code can do what it has always done, unless and until its maintainer throws the BRS. At that point early and runtime errors will provoke rewrite to "\u{...}", and with fromCharCode etc., 21-bit code points that are not reserved for surrogates.
Another way to express what I see as the problem with what you are proposing about imposing such string semantics:
Could the revised ECMAScript be used to implement a language that had similar but not identical semantic rules to those you are suggested for ES strings. My sense is that if we went down the path you are suggesting, such a implementation would have to use binary data arrays for all of its internal string processing and could not use ES string functions to process them.
If you mean a metacircular evaluator, I don't think so. Can you show a counterexample?
If you mean a UTF-transcoder, then yes: binary data / typed arrays are required. That's the right answer.
Allen Wirfs-Brock wrote:
I really don't think any Unicode semantics should be build into the basic string representation. We need to decide on a max element size and Unicode motivates 21 bits, but it could be 32-bits. Personally, I've lived through enough address space exhaustion episodes in my career be skeptical of "small" values like 2^21 being good enough for the long term.
This does not seem justified to me as a future-proofing step. Instead, it invites my corollary to Postel's Law:
"If you are liberal in what you accept, others will utterly fail to be conservative in what they send."
to bite us, hard.
We do not want implementations today to accept non-Unicode code points under the BRS (also [D800-DFFF], IMHO). If tomorrow or on April 5, 2063 when Vulcans arrive to make first contact, we need 32 bits, we can be liberal then. Old implementations will choke on Vulcan, Klingon, etc., but so they should! They cannot do better, and simply need to be upgraded.
OTOH if we are too liberal now, people will stuff non-Unicode code points into strings and it will be up to a receiving peer on the Internet to make it right (or wrong). Receiver-makes-it-wrong failed in the 80s RPC wars.
Postel's law is not about allowing unknown new bits to flow into containers. It is about unexpected combinations at higher message and header/field levels. Note that the IP protocol had to pick 4-byte addresses, and IPv6 could not be foreseen or usefully future-proofed by using wider fields without specific rules governing the use of the extra bytes.
On Feb 20, 2012, at 10:52 AM, Brendan Eich wrote:
Allen Wirfs-Brock wrote: ...
Another way to express what I see as the problem with what you are proposing about imposing such string semantics:
Could the revised ECMAScript be used to implement a language that had similar but not identical semantic rules to those you are suggested for ES strings. My sense is that if we went down the path you are suggesting, such a implementation would have to use binary data arrays for all of its internal string processing and could not use ES string functions to process them.
If you mean a metacircular evaluator, I don't think so. Can you show a counterexample?
If you mean a UTF-transcoder, then yes: binary data / typed arrays are required. That's the right answer.
Not necessarily, metacircular...it could be support for any language that imposes different semantic rules on string elements.
You are essentially saying that a compiler targeting ES for a language X that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data) could not use ES strings as the target representation of its string data type. It also could not use the built-in ES string functions in the implementation of language X's built-in functions. It could not leverage any optimizations that a ES engine may apply to strings and string functions. Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc.
On Feb 20, 2012, at 8:37 AM, Brendan Eich wrote:
BRS makes 21-bit chars, so just as String.prototype.charCodeAt returns a code point, String.fromCharCode takes actual code point arguments.
Again I'd reject (dynamically in the case of String.fromCharCode) any in [0xd800, 0xdfff]. Other code points that are not characters I'd let through to future-proof, but not these reserved ones. Also any > 0x10ffff.
Okay, gotcha – so to clarify, once the BRS is thrown, it should be impossible to create a string in which any individual element is an unassigned code point (e.g. an unpaired UTF-16 surrogate) – all strings elements should be valid unicode characters, right? (or maybe a slightly weaker form of this, all string elements must be code points in the ranges 0...0xD7FF or 0xE000...0x10FFFF?).
Implementations that use uint16 vectors as the character data representation type for both "UCS-2" and "UTF-16" string variants would probably want another flag bit per string header indicating whether, for the UTF-16 case, the string indeed contained any non-BMP characters. If not, no proxy/copy needed.
If I understand your original proposal, you propose that UCS-2 strings coming from other sources be proxied to be iterated by unicode characters (e.g. if the DOM returns a string containing the code units "\uD800\uDC00" then JS code executing in a context with the BRS set will see this as having length 1, right?) If so, do you propose any special handling for access to unassigned unicode code points in UCS-2 strings returned from the DOM (or accessed from another global object, where the BRS is not set).
e.g. var ucs2d800 = foo(); // get a string containing "\uD800" from the DOM, or another global object in BRS=off mode; var ucs2dc00 = bar(); // get a string containing "\uDC00" from the DOM, or another global object in BRS=off mode; var a = ucs2d800[0]; var b = ucs2d800.charCodeAt(0); var c = ucs2d800 + ucs2dc00; var c0 = c.charCodeAt(0); var c1 = c.charCodeAt(1);
If the proxy is to behave as is the UCS-2 sting has been converted to a valid unicode string, then I'm guessing that conversion should have converted the unmatched surrogates in the UCS-2 into unicode replacement characters? – if so, the length of c in the above example would be 2, and the values c0 & c1 would be 0xFFFD?
Allen Wirfs-Brock wrote:
On Feb 20, 2012, at 10:52 AM, Brendan Eich wrote:
Allen Wirfs-Brock wrote: ...
Another way to express what I see as the problem with what you are proposing about imposing such string semantics:
Could the revised ECMAScript be used to implement a language that had similar but not identical semantic rules to those you are suggested for ES strings. My sense is that if we went down the path you are suggesting, such a implementation would have to use binary data arrays for all of its internal string processing and could not use ES string functions to process them.
If you mean a metacircular evaluator, I don't think so. Can you show a counterexample?
If you mean a UTF-transcoder, then yes: binary data / typed arrays are required. That's the right answer.
Not necessarily, metacircular...it could be support for any language that imposes different semantic rules on string elements.
In that case, binary data / typed arrays, definitely.
You are essentially saying that a compiler targeting ES for a language X that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data)
First, as a point of order: yes, JS strings as full Unicode does not want stray surrogate pair-halves. Does anyone disagree?
Second, binary data / typed arrays stand ready for any such not-full-Unicode use-cases.
could not use ES strings as the target representation of its string data type. It also could not use the built-in ES string functions in the implementation of language X's built-in functions.
Not if this hypothetical source language being compiled to JS wants other than full Unicode, no.
Why is this a problem, even hypothetically? Such a use-case has binary data and typed arrays standing ready, and if it really could use String.prototype.* methods I would be greatly surprised.
It could not leverage any optimizations that a ES engine may apply to strings and string functions.
Emscripten already compiles LLVM source languages (C, C++, and Objective-C at least) to JS and does a very good job (getting better day by day). The utility of string function today (including uint16 indexing and length) is immaterial. Typed arrays are quite important, though.
Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc.
Emscripten does have a runtime that maps browser functionailty exposed to JS to the guest language. It does not AFAIK need to encode surrogate pairs in JS strings by hand, let alone make pair-halves.
On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:
Allen Wirfs-Brock wrote:
...
You are essentially saying that a compiler targeting ES for a language X that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data) First, as a point of order: yes, JS strings as full Unicode does not want stray surrogate pair-halves. Does anyone disagree?
Well, I'm disagreeing. Do you know of any other language that has imposed these sorts of semantic restrictions on runtime string data?
Second, binary data / typed arrays stand ready for any such not-full-Unicode use-cases.
But lacks the same level of utility function support, not the least of which is RegExp
could not use ES strings as the target representation of its string data type. It also could not use the built-in ES string functions in the implementation of language X's built-in functions.
Not if this hypothetical source language being compiled to JS wants other than full Unicode, no.
Why is this a problem, even hypothetically? Such a use-case has binary data and typed arrays standing ready, and if it really could use String.prototype.* methods I would be greatly surprised.
My sense is that there are a fairly large variety of string data types could be use the existing ES5 string type as a target type and for which many of the String.prototuype.* methods would function just fine The reason is that most of the ES5 methods don't impose this sort of semantic restriction of string elements.
It could not leverage any optimizations that a ES engine may apply to strings and string functions.
Emscripten already compiles LLVM source languages (C, C++, and Objective-C at least) to JS and does a very good job (getting better day by day). The utility of string function today (including uint16 indexing and length) is immaterial. Typed arrays are quite important, though.
There are a lot of reasons why ES strings are not a good backing representation for C/C++ strings (to the extend that there even is a C string data type). But there are also lots of high level languages that do not have those sort of mapping issues.
If Type arrays are going to be the new "string" type (maybe better stated as array of chars) for people doing systems programming in JS then we should probably start thinking about a broader set of utility functions/methods that support them.
Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc.
Emscripten does have a runtime that maps browser functionailty exposed to JS to the guest language. It does not AFAIK need to encode surrogate pairs in JS strings by hand, let alone make pair-halves.
But with the BRS flipped it would have to censor C "strings" passed to JS to ensure that unmatched surrogate pairs are present.
Probably not such a bit deal because it isn't using JS strings as its representation, but as hypothesized above that wouldn't necessarily be the case for other languages.
On 20 February 2012 16:00, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
My sense is that there are a fairly large variety of string data types could be use the existing ES5 string type as a target type and for which many of the String.prototuype.* methods would function just fine The reason is that most of the ES5 methods don't impose this sort of semantic restriction of string elements.
To pick one out of a hat, it might be nice to be able to use non-Unicode encodings, like GB 18030 or BIG5, and be able to use regexp methods on them when the BRS is on. (I'm struggling to find a really real real-world use-case, though)
Observation -- disallowing otherwise "legal" Unicode strings because they contain code points d800-dfff has very concrete implementation benefits: it's possible to use UTF-16 to represent the String's backing store. Without this concession, I fear it may not be possible to implement BRS-on without using a UTF-8 or full code point backing store (or some non-standard invention).
Maybe the answer is to consider (shudder) adding String-like utility functions to the TypedArrays? FWIW, CommonJS tried to go down this path and it turned out to be a lot of work for very little benefit (if any).
But with the BRS flipped it would have to censor C "strings" passed to JS
to ensure that unmatched surrogate pairs are present.
Only if the C strings are wide-character strings. 8-bit char strings are fine, they map right onto Latin-1 in native Unicode as well as the UTF-16 and UCS-2 encodings.
Allen Wirfs-Brock wrote:
On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:
Allen Wirfs-Brock wrote:
... You are essentially saying that a compiler targeting ES for a language X that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data) First, as a point of order: yes, JS strings as full Unicode does not want stray surrogate pair-halves. Does anyone disagree?
Well, I'm disagreeing. Do you know of any other language that has imposed these sorts of semantic restrictions on runtime string data?
Sure, Java:
String
publicString(int[] codePoints, int offset, int count)
Allocates a new|String|that contains characters from a subarray of
the Unicode code point array argument. The|offset|argument is the
index of the first code point of the subarray and the|count|argument
specifies the length of the subarray. The contents of the subarray
are converted to|char|s; subsequent modification of the|int|array
does not affect the newly created string.
*Parameters:*
|codePoints|- array that is the source of Unicode code points.
|offset|- the initial offset.
|count|- the length.
*Throws:*
|IllegalArgumentException
<http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IllegalArgumentException.html>|-
if any invalid Unicode code point is found in|codePoints|
|IndexOutOfBoundsException
<http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html>|-
if the|offset|and|count|arguments index characters outside the
bounds of the|codePoints|array.
*Since:*
1.5
Second, binary data / typed arrays stand ready for any such not-full-Unicode use-cases.
But lacks the same level of utility function support, not the least of which is RegExp
RegExp is miserabe for Unicode, it's true. That doesn't strike me as compelling for making full-Unicode string more bug-prone.
There is a strong case to be made for evolving RegExp to be usable with certain typed arrays (byte, uint16 at least). But that's another thread.
We should beef up RegExp Unicode escapes; another 'nother thread.
could not use ES strings as the target representation of its string data type. It also could not use the built-in ES string functions in the implementation of language X's built-in functions. Not if this hypothetical source language being compiled to JS wants other than full Unicode, no.
Why is this a problem, even hypothetically? Such a use-case has binary data and typed arrays standing ready, and if it really could use String.prototype.* methods I would be greatly surprised.
My sense is that there are a fairly large variety of string data types could be use the existing ES5 string type as a target type and for which many of the String.prototuype.* methods would function just fine The reason is that most of the ES5 methods don't impose this sort of semantic restriction of string elements.
If that's true then we should have enough evidence that I'll happily concede the point and the spec will allow "uD800" etc. in BRS-enabled literals. I do not see such evidence.
It could not leverage any optimizations that a ES engine may apply to strings and string functions. Emscripten already compiles LLVM source languages (C, C++, and Objective-C at least) to JS and does a very good job (getting better day by day). The utility of string function today (including uint16 indexing and length) is immaterial. Typed arrays are quite important, though.
There are a lot of reasons why ES strings are not a good backing representation for C/C++ strings (to the extend that there even is a C string data type). But there are also lots of high level languages that do not have those sort of mapping issues.
Let's name some:
Java: see above. There may be some legacy need to support invalid Unicode but I'm not seeing it right now. Anyone?
Python: docs.python.org/release/3.0.1/howto/unicode.html#the-string-type -- lots here, similar to Ruby 1.9 (see below) but not obviously in need of invalid Unicode stored in uint16 vectors handled as JS strings.
Ruby: Ruby supports strings with multiple encodings; the encoding is part of the string's metadata. I am not the expert here, but I found
www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n
helpful, and the more recent
yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails
too. See also these very interesting posts from Sam Ruby in 2007:
intertwingly.net/blog/2007/12/28/3-1-2, intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated
Ruby raises exceptions when you mix two strings with different encodings incorrectly, e.g. by concatenation.
I'm not sure about surrogate validation, but from all this I gather that compiling Ruby to JS in full would need Uint8Array, along with lots of runtime helpers that do not come for free from JS's String.prototype methods, in order to handle all the encodings.
If Type arrays are going to be the new "string" type (maybe better stated as array of chars) for people doing systems programming in JS then we should probably start thinking about a broader set of utility functions/methods that support them.
Yes, that's probably true. We'll keep feedback coming from Emscripten users and experts.
Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc. Emscripten does have a runtime that maps browser functionailty exposed to JS to the guest language. It does not AFAIK need to encode surrogate pairs in JS strings by hand, let alone make pair-halves.
But with the BRS flipped it would have to censor C "strings" passed to JS to ensure that unmatched surrogate pairs are present.
Your "it" there would be "Emscripten [would have to censor]"? I don't think so, or: I do not agree that "censor" is an apt description -- it seems loaded by implying something censorious is needed where without the error I'm proposing for [D800-DFFF], no censoring action would be needed.
ISO C says sizeof(char) == 1, so byte strings / string constants are either ISO 8859-1 and cannot form surrogates when zero-extended to 16 or 21 bits, or they're in some character set that needs more involved transcoding but again cannot by itself create surrogates.
C wide strings vary by platform. On some platforms wchar_t is 32 bits.
In any event, Emscripten currently does not use JS strings at all in its code generation (only internally in its JS-hosted libc).
Probably not such a bit deal because it isn't using JS strings as its representation, but as hypothesized above that wouldn't necessarily be the case for other languages.
I don't see it. I may have missed it in my survey of Java, Python and Ruby. Please let me know if so.
As Brendan's link indicates, JSON is specified by RFC 4627, not by the ECMAScript Language Specification. JSON is widely used for data exchange with and between systems that have nothing to do with ECMAScript and the proposed BRS - see the middle section of www.json.org
So the only thing that can (and must) be done if and when updating the ECMAScript Language Specification for the BRS is to update the JSON section (15.12 in ES5) to describe how to map from the existing JSON syntax to the new BRS-on String representation. Note that JSON.stringify doesn't create Unicode escapes for anything other than control characters (presumably those identified in RFC 4627).
Norbert
On Feb 20, 2012, at 3:14 PM, Brendan Eich wrote:
Allen Wirfs-Brock wrote:
On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:
Allen Wirfs-Brock wrote:
... You are essentially saying that a compiler targeting ES for a language X that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data) First, as a point of order: yes, JS strings as full Unicode does not want stray surrogate pair-halves. Does anyone disagree?
Well, I'm disagreeing. Do you know of any other language that has imposed these sorts of semantic restrictions on runtime string data? Sure, Java:
String
publicString(int[] codePoints, int offset, int count)
Allocates a new|String|that contains characters from a subarray of the Unicode code point array argument. The|offset|argument is the index of the first code point of the subarray and the|count|argument specifies the length of the subarray. The contents of the subarray are converted to|char|s; subsequent modification of the|int|array does not affect the newly created string.
Parameters: |codePoints|- array that is the source of Unicode code points. |offset|- the initial offset. |count|- the length. Throws: |IllegalArgumentException docs.oracle.com/javase/1.5.0/docs/api/java/lang/IllegalArgumentException.html|- if any invalid Unicode code point is found in|codePoints| |IndexOutOfBoundsException docs.oracle.com/javase/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html|- if the|offset|and|count|arguments index characters outside the bounds of the|codePoints|array. Since: 1.5
Note that the above say "invalid Unicode code point". 0xd800 is a valid Unicode code point. It isn't a valid Unicode characters.
See docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint(int)
Determines whether the specified code point is a valid Unicode code point value in the range of 0x0000 to 0x10FFFF inclusive. This method is equivalent to the expression: codePoint >= 0x0000 && codePoint <= 0x10FFFF
Why is this a problem, even hypothetically? Such a use-case has binary data and typed arrays standing ready, and if it really could use String.prototype.* methods I would be greatly surprised.
My sense is that there are a fairly large variety of string data types could be use the existing ES5 string type as a target type and for which many of the String.prototuype.* methods would function just fine The reason is that most of the ES5 methods don't impose this sort of semantic restriction of string elements.
If that's true then we should have enough evidence that I'll happily concede the point and the spec will allow "uD800" etc. in BRS-enabled literals. I do not see such evidence.
Note my concern isn't so much about literals as it is about string elements created via String.fromCharCode
The only String.prototype method algorithms seem to have any Unicode dependencies are toLowerCase/toUpperCase and the locale variants of those methods and perhaps localeCompare, trim (knowns Unicode white space character classification, and the regular expression based methods if the regexp is constructed with literal chars or uses character classes.
All concat, splice, splice, substring, indexOf/lastIndexOf, non-regexp based replace and split calls all are defined in terms of string element value comparisons and don't really care about what characters set is used.
Wes Garland mentioned the possibility of using non-Unicode character sets such as Big5
It could not leverage any optimizations that a ES engine may apply to strings and string functions. Emscripten already compiles LLVM source languages (C, C++, and Objective-C at least) to JS and does a very good job (getting better day by day). The utility of string function today (including uint16 indexing and length) is immaterial. Typed arrays are quite important, though.
There are a lot of reasons why ES strings are not a good backing representation for C/C++ strings (to the extend that there even is a C string data type). But there are also lots of high level languages that do not have those sort of mapping issues.
Let's name some:
Java: see above. There may be some legacy need to support invalid Unicode but I'm not seeing it right now. Anyone?
see above, it allows all Unicode points. does not restrict strings to well formed UTF-16 encodings.
Python: docs.python.org/release/3.0.1/howto/unicode.html#the-string-type -- lots here, similar to Ruby 1.9 (see below) but not obviously in need of invalid Unicode stored in uint16 vectors handled as JS strings.
I don't see any restrictions on inserting in that doc about strings containing \ud800 and friends. Unless there are, BRS enabled ES strings couldn't be used as the representation type for python strings.
The actual representation type used by conventional Python implementation isn't yet clear to me, but clearly it supports many character encodings besides Unicode: docs.python.org/library/codecs.html#standard-encodings
Ruby: Ruby supports strings with multiple encodings; the encoding is part of the string's metadata. I am not the expert here, but I found
www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n
helpful, and the more recent
yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails
too. See also these very interesting posts from Sam Ruby in 2007:
intertwingly.net/blog/2007/12/28/3-1-2, intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated
Ruby raises exceptions when you mix two strings with different encodings incorrectly, e.g. by concatenation.
I'm not sure about surrogate validation, but from all this I gather that compiling Ruby to JS in full would need Uint8Array, along with lots of runtime helpers that do not come for free from JS's String.prototype methods, in order to handle all the encodings. \
If Type arrays are going to be the new "string" type (maybe better stated as array of chars) for people doing systems programming in JS then we should probably start thinking about a broader set of utility functions/methods that support them.
But, using current 16-bit JS string semantics a JS string could still be used as the character store for many of these encodings with the meta data stored separately (probably a RubyString wrapper object) and the char set insensive JS string methods could be used to implement the Ruby semantics.
BRS excluding surrogate codes would at the very least require additional special case handling when dealing with Ruby strings containing those code points.
Yes, that's probably true. We'll keep feedback coming from Emscripten users and experts.
Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc. Emscripten does have a runtime that maps browser functionailty exposed to JS to the guest language. It does not AFAIK need to encode surrogate pairs in JS strings by hand, let alone make pair-halves.
But with the BRS flipped it would have to censor C "strings" passed to JS to ensure that unmatched surrogate pairs are present.
Your "it" there would be "Emscripten [would have to censor]"? I don't think so, or: I do not agree that "censor" is an apt description -- it seems loaded by implying something censorious is needed where without the error I'm proposing for [D800-DFFF], no censoring action would be needed.
Yes I meant the Emscripten runtime "foreign" call support for calling JS functions. I did mean censor in that sense. Assume that you want to automatically convert WCHAR* strings to a JS string to pass as an argument to such calls. Today, without the BRS, you can just form a JS string containing all the WCHAR s without analyzing the UTF-16 well-formedness of the C string. With the BRS flipped you would at the very least have to make sure it is well-formed UTF-16 and either throw or remove any unpaired surrogatges.
ISO C says sizeof(char) == 1, so byte strings / string constants are either ISO 8859-1 and cannot form surrogates when zero-extended to 16 or 21 bits, or they're in some character set that needs more involved transcoding but again cannot by itself create surrogates.
C wide strings vary by platform. On some platforms wchar_t is 32 bits.
In any event, Emscripten currently does not use JS strings at all in its code generation (only internally in its JS-hosted libc).
Probably not such a bit deal because it isn't using JS strings as its representation, but as hypothesized above that wouldn't necessarily be the case for other languages.
I don't see it. I may have missed it in my survey of Java, Python and Ruby. Please let me know if so.
:-) How could I pass on the opportunity. See above
On Feb 20, 2012, at 1:42 PM, Wes Garland wrote:
On 20 February 2012 16:00, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
... Observation -- disallowing otherwise "legal" Unicode strings because they contain code points d800-dfff has very concrete implementation benefits: it's possible to use UTF-16 to represent the String's backing store. Without this concession, I fear it may not be possible to implement BRS-on without using a UTF-8 or full code point backing store (or some non-standard invention).
(or using multiple representations)
Yes, I understand. If it is a requirement (or even a goal) to enable implementation to use UTF-16 as the backing store, we should be clearer about it being so.
Maybe the answer is to consider (shudder) adding String-like utility functions to the TypedArrays? FWIW, CommonJS tried to go down this path and it turned out to be a lot of work for very little benefit (if any).
But with the BRS flipped it would have to censor C "strings" passed to JS to ensure that unmatched surrogate pairs are present.
Only if the C strings are wide-character strings. 8-bit char strings are fine, they map right onto Latin-1 in native Unicode as well as the UTF-16 and UCS-2 encodings.
Yes, I was assuming WCHAR strings
Allen Wirfs-Brock wrote:
On Feb 20, 2012, at 3:14 PM, Brendan Eich wrote: Note that the above say "invalid Unicode code point". 0xd800 is a valid Unicode code point. It isn't a valid Unicode characters.
See docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint(int), docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint(int)
Determines whether the specified code point is a valid Unicode code point value in the range of |0x0000| to |0x10FFFF| inclusive. This method is equivalent to the expression: codePoint>= 0x0000&& codePoint<= 0x10FFFF
I should have remembered this, from the old days of Java and JS talking (LiveConnect). Strike one for me.
If that's true then we should have enough evidence that I'll happily concede the point and the spec will allow "uD800" etc. in BRS-enabled literals. I do not see such evidence.
Note my concern isn't so much about literals as it is about string elements created via String.fromCharCode
The only String.prototype method algorithms seem to have any Unicode dependencies are toLowerCase/toUpperCase and the locale variants of those methods and perhaps localeCompare, trim (knowns Unicode white space character classification, and the regular expression based methods if the regexp is constructed with literal chars or uses character classes.
All concat, splice, splice, substring, indexOf/lastIndexOf, non-regexp based replace and split calls all are defined in terms of string element value comparisons and don't really care about what characters set is used.
Wes Garland mentioned the possibility of using non-Unicode character sets such as Big5
These are byte-based enodings, no? What is the problem inflating them by zero extension to 16 bits now (or 21 bits in the future)? You can't make an invalid Unicode character from a byte value.
Anyway, Big5 punned into JS strings (via a C or C++ API?) is not a strong use-case for ignoring invalid characters.
Ball one. :-P
Python: docs.python.org/release/3.0.1/howto/unicode.html#the-string-type -- lots here, similar to Ruby 1.9 (see below) but not obviously in need of invalid Unicode stored in uint16 vectors handled as JS strings.
I don't see any restrictions on inserting in that doc about strings containing \ud800 and friends. Unless there are, BRS enabled ES strings couldn't be used as the representation type for python strings.
You're right, you can make a literal in Python 3 such as '\ud800' without error. Strike two.
Ruby: Ruby supports strings with multiple encodings; the encoding is part of the string's metadata. I am not the expert here, but I found
www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n
helpful, and the more recent
yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails
too. See also these very interesting posts from Sam Ruby in 2007:
intertwingly.net/blog/2007/12/28/3-1-2, intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated
Ruby raises exceptions when you mix two strings with different encodings incorrectly, e.g. by concatenation.
I'm not sure about surrogate validation, but from all this I gather that compiling Ruby to JS in full would need Uint8Array, along with lots of runtime helpers that do not come for free from JS's String.prototype methods, in order to handle all the encodings. \
If Type arrays are going to be the new "string" type (maybe better stated as array of chars) for people doing systems programming in JS then we should probably start thinking about a broader set of utility functions/methods that support them.
But, using current 16-bit JS string semantics a JS string could still be used as the character store for many of these encodings with the meta data stored separately (probably a RubyString wrapper object) and the char set insensive JS string methods could be used to implement the Ruby semantics.
Did I get a hit off your pitch, then? Because Ruby does at least raise exceptions on mixed encoding concatenations.
But I'm about to strike out on the next pitch (language). You're almost certainly right that most languages with "full Unicode" support allow the programmer to create invalid strings via literals and constructors. It also seems common for charset-encoding APIs to validate and throw on invalid character, which makes sense.
I could live with this, especially for String.fromCharCode.
For "\uD800..." in a BRS-enabled string literal, it still seems to me something is going to go wrong right away. Or rather, something should (like, early error). But based on precedent, and for the odd usage that doesn't go wrong ever (reads back code units, or has native code reading them and reassembling uint16 elements), I'll go along here too.
This means Gavin's option
- Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
as you noted in reply to him.
BRS excluding surrogate codes would at the very least require additional special case handling when dealing with Ruby strings containing those code points.
I suspect Ruby-on-JS is best done via Emscripten (demo'ed at JSConf.eu 2011), which makes this moot. With Emscripten you get the canonical Ruby implemlentation, not a hand-coded JS work(mostly)alike.
Yes I meant the Emscripten runtime "foreign" call support for calling JS functions. I did mean censor in that sense. Assume that you want to automatically convert WCHAR*
Again, wchar_t is not uint16 on all platforms.
At this point I'm not going to try my luck at bat again. Gavin's option 2 at least preserves .length distributivity over concatenation. So let's go on to other issues. What's next?
On 02/20/12 16:47, Brendan Eich wrote:
Andrew Oakley wrote:
Issues only arise in code that tries to treat a string as an array of 16-bit integers, and I don't think we should be particularly bothered by performance of code which misuses strings in this fashion (but clearly this should still work without opt-in to new string handling).
This is all strings in JS and the DOM, today.
That is, we do not have any measure of code that treats strings as uint16s, forges strings using "\uXXXX", etc. but the ES and DOM specs have allowed this for > 14 years. Based on bitter experience, it's likely that if we change by fiat to 21-bit code points from 16-bit code units, some code on the Web will break.
Sorry, I don't think I was particularly clear. The point I was trying to make is that we can pretend that code points are 16-bit but actually use a 21-bit representation internally. If content requests proper Unicode support we simply switch to allowing 21-bit code-points and stop encoding characters outside the BMP using surrogate pairs (because the characters now fit in a single code point).
And as noted in the o.p. and in the thread based on Allen's proposal last year, browser implementations definitely count on representation via array of 16-bit integers, with length property or method counting same.
Breaking the Web is off the table. Breaking implementations, less so. I'm not sure why you bring up UTF-8. It's good for encoding and decoding but for JS, unlike C, we want string to be a high level "full Unicode" abstraction. Not bytes with bits optionally set indicating more bytes follow to spell code points.
Yes, I probably shouldn't have brought up UTF-8 (we do store strings using UTF-8, I was thinking about our own implementation). The intention was not to "break the web", my comments about issues when strings were misused were purely performance concerns, behaviour would otherwise remain unchanged (unless full Unicode support had been enabled).
On 21 February 2012 00:03, Brendan Eich <brendan at mozilla.com> wrote:
These are byte-based enodings, no? What is the problem inflating them by zero extension to 16 bits now (or 21 bits in the future)? You can't make an invalid Unicode character from a byte value.
One of my examples, GB 18030, is a four-byte encoding and a Chinese government standard. It is a mapping onto Unicode, but this mapping is table-driven rather than algorithm driven like the UTF-* transport formats. To provide a single example, Unicode 0x2259 maps onto GB 18030 0x8136D830.
You're right about Big5 being byte-oriented, maybe this was a bad example, although it is a double-byte charset. It works by putting ASCII down low making bytes above 0x7f escapes into code pages dereferenced by the next byte. Each code point is encoded with one or two bytes, never more. If I were developing with Big5 in JS, I would store the byte stream 4a 4b d8 00 c1 c2 4c as 004a 004b d800 c1c2 004c. This would allow me to use JS regular expressions and so on.
Anyway, Big5 punned into JS strings (via a C or C++ API?) is not a strong
use-case for ignoring invalid characters.
Agreed - I'm stretching to see if I can stretch far enough to find a real problem with BRS -- because I really want it.
But the data does not need to arrive from C API -- it could easily be delivered by an XHR request where, say, the remote end dumps database rows into a transport format based around evaluating JS string literals (like JSON).
Ball one. :-P
If I hit the batter, does he get to first base?
We still haven't talked about equality and normalization, I suppose that can wait.
Andrew Oakley wrote:
On 02/20/12 16:47, Brendan Eich wrote:
Andrew Oakley wrote:
Issues only arise in code that tries to treat a string as an array of 16-bit integers, and I don't think we should be particularly bothered by performance of code which misuses strings in this fashion (but clearly this should still work without opt-in to new string handling).
This is all strings in JS and the DOM, today.
That is, we do not have any measure of code that treats strings as uint16s, forges strings using "\uXXXX", etc. but the ES and DOM specs have allowed this for> 14 years. Based on bitter experience, it's likely that if we change by fiat to 21-bit code points from 16-bit code units, some code on the Web will break.
Sorry, I don't think I was particularly clear. The point I was trying to make is that we canpretend that code points are 16-bit but actually use a 21-bit representation internally.
So far, that's like Allen's proposal from last year (strawman:support_full_unicode_in_strings). But you didn't say how iteration (indexing and .length) work.
If content requests proper Unicode support we simply switch to allowing 21-bit code-points and stop encoding characters outside the BMP using surrogate pairs (because the characters now fit in a single code point).
How does content request proper Unicode support? Whatever that gesture is, it's big and red ;-). But we don't have such a switch or button to press like that, yet.
If a .js or .html file as fetched from a server has a UTF-8 encoding, indeed non-BMP characters in string literals will be transcoded in open-source browsers and JS engines that use uint16 vectors internally, but each part of the surrogate pair will take up one element in the uint16 vector. Let's take this now as a "content request" to use full Unicode. But the .js file was developed 8 years ago and assumes two code units, not one. It hardcodes for that assumption, somehow (indexing, .length exact value, indexOf('\ud800'), etc.). It is now broken.
And non-literal non-BMP characters won't be helped by transcoding differently when the .js or .html file is fetched. They'll just change "size" at runtime.
Wes Garland wrote:
On 21 February 2012 00:03, Brendan Eich <brendan at mozilla.com <mailto:brendan at mozilla.com>> wrote:
Ball one. :-P
If I hit the batter, does he get to first base?
Walk, yes (en.wikipedia.org/wiki/Hit_by_pitch).
We still haven't talked about equality and normalization, I suppose that can wait.
Allen's point in this last bit of the thread is that we don't need to interfere with bits stuffed into code units today, so we shouldn't tomorrow when units become as wide as (or wider than) code points. GIGO, and equality is memcmp (if you mean == and ===).
Normalization happens to source upstream of the JS engine. Here I'll call on a designated Unicode hitter. ;-)
Brendan Eich wrote:
in open-source browsers and JS engines that use uint16 vectors internally
Sorry, that reads badly. All I meant is that I can't tell what closed-source engines do, not that they do not comply with ECMA-262 combined with other web standards to have the same observable effect, e.g. Allen's example:
var c = "😁" // where the single character between the quotes is the Unicode character U+1f638
c.length == 2; c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683 c.charCodeAt(0) == 0xd83d; c.charCodeAt(1) == 0xd338;
Still no BRS to set, we need one if we want a full-Unicode outcome (c.length == 1, etc.).
Normalization happens to source upstream of the JS engine. Here I'll call on a designated Unicode hitter. ;-)
I agree that Unicode Normalization shouldn't happen automagically in the JS engine. I rather doubt that "normalization happens to source upstream of the JS engine", unless by "upstream" you mean "best see to the normalization yourself".
By contrast, providing a method for normalizing strings would be useful.
Addison
Phillips, Addison wrote:
Normalization happens to source upstream of the JS engine. Here I'll call on a designated Unicode hitter. ;-)
I agree that Unicode Normalization shouldn't happen automagically in the JS engine. I rather doubt that "normalization happens to source upstream of the JS engine", unless by "upstream" you mean "best see to the normalization yourself".
Yes ;-).
I meant ECMA-262 punts source normalization upstream in the spec pipeline that runs parallel to the browser's loading-the-URL | processing-what-was-loaded pipeline. ECMA-262 is concerned only with its little slice of processing heaven.
By contrast, providing a method for normalizing strings would be useful. /summon Norbert.
I meant ECMA-262 punts source normalization upstream in the spec pipeline that runs parallel to the browser's loading-the-URL | processing-what-was- loaded pipeline. ECMA-262 is concerned only with its little slice of processing heaven.
Yep. One of the problems is that the source script may not be using a Unicode encoding or may be using a Unicode encoding and be serialized in a non-normalized form. Your slice of processing heaven treats Unicode-normalization-equivalent-yet-different-codepoint-sequence tokens as unequal. Not that this is a bad thing.
By contrast, providing a method for normalizing strings would be useful. /summon Norbert.
(hides the breakables, listens for thunder)
Addison
Because it has always been possible, it’s difficult to say how many scripts have transported byte-oriented data by “punning” the data into strings. Actually, I think this is more likely to be truly binary data rather than text in some non-Unicode character encoding, but anything is possible, I suppose. This could include using non-character values like “FFFE”, “FFFF” in addition to the surrogates. A BRS-running implementation would break a script that relied on String being a sequence of 16-bit unsigned integer values with no error checking.
One of my examples, GB 18030, is a four-byte encoding and a Chinese government standard. It is a mapping onto Unicode, but this mapping is table-driven rather than algorithm driven like the UTF-* transport formats. To provide a single example, Unicode 0x2259 maps onto GB 18030 0x8136D830. AP> GB 18030 is more complex than that. Not all characters are four-byte, for example. As a multibyte encoding, you might choose to “pun” GB 18030 into a String as 81 36 d8 30. There isn’t much attraction to punning it into 0x8136 0xd830, but, as noted above, someone might be foolish enough to try it ;-). Scripts that rely on this probably break under BRS.
You're right about Big5 being byte-oriented, maybe this was a bad example, although it is a double-byte charset. It works by putting ASCII down low making bytes above 0x7f escapes into code pages dereferenced by the next byte. Each code point is encoded with one or two bytes, never more. If I were developing with Big5 in JS, I would store the byte stream 4a 4b d8 00 c1 c2 4c as 004a 004b d800 c1c2 004c. This would allow me to use JS regular expressions and so on. Not exactly. The trailing bytes in Big5 start at 0x40, for example. But it is certainly the case that some multibyte characters in Big5 happen to have the same byte-pair as a surrogate code point (when considered as a pair of bytes) or other non-character in the Unicode BMP, and one might (he says, squinting really hard) want to do as you suggest and record the multibyte sequence as a single code point. But the data does not need to arrive from C API -- it could easily be delivered by an XHR request where, say, the remote end dumps database rows into a transport format based around evaluating JS string literals (like JSON). Allowing isolated invalid sequences isn’t actually the problem, if you think about it. Yes, the data is bad and yes you can’t view it cleanly. But you can do whatever you need to on it. The problem is when you intend to store two values that end up as a single character. If I have a string with code points “f235 5e7a e040 d800”, the d800 does no particular harm. The problem is: if I construct a BRS string using that sequence and then concatenate the sequence “dc00 a053 3254” onto it, the resulting string is only six characters long, rather than the expected seven, since presumably the d800 dc00 pair turns into U+10000. Addison
Phillips, Addison wrote:
Because it has always been possible, it’s difficult to say how many scripts have transported byte-oriented data by “punning” the data into strings. Actually, I think this is more likely to be truly binary data rather than text in some non-Unicode character encoding, but anything is possible, I suppose. This could include using non-character values like “FFFE”, “FFFF” in addition to the surrogates. A BRS-running implementation would break a script that relied on String being a sequence of 16-bit unsigned integer values with no error checking.
Allen's view of the BRS-enabled semantics would have 16-bit "GIGO" without exceptions -- you'd be storing 16-bit values, whatever their source (including "\uXXXX" literals spelling invalid characters and unmatched surrogates) in at-least-21-bit elements of strings, and reading them back.
My concern and reason for advocating early or late errors on shenanigans was that people today writing surrogate pais literally and then taking extra pains in JS or C++ (whatever the host language might be) to process them as single code points and characters would be broken by the BRS-enabled behavior of separating the parts into distinct code points.
But that's pessimistic. It could happen, but OTOH anyone coding surrogate pairs might want them to read back piece-wise when indexing. In that case what Allen proposes, storing each formerly 16-bit code unit, however expressed, in the wider 21-or-more-bits unit, and reading back likewise, would "just work".
Sorry if this is all obvious. Mainly I want to throw in my lot with Allen's exception-free literal/constructor approach. The encoding APIs should throw on invalid Unicode but literals and strings as immutable 16-bit storage buffers should work as today.
On Feb 21, 2012, at 7:37 AM, Brendan Eich wrote:
Brendan Eich wrote:
in open-source browsers and JS engines that use uint16 vectors internally
Sorry, that reads badly. All I meant is that I can't tell what closed-source engines do, not that they do not comply with ECMA-262 combined with other web standards to have the same observable effect, e.g. Allen's example:
A quick scan of code.google.com/p/v8/issues/detail?id=761 suggests that there may be more variability among current browsers than we thought. I haven't tried my original test case in Chrome of IE9 but the discussion in this bug report suggests that their behavior may currently be different from FF.
thanks for this post.
Mark Davis ☕ wrote:
UTF-8 represents a code point as 1-4 8-bit code units
"1-6".
UTF-16 represents a code point as 2 or 4 16-bit code units
"1 or 2".
Lock up your encoders, I am so not a Unicode guru but this is what my reptile coder brain remembers.
On Tue, Feb 21, 2012 at 3:11 PM, Brendan Eich <brendan at mozilla.com> wrote:
Hi Mark, thanks for this post. Mark Davis ☕ wrote:
UTF-8 represents a code point as 1-4 8-bit code units
"1-6". ... Lock up your encoders, I am so not a Unicode guru but this is what my reptile coder brain remembers.
Only theoretically. UTF-8 has been locked down to the same range that UTF-16 has (RFC 3629), so the largest real character you'll see is 4 bytes, as that gives you exactly 21 bits of data.
Hi Mark, thanks for this post.
Mark Davis ☕ wrote:
UTF-8 represents a code point as 1-4 8-bit code units
"1-6".
No. 1 to 4. Five and six byte "UTF-8" sequences are illegal and invalid.
UTF-16 represents a code point as 2 or 4 16-bit code units
"1 or 2".
Yes, 1 or 2 16-bit code units (that's 2 or 4 bytes, of course).
Addison
Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG)
Internationalization is not a feature. It is an architecture.
Thanks, all! That's a relief to know, six bytes always seemed to long but my reptile coder brain was also reptile-coder-lazy and I never dug into it.
I'll reply to Brendan's proposal in two parts: first about the goals for supplementary character support, second about the BRS.
Full 21-bit Unicode support means all of:
- indexing by characters, not uint16 storage units;
- counting length as one greater than the last index; and
- supporting escapes with (up to) six hexadecimal digits.
For me, full 21-bit Unicode support has a different priority list.
First come the essentials: Regular expressions; functions that interpret strings; the overall sense that all Unicode characters are supported.
- Regular expressions must recognize supplementary characters as atomic entities, and interpret them according to Unicode semantics.
Look at the contortions one has to go through currently to describe a simple character class that includes supplementary characters: roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js
Read up on why it has to be done this way, and see to what extremes some people are going to make supplementary characters work despite ECMAScript: inimino.org/~inimino/blog/javascript_cset
Now, try to figure out how you'd convert a user-entered string to a regular expression such that you can search for the string without case distinction, where the string may contain supplementary characters such as "𐐶𐐲𐑌" (Deseret for "one").
Regular expressions matter a lot here because, if done properly, they eliminate much of the need for iterating over strings manually.
-
Built-in functions that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics. The list of functions in ES5 that violate this principle is actually rather short: Besides the String functions relying on regular expressions (match, replace, search, split), they're the String case conversion functions (toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the relational comparison for strings (11.8.5). But the principle is also important for new functionality being considered for ES6 and above.
-
It must be clear that the full Unicode character set is allowed and supported. This means at least getting rid of the reference to UCS-2 (clause 2) and the bizarre equivalence between characters and UTF-16 code units (clause 6). ECMAScript has already defined several ways to create UTF-16 strings containing supplementary characters (parsing UTF-8 source; using Unicode escapes for surrogate pairs), and lets applications freely pass around such strings. Browsers have surrounded ECMAScript implementations with text input, text rendering, DOM APIs, and XMLHTTPRequest with full Unicode support, and generally use full UTF-16 to exchange text with their ECMAScript subsystem. Developers have used this to build applications that support supplementary characters, hacking around the remaining gaps in ECMAScript as seen above. But, as in the bug report that Brendan pointed to this morning (code.google.com/p/v8/issues/detail?id=761), the mention of UCS-2 is still used by some to excuse bugs.
Only after these essentials come the niceties of String representation and Unicode escapes:
-
1 String element to 1 Unicode code point is indeed a very nice and desirable relationship. Unlike Java, where binary compatibility between virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be compatible only at the source code level - or maybe, with a BRS, not even that.
-
If we don't go for UTF-32, then there should be a few functions to simplify access to strings in terms of code points, such as String.fromCodePoint, String.prototype.codePointAt.
-
I strongly prefer the use of plain characters over Unicode escapes in source code, because plain text is much easier to read than sequences of hex values. However, the need for Unicode escapes is greater in the space of supplementary characters because here we often have to reference characters for which our operating systems don't have glyphs yet. And \u{1D11E} certainly makes it easier to cross-reference a character than \uD834\uDD1E. The new escape syntax therefore should be on the list, at low priority.
I think it would help if other people involved in this discussion also clarified what exactly their requirements are for "full Unicode support".
Norbert
On Feb 21, 2012, at 6:05 PM, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:
I'll reply to Brendan's proposal in two parts: first about the goals for supplementary character support, second about the BRS.
Full 21-bit Unicode support means all of:
- indexing by characters, not uint16 storage units;
- counting length as one greater than the last index; and
- supporting escapes with (up to) six hexadecimal digits.
For me, full 21-bit Unicode support has a different priority list.
First come the essentials: Regular expressions; functions that interpret strings; the overall sense that all Unicode characters are supported.
- Regular expressions must recognize supplementary characters as atomic entities, and interpret them according to Unicode semantics.
Sorry to have been unclear. In my proposal this follows from the first two bullets.
- Built-in functions that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics.
Ditto.
- It must be clear that the full Unicode character set is allowed and supported.
Absolutely.
Only after these essentials come the niceties of String representation and Unicode escapes:
- 1 String element to 1 Unicode code point is indeed a very nice and desirable relationship. Unlike Java, where binary compatibility between virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be compatible only at the source code level - or maybe, with a BRS, not even that.
Right!
- If we don't go for UTF-32, then there should be a few functions to simplify access to strings in terms of code points, such as String.fromCodePoint, String.prototype.codePointAt.
Those would help smooth out different BRS settings, indeed.
- I strongly prefer the use of plain characters over Unicode escapes in source code, because plain text is much easier to read than sequences of hex values. However, the need for Unicode escapes is greater in the space of supplementary characters because here we often have to reference characters for which our operating systems don't have glyphs yet. And \u{1D11E} certainly makes it easier to cross-reference a character than \uD834\uDD1E. The new escape syntax therefore should be on the list, at low priority.
Allen and I were just discussing this as a desirable mini- strawman of its own, which Allen will write up for consideration at the next meeting.
We will also discuss the BRS . Did you have some thoughts on it?
I think it would help if other people involved in this discussion also clarified what exactly their requirements are for "full Unicode support".
Again, apologies for not being explicit. I model the string methods as self-hosted using indexing and .length in straightforward ways. HTH,
Second part: the BRS.
I'm wondering how development and deployment of existing full-Unicode software will play out in the presence of a Big Red Switch. Maybe I'm blind and there are ways to simplify the process, but this is how I imagine it.
Let's start with a bit of code that currently supports full Unicode by hacking around ECMAScript's limitations: roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js
To support applications running in a BRS-on environment, Roozbeh would have to create a parallel version of the module that (a) takes advantage of regular expressions that finally support supplementary characters and (b) uses the new Unicode escape syntax instead of the old one. The parallel version has to be completely separate because a BRS-on environment would reject the old Unicode escapes and an ES5/BRS-off environment would reject the new Unicode escapes.
To get the code tested, he also has to create a parallel version of the test cases. The parallel version would be functionally identical but set up a BRS-on environment and use the new Unicode escape syntax instead of the old one. The parallel version has to be completely separate because a BRS-on environment would reject the old Unicode escapes and an ES5/BRS-off environment would reject the new Unicode escapes. Fortunately the test cases are simple.
Then he has to figure out how the two separate versions of the module will get loaded by clients. It's a YUI module, and the YUI loader already has the ability to look at several parameters to figure out what to load (minimized vs. debug version, localized resource bundles, etc.), so maybe the BRS should be another parameter? But the YUI team has a long to-do list, so in the meantime the module gets two separate names, and the client has to figure out which one to request.
The first client picking up the new version is another, bigger library. As a library it doesn't control the BRS, so it has to be able to run with both BRS-on and BRS-off. So it has to check the BRS and load the appropriate version of the intl-bidi module at runtime. This means, it also has to be tested in both environments. Its test cases are not simple. So now it needs modifications to the test framework to run the test suite twice, once with BRS-on and once with BRS-off.
An application using the library and thus the intl-bidi module decides to take the plunge and switch to BRS-on. It doesn't do text processing itself (that's what libraries are for), and it doesn't use Unicode escapes, so no code changes. But when it throws the switch, exceptions get thrown. It turns out that 3 of the 50 JavaScript files loaded during startup use old Unicode escapes. One of them seems to do something that might affect supplementary characters; for the other two apparently the developers just felt safer escaping all non-ASCII characters. The developers of the application don't actually know anything about the scripts - they got loaded indirectly by apps, ads, and analytics software used by the application. The developers try to find out whom they'll have to educate about the BRS to get this resolved.
OK - migrations are hard. But so far most participants have only seen additional work, no benefits. How long will this take? When will it end? When will browsers make BRS-on the default, let alone eliminate the switch? When can Roozbeh abandon his original version? Where's the blue button?
The thing to keep in mind is that most code doesn't need to know anything about supplementary characters. The beneficiaries of the switch are only the implementors of functions that do need to know, and even they won't really benefit until the switch is permanently on (at least for all their clients). It seems the switch puts a new burden on many that so far have been rightfully oblivious to supplementary characters.
Norbert
On Feb 19, 2012, at 0:33 , Brendan Eich wrote:
[snip]
Interesting scenarios, Norbert -- well-thought-through.
The final goal (for me, at least) is to be able to tell my developers to "Just write code" and forget about the details about how the characters in strings are encoded. Your point about the bidi library is an important one, but I think if we could somehow survey the web that we would find that the vast majority of applications do The Wrong Thing now and that flipping the BRS would magically fix a lot of them. I think any group that is "with it" w.r.t. Unicode in JS today will find a way to embrace BRS-on as long there is a reasonable path to follow.
Some day, I hope developers will simply start all documents with something like <!DOCTYPE HTML UNICODE> and never worry about character encoding
details again. That is when we will start to see benefits, and these benefits will snowball as organizations start to do this.
Of course, to get there, we have to somehow manage the transition. I think your point about the static rejection of four-byte Unicode escapes is really important. During the transitional period, we need a way to write JS libraries than can run with BRS on or off.
If four-byte escapes are statically rejected in BRS-on, we have a problem -- we should be able to use old code that runs in either mode unchanged when said code only uses characters in the BMP.
Accepting both 4 and 6 byte escapes is a problem, though -- what is "\u123456".length? 1 or 3?
If we accept "\u1234" in BRS-on as a string with length 5 -- as we do today in ES5 with "\u123".length===4 -- we give developers a way to feature-test and conditionally execute code, allowing libraries to run with BRS-on and BRS-off.
It's awkward, though: there is no way to recover static strings programmatically since the \ has been eaten by the JS compiler. And users will want to programmatically convert arrays of strings (think gettext)
So, it seems that for a good migration path we somehow need to mark string literals so that the parser knows how to deal with them. And we need to do it in a way that "just works" in ES5 while preserving natural syntax with BRS-on.
Idea: can we add a per-script attribute which allows a transitional parsing scheme for string literals when BRS-on? This transitional scheme would parse string literals like BRS-off, unless the string literal had a leading U.
Having a per-script attribute lets module system developers deal with the problem easily when using DOM SCRIPT tag injection to load modules. It also allows users switching BRS-on to load old content from foreign sites, which I believe is necessary for widespread BRS-on adoption.
Sample program demonstrating how this might work:
<!DOCTYPE HTML UNICODE> <html> <script> var i; var a = [0]; a.push("\u1234"); </script> <script parser="unicodeTransitional"> a.push("\u1234"); a.push(U"\u1234"); a.push(U"\u123456"); <script> a.push("\u123456"); for (i=0; i < a.length; i++) { console.log(i + " -> " + a[i].length); } </script> </html>
Output:
0 -> 5
1 -> 1
2 -> 5
3 -> 1
4 -> 1
I think this is a sustainable solution that gives developers just enough tools to retrofit without going off in lala-land by adding a bunch of extra types and helper methods.
Erratum:
var a = [0];
should read
var a = [];
Norbert Lindenberg wrote:
OK - migrations are hard. But so far most participants have only seen additional work, no benefits. How long will this take? When will it end? When will browsers make BRS-on the default, let alone eliminate the switch? When can Roozbeh abandon his original version? Where's the blue button?
It may be that the BRS is worse than an incompatible change to "full Unicode" as Allen proposed last year. But in either case, something gets harder for Roozbeh. Which is worse?
Wes Garland wrote:
If four-byte escapes are statically rejected in BRS-on, we have a problem -- we should be able to use old code that runs in either mode unchanged when said code only uses characters in the BMP.
We've been over this and I conceded to Allen that "four-byte escapes" (I'll use \uXXXX to be clear from now on) must work as today with BRS-on. Otherwise we make it hard to impossible to migrate code that knows what it is doing with 16-bit code units that round-trip properly.
Accepting both 4 and 6 byte escapes is a problem, though -- what is "\u123456".length? 1 or 3?
This is not a problem. We want .length to distribute across concatenation, so 3 is the only answer and in particular ("\u1234" + "\u5678").length === 2 irrespective of BRS.
If we accept "\u1234" in BRS-on as a string with length 5 -- as we do today in ES5 with "\u123".length===4 -- we give developers a way to feature-test and conditionally execute code, allowing libraries to run with BRS-on and BRS-off.
Feature-testing should be done using a more explicit test. API TBD, but I don't think breaking "\uXXXX" with BRS on is a good idea.
I agree with you that Roozbeh is hardly used, so it can take the hit of having to feature-test the BRS. The much more common case today is JS code that blithely ignores non-BMP characters that make it into strings as pairs, treating them blindly as two "characters" (ugh; must purge that "c-word" abusage from the spec).
I posted a new stawman that describes what I think should is that most minimal support that we must provide for "full unicode" in ES.next: strawman:full_unicode_source_code
I'm not suggesting that we must stop at this level of support, but I think not doing at least what is describe in this proposal would would be mistake.
Thoughts?
I'm not in favour of big red switches, and I don't think the compartment based solution is going to be workable.
I'd like to plead for a solution rather like the one Java has, where strings are sequences of UTF-16 codes and there are specialized ways to iterate over them. Looking at this entry from the Unicode FAQ: unicode.org/faq/char_combmark.html#7 there are different ways to describe the length (and iteration) of a string. The BRS proposal favours #2, but I think for most applications utf-16-based-#1 is just fine, and for the applications that want to "do it right" #3 is almost always the correct solution. Solution #3 needs library support in any case and has no problems with UTF-16.
The central point here is that there are combining characters (accents) that you can't just normalize away. Getting them right has a lot of the same issues as surrogate pairs (you shouldn't normally chop them up, they count as one 'character', you can't tell how many of them there are in a string without looking, etc.). If you can handle combining characters then the surrogate pair support falls out pretty much for free.
Advantages of my proposal:
- High level of backwards compatibility
- No issues of where to place the BRS
- Compact and simple in the implementation
- Can be polyfilled on most VMs
- Interaction with the DOM is unproblematic
- No issues of what happens on concatenation if a surrogate pair is created.
Details:
- The built in string charCodeAt, [], length operations work in terms of UTF-16
- String.fromCharCode(x) can return a string with a length of 2
- New object StringIterator
new StringIterator(backing) returns a string iterator. The iterator has the following methods:
hasNext(); // Returns this.index() != this.backing().length nextGrapheme(); // Returns the next grapheme as a unicode code point, or -1 if the next grapheme is a sequence of code points nextGraphemeArray(); // Returns an array of numeric code points (possibly just one) representing the next grapheme nextCodePoint(); // Returns the next code point, possibly consuming two surrogate pairs index(); // Gets the current index in the string, from 0 to length setIndex(); // Sets the current index in the string, from 0 to length backing(); // Get the backing string
// Optionally hasPrevious(); previous*(); // Analogous to nextGrapheme etc. codePointLength(); // Takes O(length), cache the answer if you care graphemeLength(); // Ditto
If any of the next.. functions encounter an unmatched half of a surrogate pair they just return its number.
Regexp support. Regexps act 'as if' the following steps were performed.
Outside character classes an extended character turns into (?:xy) where x and y are the surrogate pairs. Inside positive character classes the extended characters are extracted so [abz] becomes (?:[ab]|xy) where z is an extended character and x and y are the surrogate pairs. Negative character classes can be handled by transforming into negative lookaheads. A decent set of unicode character classes will likely subsume most uses of these transformations.
Perhaps the BRS 21 bit solution feels marginally cleaner, but having two different kinds of strings in the same VM feels like a horrible solution that is user visible and will haunt implementations forever, and the cleanliness difference is very marginal given that grapheme based iteration is the correct solution for almost all the cases where iterating over utf-16 codes is not good enough.
2012/2/22 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
I'll reply to Brendan's proposal in two parts: first about the goals for supplementary character support, second about the BRS.
Full 21-bit Unicode support means all of:
- indexing by characters, not uint16 storage units;
- counting length as one greater than the last index; and
- supporting escapes with (up to) six hexadecimal digits.
For me, full 21-bit Unicode support has a different priority list.
First come the essentials: Regular expressions; functions that interpret strings; the overall sense that all Unicode characters are supported.
- Regular expressions must recognize supplementary characters as atomic entities, and interpret them according to Unicode semantics.
Look at the contortions one has to go through currently to describe a simple character class that includes supplementary characters: roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js
Read up on why it has to be done this way, and see to what extremes some people are going to make supplementary characters work despite ECMAScript: inimino.org/~inimino/blog/javascript_cset
Now, try to figure out how you'd convert a user-entered string to a regular expression such that you can search for the string without case distinction, where the string may contain supplementary characters such as "𐐶𐐲𐑌" (Deseret for "one").
Regular expressions matter a lot here because, if done properly, they eliminate much of the need for iterating over strings manually.
Built-in functions that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics. The list of functions in ES5 that violate this principle is actually rather short: Besides the String functions relying on regular expressions (match, replace, search, split), they're the String case conversion functions (toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the relational comparison for strings (11.8.5). But the principle is also important for new functionality being considered for ES6 and above.
It must be clear that the full Unicode character set is allowed and supported. This means at least getting rid of the reference to UCS-2 (clause 2) and the bizarre equivalence between characters and UTF-16 code units (clause 6). ECMAScript has already defined several ways to create UTF-16 strings containing supplementary characters (parsing UTF-8 source; using Unicode escapes for surrogate pairs), and lets applications freely pass around such strings. Browsers have surrounded ECMAScript implementations with text input, text rendering, DOM APIs, and XMLHTTPRequest with full Unicode support, and generally use full UTF-16 to exchange text with their ECMAScript subsystem. Developers have used this to build applications that support supplementary characters, hacking around the remaining gaps in ECMAScript as seen above. But, as in the bug report that Brendan pointed to this morning (code.google.com/p/v8/issues/detail?id=761), the mention of UCS-2 is still used by some to excuse bugs.
I agree that these are the priorities and should be done, including reopening and fixing the V8 bug.
Only after these essentials come the niceties of String representation and Unicode escapes:
- 1 String element to 1 Unicode code point is indeed a very nice and desirable relationship. Unlike Java, where binary compatibility between virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be compatible only at the source code level - or maybe, with a BRS, not even that.
I don't think this is important enough to justify incompatibility/implementation pain.
Agree with your points 5 and 6. One extra point of my own:
- I think we should prefer transparency in cases where there is doubt. This means passing data through with no errors or changes. It means allowing half surrogate pairs, combining characters that have nothing to combine with and characters that are not currently assigned in Unicode. In an ideal world, it's hard to see why these happen, but in the cases where they happen the most helpful thing to do is almost always to allow/ignore them.
Here are two hyptothetical examples:
We get data from a source that chops up a UTF-16 text into chunks and sends them separately for transmission. This will result in unmatched pairs of surrogates, but as long as our applications transmits the data unchanged, no harm results after they are recombined later.
Take an XML format where all the tags are ASCII, but there is body text that contains floating point numbers encoded as 16 bit values, including malformed surrogate pairs. This is pretty sick, but who are we to judge? We want to treat this as a string because we can use string operations on the XML tags, but it would be extremely unhelpful to replace the malformed data with invalid unicode code points.
Others have mentioned examples of abusing the string type for other encodings.
All these examples arguably involve some poor architectural decision somewhere along the line that should ideally be fixed upstream, but I don't think this justifies making JS into the angel of Unicode-correct vengance, striking down bad encodings in the name of peace and justice.
I'm old enough (just) to have worked on an OS/360 system where you had to declare text files and you had to specify a max line length. This quickly got very annoying, because if you made them text files you sooner or later ended up with truncated lines, and if you didn't make them text files there was a lot of system functionality that did not work. The relief on getting back on Unix where the OS just a file as a bag of bytes was great. I hope the analogy is clear.
Comments:
-
In terms of the prioritization I suggested a few days ago esdiscuss/2012-February/020721 it seems you're considering item 6 essential, item 1 a side effect (whose consequences are not mentioned - see below), items 2-5 nice to have. Do I understand that correctly? What is this prioritization based on?
-
The description of the current situation seems incorrect. The strawman says: "As currently specified by ES5.1, supplementary characters cannot be used in the source code of ECMAScript programs." I don't see anything in the spec saying this. To the contrary, the following statement in clause 6 of the spec opens the door to supplementary characters: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16." Actual source text outside of an ECMAScript runtime is rarely stored in streams of 16-bit code units; it's normally stored and transmitted in UTF-8 (including its subset ASCII) or some other single-byte or multi-byte character encoding. Interpreting source text therefore almost always requires conversion to UTF-16 as a first step. UTF-8 and several other encodings (GB18030, Big5-HKSCS, EUC-TW) can represent supplementary characters, and correct conversion to UTF-16 will convert them to surrogate pairs.
When I mentioned this before, you said that the intent of the ES5 wording was to keep ECMAScript limited to the BMP (the "UCS-2 world"). esdiscuss/2011-May/014337, esdiscuss/2011-May/014342 However, I don't see that intent reflected in the actual text of clause 6.
I have since also tested with supplementary characters in UTF-8 source text on a variety of current browsers (Safari / (Mac, iOS), (Firefox, Chrome, Opera) / (Mac, Windows), Explorer / Windows), and they all handle the conversion from UTF-8 to UTF-16 correctly. Do you know of one that doesn't? The only ECMAScript implementation I encountered that fails here is Node.js.
In addition to plain text encoding in UTF-8, supplementary characters can also be represented in source code as a sequence of two Unicode escapes. It's not as convenient, but it works in all implementations I've tested, including Node.js.
- Changing the source code to be just a stream of Unicode characters seems a good idea overall. However, just changing the definition of SourceCharacter is going to break things. SourceCharacter isn't only used for source syntax and JSON syntax, where the change seems benign; it's also used to define the content of String values and the interpretation of regular expression patterns:
- Subclause 7.8.4 contains the statements "The SV of DoubleStringCharacters :: DoubleStringCharacter is a sequence of one character, the CV of DoubleStringCharacter." and "The CV of DoubleStringCharacter :: SourceCharacter but not one of " or \ or LineTerminator is the SourceCharacter character itself." If SourceCharacter becomes a Unicode character, then this means coercing a 21-bit code point into a single 16-bit code unit, and that's not going to end well.
- Subclauses 15.10.1 and 15.10.2 use SourceCharacter to define PatternCharacter, IdentityEscape, RegularExpressionNonTerminator, ClassAtomNoDash. While this could potentially be part of a set of changes to make regular expression correctly support full Unicode, by itself it means that 21-bit code points will be coerced into or compared against 16-bit code units. Changing regular expressions to be code-point based has some compatibility risk which we need to carefully evaluate.
-
The statement about UnicodeEscapeSequence: "This production is limited to only expressing 16-bit code point values." is incorrect. Unicode escape sequences express 16-bit code units, not code points (remember that any use of the word "character" without the prefix "Unicode" in the spec after clause 6 means "16-bit code unit"). A supplementary character can be represented in source code as a sequence of two Unicode escapes. The proposed new Unicode escape syntax is more convenient and more legible, but doesn't provide new functionality.
-
I don't understand the sentence "For that reason, it is impossible to know for sure whether pairs of existing 16-bit Unicode escapes are intended to represent a single logical character or an explicit two character UTF-16 encoding of a Unicode characters." - what do you mean by "an explicit two character UTF-16 encoding of a Unicode characters"? In any case, it seems pretty clear to me that a Unicode escape for a high surrogate value followed by a Unicode escape for a low surrogate value, with the spec based on 16-bit values, means a surrogate pair representing a supplementary character. Even if the system were then changed to be 32-bit based, it's hard to imagine that the intent was to create a sequence of two invalid code points.
Norbert
2012/3/1 Glenn Adams <glenn at skynav.com>:
2012/3/1 Erik Corry <erik.corry at gmail.com>
I'm not in favour of big red switches, and I don't think the compartment based solution is going to be workable.
I'd like to plead for a solution rather like the one Java has, where strings are sequences of UTF-16 codes and there are specialized ways to iterate over them. Looking at this entry from the Unicode FAQ: unicode.org/faq/char_combmark.html#7 there are different ways to describe the length (and iteration) of a string. The BRS proposal favours #2, but I think for most applications utf-16-based-#1 is just fine, and for the applications that want to "do it right" #3 is almost always the correct solution. Solution #3 needs library support in any case and has no problems with UTF-16.
The central point here is that there are combining characters (accents) that you can't just normalize away. Getting them right has a lot of the same issues as surrogate pairs (you shouldn't normally chop them up, they count as one 'character', you can't tell how many of them there are in a string without looking, etc.). If you can handle combining characters then the surrogate pair support falls out pretty much for free.
The problem here is that you are mixing apples and oranges. Although it may appear that surrogate pairs and grapheme clusters have features in common, they operate at different semantic levels entirely. A solution that attempts to conflate these two levels is going to cause problems at both levels. A distinction should be maintained between the following levels:
encoding units (e.g., UTF-16 coding units) unicode scalar values (code points) grapheme clusters
This distinction is not lost on me. I propose that random access indexing and .length in JS should work on level 1, and there should be library support for levels 2 and 3. In order of descending usefulness I think the order is 1, 3, 2. Therefore I don't want to cause a lot of backwards compatibility headaches by prioritizing the efficient handling of level 2.
On Fri, Mar 2, 2012 at 12:58 AM, Erik Corry <erik.corry at gmail.com> wrote:
2012/3/1 Glenn Adams <glenn at skynav.com>:
I'd like to plead for a solution rather like the one Java has, where strings are sequences of UTF-16 codes and there are specialized ways to iterate over them. Looking at this entry from the Unicode FAQ: unicode.org/faq/char_combmark.html#7 there are different ways to describe the length (and iteration) of a string. The BRS proposal favours #2, but I think for most applications utf-16-based-#1 is just fine, and for the applications that want to "do it right" #3 is almost always the correct solution. Solution #3 needs library support in any case and has no problems with UTF-16.
The central point here is that there are combining characters (accents) that you can't just normalize away. Getting them right has a lot of the same issues as surrogate pairs (you shouldn't normally chop them up, they count as one 'character', you can't tell how many of them there are in a string without looking, etc.). If you can handle combining characters then the surrogate pair support falls out pretty much for free.
The problem here is that you are mixing apples and oranges. Although it may appear that surrogate pairs and grapheme clusters have features in common, they operate at different semantic levels entirely. A solution that attempts to conflate these two levels is going to cause problems at both levels. A distinction should be maintained between the following levels:
(1) encoding units (e.g., UTF-16 coding units) (2) unicode scalar values (code points) (3) grapheme clusters
This distinction is not lost on me. I propose that random access indexing and .length in JS should work on level 1,
that's where we are today: indexing and length based on 16-bit code units (of a UTF-16 encoding, likewise with Java)
and there should be library support for levels 2 and 3. In order of descending usefulness I think the order is 1, 3, 2. Therefore I don't want to cause a lot of backwards compatibility headaches by prioritizing the efficient handling of level 2.
from a perspective of indexing "Unicode characters", level 2 is the correct place;
level 3 is useful for higher level, language/locale sensitive text processing, but not particularly interesting at the basic ES string processing level; we aren't talking about (or IMO should not be talking about) a level 3 text processing library in this thread;
2012/3/2 Glenn Adams <glenn at skynav.com>:
On Fri, Mar 2, 2012 at 12:58 AM, Erik Corry <erik.corry at gmail.com> wrote:
2012/3/1 Glenn Adams <glenn at skynav.com>:
I'd like to plead for a solution rather like the one Java has, where strings are sequences of UTF-16 codes and there are specialized ways to iterate over them. Looking at this entry from the Unicode FAQ: unicode.org/faq/char_combmark.html#7 there are different ways to describe the length (and iteration) of a string. The BRS proposal favours #2, but I think for most applications utf-16-based-#1 is just fine, and for the applications that want to "do it right" #3 is almost always the correct solution. Solution #3 needs library support in any case and has no problems with UTF-16.
The central point here is that there are combining characters (accents) that you can't just normalize away. Getting them right has a lot of the same issues as surrogate pairs (you shouldn't normally chop them up, they count as one 'character', you can't tell how many of them there are in a string without looking, etc.). If you can handle combining characters then the surrogate pair support falls out pretty much for free.
The problem here is that you are mixing apples and oranges. Although it may appear that surrogate pairs and grapheme clusters have features in common, they operate at different semantic levels entirely. A solution that attempts to conflate these two levels is going to cause problems at both levels. A distinction should be maintained between the following levels:
(1) encoding units (e.g., UTF-16 coding units) (2) unicode scalar values (code points) (3) grapheme clusters
This distinction is not lost on me. I propose that random access indexing and .length in JS should work on level 1,
that's where we are today: indexing and length based on 16-bit code units (of a UTF-16 encoding, likewise with Java)
Not really for JS. Missing parts in the current UTF-16 support have been listed in this thread, eg in Norbert Lindenberg's 6 point prioritization list, which I replied to yesterday.
and there should be library support for levels 2 and 3. In order of descending usefulness I think the order is 1, 3, 2. Therefore I don't want to cause a lot of backwards compatibility headaches by prioritizing the efficient handling of level 2.
from a perspective of indexing "Unicode characters", level 2 is the correct place;
Yes, by definition.
level 3 is useful for higher level, language/locale sensitive text
No, the Unicode grapheme clustering algorithm is not locale or language sensitive unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
processing, but not particularly interesting at the basic ES string processing level; we aren't talking about (or IMO should not be talking about) a level 3 text processing library in this thread;
I will continue to feel free to talk about it as I believe that in the cases where just indexing by UTF-16 words is not sufficient it is normally level 3 that is the correct level. Also, I think there should be support for this level in JS as it is not locale-dependent.
On Mar 1, 2012, at 11:09 PM, Norbert Lindenberg wrote:
Comments:
- In terms of the prioritization I suggested a few days ago esdiscuss/2012-February/020721 it seems you're considering item 6 essential, item 1 a side effect (whose consequences are not mentioned - see below), items 2-5 nice to have. Do I understand that correctly? What is this prioritization based on?
The main intent of this proposal was to push forward with including \u{ }in ES6, regardless of any other on going full Unicode related discussions we are having. Hopefully we can achieve more that that, but if we don't the inclusion of \u{ } now should make it easer the next time we attach that problem by reducing the use of \uxxxx\uxxxx pairs which are ambiguous in intent. My expectation is that we would tell the world that {} is the new \uxxxx\uxxxx and that they should avoid using the latter form to inject supplementary characters into strings (and RegExp).
However, that usage depends upon the fact that today's implementations do generally allow supplementary characters to exist in the ECMAScript source code and that they do something rational with them. ES5 botched this saying that source characters can't exist in ECMAScript source code so we also need to fix that.
- The description of the current situation seems incorrect. The strawman says: "As currently specified by ES5.1, supplementary characters cannot be used in the source code of ECMAScript programs." I don't see anything in the spec saying this. To the contrary, the following statement in clause 6 of the spec opens the door to supplementary characters: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16." Actual source text outside of an ECMAScript runtime is rarely stored in streams of 16-bit code units; it's normally stored and transmitted in UTF-8 (including its subset ASCII) or some other single-byte or multi-byte character encoding. Interpreting source text therefore almost always requires conversion to UTF-16 as a first step. UTF-8 and several other encodings (GB18030, Big5-HKSCS, EUC-TW) can represent supplementary characters, and correct conversion to UTF-16 will convert them to surrogate pairs.
When I mentioned this before, you said that the intent of the ES5 wording was to keep ECMAScript limited to the BMP (the "UCS-2 world"). esdiscuss/2011-May/014337, esdiscuss/2011-May/014342 However, I don't see that intent reflected in the actual text of clause 6.
I have since also tested with supplementary characters in UTF-8 source text on a variety of current browsers (Safari / (Mac, iOS), (Firefox, Chrome, Opera) / (Mac, Windows), Explorer / Windows), and they all handle the conversion from UTF-8 to UTF-16 correctly. Do you know of one that doesn't? The only ECMAScript implementation I encountered that fails here is Node.js.
code.google.com/p/v8/issues/detail?id=761 suggests that V8 truncates supplementary characters rather than converting them to surrogate pairs. However, it is unclear whether that is referring to literal strings in the source code or only computationally generated strings.
In addition to plain text encoding in UTF-8, supplementary characters can also be represented in source code as a sequence of two Unicode escapes. It's not as convenient, but it works in all implementations I've tested, including Node.js.
the main problem is SourceCharacter :: any Unicode code unit
and "...the phrase 'code unit' and the word 'character' will be used to refer to a 16-bit unsigned value..."
All of the lexical rules in clause 7 are defined in terms of "characters" (ie code units). So, for example, a supplementary characters in category Lo occurring in an Identifier context would, at best, be seen as a pair of code units neither of which are in categories that are valid for IdentifierPart so the identifier would be invalid. Similarly a pair of \uXXXX escapes representing such a character would also be lex'ed as two distinct characters and result in an invalid identifier.
Regarding the intent of the current wording, I' was speaking of my intent when I was actually editing that text for the ES5 spec. My understanding at the time was that the lexical alphabet of ECMAScript was 16-bit code units and I was trying to clarify that but I think I botched it. In reality, I think that understanding is actually still correct in that there is nothing in the lexical grammar, as I noted in the previous paragraph that deals with anything other than 16-bit code units. Any conversions from non 16-bit character encodings is something that logically happens prior to processing as "ECMAScript source code".
- Changing the source code to be just a stream of Unicode characters seems a good idea overall. However, just changing the definition of SourceCharacter is going to break things. SourceCharacter isn't only used for source syntax and JSON syntax, where the change seems benign; it's also used to define the content of String values and the interpretation of regular expression patterns:
- Subclause 7.8.4 contains the statements "The SV of DoubleStringCharacters :: DoubleStringCharacter is a sequence of one character, the CV of DoubleStringCharacter." and "The CV of DoubleStringCharacter :: SourceCharacter but not one of " or \ or LineTerminator is the SourceCharacter character itself." If SourceCharacter becomes a Unicode character, then this means coercing a 21-bit code point into a single 16-bit code unit, and that's not going to end well.
- Subclauses 15.10.1 and 15.10.2 use SourceCharacter to define PatternCharacter, IdentityEscape, RegularExpressionNonTerminator, ClassAtomNoDash. While this could potentially be part of a set of changes to make regular expression correctly support full Unicode, by itself it means that 21-bit code points will be coerced into or compared against 16-bit code units. Changing regular expressions to be code-point based has some compatibility risk which we need to carefully evaluate.
Yes, but it isn't clear that it will change anything. We've just discussed that, in practice, JS implementations accept supplementary characters in string and RegExp literals. This proposal is saying that however implementatiions treat such characters, they must treat \u{} characters in the same way.
The interesting thing about JSON and eval, is that they take their input form actual JS strings rather than some abstract input source. The SourceCharacters they currently process correspond to single 16-bit string elements. Changing the grammar would change that correspondence unless we also change the semantics of string element values. This proposal leaves that issue for independent consideration.
- The statement about UnicodeEscapeSequence: "This production is limited to only expressing 16-bit code point values." is incorrect. Unicode escape sequences express 16-bit code units, not code points (remember that any use of the word "character" without the prefix "Unicode" in the spec after clause 6 means "16-bit code unit"). A supplementary character can be represented in source code as a sequence of two Unicode escapes. The proposed new Unicode escape syntax is more convenient and more legible, but doesn't provide new functionality.
As I said above, any such surrogate pairs aren't recognized by the grammar as a Unicode characters. What I meant by the quoted phrase is something like "This production is limitied to only expressing values in the 16-bit subset of code point values".
- I don't understand the sentence "For that reason, it is impossible to know for sure whether pairs of existing 16-bit Unicode escapes are intended to represent a single logical character or an explicit two character UTF-16 encoding of a Unicode characters." - what do you mean by "an explicit two character UTF-16 encoding of a Unicode characters"? In any case, it seems pretty clear to me that a Unicode escape for a high surrogate value followed by a Unicode escape for a low surrogate value, with the spec based on 16-bit values, means a surrogate pair representing a supplementary character. Even if the system were then changed to be 32-bit based, it's hard to imagine that the intent was to create a sequence of two invalid code points.
We don't know if the intent is to explicit construct a UTF-16 encoded string that is to be passed to a consumer that demands UTF-16 encoding. Or if the intent is simply to logically express a specific supplementary characters in a context where the internal encoding isn't known of relevent. ES5 doesn't have a way to distinguish those two use cases.
On Fri, Mar 2, 2012 at 2:13 AM, Erik Corry <erik.corry at gmail.com> wrote:
level 3 is useful for higher level, language/locale sensitive text
No, the Unicode grapheme clustering algorithm is not locale or language sensitive unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
one final comment: the Unicode algorithm is intended to define default behavior only:
"This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments."
it specifically states that implementations should provide language/locale sensitive behavior;
in actual text processing usage, one needs the language/locale sensitive behavior in most cases, not a default behavior
G.
Glenn Adams wrote:
On Fri, Mar 2, 2012 at 2:13 AM, Erik Corry <erik.corry at gmail.com <mailto:erik.corry at gmail.com>> wrote:
> level 3 is useful for higher level, language/locale sensitive text No, the Unicode grapheme clustering algorithm is not locale or language sensitive http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
one final comment: the Unicode algorithm is intended to define default behavior only:
"This specification defines /default/ mechanisms; more sophisticated implementations can /and should/ tailor them for particular locales or environments."
it specifically states that implementations should provide language/locale sensitive behavior;
in actual text processing usage, one needs the language/locale sensitive behavior in most cases, not a default behavior
Right, and Gecko, WebKit, and other web rendering engines obviously need to care about this. But they're invariably implemented mostly or wholly not in JS.
It's a bit ambitious for JS to have such facilities.
I agree with Erik that the day may come, but ES6 is being prototyped and spec'ed and we need to be done in 2012 to have the spec ready for 2013. We should not overreach.
Once more unto the breach, dear friends!
ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say ;-).
Clearly that was a while ago. These days, we would like full 21-bit Unicode character support in JS. Some (mranney at Voxer) contend that it is a requirement.
Full 21-bit Unicode support means all of:
ES4 saw bold proposals including Lars Hansen's, to allow implementations to change string indexing and length incompatibly, and let Darwin sort it out. I recall that was when we agreed to support "\u{XXXXXX}" as an extension for spelling non-BMP characters.
Allen's strawman from last year, strawman:support_full_unicode_in_strings, proposed a brute-force change to support full Unicode (albeit with too many hex digits allowed in "\u{...}"), observing that "There are very few places where the ECMAScript specification has actual dependencies upon the size of individual characters so the compatibility impact of supporting full Unicode is quite small." But two problems remained:
P1. As Allen wrote, "There is a larger impact on actual implementations", and no implementors that I can recall were satisfied that the cost was acceptable. It might be, we just didn't know, and there are enough signs of high cost to create this concern.
P2. The change is not backward compatible. In JS today, one read a string s from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a surrogate pair, then advance to the next-indexed uint16 unit and read the other half, then combine to compute some result. Such usage would break.
Example from Allen:
var c = "😁" // where the single character between the quotes is the Unicode character U+1f638
c.length == 2; c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683 c.charCodeAt(0) == 0xd83d; c.charCodeAt(1) == 0xd338;
(Allen points out how browsers, node.js, and other environments blindly handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of the JS engine, so the above actually works without any spec-language in ECMA-262 saying it should.)
So based on a recent twitter/github exchange, gist recorded at gist.github.com/1850768, I would like to propose a variation on Allen's proposal that resolves both of these problems. Here are resolutions in reverse order:
R2. No incompatible change without opt-in. If you hardcode as in Allen's example, don't opt in without changing your index, length, and char/code-at assumptions.
Such opt-in cannot be a pragma since those have lexical scope and affect code, not the heap where strings and String.prototype methods live.
We also wish to avoid exposing a "full Unicode" representation type and duplicated suite of the String static and prototype methods, as Java did. (We may well want UTF-N transcoding helpers; we certainly want ByteArray <-> UTF-8 transcoding APIs.)
True, R2 implies there are two string primitive representations at most, or more likely "1.x" for some fraction .x. Say, a flag bit in the string header to distinguish JS's uint16-based indexing ("UCS-2") from non-O(1)-indexing UTF-16. Lots of non-observable implementation options here.
Instead of any such big new observables, I propose a so-called "Big Red [opt-in] Switch" (BRS) on the side of a unit of VM isolation: specifically the global object.
Why the global object? Because for many VMs, each global has its own heap or sub-heap ("compartment"), and all references outside that heap are to local proxies that copy from, or in the case of immutable data, reference the remote heap. Also because inter-compartment traffic is (we conjecture) infrequent enough to tolerate the proxy/copy overhead.
For strings and String objects, such proxies would consult the remote heap's BRS setting and transcode indexed access, and .length gets, accordingly. It doesn't matter if the BRS is in the global or its String constructor or String.prototype, as the latter are unforgeably linked to the global.
This means a script intent on comparing strings from two globals with different BRS settings could indeed tell that one discloses non-BMP char/codes, e.g. charCodeAt return values >= 0x10000. This is the small new observable I claim we can live with, because someone opted into it at least in one of the related global objects.
Note that implementations such as Node.js can pre-set the BRS to "full Unicode" at startup. Embeddings that fully isolate each global and its reachable objects and strings pay no string-proxy or -copy overhead.
R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls from JS to (typically) C++ would have to proxy or copy any strings containing non-BMP characters. Strings with only BMP characters would work as today.
Note that we are dealing only in spec observables here. It doesn't matter whether the JS engine uses UTF-8 and the DOM UCS-2 (in which case there is already a transcoding penalty; IIRC WebKit libxml and libxslt use UTF-8 and so must transcode to interface with WebKit's DOM). The only issue at this boundary, I believe, is how indexing and .length work.
Ok, there you have it: resolutions for both problems that killed the last assault on Castle '90s-JS.
Implementations that use uint16 vectors as the character data representation type for both "UCS-2" and "UTF-16" string variants would probably want another flag bit per string header indicating whether, for the UTF-16 case, the string indeed contained any non-BMP characters. If not, no proxy/copy needed.
Such implementations probably would benefit from string (primitive value) proxies not just copies, since the underlying uint16 vector could be shared by two different string headers with whatever metadata flag bits, etc., are needed to disclose different length values, access different methods from distinct globals' String.prototype objects, etc.
We could certainly also work with the W3C to revise the DOM to check the BRS setting, if that is possible, to avoid this non-BMP-string proxy/copy overhead.
How is the BRS configured? Again, not via a pragma, and not by imperative state update inside the language (mutating hidden BRS state at a given program point could leave strings created before mutation observably different from those created after, unless the implementation in effect scanned the local heap and wrapped or copied any non-BMP-char-bearing ones creatd before).
The obvious way to express the BRS in HTML is a <meta> tag in document <head>, but I don't want to get hung up on this point. I do welcome
expert guidance. Here is another W3C/WHATWG interaction point. For this reason I'm cc'ing public-script-coord.
The upshot of this proposal is to get JS out of the '90s without a mandatory breaking change. With simple-enough opt-in expressed at coarse-enough boundaries so as not to impose high cost or unintended string type confusion bugs, the complexity is mostly borne by implementors, and at less than a 2x cost comparing string implementations (I think -- demonstration required of course).
In particular, Node.js can get modern at startup, and perhaps engines such as V8 as used in Node could even support compile-time (#ifdef) configury by which to support only full Unicode.
Comments welcome.