UTF-16 Strings not-strawman

# Shawn Steele (14 years ago)

I don’t have time to make a real strawman, but what would people need if we went the UTF-16 route (instead of full-Unicode)? (This thread is to collect requirements, which are somewhat getting lost in the merits of UTF-16 vs 32 bit thread). Basically, just replace UCS-2 with UTF-16, allowing irregular UTF-16 for compatibility.

Things that come to mind immediately are:

· Some sort of convenience notation for string literals and regular expressions.

· Extend string.fromCharCode() to allow generating UTF-16 pairs for values 10000-10ffff.

· Something to allow values 10000-10ffff from string.charCodeAt. I assume it’d have to be new function.

· Make encodeURIcomponent and decodeURIcomponent use UTF-8 instead of CESU-8. (The current behavior actually breaks the specifications because CESU-8 as generated != UTF-8 as defined, but I’m not sure the bug can be fixed.) So either fix the bug (probably too breaking?) or make at least a new “correctlyEncodeURIcomponent”. (I don’t think decoding is breaking).

Things I’m less certain about:

· There is apparently some desire to walk a string by +=1 or +=2 depending on if it’s a surrogate pair or not. I’m not sure it’s worth formalizing, as, to me, it’s more interesting to walk it by graphemes or other more appropriate text elements. And most applications don’t seem to care much about whether they break strings.

· A strict mode that disallows the irregular UTF-16?

Shawn

  blogs.msdn.com/shawnste

Things that come to mind immediately are:

· Some sort of convenience notation for string literals and regular expressions.

· Extend string.fromCharCode() to allow generating UTF-16 pairs for values 10000-10ffff.

· Something to allow values 10000-10ffff from string.charCodeAt. I assume it’d have to be new function.

Things I’m less certain about:

· A strict mode that disallows the irregular UTF-16?

- Shawn

 
http://blogs.msdn.com/shawnste

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110519/10557304/attachment-0001.html>

# Mike Samuel (14 years ago)

2011/5/19 Shawn Steele <Shawn.Steele at microsoft.com>:

I don’t have time to make a real strawman, but what would people need if we went the UTF-16 route (instead of full-Unicode)? (This thread is to collect requirements, which are somewhat getting lost in the merits of UTF-16 vs 32 bit thread). Basically, just replace UCS-2 with UTF-16, allowing irregular UTF-16 for compatibility.

Things that come to mind immediately are:

· Some sort of convenience notation for string literals and regular expressions.

· Extend string.fromCharCode() to allow generating UTF-16 pairs for values 10000-10ffff.

· Something to allow values 10000-10ffff from string.charCodeAt. I assume it’d have to be new function.

+1 for new function

· Make encodeURIcomponent and decodeURIcomponent use UTF-8 instead of CESU-8. (The current behavior actually breaks the specifications because CESU-8 as generated != UTF-8 as defined, but I’m not sure the bug can be fixed.) So either fix the bug (probably too breaking?) or make at least a new “correctlyEncodeURIcomponent”. (I don’t think decoding is breaking).

Things I’m less certain about:

· There is apparently some desire to walk a string by +=1 or +=2 depending on if it’s a surrogate pair or not. I’m not sure it’s worth formalizing, as, to me, it’s more interesting to walk it by graphemes or other more appropriate text elements. And most applications don’t seem to care much about whether they break strings.

Such a thing would make it marginally easier to write escaping functions for a few languages:

"\ud800\udc00" -> "&#x10000;"

but instead of putting it in A UTF-16 strawman, we could just keep it in mind as a criterion for judging any string related stuff in the loop/iterators/enumeration strawmen.

· A strict mode that disallows the irregular UTF-16?

I think this can be best left to JSLint.

2011/5/19 Shawn Steele <Shawn.Steele at microsoft.com>:
> I don’t have time to make a real strawman, but what would people need if we
> went the UTF-16 route (instead of full-Unicode)?  (This thread is to collect
> requirements, which are somewhat getting lost in the merits of UTF-16 vs 32
> bit thread).  Basically, just replace UCS-2 with UTF-16, allowing irregular
> UTF-16 for compatibility.
>
>
>
> Things that come to mind immediately are:
>
> ·         Some sort of convenience notation for string literals and regular
> expressions.

> ·         Extend string.fromCharCode() to allow generating UTF-16 pairs for
> values 10000-10ffff.

+1

> ·         Something to allow values 10000-10ffff from string.charCodeAt.  I
> assume it’d have to be new function.

+1 for new function


> ·         Make encodeURIcomponent and decodeURIcomponent use UTF-8 instead
> of CESU-8.  (The current behavior actually breaks the specifications because
> CESU-8 as generated != UTF-8 as defined, but I’m not sure the bug can be
> fixed.)  So either fix the bug (probably too breaking?) or make at least a
> new “correctlyEncodeURIcomponent”.  (I don’t think decoding is breaking).

+1


> Things I’m less certain about:
>
> ·         There is apparently some desire to walk a string by +=1 or +=2
> depending on if it’s a surrogate pair or not.  I’m not sure it’s worth
> formalizing, as, to me, it’s more interesting to walk it by graphemes or
> other more appropriate text elements.  And most applications don’t seem to
> care much about whether they break strings.

Such a thing would make it marginally easier to write escaping
functions for a few languages:

    "\ud800\udc00" -> "&#x10000;"

but instead of putting it in A UTF-16 strawman, we could just keep it
in mind as a criterion for judging any string related stuff in the
loop/iterators/enumeration strawmen.

> ·         A strict mode that disallows the irregular UTF-16?

I think this can be best left to JSLint.


> - Shawn
>
>
>
>  
>
> http://blogs.msdn.com/shawnste
>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>

# Brendan Eich (14 years ago)

On May 19, 2011, at 1:42 PM, Mike Samuel wrote:

2011/5/19 Shawn Steele <Shawn.Steele at microsoft.com>:

I don’t have time to make a real strawman, but what would people need if we went the UTF-16 route (instead of full-Unicode)? (This thread is to collect requirements, which are somewhat getting lost in the merits of UTF-16 vs 32 bit thread). Basically, just replace UCS-2 with UTF-16, allowing irregular UTF-16 for compatibility.

Things that come to mind immediately are:

· Some sort of convenience notation for string literals and regular expressions.

We could surely use better string and regexp literal support. I'm going to get the strawman:multiline_regexps done by next week or bust. Perhaps we should have a companion multiline string proposal (or I'll combine them :-P), where we stipulate that both new literal forms support UTF-16 if that is accepted.

· Extend string.fromCharCode() to allow generating UTF-16 pairs for values 10000-10ffff.

+1

This is not a compatible change:

js> String.fromCharCode(0x10000) "\0"

The heap is shared in the same origin between old and new scripts, so this is borrowing some trouble. Not sure how much, but why borrow if we don't need to?

· A strict mode that disallows the irregular UTF-16?

I think this can be best left to JSLint.

If static, yes. Runtime checking is going to cry wolf

On May 19, 2011, at 1:42 PM, Mike Samuel wrote:

> 2011/5/19 Shawn Steele <Shawn.Steele at microsoft.com>:
>> I don’t have time to make a real strawman, but what would people need if we
>> went the UTF-16 route (instead of full-Unicode)?  (This thread is to collect
>> requirements, which are somewhat getting lost in the merits of UTF-16 vs 32
>> bit thread).  Basically, just replace UCS-2 with UTF-16, allowing irregular
>> UTF-16 for compatibility.
>> 
>> Things that come to mind immediately are:
>> 
>> ·         Some sort of convenience notation for string literals and regular
>> expressions.

We could surely use better string and regexp literal support. I'm going to get the http://wiki.ecmascript.org/doku.php?id=strawman:multiline_regexps done by next week or bust. Perhaps we should have a companion multiline string proposal (or I'll combine them :-P), where we stipulate that both new literal forms support UTF-16 if that is accepted.

> 
>> ·         Extend string.fromCharCode() to allow generating UTF-16 pairs for
>> values 10000-10ffff.
> 
> +1

This is not a compatible change:

js> String.fromCharCode(0x10000)
"\0"

The heap is shared in the same origin between old and new scripts, so this is borrowing some trouble. Not sure how much, but why borrow if we don't need to?

>> ·         A strict mode that disallows the irregular UTF-16?
> 
> I think this can be best left to JSLint.

If static, yes. Runtime checking is going to cry wolf on all the data hacked into uint16 pieces in strings. That seems the common case, not escaped irregular UTF-16 in string literals. I don't see a need here that we can realistically meet.

/be
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110519/a5c8b036/attachment.html>

# Shawn Steele (14 years ago)

js> String.fromCharCode(0x10000) "\0"

Bummer

js> String.fromCharCode(0x10000)
"\0"

Bummer


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110520/0fb3eb51/attachment.html>