Norbert Lindenberg (2013-09-05T19:07:01.000Z)
domenic at domenicdenicola.com (2013-09-09T02:07:53.175Z)
Previous discussion of allowing surrogate code points: - https://mail.mozilla.org/pipermail/es-discuss/2012-December/thread.html#27057 - https://mail.mozilla.org/pipermail/es-discuss/2013-January/thread.html#28086 - http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29 Essentially, ECMAScript strings are Unicode strings as defined in The Unicode Standard section 2.7, and thus may contain unpaired surrogate code units in their 16-bit form or surrogate code points when interpreted as 32-bit sequences. String.fromCodePoint and String.prototype.codePointAt just convert between 16-bit and 32-bit forms; they're not meant to interpret the code points beyond that, and some processing (such as test cases) may depend on them being preserved. This is different from encoding for communication over networks, where the use of valid UTF-8 or UTF-16 (which cannot contain surrogate code points) is generally required. The indexing issue was first discussed in the form "why can't we just use UTF-32"? See http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32 for pointers to that. It would have been great to use UTF-8, but it's unfortunately not compatible with the past and the DOM. Adding code point indexing to 16-bit code unit strings would add significant performance overhead. In reality, whether an index is for 16-bit or 32-bit units matters only for some relatively low-level software that needs to process code point by code point. A lot of software deals with complete strings without ever looking inside, or is fine processing code unit by code unit (e.g., String.prototype.indexOf).