easy handling of UTF16 surrogates & well-formed strings

# Roger Andrews (12 years ago)

This is rather long but the idea is to make handling UTF16 surrogates easier for the casual user without harming the ability of UTF16 experts to delve into details if surrogates are not well-paired (and hence the string is not well-formed).

Under the current definitions (ed. 6_10-26-12) surprising things happen. E.g. a string converted to an array of codepoints with 'codePointAt' then back to a string with 'fromCodePoint' is not equal to the original string if it contains well-formed surrogate pairs.

Here are some thoughts from a JavaScript enthusiast playing with Unicode outside the BMP.

String.prototype.codePointAt

The current definition of codePointAt has results: out-of-bounds -> Undefined normal BMP char -> the codepoint lead surrogate of a good pair -> the codepoint trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous bad trail surrogate -> codeunit in [0xDC00:0xDFFF] bad lead surrogate -> codeunit in [0xD800:0xDBFF]

Note that a well-paired trail surrogate still results in a value even though the previous codeunit "subsumed" it. So, if a caller is indexing down the string then it should take the well-paired trail surrogate value out of the sequence.

UTF16 experts can write code to check these possibilities; but for general usability lets have: Undefined for the trail surrogate of a good pair, and NaN for bad surrogate.

Then codePointAt would do the work for the casual user and experts can probe the string with charCodeAt (or codeUnitAt if it exists) if they really want to know the situation of bad surrogates.

[Unchanged, users are called upon to write code patterns like the messy....

// if the indexed position is part of a well-formed surrogate pair
// then result is either the entire code-point (for lead surrogates)
//                or undefined (for trail surrogates)
// result is NaN for bad surrogates
// (result is always undefined for out-of-bounds position)

cp = str.codePointAt( pos );
if (0xDC00 <= cp  &&  cp <= 0xDFFF) {
    cu = str.charCodeAt( pos-1 );
    if (0xD800 <= cu  &&  cu <= 0xDBFF) {
        cp =  undefined;      // trail surrogate of good pair
    }
}
if (0xD800 <= cp  &&  cp <= 0xDFFF) {
    cp = NaN;                 // bad surrogate
}

]

String.prototype.charCodeAt / String.prototype.codeUnitAt

The existing charCodeAt returns NaN (not Undefined) if the indexed position is out-of-bounds, unlike codePointAt.

For consistency, there could be a method 'codeUnitAt' which behaves like (and is named like) codePointAt; i.e. returns Undefined for out-of-bounds.

String.prototype.charAt / String.prototype.unicodeCharAt

The existing charAt does not handle UTF16 surrogate pairs.

For consistency with the above, there could be a method 'unicodeCharAt' which returns the 1- or 2-char string corresponding to the 'codePointAt' value and empty-string for out-of-bounds or a well-paired trail surrogate. Note that an array of such strings could be joined to form the original string.

What to return for a bad surrogate? Null? Undefined?

String.fromCodePoint

The current definition of fromCodePoint does not convert a sequence produced by codePointAt back to the original string.

This is really due to codePointAt returning a trail surrogate value after a well-formed pair (which were just converted to a single codepoint).

If codePointAt is changed to return Undefined for a good trail surrogate then fromCodePoint should simply ignore Undefined arguments. Currently I think it throws RangeError (or maybe converts Undefined values to NUL chars?).

String.fromCharCode / String.fromCodeUnit

The existing fromCharCode converts undefined,null,NaN,Infinity values into NUL chars (U+0000), and maps other naughty values into valid chars.

For consistency, there could be a function 'fromCodeUnit' which behaves like (and is named like) fromCodePoint; i.e. throws RangeError for naughty values. This function should also have arity = 0 like fromCodePoint.

If fromCodePoint is changed to ignore Undefined arguments then so should fromCodeUnit.

String.isWellFormed

To enable a user easily to detect a well-/ill-formed string how about a friendly predicate: String.isWellFormed( str )

Without this, the following regexp should test a string for well-formedness (no warranty implied): /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/

String.prototype.repair

Following on from isWellFormed, what is the user to do with an ill-formed string? Here is one suggestion: a 'repair' method which replaces improper surrogates with something (like the Unicode replacement character U+FFFD). (Alternatively, the user may want to give up and throw an Error, see next.)

[Here is a possible implementation which UTF16 experts could shim in....

var re_badsurrogate =

/\uD800-\uDBFF|([^\uD800-\uDBFF])[\uDC00-\uDFFF]|^[\uDC00-\uDFFF]/g;

String.prototype.repair = function (replacer)
{
    if (arguments.length == 0)  replacer = "\uFFFD";

    return this.replace( re_badsurrogate, "$1"+replacer );
};

]

StringError (& URI functions)

The existing encodeURI & encodeURIComponent throw URIError if given an ill-formed string. (The URI decode function similar both for ill-formed strings and improper use of percent-coding.)

A new Error, called StringError, could be thrown by URI functions and user functions which reject an ill-formed string because it is ill-formed, (rather than trying to repair it).

To avoid changing the existing URI functions, versions using StringError could be moved from global namespace to a "URI" namespace (ala "JSON"): URI.encodeComponent, ... This seems quite neat, and declutters the global namespace too.

# Phillips, Addison (12 years ago)

You might want to check out Norbert's proposal [1]

Addison

Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG)

Internationalization is not a feature. It is an architecture.

[1] norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

# Roger Andrews (12 years ago)

Thanks for the ref to Norbert's proposal. (I have been interested in i18n since writing an international telephony switch control system in 1987.)

Norbert's proposal has much interesting info about formats, locales, case-mapping & much else, but says little about the String.* functions or how the user can handle an ill-formed string (thinking from the perspective of a lowly software engineer working to achieve some task, rather than a top-down architect).

Head: 4.3.20 Surrogate pair The proposal does confirm that an unpaired surrogate makes a UTF16 sequence ill-formed. Head: 5.3 Text Interpretation The proposal confirms that a valid surrogate pair is interpreted as a single codepoint, not a codepoint followed by an unpaired surrogate (as String.prototype.codePointAt does).

Towards the end of the page, in section Code Point Based String Accessors, the proposal defines String.fromCodePoint and String.prototype.codePointAt in effectively the same manner as ES6 (ed. 6_10-26-12) - although the length property (arity) of fromCodePoint differs from ES6's.

This definition of codePointAt has the same usability issues as ES6's (ed. 6_10-26-12); i.e. it returns a value in [0xDC00:0xDFFF] for both the 2nd member of a surrogate pair and an unpaired surrogate. It returns a value in [0xD800:0xDFFF] for an unpaired surrogate - maybe it would be friendlier to the casual user to return NaN (UTF16 experts can probe the location with charCodeAt / codeUnitAt if they care to).

My original post tried to point to anomalies in: String.prototype.codePointAt (of ES6) String.prototype.charCodeAt (suggest String.prototype.codeUnitAt instead) String.prototype.charAt (suggest String.prototype.unicodeCharAt too) String.fromCodePoint (of ES6) String.fromCharCode (suggest String.fromCodeUnit instead) and floated: String.isWellFormed String.prototype.repair StringError (& suggest URI functions mods)

Thanks again for the ref.


From: "Phillips, Addison" <addison at lab126.com>

Sent: Wednesday, November 14, 2012 5:05 PM To: "Roger Andrews" <roger.andrews at mail104.co.uk>; <es-discuss at mozilla.org>

Subject: RE: easy handling of UTF16 surrogates & well-formed strings

# Norbert Lindenberg (12 years ago)

Thank you for the feedback! It's always good to hear from developers actually using or planning to use the API we're putting together.

I saw Addison's and your follow-up, but will reply to this message because it has the meat of your feedback. Note that my proposal has been approved by TC 39 for ES6, but Allen sometimes has to tweak details to match the rest of the spec, and not all parts of the proposal are incorporated into the spec yet.

More inline below.

, Norbert

On Nov 14, 2012, at 6:06 , Roger Andrews wrote:

This is rather long but the idea is to make handling UTF16 surrogates easier for the casual user without harming the ability of UTF16 experts to delve into details if surrogates are not well-paired (and hence the string is not well-formed).

Under the current definitions (ed. 6_10-26-12) surprising things happen. E.g. a string converted to an array of codepoints with 'codePointAt' then back to a string with 'fromCodePoint' is not equal to the original string if it contains well-formed surrogate pairs.

Can you give an example usage scenario where this round-trip conversion is necessary?

Note that my proposal contains an iterator [1] as a more convenient interface for developers who need to access the code points of a string in sequence (this hasn't made it into the spec yet). The current version returns substrings; an alternative version could return integers - feedback on that would be helpful.

[1] norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#String

Here are some thoughts from a JavaScript enthusiast playing with Unicode outside the BMP.

String.prototype.codePointAt

The current definition of codePointAt has results: out-of-bounds -> Undefined normal BMP char -> the codepoint lead surrogate of a good pair -> the codepoint trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous bad trail surrogate -> codeunit in [0xDC00:0xDFFF] bad lead surrogate -> codeunit in [0xD800:0xDBFF]

Note that a well-paired trail surrogate still results in a value even though the previous codeunit "subsumed" it. So, if a caller is indexing down the string then it should take the well-paired trail surrogate value out of the sequence.

UTF16 experts can write code to check these possibilities; but for general usability lets have: Undefined for the trail surrogate of a good pair, and NaN for bad surrogate.

The current spec requires developers to check whether a returned code point is above 0xFFFF and skip one string element if that's the case (see the implementation of the proposed iterator). Your proposed spec requires them to check whether a code point is undefined and skip it if that's the case. Is that really better?

NaN seems a bad choice because it's not a code point and so forces callers to check for it separately. For cases where the actual bad surrogate is not wanted, U+FFFD is the standard choice.

Then codePointAt would do the work for the casual user and experts can probe the string with charCodeAt (or codeUnitAt if it exists) if they really want to know the situation of bad surrogates.

[Unchanged, users are called upon to write code patterns like the messy....

// if the indexed position is part of a well-formed surrogate pair // then result is either the entire code-point (for lead surrogates) // or undefined (for trail surrogates) // result is NaN for bad surrogates // (result is always undefined for out-of-bounds position)

cp = str.codePointAt( pos ); if (0xDC00 <= cp && cp <= 0xDFFF) { cu = str.charCodeAt( pos-1 ); if (0xD800 <= cu && cu <= 0xDBFF) { cp = undefined; // trail surrogate of good pair } } if (0xD800 <= cp && cp <= 0xDFFF) { cp = NaN; // bad surrogate }

]

That might be necessary if they really want to deal with undefined and NaN. I think it would be more useful to look at actual usage scenarios and see how they can be addressed using the iterator or codePointAt directly. Can you provide some of the algorithms (simplified if necessary) where you're trying to support supplementary characters?

String.prototype.charCodeAt / String.prototype.codeUnitAt

The existing charCodeAt returns NaN (not Undefined) if the indexed position is out-of-bounds, unlike codePointAt.

For consistency, there could be a method 'codeUnitAt' which behaves like (and is named like) codePointAt; i.e. returns Undefined for out-of-bounds.

There's been an argument that codePointAt should return NaN like codeUnitAt [2, 3; thread 4 for context]. What's your view on this?

[2] esdiscuss/2012-August/024587 [3] esdiscuss/2012-August/024606 [4] esdiscuss/2012-August/thread.html#24576

String.prototype.charAt / String.prototype.unicodeCharAt

The existing charAt does not handle UTF16 surrogate pairs.

For consistency with the above, there could be a method 'unicodeCharAt' which returns the 1- or 2-char string corresponding to the 'codePointAt' value and empty-string for out-of-bounds or a well-paired trail surrogate. Note that an array of such strings could be joined to form the original string.

Again, would be good to look at usage scenarios. codePointAt fits scenarios where you want to look up information about a code point in tables. How would you use the code point strings? I assume joining comes only at the end of an algorithm that modifies the string in some way.

String.fromCodePoint


The current definition of fromCodePoint does not convert a sequence produced by codePointAt back to the original string.

This is really due to codePointAt returning a trail surrogate value after a well-formed pair (which were just converted to a single codepoint).

If codePointAt is changed to return Undefined for a good trail surrogate then fromCodePoint should simply ignore Undefined arguments. Currently I think it throws RangeError (or maybe converts Undefined values to NUL chars?).

It throws RangeError because ToNumber(undefined) is NaN.

String.fromCharCode / String.fromCodeUnit

The existing fromCharCode converts undefined,null,NaN,Infinity values into NUL chars (U+0000), and maps other naughty values into valid chars.

For consistency, there could be a function 'fromCodeUnit' which behaves like (and is named like) fromCodePoint; i.e. throws RangeError for naughty values. This function should also have arity = 0 like fromCodePoint.

If fromCodePoint is changed to ignore Undefined arguments then so should fromCodeUnit.

String.isWellFormed

To enable a user easily to detect a well-/ill-formed string how about a friendly predicate: String.isWellFormed( str )

More likely, String.prototype.isWellFormedUTF16().

Would you actually use such a method in your code? It seems reasonable, but there's the risk that algorithms actually have stricter requirements, therefore implement their own checks, and don't use this method.

Without this, the following regexp should test a string for well-formedness (no warranty implied): /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/

String.prototype.repair

Following on from isWellFormed, what is the user to do with an ill-formed string? Here is one suggestion: a 'repair' method which replaces improper surrogates with something (like the Unicode replacement character U+FFFD). (Alternatively, the user may want to give up and throw an Error, see next.)

[Here is a possible implementation which UTF16 experts could shim in....

var re_badsurrogate = /\uD800-\uDBFF|([^\uD800-\uDBFF])[\uDC00-\uDFFF]|^[\uDC00-\uDFFF]/g;

String.prototype.repair = function (replacer) { if (arguments.length == 0) replacer = "\uFFFD";

  return this.replace( re_badsurrogate, "$1"+replacer );

};

]

Again, seems reasonable, but would you actually use it in your code?

StringError (& URI functions)

The existing encodeURI & encodeURIComponent throw URIError if given an ill-formed string. (The URI decode function similar both for ill-formed strings and improper use of percent-coding.)

A new Error, called StringError, could be thrown by URI functions and user functions which reject an ill-formed string because it is ill-formed, (rather than trying to repair it).

To avoid changing the existing URI functions, versions using StringError could be moved from global namespace to a "URI" namespace (ala "JSON"): URI.encodeComponent, ... This seems quite neat, and declutters the global namespace too.

I don't think TC 39 would introduce new variants of the URI functions just to change the errors they throw.

ECMAScript 6 introduces modules [6], which would handle the namespace problem - see in particular the proposal to modularize the standard built-in functionality [7].

[6] harmony:modules [7] harmony:modules_standard

# Roger Andrews (12 years ago)

Sorry about the slow response - (other pressures to contend with).

I try to address some of your points below with a few quick comments. More in a few days hopefully.

Thank you for the feedback! It's always good to hear from developers actually using or planning to use the API we're putting together.

We shim in an API by monkey-patching String and Error so that novices can write code that Just Works without understanding UTF16, supplementary characters, unpaired surrogates.

Can you give an example usage scenario where this round-trip conversion is necessary?

Encryption and (lossless) compression.

[we have an iterator] The current spec requires developers to check whether a returned code point is above 0xFFFF and skip one string element if that's the case

Which is fine if developers want to process strings iteratively in increasing order (and know the >0xFFFF rule).

Methods like codePointAt and xxxxxAt allow random access (just like Arrays).

If a developer accesses position k of a string then they shouldn't also have to access k-1 or k+1 to grok what's going on. If a random access to position k returns Undefined or Empty-String (as appropriate) and other functions behave well given these values then the developer doesn't have to know, care, or spend time discovering how UTF16 works. It all works the same as Arrays, leveraging the developers existing knowledge.

NaN seems a bad choice [for an unpaired surrogate] because it's not a code point and so forces callers to check for it separately. For cases where the actual bad surrogate is not wanted, U+FFFD is the standard choice.

NaN means "something bad here", Undefined means "nothing here" (because the location is out-of-bounds, just like an Array, or because the location is "hidden" by a preceding paired surrogate).

Developers get the U+FFFD substitution if they explicitly ask for it with the 'repair' function, (or they can throw 'StringError' if they don't want to deal with unpaired surrogates).

That might be necessary if they really want to deal with undefined. ... Can you provide some of the algorithms ... where you're trying to support supplementary characters?

They don't have to deal with Undefined, other processing function do, and it all just works (except in the case of extreme misuse).

They don't have to "support supplementary characters" (or know what that means). Just do anything they like with a bunch of codepoints without worrying or learning UTF16.

=== Sorry, must leave now, will try to return to this with better structuring.

Roger


From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>

Sent: Friday, November 16, 2012 7:36 PM To: "Roger Andrews" <roger.andrews at mail104.co.uk>

Cc: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>; <es-discuss at mozilla.org>

Subject: Re: easy handling of UTF16 surrogates & well-formed strings