Internationalization: Normalization and canonical equivalence in string comparison
On Tue, Jun 19, 2012 at 12:36 AM, Norbert Lindenberg < ecmascript at norbertlindenberg.com> wrote:
The ECMAScript Internationalization API Specification currently has normalization as an optional feature in collation. However, it requires that the compare function "return 0 when comparing Strings that are considered canonically equivalent by the Unicode standard". Canonical equivalence, I thought, is usually implemented through normalization. Does it make sense to keep normalization as a separate and optional feature then? Is anybody planning to implement canonical equivalence through other mechanisms, such that the lack of normalization would be visible in the comparison of non-equivalent strings?
BTW, the requirement that canonically equivalent strings compare as equal has been part of the specification of String.prototype.localeCompare since ES3. When testing with a handful of string pairs pulled from chapter 3 of the Unicode Standard and from UTS 10, however, I found that only Opera on the Mac detects their equivalence correctly. Firefox on the Mac and the V8 systems (Chrome, Node) fail to detect any equivalence; Safari, Explorer and the Windows versions of Opera and Firefox detect some and miss others. Obviously people haven't been paying much attention to localeCompare...
I don't know enough about the first part of your message to be any use; I am, however, interested in the second part - will you be publishing your tests and findings?
The test is at norbertlindenberg.com/ecmascript/ESTest.html (and .js).
The strings I used are: ["o\u0308", "ö"], ["ä\u0323", "a\u0323\u0308"], // requires reordering ["a\u0308\u0323", "a\u0323\u0308"], // requires reordering ["ạ\u0308", "a\u0323\u0308"], ["ä\u0306", "a\u0308\u0306"], ["ă\u0308", "a\u0306\u0308"], ["\u1111\u1171\u11b6", "퓛"], // jamo/hangul ["Å", "Å"]
Results:
Safari on Mac, iOS: Fail for comparisons that require reordering nonspacing marks within strings; pass for others. Firefox, Opera, Explorer on Windows: Fail for jamo/hangul comparison; pass for others. Firefox, Node on Mac; Chrome on Mac, Windows: Fail for all. Opera on Mac: Passes for all.
Norbert
Norbert--
The ECMAScript Internationalization API Specification currently has normalization as an optional feature in collation. However, it requires that the compare function "return 0 when comparing Strings that are considered canonically equivalent by the Unicode standard". Canonical equivalence, I thought, is usually implemented through normalization. Does it make sense to keep normalization as a separate and optional feature then? Is anybody planning to implement canonical equivalence through other mechanisms, such that the lack of normalization would be visible in the comparison of non-equivalent strings?
For what little it may be worth, I think it would make sense to just make normalization mandatory in localeCompare(). Of course, I don't know if that causes trouble for anybody (I'm pretty sure it doesn't for me).
--Rich Gillam Lab126
I'm afraid it's not quite so simple. The Internationalization API spec defines localeCompare() as a wrapper around Intl.Collator.prototype.compare, so to make normalization mandatory for localeCompare, we'd have to make it mandatory for Collator as well. I'd like to get some input from implementors whether that makes sense, or whether they're planning to implement canonical equivalence in some other way.
Thanks, Norbert
The ECMAScript Internationalization API Specification currently has normalization as an optional feature in collation. However, it requires that the compare function "return 0 when comparing Strings that are considered canonically equivalent by the Unicode standard". Canonical equivalence, I thought, is usually implemented through normalization. Does it make sense to keep normalization as a separate and optional feature then? Is anybody planning to implement canonical equivalence through other mechanisms, such that the lack of normalization would be visible in the comparison of non-equivalent strings?
BTW, the requirement that canonically equivalent strings compare as equal has been part of the specification of String.prototype.localeCompare since ES3. When testing with a handful of string pairs pulled from chapter 3 of the Unicode Standard and from UTS 10, however, I found that only Opera on the Mac detects their equivalence correctly. Firefox on the Mac and the V8 systems (Chrome, Node) fail to detect any equivalence; Safari, Explorer and the Windows versions of Opera and Firefox detect some and miss others. Obviously people haven't been paying much attention to localeCompare...
Norbert