ECMAScript collation question
ICU is always able to compare them as being equal, just by setting the parameter.
Even if the parameter isn't set, it uses an FCD sort (see unicode.org/notes/tn5) and canonical closure, which handles most cases of canonical equivalence. The default is turned on for languages where the normal+auxiliary exemplar sets contains characters that would show a difference even with an FCD+closure sort, and can be turned on always if desired (at some cost in performance; 30% sounds high though).
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
This is what Markus had to say (he implemented most of the collation for ICU):
"www.unicode.org/reports/tr10/#Avoiding_Normalization
Step 1 of the algorithm: www.unicode.org/reports/tr10/#Step_1 which has a note:
- Conformant implementations may skip this step *in certain circumstances: *see *Section 6.5, Avoiding Normalizationwww.unicode.org/reports/tr10/#Avoiding_Normalization
- for more information.
See also www.unicode.org/reports/tr10/#Parametic_Tailoring -> attribute "normalization", see the description there (this whole table 14 will soon move to the LDML spec, leaving only a link in this place)"
So the question is:
- Do we change i18n API default for normalization to always be true, with some performance penalty?
- Update ES 262 spec with info Markus passed (if possible)?
2012/8/30 Mark Davis ☕ <mark at macchiato.com>
OK, so the Unicode conformance question hinges on "must be able to do" versus "must do".
The question for ECMAScript then is whether we should stick with "must do" (the current state of the specifications) or change to "must be able to do".
The changes for "must be able to do" would be:
-
In the Language specification, remove the description of String.prototype.localeCompare, and require implementations to follow the Internationalization API specification at least for this method, or better provide the complete Internationalization API. That way, localeCompare acquires support for the normalization property in options, and the -kk- key in the Unicode locale extensions.
-
In the Internationalization API specification, make support for the normalization property and the -kk- key mandatory (it's currently optional), but drop the separate requirement that canonically equivalent strings compare as 0.
This would give applications control over the trade-off between performance and full canonical equivalence, and let implementations select the default per locale.
But trading off correctness for performance in this way doesn't seem quite right. Especially for search usage, it could mean that you're staring at a Vietnamese or Arabic word in a list and the search functions says it's not there because you typed an indistinguishable but different string into the search box.
Thanks, Norbert
I think we could go either way. It depends on the usage mode.
- The case where performance is crucial is where you are comparing gazillions of strings, such as records in a database.
- If the number of strings to be compared is relatively small, and/or there is enough overhead anyway, the performance win by turning off full normalization would be lost in the noise.
So if #2 is the expected use case, we could require full normalization.
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
I think #2 is far more common for ECMAScript - typical use would be to re-sort a list of a few dozen or at most a few hundred entries and then redisplay that list. #1 might become more common though as JavaScript use on the server progresses.
So here's an alternative spec approach:
-
Leave the specification of String.prototype.localeCompare as is. That is, if it's not based on Collator, canonical equivalence -> 0 is required.
-
For Collator.prototype.compare, require that canonical equivalence -> 0 unless the client explicitly turns off normalization (i.e., normalization is on by default, independent of locale). Support for the normalization property in options and the kk key would become mandatory.
Norbert
Support for the normalization property in options and the kk key would
become mandatory.
The options that ICU offers are to observe full canonical equivalence:
- For all locales
- kk=true
- For key locales (where it is necessary); otherwise partial (FCD)
- kk=<not present>
- For no locales; always partial (FCD)
- kk=false
Your proposal looks reasonable, except I'm not sure how someone would use the kk value to get #2.
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
On Sat, Sep 1, 2012 at 4:19 PM, Mark Davis ☕ <mark at macchiato.com> wrote:
Your proposal looks reasonable, except I'm not sure how someone would use the kk value to get #2.
Could we say kk=default? markus
We could propose to the CLDR group adding <attribute>=default to mean (for
CLDR) the same as missing (at least for kk, if not others).
That would formally work, but would mean than in an ECMAScript context missing != default, while in other CLDR contexts, missing == default.
May work, but any other thoughts?
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
On Sun, Sep 2, 2012 at 12:51 PM, Mark Davis ☕ <mark at macchiato.com> wrote:
We could propose to the CLDR group adding <attribute>=default to mean (for CLDR) the same as missing (at least for kk, if not others).
I don't think that CLDR needs that just because ECMAScript might have it.
markus
The BCP 47 Unicode Locale Extension would need it, and currently that's tangled with CLDR...
Norbert
Seeing that the final draft of the spec is due today, here's a breakdown of possible changes around normalization in Collator:
- Change the description of Intl.Collator.prototype.compare to say: "The method is required to return 0 when comparing Strings that are considered canonically equivalent by the Unicode standard, unless collator has a [[normalization]] internal property whose value is false."
This is the smallest possible change to the spec that's needed to make its canonical equivalence and normalization requirements consistent, and I've made it.
- Require support for the normalization property and the kk key.
The way I phrased the spec in 1), this isn't necessary anymore, and we can make this change in the second edition if needed.
- Add "locale" to the set of acceptable input values for the normalization property of options. Implementations that support the normalization property would use the selected locale's default for the "kk" key. The normalization property of the object returned by resolvedOptions remains a boolean.
This change could be made today or in the second edition. If we make it in the second edition, implementations of the first edition would interpret "locale" as true because "locale" is truthy. The conformance clause does not allow implementations to add support for this value on their own.
- Add "locale" to the set of acceptable values of the kk key of BCP 47. The Internationalization API would use this, if the normalization property of options is undefined, to map to the appropriate boolean value.
This can't happen today, and I'm not sure it's really required. Turning off normalization is primarily an optimization and so should be under application control.
Comments?
Norbert
In view of the schedule, I suggest that we make your first, minimal change right now, and plan to correct it along one of the other lines in the next edition.
.#1 is much weaker than we want, so we should correct it, but we can do that in edition 2.
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
It was too weak indeed - I added the requirement that normalization is turned on by default.
Norbert
That works (for now).
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
I changed the subject because this question also affects the ECMAScript Language Specification.
Section 15.5.4.9, String.prototype.localeCompare (that), has said since ES3: "the function is required ... and to return 0 when comparing two strings that are considered canonically equivalent by the Unicode standard." ecma-international.org/ecma-262/5.1/#sec-15.5.4.9
I assume this requirement goes back to Unicode Technical Standard #10, Unicode Collation Algorithm, whose conformance clause C1 says (and has said since 1999): "Given a well-formed Unicode Collation Element Table, a conformant implementation shall replicate the same comparisons of strings as those produced by Section 4, Main Algorithm. In particular, a conformant implementation must be able to compare any two canonical-equivalent strings as being equal, for all Unicode characters supported by that implementation." unicode.org/reports/tr10/#Conformance
How can the default behavior of ICU be reconciled with this conformance clause?
I brought up the issue of collation and normalization before, but didn't get much feedback: esdiscuss/2012-June/thread.html#23568
Thanks, Norbert