Internationalization API: Collator sensitivity

# Norbert Lindenberg (13 years ago)

In recent discussions between Markus Scherer, Nebojša Ćirić, Mark Davis (Google), Eric Albright (Microsoft) and myself, a few issues around the sensitivity option of the Collator constructor in the ECMAScript Internationalization API [1, section 11.3.2] have come up. It would be good to get input from a wider audience.

  1. The "variant" sensitivity: This name isn't very descriptive. When "variant" is selected, a collator has to take all differences between input strings into account that it considers at the "case" and "accent" levels; it may consider additional differences. New names have been proposed:
  • "accent+case": mnemonic, but doesn't indicate that additional differences may be considered.
  • "common"
  • "normal"
  • "default": doesn't work because it's not actually the default in all cases.
  • "full": doesn't work because implementations aren't required to take all differences into consideration.
  • "distinct", "dissimilar", "varied", "inherent", "intrinsic", "essential": not really descriptive.

An alternative would be to use the terminology of the Unicode Collation Algorithm [2] even though implementations do not have to follow that spec, so there would be sensitivity values "primary", "primary+caseLevel", "secondary", "tertiary", "quaternary", "identical". The problem here is that implementations may not actually have all these levels. The current "variant" can fall anywhere between "tertiary" and "identical".

I'm leaning towards renaming "variant" to "accent+case", with a note "Other differences, such as those between hiragana and katakana, may compare as unequal as well.".

  1. The description of the sensitivity values seems to use the term "width" for the difference between the hiragana characters あ and ぁ. In the usage of the Unicode standard, these two characters are normal and small, while "width" refers to the difference between normal and full-width Latin characters such as A and A, or normal and half-width katakana characters such as ア and ア (katakana characters also have small variants such as ァ).

Implementations don't agree on their interpretation of these differences:

  • あ vs ぁ is interpreted as either a difference in case or a difference in accent.
  • あ vs ア vs ア is locale dependent in ICU.

My proposed resolution: Remove references to width and the comparison of あ and ぁ from the spec.

  1. The term "accent" is too narrow - differences in other diacritics should be considered along with accents. When mentioning diacritics, however, it becomes necessary to clarify that some languages treat some characters with diacritics as base letters.

My proposed resolution: Keep "accent" as the value of the sensitivity option, but add "or other diacritics" to "accent" in the descriptions. Add a note: "In some languages, some characters with diacritics sort as separate base letters. For example, Swedish treats 'å', 'ä' and 'ö' as base letters separate from 'a' and 'o'."

Comments?

Norbert

[1] globalization:specification_drafts [2] unicode.org/reports/tr10

# Gillam, Richard (13 years ago)

Norbert--

  1. The "variant" sensitivity: This name isn't very descriptive. When "variant" is selected, a collator has to take all differences between input strings into account that it considers at the "case" and "accent" levels; it may consider additional differences. … An alternative would be to use the terminology of the Unicode Collation Algorithm [2] even though implementations do not have to follow that spec, so there would be sensitivity values "primary", "primary+caseLevel", "secondary", "tertiary", "quaternary", "identical".

The problem we're having coming up with appropriate words here suggests we're in an area where existing language falls short and we have to define our own terminology. UCA did that by using the "primary"/"secondary"/etc. names and then explaining in running text what they mean and how they vary between locales. Maybe I'm wrong, but I think we deviated from that here mainly in an effort to be more descriptive, but in doing that, we ran into the inherent difficulties in capturing these concepts in everyday-language terms. Maybe we should go back to the "primary"/"secondary" stuff.

The problem here is that implementations may not actually have all these levels.

Maybe use all the terms but declare some of them as optional?

The current "variant" can fall anywhere between "tertiary" and "identical".

Forgive my not remembering, but why is this?

I'm leaning towards renaming "variant" to "accent+case", with a note "Other differences, such as those between hiragana and katakana, may compare as unequal as well.".

If w don't do what I'm suggesting above, this would be my second choice.

  1. The description of the sensitivity values seems to use the term "width" for the difference between the hiragana characters あ and ぁ. In the usage of the Unicode standard, these two characters are normal and small, while "width" refers to the difference between normal and full-width Latin characters such as A and A, or normal and half-width katakana characters such as ア and ア (katakana characters also have small variants such as ァ).

Implementations don't agree on their interpretation of these differences:

  • あ vs ぁ is interpreted as either a difference in case or a difference in accent.
  • あ vs ア vs ア is locale dependent in ICU.

My proposed resolution: Remove references to width and the comparison of あ and ぁ from the spec.

I agree.

  1. The term "accent" is too narrow - differences in other diacritics should be considered along with accents. When mentioning diacritics, however, it becomes necessary to clarify that some languages treat some characters with diacritics as base letters.

You could use "diacritic" instead of "accent", I suppose.

My proposed resolution: Keep "accent" as the value of the sensitivity option, but add "or other diacritics" to "accent" in the descriptions. Add a note: "In some languages, some characters with diacritics sort as separate base letters. For example, Swedish treats 'å', 'ä' and 'ö' as base letters separate from 'a' and 'o'."

This would be fine with me too.

--Rich Gillam Lab126