String.prototype.normalize, case folding and sort keys

# Nebojša Ćirić (12 years ago)

String.prototype.normalize(form) spec is here. It offers all 4 forms of normalization.

We did mention additional CF and CFNKFC forms for case folding, but they were not added to the spec. They case fold string in a locale independent way (see www.unicode.org/faq/casemap_charprop.html#2).

Should we:

Add those two new forms to the spec of String.prototype.normalize(form) method?
Add a new String.prototype.toFoldCase(form) method?
Add Intl.Collator.prototype.sortKey(string) -> string method?

We could do 1 and 3, or 2 and 3, or just 3.

Use case would be: user inputs M words, and we would like to see if some of them match N predefined words (say to trigger an action). With current Intl.Collator.prototype.compare() we need MxN comparisons. With toFoldCase/sortKey we would need only O(M) queries to the hash with N keys.

Mihai and I lean towards 3. because it gives more control to the user on what you want to check. For example, it doesn't make sense to ignoreCase for locales that don't have case distinction. Or user may want to preserve accents in the comparison...

String.prototype.normalize(form) spec is here - http://people.mozilla.org/~
jorendorff/es6-draft.html#sec-string.prototype.normalize. It offers all 4
forms of normalization.

We did mention additional CF and CFNKFC forms for case folding, but they
were not added to the spec. They case fold string in a locale independant
way (see http://www.unicode.org/faq/casemap_charprop.html#2).

Should we:
1. Add those two new forms to the spec of String.prototype.normalize(form)
method?
2. Add a new String.prototype.toFoldCase(form) method?
3. Add Intl.Collator.prototype.sortKey(string)->string method?

We could do 1 and 3, or 2 and 3, or just 3.

Use case would be: user inputs M words, and we would like to see if some of
them match N predefined words (say to trigger an action). With current
Intl.Collator.prototype.compare() we need MxN comparisons. With
toFoldCase/sortKey we would need only O(M) queries to the hash with N keys.

Mihai and I lean towards 3. because it gives more control to the user on
what you want to check. For example, it doesn't make sense to ignoreCase
for locales that don't have case distinction. Or user may want to preserve
accents in the comparison...

-- 
Nebojša Ćirić
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131024/7b06f84c/attachment.html>

# Nebojša Ćirić (12 years ago)

[+mscherer]

[+mscherer]


2013/10/24 Nebojša Ćirić <cira at google.com>

> String.prototype.normalize(form) spec is here -
> http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype.
> normalize. It offers all 4 forms of normalization.
>
> We did mention additional CF and CFNKFC forms for case folding, but they
> were not added to the spec. They case fold string in a locale independant
> way (see http://www.unicode.org/faq/casemap_charprop.html#2).
>
> Should we:
> 1. Add those two new forms to the spec of String.prototype.normalize(form)
> method?
> 2. Add a new String.prototype.toFoldCase(form) method?
> 3. Add Intl.Collator.prototype.sortKey(string)->string method?
>
> We could do 1 and 3, or 2 and 3, or just 3.
>
> Use case would be: user inputs M words, and we would like to see if some
> of them match N predefined words (say to trigger an action). With current
> Intl.Collator.prototype.compare() we need MxN comparisons. With
> toFoldCase/sortKey we would need only O(M) queries to the hash with N keys.
>
> Mihai and I lean towards 3. because it gives more control to the user on
> what you want to check. For example, it doesn't make sense to ignoreCase
> for locales that don't have case distinction. Or user may want to preserve
> accents in the comparison...
>
> --
> Nebojša Ćirić
>



-- 
Nebojša Ćirić
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131024/47bf3b9a/attachment.html>

# Allen Wirfs-Brock (12 years ago)

Also see String.prototype.toLowerCase.

In my working draft, the paragraph that immediately follows the algorithm has been modified to read:

The result must be derived according to the locale-insensitive case mappings in the Unicode Character Database (this explicitly includes not only the UnicodeData.txt file, but also all locale-insensitive mappings in the SpecialCasings.txt file that accompanies it).

This change is in response to ecmascript#206

Does this sufficiently cover the locale independent case folding use case?

On Oct 23, 2013, at 3:09 PM, Nebojša Ćirić wrote:

> String.prototype.normalize(form) spec is here - http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype.normalize. It offers all 4 forms of normalization.
> 
> We did mention additional CF and CFNKFC forms for case folding, but they were not added to the spec. They case fold string in a locale independant way (see http://www.unicode.org/faq/casemap_charprop.html#2).
> 
> Should we:
> 1. Add those two new forms to the spec of String.prototype.normalize(form) method?
> 2. Add a new String.prototype.toFoldCase(form) method?
> 3. Add Intl.Collator.prototype.sortKey(string)->string method?
> 
> We could do 1 and 3, or 2 and 3, or just 3.
> 
> Use case would be: user inputs M words, and we would like to see if some of them match N predefined words (say to trigger an action). With current Intl.Collator.prototype.compare() we need MxN comparisons. With toFoldCase/sortKey we would need only O(M) queries to the hash with N keys.
> 
> Mihai and I lean towards 3. because it gives more control to the user on what you want to check. For example, it doesn't make sense to ignoreCase for locales that don't have case distinction. Or user may want to preserve accents in the comparison...
> 
> -- 
> Nebojša Ćirić
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

Also see http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype.tolowercase 

In my working draft, the paragraph that immediately follows the algorithm has been modified to read:

The result must be derived according to the *locale-insensitive* case mappings in the Unicode Character Database (this explicitly includes not only the UnicodeData.txt file, but also *all locale-insensitive mappings in* the SpecialCasings.txt file that accompanies it).

This change is in response to https://bugs.ecmascript.org/show_bug.cgi?id=206 

Does this sufficiently cover the locale independent case folding use case?

Allen 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131023/c0f19b60/attachment.html>

# Nebojša Ćirić (12 years ago)

Having sort keys in the collator would allow user to be more flexible in comparing strings, but your* approach is good enough for now.

* toUpperCase spec as it stands

Having sort keys in the collator would allow user to be more flexible in
comparing strings, but your* approach is good enough for now.

* toUpperCase spec as it stands


2013/10/24 Mihai Niță <mnita at google.com>

> "Does this sufficiently cover the locale independent case folding use
> case?"
> I think it does.
> Mihai
>
>
> On Wed, Oct 23, 2013 at 4:19 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:
>
>>
>> On Oct 23, 2013, at 3:09 PM, Nebojša Ćirić wrote:
>>
>> String.prototype.normalize(form) spec is here -
>> http://people.mozilla.org/~jorendorff/es6-draft.html#sec-
>> string.prototype.normalize. It offers all 4 forms of normalization.
>>
>> We did mention additional CF and CFNKFC forms for case folding, but they
>> were not added to the spec. They case fold string in a locale independant
>> way (see http://www.unicode.org/faq/casemap_charprop.html#2).
>>
>> Should we:
>> 1. Add those two new forms to the spec of
>> String.prototype.normalize(form) method?
>> 2. Add a new String.prototype.toFoldCase(form) method?
>> 3. Add Intl.Collator.prototype.sortKey(string)->string method?
>>
>> We could do 1 and 3, or 2 and 3, or just 3.
>>
>> Use case would be: user inputs M words, and we would like to see if some
>> of them match N predefined words (say to trigger an action). With current
>> Intl.Collator.prototype.compare() we need MxN comparisons. With
>> toFoldCase/sortKey we would need only O(M) queries to the hash with N keys.
>>
>> Mihai and I lean towards 3. because it gives more control to the user on
>> what you want to check. For example, it doesn't make sense to ignoreCase
>> for locales that don't have case distinction. Or user may want to preserve
>> accents in the comparison...
>>
>> --
>> Nebojša Ćirić
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>>
>> Also see
>> http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype.tolowercase
>>
>>
>> In my working draft, the paragraph that immediately follows the algorithm
>> has been modified to read:
>>
>> The result must be derived according to the *locale-insensitive* case
>> mappings in the Unicode Character Database (this explicitly includes not
>> only the UnicodeData.txt file, but also *all locale-insensitive mappings
>> in* the SpecialCasings.txt file that accompanies it).
>>
>>
>> This change is in response to
>> https://bugs.ecmascript.org/show_bug.cgi?id=206
>>
>> Does this sufficiently cover the locale independent case folding use case?
>>
>> Allen
>>
>
>


-- 
Nebojša Ćirić
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131025/eb966314/attachment.html>