String.prototype.normalize, case folding and sort keys
# Nebojša Ćirić (12 years ago)
[+mscherer]
[+mscherer] 2013/10/24 Nebojša Ćirić <cira at google.com> > String.prototype.normalize(form) spec is here - > http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype. > normalize. It offers all 4 forms of normalization. > > We did mention additional CF and CFNKFC forms for case folding, but they > were not added to the spec. They case fold string in a locale independant > way (see http://www.unicode.org/faq/casemap_charprop.html#2). > > Should we: > 1. Add those two new forms to the spec of String.prototype.normalize(form) > method? > 2. Add a new String.prototype.toFoldCase(form) method? > 3. Add Intl.Collator.prototype.sortKey(string)->string method? > > We could do 1 and 3, or 2 and 3, or just 3. > > Use case would be: user inputs M words, and we would like to see if some > of them match N predefined words (say to trigger an action). With current > Intl.Collator.prototype.compare() we need MxN comparisons. With > toFoldCase/sortKey we would need only O(M) queries to the hash with N keys. > > Mihai and I lean towards 3. because it gives more control to the user on > what you want to check. For example, it doesn't make sense to ignoreCase > for locales that don't have case distinction. Or user may want to preserve > accents in the comparison... > > -- > Nebojša Ćirić > -- Nebojša Ćirić -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131024/47bf3b9a/attachment.html>
# Allen Wirfs-Brock (12 years ago)
Also see String.prototype.toLowerCase
.
In my working draft, the paragraph that immediately follows the algorithm has been modified to read:
The result must be derived according to the locale-insensitive case mappings in the Unicode Character Database (this explicitly includes not only the UnicodeData.txt file, but also all locale-insensitive mappings in the SpecialCasings.txt file that accompanies it).
This change is in response to ecmascript#206
Does this sufficiently cover the locale independent case folding use case?
On Oct 23, 2013, at 3:09 PM, Nebojša Ćirić wrote: > String.prototype.normalize(form) spec is here - http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype.normalize. It offers all 4 forms of normalization. > > We did mention additional CF and CFNKFC forms for case folding, but they were not added to the spec. They case fold string in a locale independant way (see http://www.unicode.org/faq/casemap_charprop.html#2). > > Should we: > 1. Add those two new forms to the spec of String.prototype.normalize(form) method? > 2. Add a new String.prototype.toFoldCase(form) method? > 3. Add Intl.Collator.prototype.sortKey(string)->string method? > > We could do 1 and 3, or 2 and 3, or just 3. > > Use case would be: user inputs M words, and we would like to see if some of them match N predefined words (say to trigger an action). With current Intl.Collator.prototype.compare() we need MxN comparisons. With toFoldCase/sortKey we would need only O(M) queries to the hash with N keys. > > Mihai and I lean towards 3. because it gives more control to the user on what you want to check. For example, it doesn't make sense to ignoreCase for locales that don't have case distinction. Or user may want to preserve accents in the comparison... > > -- > Nebojša Ćirić > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss Also see http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype.tolowercase In my working draft, the paragraph that immediately follows the algorithm has been modified to read: The result must be derived according to the *locale-insensitive* case mappings in the Unicode Character Database (this explicitly includes not only the UnicodeData.txt file, but also *all locale-insensitive mappings in* the SpecialCasings.txt file that accompanies it). This change is in response to https://bugs.ecmascript.org/show_bug.cgi?id=206 Does this sufficiently cover the locale independent case folding use case? Allen -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131023/c0f19b60/attachment.html>
# Nebojša Ćirić (12 years ago)
Having sort keys in the collator would allow user to be more flexible in comparing strings, but your* approach is good enough for now.
* toUpperCase spec as it stands
Having sort keys in the collator would allow user to be more flexible in comparing strings, but your* approach is good enough for now. * toUpperCase spec as it stands 2013/10/24 Mihai Niță <mnita at google.com> > "Does this sufficiently cover the locale independent case folding use > case?" > I think it does. > Mihai > > > On Wed, Oct 23, 2013 at 4:19 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote: > >> >> On Oct 23, 2013, at 3:09 PM, Nebojša Ćirić wrote: >> >> String.prototype.normalize(form) spec is here - >> http://people.mozilla.org/~jorendorff/es6-draft.html#sec- >> string.prototype.normalize. It offers all 4 forms of normalization. >> >> We did mention additional CF and CFNKFC forms for case folding, but they >> were not added to the spec. They case fold string in a locale independant >> way (see http://www.unicode.org/faq/casemap_charprop.html#2). >> >> Should we: >> 1. Add those two new forms to the spec of >> String.prototype.normalize(form) method? >> 2. Add a new String.prototype.toFoldCase(form) method? >> 3. Add Intl.Collator.prototype.sortKey(string)->string method? >> >> We could do 1 and 3, or 2 and 3, or just 3. >> >> Use case would be: user inputs M words, and we would like to see if some >> of them match N predefined words (say to trigger an action). With current >> Intl.Collator.prototype.compare() we need MxN comparisons. With >> toFoldCase/sortKey we would need only O(M) queries to the hash with N keys. >> >> Mihai and I lean towards 3. because it gives more control to the user on >> what you want to check. For example, it doesn't make sense to ignoreCase >> for locales that don't have case distinction. Or user may want to preserve >> accents in the comparison... >> >> -- >> Nebojša Ćirić >> _______________________________________________ >> es-discuss mailing list >> es-discuss at mozilla.org >> https://mail.mozilla.org/listinfo/es-discuss >> >> >> Also see >> http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string.prototype.tolowercase >> >> >> In my working draft, the paragraph that immediately follows the algorithm >> has been modified to read: >> >> The result must be derived according to the *locale-insensitive* case >> mappings in the Unicode Character Database (this explicitly includes not >> only the UnicodeData.txt file, but also *all locale-insensitive mappings >> in* the SpecialCasings.txt file that accompanies it). >> >> >> This change is in response to >> https://bugs.ecmascript.org/show_bug.cgi?id=206 >> >> Does this sufficiently cover the locale independent case folding use case? >> >> Allen >> > > -- Nebojša Ćirić -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131025/eb966314/attachment.html>
String.prototype.normalize(form)
spec is here. It offers all 4 forms of normalization.We did mention additional CF and CFNKFC forms for case folding, but they were not added to the spec. They case fold string in a locale independent way (see www.unicode.org/faq/casemap_charprop.html#2).
Should we:
String.prototype.normalize(form)
method?String.prototype.toFoldCase(form)
method?Intl.Collator.prototype.sortKey(string)
-> string method?We could do 1 and 3, or 2 and 3, or just 3.
Use case would be: user inputs M words, and we would like to see if some of them match N predefined words (say to trigger an action). With current
Intl.Collator.prototype.compare()
we need MxN comparisons. WithtoFoldCase
/sortKey
we would need only O(M) queries to the hash with N keys.Mihai and I lean towards 3. because it gives more control to the user on what you want to check. For example, it doesn't make sense to ignoreCase for locales that don't have case distinction. Or user may want to preserve accents in the comparison...