Locale sensitivity in toLowerCase/toUpperCase

# Norbert Lindenberg (14 years ago)

The specification of String.prototype.toLowerCase in ES 5.1 (which is also referenced in String.prototype.toUpperCase) refers to the Unicode character database for case mappings, explicitly including "not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later".

The SpecialCasings.txt file includes not only a large number of locale-insensitive mappings, but also a few locale-sensitive mappings. In particular, the Turkish mappings for the Latin letters "I" and "i" (which map to "ı" (U+0131) and "İ" (U+0130) in Turkish) have been in the file since Unicode 2.1.8, while additional ones were added later.

The specification of String.prototype.toLocaleLowerCase in ES 5.1, however, seems to imply that String.prototype.toLowerCase should not use the locale-sensitive mappings: "This function works exactly the same as toLowerCase except that its result is intended to yield the correct result for the host environment‘s current locale, rather than a locale-independent result. There will only be a difference in the few cases (such as Turkish) where the rules for that language conflict with the regular Unicode case mappings."

Shouldn't the specification for String.prototype.toLowerCase explicitly exclude the locale-sensitive mappings in SpecialCasings.txt?

SpecialCasing.txt in Unicode 2.1.8: www.unicode.org/Public/2.1-Update3/SpecialCasing-1.txt

SpecialCasing.txt in Unicode 2.1.9, which corrected the Turkish mapping for "I": www.unicode.org/Public/2.1-Update4/SpecialCasing-2.txt

SpecialCasing.txt in Unicode 6.0: www.unicode.org/Public/6.0.0/ucd/SpecialCasing.txt

Thanks, Norbert

The specification of String.prototype.toLowerCase in ES 5.1 (which is also referenced in String.prototype.toUpperCase) refers to the Unicode character database for case mappings, explicitly including "not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later".

The SpecialCasings.txt file includes not only a large number of locale-insensitive mappings, but also a few locale-sensitive mappings. In particular, the Turkish mappings for the Latin letters "I" and "i" (which map to "ı" (U+0131) and "İ" (U+0130) in Turkish) have been in the file since Unicode 2.1.8, while additional ones were added later.

The specification of String.prototype.toLocaleLowerCase in ES 5.1, however, seems to imply that String.prototype.toLowerCase should not use the locale-sensitive mappings: "This function works exactly the same as toLowerCase except that its result is intended to yield the correct result for the host environment‘s current locale, rather than a locale-independent result. There will only be a difference in the few cases (such as Turkish) where the rules for that language conflict with the regular Unicode case mappings."

Shouldn't the specification for String.prototype.toLowerCase explicitly exclude the locale-sensitive mappings in SpecialCasings.txt?

SpecialCasing.txt in Unicode 2.1.8:
http://www.unicode.org/Public/2.1-Update3/SpecialCasing-1.txt

SpecialCasing.txt in Unicode 2.1.9, which corrected the Turkish mapping for "I":
http://www.unicode.org/Public/2.1-Update4/SpecialCasing-2.txt

SpecialCasing.txt in Unicode 6.0:
http://www.unicode.org/Public/6.0.0/ucd/SpecialCasing.txt

Thanks,
Norbert

# Mike Samuel (14 years ago)

2011/8/22 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

The specification of String.prototype.toLowerCase in ES 5.1 (which is also referenced in String.prototype.toUpperCase) refers to the Unicode character database for case mappings, explicitly including "not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later".

The SpecialCasings.txt file includes not only a large number of locale-insensitive mappings, but also a few locale-sensitive mappings. In particular, the Turkish mappings for the Latin letters "I" and "i" (which map to "ı" (U+0131) and "İ" (U+0130) in Turkish) have been in the file since Unicode 2.1.8, while additional ones were added later.

JQuery and a lot of other JS code would break in pretty obvious ways if any major browser did this. code.jquery.com/jquery-1.6.2.js contains

rscript = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,

and the "i" flag handling delegates to toUpperCase per 15.10.2.8 so would never actually match "<SCRIPT", instead matching only the version with a dotted upper-case I.

I seem to remember that some version of Rhino did this though, delegating both toUpperCase and toLocaleUpperCase to java.lang.String.toUpperCase() which uses the default locale.

2011/8/22 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
> The specification of String.prototype.toLowerCase in ES 5.1 (which is also referenced in String.prototype.toUpperCase) refers to the Unicode character database for case mappings, explicitly including "not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later".
>
> The SpecialCasings.txt file includes not only a large number of locale-insensitive mappings, but also a few locale-sensitive mappings. In particular, the Turkish mappings for the Latin letters "I" and "i" (which map to "ı" (U+0131) and "İ" (U+0130) in Turkish) have been in the file since Unicode 2.1.8, while additional ones were added later.

JQuery and a lot of other JS code would break in pretty obvious ways
if any major browser did this.  http://code.jquery.com/jquery-1.6.2.js
contains

    rscript = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,

and the "i" flag handling delegates to toUpperCase per 15.10.2.8 so
would never actually match "<SCRIPT", instead matching only the
version with a dotted upper-case I.

I seem to remember that some version of Rhino did this though,
delegating both toUpperCase and toLocaleUpperCase to
java.lang.String.toUpperCase() which uses the default locale.