Case transformations in strings

# James Graham (17 years ago)

The March 2nd draft has this to say about String.prototype.toLowerCase:

"The following steps are taken:

Call CheckObjectCoercible passing the this value as its argument.
Let S be the result of calling ToString, giving it the this value as its argument.
Let L be a string of the same length as S where each character of L is either the Unicode lowercase equivalent of the corresponding character of S or the actual corresponding character of S if no Unicode lowercase equivalent exists.
Return L. NOTE The result should be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later)."

The other algorithms such as string.prototype.toUpperCase then refer to this one. However, afaict, the statement that L is the same length as S is incorrect for many of the mappings listed in SpecialCasings.txt. An obvious example is is the German lowercase sharp character under toUpperCase:

"\u00DF".toUpperCase() == "SS"

If the intention is that these characters where the string changes length are to be mapped to themselves then the note should explicitly mention this. However since returning a string of a different length seems to already be supported in several implementations, it would be disappointing if this was the intent.

A further question concerns characters with context-sensitive case mappings. Are implementations expected to apply the context-sensitive case transformation or act as if each character appeared in isolation? For example with Greek capital letter sigma, SpecialCasings.txt suggests:

"\u03A3\u03A3".toLowerCase() == σς, not σσ

V8 is the only implementation I tested that agreed with SpecialCasings.txt here. It would be useful if the spec was explicit about what should happen in these cases.

The March 2nd draft has this to say about String.prototype.toLowerCase:

"The following steps are taken:
1.  Call CheckObjectCoercible passing the this value as its argument.
2.  Let S be the result of calling ToString, giving it the this value as 
its argument.
3.  Let L be a string of the same length as S where each character of L 
is either the Unicode lowercase
     equivalent of the corresponding character of S or the actual 
corresponding character of S if no
     Unicode lowercase equivalent exists.
4. Return L.
NOTE
The result should be derived according to the case mappings in the 
Unicode character database (this
explicitly includes not only the UnicodeData.txt file, but also the 
SpecialCasings.txt file that
accompanies it in Unicode 2.1.8 and later)."

The other algorithms such as string.prototype.toUpperCase then refer to 
this one. However, afaict, the statement that L is the same length as S 
is incorrect for many of the mappings listed in SpecialCasings.txt. An 
obvious example is is the German lowercase sharp character under 
toUpperCase:

"\u00DF".toUpperCase() == "SS"

If the intention is that these characters where the string changes 
length are to be mapped to themselves then the note should explicitly 
mention this. However since returning a string of a different length 
seems to already be supported in several implementations, it would be 
disappointing if this was the intent.

A further question concerns characters with context-sensitive case 
mappings. Are implementations expected to apply the context-sensitive 
case transformation or act as if each character appeared in isolation? 
For example with Greek capital letter sigma, SpecialCasings.txt suggests:

"\u03A3\u03A3".toLowerCase() == σς, not σσ

V8 is the only implementation I tested that agreed with 
SpecialCasings.txt here. It would be useful if the spec was explicit 
about what should happen in these cases.

# Allen Wirfs-Brock (17 years ago)

Any input from our other Unicode experts would be appreciated...

Here's what I found (running on Windows Vista): IE, FF, Opera "\u00DF".toUpperCase() returns "\u00DF" Safari, Chrome "\u00DF".toUpperCase() returns "SS"

It would be interesting if somebody could try the above for FF and Opera on a non-Windows machine to check whether this is a byproduct of using the Windows provided conversion routines.

Question 1: Is the specified length invariant essential or just noise in the ES3 spec. If it's not we could could eliminate that invariant and say that each S character is replaced in the result by the corresponding character(s) from the Unicode case mappings.

Question 2: If the observed variance is indeed a result of using the Windows mapping do we really want to require every implementation to provide its own internal mappings data and algorithms (as Safari and Chrome may be doing) if the underlying host is not fully Unicode compliant?

Question 3: Do we need to explicitly provide for some implementation variance here. That appears to be the current reality of the web. Do we want to try to stamp out the variance or to acknowledge and allow it.

Question 4: Is Chrome correct with: "\u03A3\u03A3".toLowerCase() == σς, not σσ And everybody else is wrong? This sounds like a reasonable interpretation of the explicit mention of SpecialCasing.txt in the note (but that the note is not normative). If so, should be explicit mention in step 3 that the translation must be appropriately context sensitive.

Finally, is any of the above going to actually influence anything. If not, maybe carrying the exact ES3 specification forward is ok.

Any input from our other Unicode experts would be appreciated...

Here's what I found (running on Windows Vista):
IE, FF, Opera
"\u00DF".toUpperCase() returns "\u00DF"
Safari, Chrome
"\u00DF".toUpperCase() returns "SS"

It would be interesting if somebody could try the above for FF and Opera on a non-Windows machine to check whether this is a byproduct of using the Windows provided conversion routines.

Question 4: Is Chrome correct with:
"\u03A3\u03A3".toLowerCase() == σς, not σσ
And everybody else is wrong? This sounds like a reasonable interpretation of the explicit mention of SpecialCasing.txt in the note (but that the note is not normative). If so, should be explicit mention in step 3 that the translation must be appropriately context sensitive.

Finally, is any of the above going to actually influence anything. If not, maybe carrying the exact ES3 specification forward is ok.

Allen

# Mike Shaver (17 years ago)

On Wed, Mar 4, 2009 at 2:35 PM, Allen Wirfs-Brock <Allen.Wirfs-Brock at microsoft.com> wrote:

Any input from our other Unicode experts would be appreciated...

Here's what I found (running on Windows Vista): IE, FF, Opera "\u00DF".toUpperCase() returns "\u00DF"

Same on FF3.1b3 on OS X.

Mike

On Wed, Mar 4, 2009 at 2:35 PM, Allen Wirfs-Brock
<Allen.Wirfs-Brock at microsoft.com> wrote:
> Any input from our other Unicode experts would be appreciated...
>
> Here's what I found (running on Windows Vista):
> IE, FF, Opera
> "\u00DF".toUpperCase()  returns "\u00DF"

Same on FF3.1b3 on OS X.

Mike

# Allen Wirfs-Brock (17 years ago)

-----Original Message----- From: es-discuss-bounces at mozilla.org [mailto:es-discuss- bounces at mozilla.org] On Behalf Of James Graham ... A further question concerns characters with context-sensitive case mappings. Are implementations expected to apply the context-sensitive case transformation or act as if each character appeared in isolation? For example with Greek capital letter sigma, SpecialCasings.txt suggests:

The NOTE following toUpperCase (15.5.4.18) says:

NOTE Because both toUpperCase and toLowerCase have context-sensitive behaviour, the functions are not symmetrical. In other words, s.toUpperCase().toLowerCase() is not necessarily equal to s.toLowerCase().

This text is a carry over from ES3 and would seem to imply that context sensitive processing is expected.

On an related issue, I'm starting to worry that the current specification of both toUpperCase and toLowerCase is problematic given the Unicode related changes in the ES3.1 spec. that essentially say that strings contain 16-bit Unicode code units (not "Unicode characters" or code points) and that any UTF-16 interpretation of such strings/code units is left to application code. The algorithm step:

Let L be a string of the same length as S where each character of L is either the Unicode lowercase equivalent of the corresponding character of S or the actual corresponding character of S if no Unicode lowercase equivalent exists.

seems inadequate in that context. Don't we need to either say that for the purposes of this translation that the string elements need to be treated as 16-bit truncated code point values or alternatively we might say that for the purposes of these operations the string needs to be interpreted assuming UTF-16 encoding? (For the first alternative, I'm guessing that there aren’t any toUpper/toLower Unicode transformations that require the 16-bit to/from >16-bit code point translations.)

Thoughts?

>-----Original Message-----
>From: es-discuss-bounces at mozilla.org [mailto:es-discuss-
>bounces at mozilla.org] On Behalf Of James Graham
...
>A further question concerns characters with context-sensitive case
>mappings. Are implementations expected to apply the context-sensitive
>case transformation or act as if each character appeared in isolation?
>For example with Greek capital letter sigma, SpecialCasings.txt
>suggests:
>

The NOTE following toUpperCase (15.5.4.18) says:

NOTE 
Because both toUpperCase and toLowerCase have context-sensitive behaviour, the functions are not symmetrical. In other words, s.toUpperCase().toLowerCase() is not necessarily equal to s.toLowerCase().

This text is a carry over from ES3 and would seem to imply that context sensitive processing is expected.

On an related issue, I'm starting to worry that the current specification of both toUpperCase and toLowerCase is problematic given the Unicode related changes in the ES3.1 spec. that essentially say that strings contain 16-bit Unicode code units (not "Unicode characters" or code points) and that any UTF-16 interpretation of such strings/code units is left to application code. The algorithm step:

3.	Let L be a string of the same length as S where each character of L is either the Unicode lowercase equivalent of the corresponding character of S or the actual corresponding character of S if no Unicode lowercase equivalent exists.

seems inadequate in that context.  Don't we need to either say that for the purposes of this translation that the string elements need to be treated as 16-bit truncated code point values or alternatively we might say that for the purposes of these operations the string needs to be interpreted assuming UTF-16 encoding?  (For the first alternative, I'm guessing that there aren’t any toUpper/toLower Unicode transformations that require the 16-bit to/from >16-bit code point translations.)

Thoughts?
Allen

# Brendan Eich (17 years ago)

On Mar 5, 2009, at 10:20 PM, Allen Wirfs-Brock wrote:

The NOTE following toUpperCase (15.5.4.18) says:

NOTE Because both toUpperCase and toLowerCase have context-sensitive
behaviour, the functions are not symmetrical. In other words,
s.toUpperCase().toLowerCase() is not necessarily equal to
s.toLowerCase().

This text is a carry over from ES3 and would seem to imply that
context sensitive processing is expected.

IIRC this is merely about characters such as Turkish dotless-I:

js> s = "\u0131"

1 js> s.toUpperCase()

I js> s.toUpperCase().charCodeAt(0)

73 js> s.toUpperCase().toLowerCase()

On Mar 5, 2009, at 10:20 PM, Allen Wirfs-Brock wrote:
>
> The NOTE following toUpperCase (15.5.4.18) says:
>
> NOTE
> Because both toUpperCase and toLowerCase have context-sensitive  
> behaviour, the functions are not symmetrical. In other words,  
> s.toUpperCase().toLowerCase() is not necessarily equal to  
> s.toLowerCase().
>
> This text is a carry over from ES3 and would seem to imply that  
> context sensitive processing is expected.

IIRC this is merely about characters such as Turkish dotless-I:

js> s = "\u0131"
1
js> s.toUpperCase()
I
js> s.toUpperCase().charCodeAt(0)
73
js> s.toUpperCase().toLowerCase()
i

/be
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20090305/d70f3c8f/attachment.html>

# Maciej Stachowiak (17 years ago)

On Mar 4, 2009, at 11:35 AM, Allen Wirfs-Brock wrote:

Any input from our other Unicode experts would be appreciated...

Here's what I found (running on Windows Vista): IE, FF, Opera "\u00DF".toUpperCase() returns "\u00DF" Safari, Chrome "\u00DF".toUpperCase() returns "SS"

It would be interesting if somebody could try the above for FF and
Opera on a non-Windows machine to check whether this is a byproduct
of using the Windows provided conversion routines.

Question 1: Is the specified length invariant essential or just
noise in the ES3 spec. If it's not we could could eliminate that
invariant and say that each S character is replaced in the result by
the corresponding character(s) from the Unicode case mappings.

I don't think the invariant is essential. Or at least, I don't know of
other parts of the spec depending on it, or Web compatibility
requiring it. Having this requirement prevents doing the right thing
Unicode-wise. I think the spec needs at least allow doing the right
thing, therefore I think the string length requirement should be
removed.

It may be more problematic at this time to mandate doing the right
thing.

On Mar 4, 2009, at 11:35 AM, Allen Wirfs-Brock wrote:

> Any input from our other Unicode experts would be appreciated...
>
> Here's what I found (running on Windows Vista):
> IE, FF, Opera
> "\u00DF".toUpperCase()  returns "\u00DF"
> Safari, Chrome
> "\u00DF".toUpperCase()  returns "SS"
>
> It would be interesting if somebody could try the above for FF and  
> Opera on a non-Windows machine to check whether this is a byproduct  
> of using the Windows provided conversion routines.
>
> Question 1: Is the specified length invariant essential or just  
> noise in the ES3 spec. If it's not we could could eliminate that  
> invariant and say that each S character is replaced in the result by  
> the corresponding character(s) from the Unicode case mappings.

I don't think the invariant is essential. Or at least, I don't know of  
other parts of the spec depending on it, or Web compatibility  
requiring it. Having this requirement prevents doing the right thing  
Unicode-wise. I think the spec needs at least allow doing the right  
thing, therefore I think the string length requirement should be  
removed.

It may be more problematic at this time to mandate doing the right  
thing.

  - Maciej

>
>
> Question 2: If the observed variance is indeed a result of using the  
> Windows mapping do we really want to require every implementation to  
> provide its own internal mappings data and algorithms (as Safari and  
> Chrome may be doing) if the underlying host is not fully Unicode  
> compliant?
>
> Question 3: Do we need to explicitly provide for some implementation  
> variance here.  That appears to be the current reality of the web.   
> Do we want to try to stamp out the variance or to acknowledge and  
> allow it.
>
> Question 4: Is Chrome correct with:
> "\u03A3\u03A3".toLowerCase() == σς, not σσ
> And everybody else is wrong?  This sounds like a reasonable  
> interpretation of the explicit mention of SpecialCasing.txt in the  
> note (but that the note is not normative). If so, should be explicit  
> mention in step 3 that the translation must be appropriately context  
> sensitive.
>
> Finally, is any of the above going to actually influence anything.   
> If not, maybe carrying the exact ES3 specification forward is ok.
>
> Allen
>
>
> _______________________________________________
> Es-discuss mailing list
> Es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Waldemar Horwat (17 years ago)

Allen Wirfs-Brock wrote:

Any input from our other Unicode experts would be appreciated...

Here's what I found (running on Windows Vista): IE, FF, Opera "\u00DF".toUpperCase() returns "\u00DF" Safari, Chrome "\u00DF".toUpperCase() returns "SS"

It would be interesting if somebody could try the above for FF and Opera on a non-Windows machine to check whether this is a byproduct of using the Windows provided conversion routines.

Question 1: Is the specified length invariant essential or just noise in the ES3 spec. If it's not we could could eliminate that invariant and say that each S character is replaced in the result by the corresponding character(s) from the Unicode case mappings.

Question 2: If the observed variance is indeed a result of using the Windows mapping do we really want to require every implementation to provide its own internal mappings data and algorithms (as Safari and Chrome may be doing) if the underlying host is not fully Unicode compliant?

Question 3: Do we need to explicitly provide for some implementation variance here. That appears to be the current reality of the web. Do we want to try to stamp out the variance or to acknowledge and allow it.

Question 4: Is Chrome correct with: "\u03A3\u03A3".toLowerCase() == σς, not σσ And everybody else is wrong? This sounds like a reasonable interpretation of the explicit mention of SpecialCasing.txt in the note (but that the note is not normative). If so, should be explicit mention in step 3 that the translation must be appropriately context sensitive.

Finally, is any of the above going to actually influence anything. If not, maybe carrying the exact ES3 specification forward is ok.

The reason the ES3 specification was the way it was is because converting one character to many during case conversions would be incompatible with regular expressions. The regular expression algorithm refers to String.prototype.toUpperCase.

Waldemar

Allen Wirfs-Brock wrote:
> Any input from our other Unicode experts would be appreciated...
> 
> Here's what I found (running on Windows Vista):
> IE, FF, Opera
> "\u00DF".toUpperCase()  returns "\u00DF"
> Safari, Chrome
> "\u00DF".toUpperCase()  returns "SS"
> 
> It would be interesting if somebody could try the above for FF and Opera on a non-Windows machine to check whether this is a byproduct of using the Windows provided conversion routines.
> 
> Question 1: Is the specified length invariant essential or just noise in the ES3 spec. If it's not we could could eliminate that invariant and say that each S character is replaced in the result by the corresponding character(s) from the Unicode case mappings.
> 
> Question 2: If the observed variance is indeed a result of using the Windows mapping do we really want to require every implementation to provide its own internal mappings data and algorithms (as Safari and Chrome may be doing) if the underlying host is not fully Unicode compliant?
> 
> Question 3: Do we need to explicitly provide for some implementation variance here.  That appears to be the current reality of the web.  Do we want to try to stamp out the variance or to acknowledge and allow it.
> 
> Question 4: Is Chrome correct with: 
> "\u03A3\u03A3".toLowerCase() == σς, not σσ
> And everybody else is wrong?  This sounds like a reasonable interpretation of the explicit mention of SpecialCasing.txt in the note (but that the note is not normative). If so, should be explicit mention in step 3 that the translation must be appropriately context sensitive.
> 
> Finally, is any of the above going to actually influence anything.  If not, maybe carrying the exact ES3 specification forward is ok.

The reason the ES3 specification was the way it was is because converting one character to many during case conversions would be incompatible with regular expressions.  The regular expression algorithm refers to String.prototype.toUpperCase.

    Waldemar

# David-Sarah Hopwood (17 years ago)

Waldemar Horwat wrote:

Allen Wirfs-Brock wrote:

Any input from our other Unicode experts would be appreciated...

Here's what I found (running on Windows Vista): IE, FF, Opera "\u00DF".toUpperCase() returns "\u00DF" Safari, Chrome "\u00DF".toUpperCase() returns "SS" [...] The reason the ES3 specification was the way it was is because converting one character to many during case conversions would be incompatible with regular expressions. The regular expression algorithm refers to String.prototype.toUpperCase.

If converting one character to many would cause a problem with the reference to toUpperCase in the regular expression algorithm, then presumably Safari and Chrome would hit that problem. Do they, or do they use different uppercase conversions for regexps vs toUpperCase?

If the latter, then we should allow that, and probably require it.

Waldemar Horwat wrote:
> Allen Wirfs-Brock wrote:
>> Any input from our other Unicode experts would be appreciated...
>>
>> Here's what I found (running on Windows Vista):
>> IE, FF, Opera
>> "\u00DF".toUpperCase()  returns "\u00DF"
>> Safari, Chrome
>> "\u00DF".toUpperCase()  returns "SS"
[...]
> The reason the ES3 specification was the way it was is because
> converting one character to many during case conversions would be
> incompatible with regular expressions.  The regular expression algorithm
> refers to String.prototype.toUpperCase.

If converting one character to many would cause a problem with the
reference to toUpperCase in the regular expression algorithm, then
presumably Safari and Chrome would hit that problem. Do they, or
do they use different uppercase conversions for regexps vs
toUpperCase?

If the latter, then we should allow that, and probably require it.

-- 
David-Sarah Hopwood ⚥

# Lasse R.H. Nielsen (17 years ago)

On Tue, 24 Mar 2009 00:50:24 +0100, David-Sarah Hopwood
<david.hopwood at industrial-designers.co.uk> wrote:

If converting one character to many would cause a problem with the reference to toUpperCase in the regular expression algorithm, then presumably Safari and Chrome would hit that problem. Do they, or do they use different uppercase conversions for regexps vs toUpperCase?

The Regular Expression specification in ES3 doesn't use toUpperCase
directly, but rather the Canonicalize helper function (15.10.2.8). It states:

Let u be ch converted to upper case as if by calling
String.prototype.toUpperCase on the one-character string ch.
If u does not consist of a single character, return ch.

I.e., it uses a different algorithm for regexps than for strings. (It also prevents non-ASCII characters from canonicalizing to ASCII
characters.)

If the latter, then we should allow that, and probably require it.

It's allowed, and required, already, so that's an easy fix :)

On Tue, 24 Mar 2009 00:50:24 +0100, David-Sarah Hopwood  
<david.hopwood at industrial-designers.co.uk> wrote:

> If converting one character to many would cause a problem with the
> reference to toUpperCase in the regular expression algorithm, then
> presumably Safari and Chrome would hit that problem. Do they, or
> do they use different uppercase conversions for regexps vs
> toUpperCase?

The Regular Expression specification in ES3 doesn't use toUpperCase  
directly, but rather the
Canonicalize helper function (15.10.2.8). It states:

  2. Let u be ch converted to upper case as if by calling  
String.prototype.toUpperCase on the one-character
     string ch.
  3. If u does not consist of a single character, return ch.

I.e., it uses a different algorithm for regexps than for strings.
(It also prevents non-ASCII characters from canonicalizing to ASCII  
characters.)

> If the latter, then we should allow that, and probably require it.

It's allowed, and required, already, so that's an easy fix :)

/Lasse

# David-Sarah Hopwood (17 years ago)

Christian Plesner Hansen wrote:

David-Sarah Hopwood wrote:

If converting one character to many would cause a problem with the reference to toUpperCase in the regular expression algorithm, then presumably Safari and Chrome would hit that problem. Do they, or do they use different uppercase conversions for regexps vs toUpperCase?

Chrome uses context (but not locale) sensitive special casing for ordinary toUpperCase. For regexps it uses the same mapping but doesn't convert chars that map to more than one char and non-ascii chars that would have converted to ascii chars. We would have liked to use the full multi-character mapping without the exception for ascii but couldn't for compatibility reasons.

Can you expand on what the compatibility problem was for non-ASCII -> ASCII mappings in regexps?

Christian Plesner Hansen wrote:
> David-Sarah Hopwood wrote:
>> If converting one character to many would cause a problem with the
>> reference to toUpperCase in the regular expression algorithm, then
>> presumably Safari and Chrome would hit that problem. Do they, or
>> do they use different uppercase conversions for regexps vs
>> toUpperCase?
> 
> Chrome uses context (but not locale) sensitive special casing for
> ordinary toUpperCase.  For regexps it uses the same mapping but
> doesn't convert chars that map to more than one char and non-ascii
> chars that would have converted to ascii chars.  We would have liked
> to use the full multi-character mapping without the exception for
> ascii but couldn't for compatibility reasons.

Can you expand on what the compatibility problem was for
non-ASCII -> ASCII mappings in regexps?

-- 
David-Sarah Hopwood ⚥

# David-Sarah Hopwood (17 years ago)

David-Sarah Hopwood wrote:

Christian Plesner Hansen wrote:

David-Sarah Hopwood wrote:

If converting one character to many would cause a problem with the reference to toUpperCase in the regular expression algorithm, then presumably Safari and Chrome would hit that problem. Do they, or do they use different uppercase conversions for regexps vs toUpperCase? Chrome uses context (but not locale) sensitive special casing for ordinary toUpperCase. For regexps it uses the same mapping but doesn't convert chars that map to more than one char and non-ascii chars that would have converted to ascii chars. We would have liked to use the full multi-character mapping without the exception for ascii but couldn't for compatibility reasons.

Can you expand on what the compatibility problem was for non-ASCII -> ASCII mappings in regexps?

Oh, never mind -- this is required by step 5 of Canonicalize in section 15.10.2.8.

So, there would be no regexp-related problems with requiring toUpperCase to perform multi-code-unit and/or context-sensitive mappings in ES3.1.

David-Sarah Hopwood wrote:
> Christian Plesner Hansen wrote:
>> David-Sarah Hopwood wrote:
>>> If converting one character to many would cause a problem with the
>>> reference to toUpperCase in the regular expression algorithm, then
>>> presumably Safari and Chrome would hit that problem. Do they, or
>>> do they use different uppercase conversions for regexps vs
>>> toUpperCase?
>> Chrome uses context (but not locale) sensitive special casing for
>> ordinary toUpperCase.  For regexps it uses the same mapping but
>> doesn't convert chars that map to more than one char and non-ascii
>> chars that would have converted to ascii chars.  We would have liked
>> to use the full multi-character mapping without the exception for
>> ascii but couldn't for compatibility reasons.
> 
> Can you expand on what the compatibility problem was for
> non-ASCII -> ASCII mappings in regexps?

Oh, never mind -- this is required by step 5 of Canonicalize in section
15.10.2.8.

So, there would be no regexp-related problems with requiring toUpperCase
to perform multi-code-unit and/or context-sensitive mappings in ES3.1.

-- 
David-Sarah Hopwood ⚥