Unicode normalization

# Norbert Lindenberg (13 years ago)

The ECMAScript Language Specification 5.1 makes assumptions about source text being in Unicode normalization form C (NFC), but doesn't say anything that would actually make it so. Implementations, as far as I can tell, have also chosen to just "assume". This is partially based on the Character Model for the World Wide Web: Normalization, which recommends early normalization to NFC, but never became a standard.

I'm proposing to correct this by

removing the invalid assumptions from the specification,
add a normalization function so that applications can normalize text where needed.

strawman:unicode_normalization

Comments?

, Norbert

The ECMAScript Language Specification 5.1 makes assumptions about source text being in Unicode normalization form C (NFC), but doesn't say anything that would actually make it so. Implementations, as far as I can tell, have also chosen to just "assume". This is partially based on the Character Model for the World Wide Web: Normalization, which recommends early normalization to NFC, but never became a standard.

I'm proposing to correct this by
- removing the invalid assumptions from the specification,
- add a normalization function so that applications can normalize text where needed.

http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization

Comments?

Regards,
Norbert

# Mark Davis ☕ (13 years ago)

This is for v2, right?

Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **

This is for v2, right?

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, May 29, 2012 at 5:34 PM, Norbert Lindenberg <
ecmascript at norbertlindenberg.com> wrote:

> The ECMAScript Language Specification 5.1 makes assumptions about source
> text being in Unicode normalization form C (NFC), but doesn't say anything
> that would actually make it so. Implementations, as far as I can tell, have
> also chosen to just "assume". This is partially based on the Character
> Model for the World Wide Web: Normalization, which recommends early
> normalization to NFC, but never became a standard.
>
> I'm proposing to correct this by
> - removing the invalid assumptions from the specification,
> - add a normalization function so that applications can normalize text
> where needed.
>
> http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization
>
> Comments?
>
> Regards,
> Norbert
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120529/50b3cea4/attachment-0001.html>

# Norbert Lindenberg (13 years ago)

This is for the Language Specification, not the Internationalization API Specification.

The assumptions are in the Language Specification, so they have to be addressed there.

A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.

Norbert

This is for the Language Specification, not the Internationalization API Specification.

The assumptions are in the Language Specification, so they have to be addressed there.

A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.

Norbert

On May 29, 2012, at 17:36 , Mark Davis ☕ wrote:

> This is for v2, right?
> 
> Mark
> 
> — Il meglio è l’inimico del bene —
> 
> 
> 
> On Tue, May 29, 2012 at 5:34 PM, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:
> The ECMAScript Language Specification 5.1 makes assumptions about source text being in Unicode normalization form C (NFC), but doesn't say anything that would actually make it so. Implementations, as far as I can tell, have also chosen to just "assume". This is partially based on the Character Model for the World Wide Web: Normalization, which recommends early normalization to NFC, but never became a standard.
> 
> I'm proposing to correct this by
> - removing the invalid assumptions from the specification,
> - add a normalization function so that applications can normalize text where needed.
> 
> http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization
> 
> Comments?
> 
> Regards,
> Norbert
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>

# Erik Corry (13 years ago)

2012/5/30 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

This is for the Language Specification, not the Internationalization API Specification.

The assumptions are in the Language Specification, so they have to be addressed there.

A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.

Having read through the normalization spec I don't agree that it is simple. I would suggest that this is more appropriately placed in the internationalization API than in the core language.

Since concatenating two long canonicalized strings to make a new canonicalized string is much faster than first concatenating, then renormalizing, perhaps a method should be provided for that combined concat-giving-a-normalized-result operation.

2012/5/30 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
> This is for the Language Specification, not the Internationalization API Specification.
>
> The assumptions are in the Language Specification, so they have to be addressed there.
>
> A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.

Having read through the normalization spec I don't agree that it is
simple.  I would suggest that this is more appropriately placed in the
internationalization API than in the core language.

Since concatenating two long canonicalized strings to make a new
canonicalized string is much faster than first concatenating, then
renormalizing, perhaps a method should be provided for that combined
concat-giving-a-normalized-result operation.

-- 
Erik Corry

# Norbert Lindenberg (13 years ago)

On May 29, 2012, at 23:45 , Erik Corry wrote:

2012/5/30 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:

This is for the Language Specification, not the Internationalization API Specification.

The assumptions are in the Language Specification, so they have to be addressed there.

A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.

Having read through the normalization spec I don't agree that it is simple. I would suggest that this is more appropriately placed in the internationalization API than in the core language.

I meant "simple" in terms of API surface. Implementing normalization is not simple, but you'd most likely rely on existing implementations such as ICU.

Since concatenating two long canonicalized strings to make a new canonicalized string is much faster than first concatenating, then renormalizing, perhaps a method should be provided for that combined concat-giving-a-normalized-result operation.

Such a method would be useful in an environment where strings are kept normalized all the time. However, keeping strings normalized requires attention to many more operations - the draft Character Model for the World Wide Web 1.0: Normalization [1] has a discussion. My impression is that not much software is designed to keep strings normalized throughout. Instead, you have to normalize them at the point where normalization actually matters: As part of or in preparation of text comparison operations.

Norbert

On May 29, 2012, at 23:45 , Erik Corry wrote:

> 2012/5/30 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
>> This is for the Language Specification, not the Internationalization API Specification.
>> 
>> The assumptions are in the Language Specification, so they have to be addressed there.
>> 
>> A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.
> 
> Having read through the normalization spec I don't agree that it is
> simple.  I would suggest that this is more appropriately placed in the
> internationalization API than in the core language.

I meant "simple" in terms of API surface. Implementing normalization is not simple, but you'd most likely rely on existing implementations such as ICU.

> Since concatenating two long canonicalized strings to make a new
> canonicalized string is much faster than first concatenating, then
> renormalizing, perhaps a method should be provided for that combined
> concat-giving-a-normalized-result operation.

Such a method would be useful in an environment where strings are kept normalized all the time. However, keeping strings normalized requires attention to many more operations - the draft Character Model for the World Wide Web 1.0: Normalization [1] has a discussion. My impression is that not much software is designed to keep strings normalized throughout. Instead, you have to normalize them at the point where normalization actually matters: As part of or in preparation of text comparison operations.

Norbert

# Gillam, Richard (13 years ago)

Early normalization never became a standard? Boy, I've been away from this stuff for too long… Does this mean that all the other Web standards that were going to adhere to early normalization do something similar to what you're proposing? Or do they also "just assume," even though it never became official?

Personally, I'm not a fan of having to worry about this, but maybe we have no choice.

--Rich Gillam Lab126

Early normalization never became a standard?  Boy, I've been away from this stuff for too long…  Does this mean that all the other Web standards that were going to adhere to early normalization do something similar to what you're proposing?  Or do they also "just assume," even though it never became official?

Personally, I'm not a fan of having to worry about this, but maybe we have no choice.

--Rich Gillam
  Lab126

On May 29, 2012, at 5:34 PM, Norbert Lindenberg wrote:

> The ECMAScript Language Specification 5.1 makes assumptions about source text being in Unicode normalization form C (NFC), but doesn't say anything that would actually make it so. Implementations, as far as I can tell, have also chosen to just "assume". This is partially based on the Character Model for the World Wide Web: Normalization, which recommends early normalization to NFC, but never became a standard.
> 
> I'm proposing to correct this by
> - removing the invalid assumptions from the specification,
> - add a normalization function so that applications can normalize text where needed.
> 
> http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization
> 
> Comments?
> 
> Regards,
> Norbert
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss