Unicode normalization
This is for v2, right?
Mark plus.google.com/114199149796022210033 * * — Il meglio è l’inimico del bene — **
This is for the Language Specification, not the Internationalization API Specification.
The assumptions are in the Language Specification, so they have to be addressed there.
A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.
Norbert
2012/5/30 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
This is for the Language Specification, not the Internationalization API Specification.
The assumptions are in the Language Specification, so they have to be addressed there.
A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.
Having read through the normalization spec I don't agree that it is simple. I would suggest that this is more appropriately placed in the internationalization API than in the core language.
Since concatenating two long canonicalized strings to make a new canonicalized string is much faster than first concatenating, then renormalizing, perhaps a method should be provided for that combined concat-giving-a-normalized-result operation.
On May 29, 2012, at 23:45 , Erik Corry wrote:
2012/5/30 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
This is for the Language Specification, not the Internationalization API Specification.
The assumptions are in the Language Specification, so they have to be addressed there.
A normalization API can live in the Language Specification or in the Internationalization API. If we keep it simple (as this one function), then I think it can easily be added to String.prototype. More fine-grained functionality (like in ICU) would have to go into the Internationalization API (v2). The two are not mutually exclusive.
Having read through the normalization spec I don't agree that it is simple. I would suggest that this is more appropriately placed in the internationalization API than in the core language.
I meant "simple" in terms of API surface. Implementing normalization is not simple, but you'd most likely rely on existing implementations such as ICU.
Since concatenating two long canonicalized strings to make a new canonicalized string is much faster than first concatenating, then renormalizing, perhaps a method should be provided for that combined concat-giving-a-normalized-result operation.
Such a method would be useful in an environment where strings are kept normalized all the time. However, keeping strings normalized requires attention to many more operations - the draft Character Model for the World Wide Web 1.0: Normalization [1] has a discussion. My impression is that not much software is designed to keep strings normalized throughout. Instead, you have to normalize them at the point where normalization actually matters: As part of or in preparation of text comparison operations.
Norbert
Early normalization never became a standard? Boy, I've been away from this stuff for too long… Does this mean that all the other Web standards that were going to adhere to early normalization do something similar to what you're proposing? Or do they also "just assume," even though it never became official?
Personally, I'm not a fan of having to worry about this, but maybe we have no choice.
--Rich Gillam Lab126
The ECMAScript Language Specification 5.1 makes assumptions about source text being in Unicode normalization form C (NFC), but doesn't say anything that would actually make it so. Implementations, as far as I can tell, have also chosen to just "assume". This is partially based on the Character Model for the World Wide Web: Normalization, which recommends early normalization to NFC, but never became a standard.
I'm proposing to correct this by
strawman:unicode_normalization
Comments?
, Norbert