Stupid i18n use cases question
There are really 5 cases at issue:
- Code point breaks
- Grapheme-Cluster breaks (with three possible variants: 'legacy', extended, and aksha www.unicode.org/glossary/#aksara)
- Word breaks
- Line breaks
- Sentence breaks
Notes:
- #1 is pretty trivial to do right in ES.
- The others can be done in ES, but the code is more complicated -- the biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
- Word-breaks are different than linebreaks; the latter are the points where you can wrap a line, which may include more than a word or come in the middle of a word.
- For examples, see unicode.org/cldr/utility/breaks.jsp.
I don't know about the specific use cases that Jungshik had in mind, but if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.
There are other use cases for #4 besides word processing; for example, break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.
Mark
— Il meglio è l’inimico del bene —
For grapheme clusters, the use cases are pretty straightforward. Here are a few right off the top of my head:
-
You have a field that is limited in length and you wish to truncate the text automatically (perhaps appending ellipses). You want to break the text on a grapheme boundary so that scripts like Indic don’t break specific syllables.
-
You wish to scroll some text a grapheme at a time.
-
You wish to select a bit of text, move the cursor, or any number of other visual text manipulations. Most of these you want grapheme boundary control (not character-level and definitely not code point)
I’m sure Norbert is dying to share more use cases: we had scads of them when I was at Yahoo.
For line breaking, the main thing would be to preformat some text. We (the Lab126 we) use line and word breaking all the time.
Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N, IETF IRI WGs)
Internationalization is not a feature. It is an architecture.
From: Mark Davis ☕ [mailto:mark at macchiato.com] Sent: Saturday, January 29, 2011 11:36 AM To: Shawn Steele Cc: es-discuss at mozilla.org Subject: Re: Stupid i18n use cases question
There are really 5 cases at issue:
- Code point breaks
- Grapheme-Cluster breaks (with three possible variants: 'legacy', extended, and akshawww.unicode.org/glossary/#aksara)
- Word breaks
- Line breaks
- Sentence breaks Notes:
- #1 is pretty trivial to do right in ES.
- The others can be done in ES, but the code is more complicated -- the biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
- Word-breaks are different than linebreaks; the latter are the points where you can wrap a line, which may include more than a word or come in the middle of a word.
- For examples, see unicode.org/cldr/utility/breaks.jsp.
I don't know about the specific use cases that Jungshik had in mind, but if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.
There are other use cases for #4 besides word processing; for example, break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.
Mark
— Il meglio è l’inimico del bene —
On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:
On the phone yesterday we mentioned word/line breaking and grapheme clusters. It didn't occur to me to ask about the use cases.
Why does someone need word/line breaking in js? It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?
-Shawn
blogs.msdn.com/shawnste
I realize what line breaking's for, but I didn't think that would often be done in JavaScript. You "preformat some text" in JavaScript?
-Shawn
blogs.msdn.com/shawnste
On Sat, Jan 29, 2011 at 9:24 PM, Shawn Steele <Shawn.Steele at microsoft.com> wrote:
I realize what line breaking's for, but I didn't think that would often be done in JavaScript. You "preformat some text" in JavaScript?
Yeah, for use in SVG or rendering atop canvas, for example.
Mike
I agree with you, the optimized C++ layout engine in every browser is where the line-breaking action is -- today.
But consider projects such as Bespin, or 280 North's Cappuccino, not to mention systems such as GWT and OpenLaszlo. Layout happens in JS too.
In general with high performance VMs we are seeing more tasks formerly hosted by the C++ browser codebase being hosted in JS. It's a big, long-length wave of the future.
On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com> wrote:
There are really 5 cases at issue: Code point breaks Grapheme-Cluster breaks (with three possible variants: 'legacy', extended,
and aksha)
Word breaks Line breaks Sentence breaks Notes: #1 is pretty trivial to do right in ES. The others can be done in ES, but the code is more complicated -- the
biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
The argument that large amounts of data must be downloaded for one language can't be used to argue that users should be forced to download that data for all languages in the world. The alternative, that the browser make use of data from the OS, is a fragmentation and testability nightmare.
Fonts have similar issues. In that case we are moving to downloadable fonts. That seems like the right way to go for I18n data too. Issues of the cacheability of large font and i18n data are important but not in the scope of ES.
Moving to a downloadable I18n data architecture also solves the collation order issues mentioned by Shawn recently where the front end and back end disagree on collation due to all the issues he mentioned. All those issues apply to testing and the homogeneity and testability of the web platform.
Word-breaks are different than linebreaks; the latter are the points where
you can wrap a line, which may include more than a word or come in the middle of a word.
For examples, see unicode.org/cldr/utility/breaks.jsp.
I don't know about the specific use cases that Jungshik had in mind, but
if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.
There are other use cases for #4 besides word processing; for example,
break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.
Mark
— Il meglio è l’inimico del bene —
On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com>
wrote:
On the phone yesterday we mentioned word/line breaking and grapheme
clusters. It didn't occur to me to ask about the use cases.
Why does someone need word/line breaking in js? It seems like that would
better be done by my rendering engine, like the HTML layout engine or my edit control or something?
Downloading the data is insufficient for collation; you'd also have to ensure that the code processing the data is v1.0 or 1.1 or X.X. And that there weren't any errors or discrepencies between implementations. I think you'd quickly discover that isn't possible to guarantee. Even if everyone agreed to use ICU and the UCA there'd be lots of differences. Also: who's going to collect (& provide) the data to be downloaded? What's the fallback when the data isn't available?
I'm still trying to grok "word processing in JavaScript" (beyond the simple case), however for sorting I think it's way better to provide an architecture that works with an understanding that collation can't be consistent between machines, at least for the foreseeable future.
-Shawn
blogs.msdn.com/shawnste
If downloading the data is insufficient then download the code too (in js).
Responsibility for correctness lies with the webpage authors
The alternative is a new way to make windows-only webpages or even pages that only work in one version of windows. It's exactly the same issues as fonts which also contain code and data. If they can be made downloadable then so can locale data.
Windows and the others don't even agree on the names of the locales. And it goes downhill from there. On Jan 30, 2011 7:32 PM, "Shawn Steele" <Shawn.Steele at microsoft.com> wrote:
Downloading the data is insufficient for collation; you'd also have to
ensure that the code processing the data is v1.0 or 1.1 or X.X. And that there weren't any errors or discrepencies between implementations. I think you'd quickly discover that isn't possible to guarantee. Even if everyone agreed to use ICU and the UCA there'd be lots of differences. Also: who's going to collect (& provide) the data to be downloaded? What's the fallback when the data isn't available?
I'm still trying to grok "word processing in JavaScript" (beyond the
simple case), however for sorting I think it's way better to provide an architecture that works with an understanding that collation can't be consistent between machines, at least for the foreseeable future.
-Shawn
blogs.msdn.com/shawnste
From: es-discuss-bounces at mozilla.org [es-discuss-bounces at mozilla.org] on
behalf of Erik Corry [erik.corry at gmail.com]
Sent: Sunday, January 30, 2011 12:32 AM To: Mark Davis ☕ Cc: Mads Ager; Shawn Steele; es-discuss at mozilla.org Subject: Re: Stupid i18n use cases question
On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com<mailto:
mark at macchiato.com>> wrote:
There are really 5 cases at issue: Code point breaks Grapheme-Cluster breaks (with three possible variants: 'legacy',
extended, and aksha)
Word breaks Line breaks Sentence breaks Notes: #1 is pretty trivial to do right in ES. The others can be done in ES, but the code is more complicated -- the
biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
The argument that large amounts of data must be downloaded for one
language can't be used to argue that users should be forced to download that data for all languages in the world. The alternative, that the browser make use of data from the OS, is a fragmentation and testability nightmare.
Fonts have similar issues. In that case we are moving to downloadable
fonts. That seems like the right way to go for I18n data too. Issues of the cacheability of large font and i18n data are important but not in the scope of ES.
Moving to a downloadable I18n data architecture also solves the collation
order issues mentioned by Shawn recently where the front end and back end disagree on collation due to all the issues he mentioned. All those issues apply to testing and the homogeneity and testability of the web platform.
Word-breaks are different than linebreaks; the latter are the points
where you can wrap a line, which may include more than a word or come in the middle of a word.
For examples, see unicode.org/cldr/utility/breaks.jsp.
I don't know about the specific use cases that Jungshik had in mind, but
if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.
There are other use cases for #4 besides word processing; for example,
break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.
Mark
— Il meglio è l’inimico del bene —
On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com <mailto:Shawn.Steele at microsoft.com>> wrote:
On the phone yesterday we mentioned word/line breaking and grapheme
clusters. It didn't occur to me to ask about the use cases.
Why does someone need word/line breaking in js? It seems like that would
better be done by my rendering engine, like the HTML layout engine or my edit control or something?
"Names of the locales"? (We seem to be digressing somewhat).
Downloading the code too is how jQuery's i18n works.
-Shawn
blogs.msdn.com/shawnste
On Sun, Jan 30, 2011 at 10:32 AM, Shawn Steele <Shawn.Steele at microsoft.com> wrote:
I'm still trying to grok "word processing in JavaScript" (beyond the simple case)
What's to grok? Microsoft is putting word processors on the web, even. They don't want to go back to the server for all processing (word breaking, etc.) on entered text, I'm sure, and I doubt anyone else will want to either.
Maybe you could be more explicit about what about the implementation language being JavaScript (which must currently include server-side applications like Node, of course) makes you confused, or why you think some specific application is inappropriate to perform in JS.
I'm doing simple stemming and word-breaking by hand in JS on both server and client side, to facilitate "fuzzy" search over a moderate corpus. Making it work in non-English languages is a daunting prospect both implementation-wise and in terms of finding expertise in all relevant languages, and I don't think it should be that way.
Mike
On the phone yesterday we mentioned word/line breaking and grapheme clusters. It didn't occur to me to ask about the use cases.
Why does someone need word/line breaking in js? It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?
-Shawn
blogs.msdn.com/shawnste