Stupid i18n use cases question

# Shawn Steele (15 years ago)

On the phone yesterday we mentioned word/line breaking and grapheme clusters. It didn't occur to me to ask about the use cases.

Why does someone need word/line breaking in js? It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?

-Shawn

  blogs.msdn.com/shawnste

On the phone yesterday we mentioned word/line breaking and grapheme clusters.   It didn't occur to me to ask about the use cases.



Why does someone need word/line breaking in js?  It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?


-Shawn

 
http://blogs.msdn.com/shawnste

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110129/61e049a9/attachment-0001.html>

# Mark Davis ☕ (15 years ago)

There are really 5 cases at issue:

Code point breaks
Grapheme-Cluster breaks (with three possible variants: 'legacy', extended, and aksha www.unicode.org/glossary/#aksara)
Word breaks
Line breaks
Sentence breaks

Notes:

#1 is pretty trivial to do right in ES.
The others can be done in ES, but the code is more complicated -- the biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
Word-breaks are different than linebreaks; the latter are the points where you can wrap a line, which may include more than a word or come in the middle of a word.
For examples, see unicode.org/cldr/utility/breaks.jsp.

I don't know about the specific use cases that Jungshik had in mind, but if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.

There are other use cases for #4 besides word processing; for example, break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.

Mark

— Il meglio è l’inimico del bene —

There are really 5 cases at issue:

   1. Code point breaks
   2. Grapheme-Cluster breaks (with three possible variants: 'legacy',
   extended, and aksha <http://www.unicode.org/glossary/#aksara>)
   3. Word breaks
   4. Line breaks
   5. Sentence breaks

Notes:

   - #1 is pretty trivial to do right in ES.
   - The others can be done in ES, but the code is more complicated -- the
   biggest issue is that they require a download of a possibly substantial
   amount of data. For certain languages, #3 requires considerable code and
   data.
   - Word-breaks are different than linebreaks; the latter are the points
   where you can wrap a line, which may include more than a word or come in the
   middle of a word.
   - For examples, see http://unicode.org/cldr/utility/breaks.jsp.

I don't know about the specific use cases that Jungshik had in mind, but if
you are doing client-side word-processing in ES (which various software
does, including ours), then you want all of these, except perhaps #5. For
example, a double-click uses #3.

There are other use cases for #4 besides word processing; for example, break
up long SMS's, we break at line-boundaries. I'm not saying that someone has
to do this in ES; just giving an example outside of the word-processing
domain.

Mark

*— Il meglio è l’inimico del bene —*

On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

>   On the phone yesterday we mentioned word/line breaking and grapheme
> clusters.   It didn't occur to me to ask about the use cases.
>
>
>
> Why does someone need word/line breaking in js?  It seems like that would
> better be done by my rendering engine, like the HTML layout engine or my
> edit control or something?
>
>
>
> -Shawn
>
>
>
>  
>
> http://blogs.msdn.com/shawnste
>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110129/9d69fa4f/attachment.html>

# Phillips, Addison (15 years ago)

For grapheme clusters, the use cases are pretty straightforward. Here are a few right off the top of my head:

     You have a field that is limited in length and you wish to truncate the text automatically (perhaps appending ellipses). You want to break the text on a grapheme boundary so that scripts like Indic don’t break specific syllables.

     You wish to scroll some text a grapheme at a time.

     You wish to select a bit of text, move the cursor, or any number of other visual text manipulations. Most of these you want grapheme boundary control (not character-level and definitely not code point)

I’m sure Norbert is dying to share more use cases: we had scads of them when I was at Yahoo.

For line breaking, the main thing would be to preformat some text. We (the Lab126 we) use line and word breaking all the time.

Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature. It is an architecture.

From: Mark Davis ☕ [mailto:mark at macchiato.com] Sent: Saturday, January 29, 2011 11:36 AM To: Shawn Steele Cc: es-discuss at mozilla.org Subject: Re: Stupid i18n use cases question

There are really 5 cases at issue:

Code point breaks
Grapheme-Cluster breaks (with three possible variants: 'legacy', extended, and akshawww.unicode.org/glossary/#aksara)
Word breaks
Line breaks
Sentence breaks Notes:

#1 is pretty trivial to do right in ES.
The others can be done in ES, but the code is more complicated -- the biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
Word-breaks are different than linebreaks; the latter are the points where you can wrap a line, which may include more than a word or come in the middle of a word.
For examples, see unicode.org/cldr/utility/breaks.jsp.

Mark

— Il meglio è l’inimico del bene —

On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:

On the phone yesterday we mentioned word/line breaking and grapheme clusters. It didn't occur to me to ask about the use cases.

Why does someone need word/line breaking in js? It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?

-Shawn

  blogs.msdn.com/shawnste

For grapheme clusters, the use cases are pretty straightforward. Here are a few right off the top of my head:


-          You have a field that is limited in length and you wish to truncate the text automatically (perhaps appending ellipses). You want to break the text on a grapheme boundary so that scripts like Indic don’t break specific syllables.

-          You wish to scroll some text a grapheme at a time.

-          You wish to select a bit of text, move the cursor, or any number of other visual text manipulations. Most of these you want grapheme boundary control (not character-level and definitely not code point)

I’m sure Norbert is dying to share more use cases: we had scads of them when I was at Yahoo.

For line breaking, the main thing would be to preformat some text. We (the Lab126 we) use line and word breaking all the time.

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

From: Mark Davis ☕ [mailto:mark at macchiato.com]
Sent: Saturday, January 29, 2011 11:36 AM
To: Shawn Steele
Cc: es-discuss at mozilla.org
Subject: Re: Stupid i18n use cases question

There are really 5 cases at issue:

 1.  Code point breaks
 2.  Grapheme-Cluster breaks (with three possible variants: 'legacy', extended, and aksha<http://www.unicode.org/glossary/#aksara>)
 3.  Word breaks
 4.  Line breaks
 5.  Sentence breaks
Notes:

 *   #1 is pretty trivial to do right in ES.
 *   The others can be done in ES, but the code is more complicated -- the biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
 *   Word-breaks are different than linebreaks; the latter are the points where you can wrap a line, which may include more than a word or come in the middle of a word.
 *   For examples, see http://unicode.org/cldr/utility/breaks.jsp.

I don't know about the specific use cases that Jungshik had in mind, but if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.

There are other use cases for #4 besides word processing; for example, break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.

Mark

— Il meglio è l’inimico del bene —

On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:

On the phone yesterday we mentioned word/line breaking and grapheme clusters.   It didn't occur to me to ask about the use cases.



Why does someone need word/line breaking in js?  It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?


-Shawn

 
http://blogs.msdn.com/shawnste


_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
https://mail.mozilla.org/listinfo/es-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110129/45eb1d48/attachment.html>

# Shawn Steele (15 years ago)

I realize what line breaking's for, but I didn't think that would often be done in JavaScript. You "preformat some text" in JavaScript?

-Shawn

  blogs.msdn.com/shawnste

I realize what line breaking's for, but I didn't think that would often be done in JavaScript.  You "preformat some text" in JavaScript?


-Shawn

 
http://blogs.msdn.com/shawnste

________________________________
From: es-discuss-bounces at mozilla.org [es-discuss-bounces at mozilla.org] on behalf of Phillips, Addison [addison at lab126.com]
Sent: Saturday, January 29, 2011 4:19 PM
To: Mark Davis ☕; Shawn Steele
Cc: es-discuss at mozilla.org
Subject: RE: Stupid i18n use cases question
...

For line breaking, the main thing would be to preformat some text. We (the Lab126 we) use line and word breaking all the time.

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110130/6182468b/attachment-0001.html>

# Mike Shaver (15 years ago)

On Sat, Jan 29, 2011 at 9:24 PM, Shawn Steele <Shawn.Steele at microsoft.com> wrote:

I realize what line breaking's for, but I didn't think that would often be done in JavaScript. You "preformat some text" in JavaScript?

Yeah, for use in SVG or rendering atop canvas, for example.

Mike

On Sat, Jan 29, 2011 at 9:24 PM, Shawn Steele
<Shawn.Steele at microsoft.com> wrote:
> I realize what line breaking's for, but I didn't think that would often be
> done in JavaScript.  You "preformat some text" in JavaScript?

Yeah, for use in SVG or rendering atop canvas, for example.

Mike

# Brendan Eich (15 years ago)

I agree with you, the optimized C++ layout engine in every browser is where the line-breaking action is -- today.

But consider projects such as Bespin, or 280 North's Cappuccino, not to mention systems such as GWT and OpenLaszlo. Layout happens in JS too.

In general with high performance VMs we are seeing more tasks formerly hosted by the C++ browser codebase being hosted in JS. It's a big, long-length wave of the future.

I agree with you, the optimized C++ layout engine in every browser is where the line-breaking action is -- today.

But consider projects such as Bespin, or 280 North's Cappuccino, not to mention systems such as GWT and OpenLaszlo. Layout happens in JS too.

In general with high performance VMs we are seeing more tasks formerly hosted by the C++ browser codebase being hosted in JS. It's a big, long-length wave of the future.

/be

On Jan 29, 2011, at 9:24 PM, Shawn Steele wrote:

> I realize what line breaking's for, but I didn't think that would often be done in JavaScript.  You "preformat some text" in JavaScript?
>  
> -Shawn
>  
>  
> http://blogs.msdn.com/shawnste
>  
> From: es-discuss-bounces at mozilla.org [es-discuss-bounces at mozilla.org] on behalf of Phillips, Addison [addison at lab126.com]
> Sent: Saturday, January 29, 2011 4:19 PM
> To: Mark Davis ☕; Shawn Steele
> Cc: es-discuss at mozilla.org
> Subject: RE: Stupid i18n use cases question
> ...
>  
> For line breaking, the main thing would be to preformat some text. We (the Lab126 we) use line and word breaking all the time.
>  
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N, IETF IRI WGs)
>  
> Internationalization is not a feature.
> It is an architecture.
>  
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110129/270eb32e/attachment.html>

# Erik Corry (15 years ago)

On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com> wrote:

There are really 5 cases at issue: Code point breaks Grapheme-Cluster breaks (with three possible variants: 'legacy', extended,

and aksha)

Word breaks Line breaks Sentence breaks Notes: #1 is pretty trivial to do right in ES. The others can be done in ES, but the code is more complicated -- the

biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.

The argument that large amounts of data must be downloaded for one language can't be used to argue that users should be forced to download that data for all languages in the world. The alternative, that the browser make use of data from the OS, is a fragmentation and testability nightmare.

Fonts have similar issues. In that case we are moving to downloadable fonts. That seems like the right way to go for I18n data too. Issues of the cacheability of large font and i18n data are important but not in the scope of ES.

Moving to a downloadable I18n data architecture also solves the collation order issues mentioned by Shawn recently where the front end and back end disagree on collation due to all the issues he mentioned. All those issues apply to testing and the homogeneity and testability of the web platform.

Word-breaks are different than linebreaks; the latter are the points where

you can wrap a line, which may include more than a word or come in the middle of a word.

For examples, see unicode.org/cldr/utility/breaks.jsp.

I don't know about the specific use cases that Jungshik had in mind, but

if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.

There are other use cases for #4 besides word processing; for example,

break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.

Mark

— Il meglio è l’inimico del bene —

On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com>

wrote:

On the phone yesterday we mentioned word/line breaking and grapheme

clusters. It didn't occur to me to ask about the use cases.

Why does someone need word/line breaking in js? It seems like that would

better be done by my rendering engine, like the HTML layout engine or my edit control or something?

On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com> wrote:
>
> There are really 5 cases at issue:
> Code point breaks
> Grapheme-Cluster breaks (with three possible variants: 'legacy', extended,
and aksha)
> Word breaks
> Line breaks
> Sentence breaks
> Notes:
> #1 is pretty trivial to do right in ES.
> The others can be done in ES, but the code is more complicated -- the
biggest issue is that they require a download of a possibly substantial
amount of data. For certain languages, #3 requires considerable code and
data.

The argument that large amounts of data must be downloaded for one language
can't be used to argue that users should be forced to download that data for
all languages in the world.  The alternative, that the browser make use of
data from the OS, is a fragmentation and testability nightmare.

Fonts have similar issues. In that case we are moving to downloadable fonts.
That seems like the right way to go for I18n data too.  Issues of the
cacheability of large font and i18n data are important but not in the scope
of ES.

Moving to a downloadable I18n data architecture also solves the collation
order issues mentioned by Shawn recently where the front end and back end
disagree on collation due to all the issues he mentioned.  All those issues
apply to testing and the homogeneity and testability of the web platform.

> Word-breaks are different than linebreaks; the latter are the points where
you can wrap a line, which may include more than a word or come in the
middle of a word.
> For examples, see http://unicode.org/cldr/utility/breaks.jsp.
>
> I don't know about the specific use cases that Jungshik had in mind, but
if you are doing client-side word-processing in ES (which various software
does, including ours), then you want all of these, except perhaps #5. For
example, a double-click uses #3.
>
> There are other use cases for #4 besides word processing; for example,
break up long SMS's, we break at line-boundaries. I'm not saying that
someone has to do this in ES; just giving an example outside of the
word-processing domain.
>
> Mark
>
> — Il meglio è l’inimico del bene —
>
>
> On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com>
wrote:
>>
>> On the phone yesterday we mentioned word/line breaking and grapheme
clusters.   It didn't occur to me to ask about the use cases.
>>
>>
>>
>> Why does someone need word/line breaking in js?  It seems like that would
better be done by my rendering engine, like the HTML layout engine or my
edit control or something?
>>
>>
>>
>> -Shawn
>>
>>
>>
>>  
>>
>> http://blogs.msdn.com/shawnste
>>
>>
>>
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110130/04e37cfc/attachment.html>

# Shawn Steele (15 years ago)

Downloading the data is insufficient for collation; you'd also have to ensure that the code processing the data is v1.0 or 1.1 or X.X. And that there weren't any errors or discrepencies between implementations. I think you'd quickly discover that isn't possible to guarantee. Even if everyone agreed to use ICU and the UCA there'd be lots of differences. Also: who's going to collect (& provide) the data to be downloaded? What's the fallback when the data isn't available?

I'm still trying to grok "word processing in JavaScript" (beyond the simple case), however for sorting I think it's way better to provide an architecture that works with an understanding that collation can't be consistent between machines, at least for the foreseeable future.

-Shawn

  blogs.msdn.com/shawnste

Downloading the data is insufficient for collation; you'd also have to ensure that the code processing the data is v1.0 or 1.1 or X.X.  And that there weren't any errors or discrepencies between implementations.  I think you'd quickly discover that isn't possible to guarantee.  Even if everyone agreed to use ICU and the UCA there'd be lots of differences.  Also: who's going to collect (& provide) the data to be downloaded?  What's the fallback when the data isn't available?

I'm still trying to grok "word processing in JavaScript" (beyond the simple case), however for sorting I think it's way better to provide an architecture that works with an understanding that collation can't be consistent between machines, at least for the foreseeable future.

-Shawn

 
http://blogs.msdn.com/shawnste

________________________________
From: es-discuss-bounces at mozilla.org [es-discuss-bounces at mozilla.org] on behalf of Erik Corry [erik.corry at gmail.com]
Sent: Sunday, January 30, 2011 12:32 AM
To: Mark Davis ☕
Cc: Mads Ager; Shawn Steele; es-discuss at mozilla.org
Subject: Re: Stupid i18n use cases question

On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:
>
> There are really 5 cases at issue:
> Code point breaks
> Grapheme-Cluster breaks (with three possible variants: 'legacy', extended, and aksha)
> Word breaks
> Line breaks
> Sentence breaks
> Notes:
> #1 is pretty trivial to do right in ES.
> The others can be done in ES, but the code is more complicated -- the biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.

The argument that large amounts of data must be downloaded for one language can't be used to argue that users should be forced to download that data for all languages in the world.  The alternative, that the browser make use of data from the OS, is a fragmentation and testability nightmare.

Fonts have similar issues. In that case we are moving to downloadable fonts. That seems like the right way to go for I18n data too.  Issues of the cacheability of large font and i18n data are important but not in the scope of ES.

Moving to a downloadable I18n data architecture also solves the collation order issues mentioned by Shawn recently where the front end and back end disagree on collation due to all the issues he mentioned.  All those issues apply to testing and the homogeneity and testability of the web platform.

> Word-breaks are different than linebreaks; the latter are the points where you can wrap a line, which may include more than a word or come in the middle of a word.
> For examples, see http://unicode.org/cldr/utility/breaks.jsp.
>
> I don't know about the specific use cases that Jungshik had in mind, but if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.
>
> There are other use cases for #4 besides word processing; for example, break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.
>
> Mark
>
> — Il meglio è l’inimico del bene —
>
>
> On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:
>>
>> On the phone yesterday we mentioned word/line breaking and grapheme clusters.   It didn't occur to me to ask about the use cases.
>>
>>
>>
>> Why does someone need word/line breaking in js?  It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?
>>
>>
>>
>> -Shawn
>>
>>
>>
>>  
>>
>> http://blogs.msdn.com/shawnste
>>
>>
>>
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110130/400e37fc/attachment-0001.html>

# Erik Corry (15 years ago)

If downloading the data is insufficient then download the code too (in js).

Responsibility for correctness lies with the webpage authors

The alternative is a new way to make windows-only webpages or even pages that only work in one version of windows. It's exactly the same issues as fonts which also contain code and data. If they can be made downloadable then so can locale data.

Windows and the others don't even agree on the names of the locales. And it goes downhill from there. On Jan 30, 2011 7:32 PM, "Shawn Steele" <Shawn.Steele at microsoft.com> wrote:

Downloading the data is insufficient for collation; you'd also have to

ensure that the code processing the data is v1.0 or 1.1 or X.X. And that there weren't any errors or discrepencies between implementations. I think you'd quickly discover that isn't possible to guarantee. Even if everyone agreed to use ICU and the UCA there'd be lots of differences. Also: who's going to collect (& provide) the data to be downloaded? What's the fallback when the data isn't available?

I'm still trying to grok "word processing in JavaScript" (beyond the

simple case), however for sorting I think it's way better to provide an architecture that works with an understanding that collation can't be consistent between machines, at least for the foreseeable future.

-Shawn

  blogs.msdn.com/shawnste

From: es-discuss-bounces at mozilla.org [es-discuss-bounces at mozilla.org] on

behalf of Erik Corry [erik.corry at gmail.com]

Sent: Sunday, January 30, 2011 12:32 AM To: Mark Davis ☕ Cc: Mads Ager; Shawn Steele; es-discuss at mozilla.org Subject: Re: Stupid i18n use cases question

On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com<mailto:

mark at macchiato.com>> wrote:

There are really 5 cases at issue: Code point breaks Grapheme-Cluster breaks (with three possible variants: 'legacy',

extended, and aksha)

Word breaks Line breaks Sentence breaks Notes: #1 is pretty trivial to do right in ES. The others can be done in ES, but the code is more complicated -- the

biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.

The argument that large amounts of data must be downloaded for one

language can't be used to argue that users should be forced to download that data for all languages in the world. The alternative, that the browser make use of data from the OS, is a fragmentation and testability nightmare.

Fonts have similar issues. In that case we are moving to downloadable

fonts. That seems like the right way to go for I18n data too. Issues of the cacheability of large font and i18n data are important but not in the scope of ES.

Moving to a downloadable I18n data architecture also solves the collation

order issues mentioned by Shawn recently where the front end and back end disagree on collation due to all the issues he mentioned. All those issues apply to testing and the homogeneity and testability of the web platform.

Word-breaks are different than linebreaks; the latter are the points

where you can wrap a line, which may include more than a word or come in the middle of a word.

For examples, see unicode.org/cldr/utility/breaks.jsp.

I don't know about the specific use cases that Jungshik had in mind, but

if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.

There are other use cases for #4 besides word processing; for example,

break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.

Mark

— Il meglio è l’inimico del bene —

On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com <mailto:Shawn.Steele at microsoft.com>> wrote:

On the phone yesterday we mentioned word/line breaking and grapheme

clusters. It didn't occur to me to ask about the use cases.

Why does someone need word/line breaking in js? It seems like that would

better be done by my rendering engine, like the HTML layout engine or my edit control or something?

If downloading the data is insufficient then download the code too (in js).

Responsibility for correctness lies with the webpage authors

The alternative is a new way to make windows-only webpages or even pages
that only work in one version of windows. It's exactly the same issues as
fonts which also contain code and data. If they can be made downloadable
then so can locale data.

Windows and the others don't even agree on the names of the locales. And it
goes downhill from there.
On Jan 30, 2011 7:32 PM, "Shawn Steele" <Shawn.Steele at microsoft.com> wrote:
> Downloading the data is insufficient for collation; you'd also have to
ensure that the code processing the data is v1.0 or 1.1 or X.X. And that
there weren't any errors or discrepencies between implementations. I think
you'd quickly discover that isn't possible to guarantee. Even if everyone
agreed to use ICU and the UCA there'd be lots of differences. Also: who's
going to collect (& provide) the data to be downloaded? What's the fallback
when the data isn't available?
>
>
>
> I'm still trying to grok "word processing in JavaScript" (beyond the
simple case), however for sorting I think it's way better to provide an
architecture that works with an understanding that collation can't be
consistent between machines, at least for the foreseeable future.
>
>
> -Shawn
>
>  
> http://blogs.msdn.com/shawnste
>
> ________________________________
> From: es-discuss-bounces at mozilla.org [es-discuss-bounces at mozilla.org] on
behalf of Erik Corry [erik.corry at gmail.com]
> Sent: Sunday, January 30, 2011 12:32 AM
> To: Mark Davis ☕
> Cc: Mads Ager; Shawn Steele; es-discuss at mozilla.org
> Subject: Re: Stupid i18n use cases question
>
>
> On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com<mailto:
mark at macchiato.com>> wrote:
>>
>> There are really 5 cases at issue:
>> Code point breaks
>> Grapheme-Cluster breaks (with three possible variants: 'legacy',
extended, and aksha)
>> Word breaks
>> Line breaks
>> Sentence breaks
>> Notes:
>> #1 is pretty trivial to do right in ES.
>> The others can be done in ES, but the code is more complicated -- the
biggest issue is that they require a download of a possibly substantial
amount of data. For certain languages, #3 requires considerable code and
data.
>
> The argument that large amounts of data must be downloaded for one
language can't be used to argue that users should be forced to download that
data for all languages in the world. The alternative, that the browser make
use of data from the OS, is a fragmentation and testability nightmare.
>
> Fonts have similar issues. In that case we are moving to downloadable
fonts. That seems like the right way to go for I18n data too. Issues of the
cacheability of large font and i18n data are important but not in the scope
of ES.
>
> Moving to a downloadable I18n data architecture also solves the collation
order issues mentioned by Shawn recently where the front end and back end
disagree on collation due to all the issues he mentioned. All those issues
apply to testing and the homogeneity and testability of the web platform.
>
>> Word-breaks are different than linebreaks; the latter are the points
where you can wrap a line, which may include more than a word or come in the
middle of a word.
>> For examples, see http://unicode.org/cldr/utility/breaks.jsp.
>>
>> I don't know about the specific use cases that Jungshik had in mind, but
if you are doing client-side word-processing in ES (which various software
does, including ours), then you want all of these, except perhaps #5. For
example, a double-click uses #3.
>>
>> There are other use cases for #4 besides word processing; for example,
break up long SMS's, we break at line-boundaries. I'm not saying that
someone has to do this in ES; just giving an example outside of the
word-processing domain.
>>
>> Mark
>>
>> — Il meglio è l’inimico del bene —
>>
>>
>> On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com
<mailto:Shawn.Steele at microsoft.com>> wrote:
>>>
>>> On the phone yesterday we mentioned word/line breaking and grapheme
clusters. It didn't occur to me to ask about the use cases.
>>>
>>>
>>>
>>> Why does someone need word/line breaking in js? It seems like that would
better be done by my rendering engine, like the HTML layout engine or my
edit control or something?
>>>
>>>
>>>
>>> -Shawn
>>>
>>>
>>>
>>>  
>>>
>>> http://blogs.msdn.com/shawnste
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
>> https://mail.mozilla.org/listinfo/es-discuss
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110130/c1ff7bba/attachment.html>

# Shawn Steele (15 years ago)

"Names of the locales"? (We seem to be digressing somewhat).

Downloading the code too is how jQuery's i18n works.

-Shawn

  blogs.msdn.com/shawnste

"Names of the locales"?  (We seem to be digressing somewhat).



Downloading the code too is how jQuery's i18n works.


-Shawn

 
http://blogs.msdn.com/shawnste

________________________________
From: Erik Corry [erik.corry at gmail.com]
Sent: Sunday, January 30, 2011 10:48 AM
To: Shawn Steele
Cc: es-discuss at mozilla.org; Mads Ager; Mark Davis ☕
Subject: Re: RE: Stupid i18n use cases question


If downloading the data is insufficient then download the code too (in js).

Responsibility for correctness lies with the webpage authors

The alternative is a new way to make windows-only webpages or even pages that only work in one version of windows. It's exactly the same issues as fonts which also contain code and data. If they can be made downloadable then so can locale data.

Windows and the others don't even agree on the names of the locales. And it goes downhill from there.

On Jan 30, 2011 7:32 PM, "Shawn Steele" <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:
> Downloading the data is insufficient for collation; you'd also have to ensure that the code processing the data is v1.0 or 1.1 or X.X. And that there weren't any errors or discrepencies between implementations. I think you'd quickly discover that isn't possible to guarantee. Even if everyone agreed to use ICU and the UCA there'd be lots of differences. Also: who's going to collect (& provide) the data to be downloaded? What's the fallback when the data isn't available?
>
>
>
> I'm still trying to grok "word processing in JavaScript" (beyond the simple case), however for sorting I think it's way better to provide an architecture that works with an understanding that collation can't be consistent between machines, at least for the foreseeable future.
>
>
> -Shawn
>
>  
> http://blogs.msdn.com/shawnste
>
> ________________________________
> From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org>] on behalf of Erik Corry [erik.corry at gmail.com<mailto:erik.corry at gmail.com>]
> Sent: Sunday, January 30, 2011 12:32 AM
> To: Mark Davis ☕
> Cc: Mads Ager; Shawn Steele; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
> Subject: Re: Stupid i18n use cases question
>
>
> On Jan 29, 2011 8:37 PM, "Mark Davis ☕" <mark at macchiato.com<mailto:mark at macchiato.com><mailto:mark at macchiato.com<mailto:mark at macchiato.com>>> wrote:
>>
>> There are really 5 cases at issue:
>> Code point breaks
>> Grapheme-Cluster breaks (with three possible variants: 'legacy', extended, and aksha)
>> Word breaks
>> Line breaks
>> Sentence breaks
>> Notes:
>> #1 is pretty trivial to do right in ES.
>> The others can be done in ES, but the code is more complicated -- the biggest issue is that they require a download of a possibly substantial amount of data. For certain languages, #3 requires considerable code and data.
>
> The argument that large amounts of data must be downloaded for one language can't be used to argue that users should be forced to download that data for all languages in the world. The alternative, that the browser make use of data from the OS, is a fragmentation and testability nightmare.
>
> Fonts have similar issues. In that case we are moving to downloadable fonts. That seems like the right way to go for I18n data too. Issues of the cacheability of large font and i18n data are important but not in the scope of ES.
>
> Moving to a downloadable I18n data architecture also solves the collation order issues mentioned by Shawn recently where the front end and back end disagree on collation due to all the issues he mentioned. All those issues apply to testing and the homogeneity and testability of the web platform.
>
>> Word-breaks are different than linebreaks; the latter are the points where you can wrap a line, which may include more than a word or come in the middle of a word.
>> For examples, see http://unicode.org/cldr/utility/breaks.jsp.
>>
>> I don't know about the specific use cases that Jungshik had in mind, but if you are doing client-side word-processing in ES (which various software does, including ours), then you want all of these, except perhaps #5. For example, a double-click uses #3.
>>
>> There are other use cases for #4 besides word processing; for example, break up long SMS's, we break at line-boundaries. I'm not saying that someone has to do this in ES; just giving an example outside of the word-processing domain.
>>
>> Mark
>>
>> — Il meglio è l’inimico del bene —
>>
>>
>> On Sat, Jan 29, 2011 at 10:25, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com><mailto:Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>>> wrote:
>>>
>>> On the phone yesterday we mentioned word/line breaking and grapheme clusters. It didn't occur to me to ask about the use cases.
>>>
>>>
>>>
>>> Why does someone need word/line breaking in js? It seems like that would better be done by my rendering engine, like the HTML layout engine or my edit control or something?
>>>
>>>
>>>
>>> -Shawn
>>>
>>>
>>>
>>>  
>>>
>>> http://blogs.msdn.com/shawnste
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org<mailto:es-discuss at mozilla.org><mailto:es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>>
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org<mailto:es-discuss at mozilla.org><mailto:es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>>
>> https://mail.mozilla.org/listinfo/es-discuss
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110130/312bd226/attachment-0001.html>

# Mike Shaver (15 years ago)

On Sun, Jan 30, 2011 at 10:32 AM, Shawn Steele <Shawn.Steele at microsoft.com> wrote:

I'm still trying to grok "word processing in JavaScript" (beyond the simple case)

What's to grok? Microsoft is putting word processors on the web, even. They don't want to go back to the server for all processing (word breaking, etc.) on entered text, I'm sure, and I doubt anyone else will want to either.

Maybe you could be more explicit about what about the implementation language being JavaScript (which must currently include server-side applications like Node, of course) makes you confused, or why you think some specific application is inappropriate to perform in JS.

I'm doing simple stemming and word-breaking by hand in JS on both server and client side, to facilitate "fuzzy" search over a moderate corpus. Making it work in non-English languages is a daunting prospect both implementation-wise and in terms of finding expertise in all relevant languages, and I don't think it should be that way.

Mike

On Sun, Jan 30, 2011 at 10:32 AM, Shawn Steele
<Shawn.Steele at microsoft.com> wrote:
> I'm still trying to grok "word processing in JavaScript" (beyond the simple
> case)

What's to grok?  Microsoft is putting word processors on the web,
even.  They don't want to go back to the server for all processing
(word breaking, etc.) on entered text, I'm sure, and I doubt anyone
else will want to either.

Maybe you could be more explicit about what about the implementation
language being JavaScript (which must currently include server-side
applications like Node, of course) makes you confused, or why you
think some specific application is inappropriate to perform in JS.

I'm doing simple stemming and word-breaking by hand in JS on both
server and client side, to facilitate "fuzzy" search over a moderate
corpus.  Making it work in non-English languages is a daunting
prospect both implementation-wise and in terms of finding expertise in
all relevant languages, and I don't think it should be that way.

Mike