domenic at domenicdenicola.com (2013-10-28T14:50:16.921Z)
The internationalization working group is planning to support grapheme clusters through its text segmentation API - the strawman: http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation Note that Unicode Standard Annex #29 allows for tailored (language sensitive) grapheme clusters, which makes ECMA-402 a better fit than ECMA-262.
The internationalization working group is planning to support grapheme clusters through its text segmentation API - the strawman: http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation Note that Unicode Standard Annex #29 allows for tailored (language sensitive) grapheme clusters, which makes ECMA-402 a better fit than ECMA-262. Norbert On Oct 24, 2013, at 7:02 , Claude Pache <claude.pache at gmail.com> wrote: > Hello, > > You might know that the following ES expressions are broken: > > text.charAt(0) // get the first character of the text > text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters > > The reason is *not* because ES works with UTF-16 code units instead of Unicode code points (it's just a red herring!), but > because _graphemes_ (that is, what a human perceives as a "character") may span multiple code units and/or code points. > For example, the letter "n̈" ("n" with diaeresis) is coded using two code points, namely > U+006E (LATIN SMALL LETTER N) and U+0308 (COMBINING DIAERESIS), and two UTF-16 code units, > and it should be avoided to cut a string between these two codes in order to keep the resulting text meaningful for humans. > (Nota bene: I have carefully chosen a grapheme that exists in a current written language and does not exist > as precomposed Unicode character.) > > The correct technical notion to use here, is the notion of "grapheme cluster", that is > a sequence of Unicode code points that represents a grapheme. > See [UAX29] (Unicode Standard Annex #29: Unicode text segmentation), section 3, for more info. > > Therefore, I propose the following basic operations to operate on grapheme clusters: > > (1) String.prototype.graphemeAt(pos) > > This method is similar to `String.prototype.charAt()`, but it returns a grapheme cluster instead of a string > composed of a single UTF-16 code unit. More precisely, it returns the shortest substring of `this` > beginning at position `pos` (inclusively) and ending at position `pos2` (exclusively), where `pos2` is the > smallest position in `this` which is greater than `pos` and which is an (extended) grapheme cluster boundary, > according to the specification in [UAX29], section 3.1. > If `pos` is out of bounds, an empty string is returned. > > (2) String.prototype.graphemes(start = 0) > > This method returns an iterator, enumerating the graphemes of `this`, starting at position `start`. > Given the `String.prototype.graphemeAt` method as above, > it could be approximatively expressed in ES6 as follows (ignoring edge cases): > > String.prototype.graphemes = function*(pos = 0) { > pos = Math.floor(pos) > if (pos < 0 || Number.isNaN(pos)) > pos = 0 > while (pos < this.length) { > let grapheme = this.graphemeAt(pos) > pos += grapheme.length > yield grapheme > } > } > > So, the two examples of the beginning of my message could be correctly implemented as follows: > > text.graphemeAt(0) // get the first grapheme of the text > > // shorten a text to its first hundred graphemes > var shortenText = '' > let numGraphemes = 0 > for (let grapheme of text) { > numGraphemes += 1 > if (numGraphemes > 100) { > shortenText += '…' > break > } > shortenText += grapheme > } > > As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread, > and the `String.prototype[@@iterator]` as currently specified, are really what people need, > or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt` > and `String.prototype.graphemes` as discussed in the present message? > > Thoughts? > > Claude > > [UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: Unicode text segmentation." > > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss