Working with grapheme clusters
On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache <claude.pache at gmail.com> wrote:
As a side note, I ask whether the
String.prototype.symbolAt
/String.prototype.at
as proposed in a recent thread, and theString.prototype[@@iterator]
as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning ofString.prototype.graphemeAt
andString.prototype.graphemes
as discussed in the present message?Thoughts?
I think you're correct. If we want to make it easier for developers to work with text, we should offer them functionality at the grapheme cluster level and not distract everyone with code units and code points. Thanks for making a proposal!
On 24 Oct 2013, at 16:02, Claude Pache <claude.pache at gmail.com> wrote:
Therefore, I propose the following basic operations to operate on grapheme clusters:
Out of curiosity, is there any programming language that operates on grapheme clusters (rather than code points) by default? FWIW, code point iteration is what I’d expect in any language.
text.graphemeAt(0) // get the first grapheme of the text
// shorten a text to its first hundred graphemes var shortenText = '' let numGraphemes = 0 for (let grapheme of text) { numGraphemes += 1 if (numGraphemes > 100) { shortenText += '…' break } shortenText += grapheme }
So, you would want to change the string iterator’s behavior too?
As a side note, I ask whether the
String.prototype.symbolAt
/String.prototype.at
as proposed in a recent thread, and theString.prototype[@@iterator]
as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning ofString.prototype.graphemeAt
andString.prototype.graphemes
as discussed in the present message?
I don’t think this would be an issue. The new String
methods and the iterator are well-defined and documented in terms of code points.
IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:
// Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u 1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;
var zalgo = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞';
zalgo.match(regexGraphemeCluster);
[
"Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍",
"A̴̵̜̰͔ͫ͗͢",
"L̠ͨͧͩ͘",
"G̴̻͈͍͔̹̑͗̎̅͛́",
"Ǫ̵̹̻̝̳͂̌̌͘",
"!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"
]
On 24 Oct 2013, at 16:22, Anne van Kesteren <annevk at annevk.nl> wrote:
On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache <claude.pache at gmail.com> wrote:
As a side note, I ask whether the
String.prototype.symbolAt
/String.prototype.at
as proposed in a recent thread, and theString.prototype[@@iterator]
as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning ofString.prototype.graphemeAt
andString.prototype.graphemes
as discussed in the present message?Thoughts?
If we want to make it easier for developers to work with text, we should offer them functionality at the grapheme cluster level and not distract everyone with code units and code points. Thanks for making a proposal!
I’d welcome grapheme helper methods (even though the ES6 string methods already make it easier to deal with grapheme clusters than ever before), but I strongly disagree the string iterator should be changed. I think the use case of iterating over code points is much more common.
Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.
On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:
Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.
Is that really a common operation? I would expect formatting, searching, etc. to dominate. E.g. whenever you do substr/substring you would want that to be grapheme-cluster aware.
Le 24 oct. 2013 à 16:24, Mathias Bynens <mathias at qiwi.be> a écrit :
text.graphemeAt(0) // get the first grapheme of the text
// shorten a text to its first hundred graphemes var shortenText = '' let numGraphemes = 0 for (let grapheme of text) { numGraphemes += 1 if (numGraphemes > 100) { shortenText += '…' break } shortenText += grapheme }
So, you would want to change the string iterator’s behavior too?
At least, I'd like to have the opportunity to iterate over what I need. I have no opinion whether iterating through code points, or through grapheme clusters, should be the default iterator, or if there should be none, forcing the developer to consciously pick the one they really mean.
As a side note, I ask whether the
String.prototype.symbolAt
/String.prototype.at
as proposed in a recent thread, and theString.prototype[@@iterator]
as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning ofString.prototype.graphemeAt
andString.prototype.graphemes
as discussed in the present message?I don’t think this would be an issue. The new
String
methods and the iterator are well-defined and documented in terms of code points.IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:
// Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u 1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;
Note that the specification in UAX29, section 3.1, for determining grapheme cluster boundaries does not just use the notion of "combining marks". I fear that, for some exotic scripts (apparently, at least Hangul), it is more complicated than just finding a span of combining marks.
Mathias Bynens wrote:
Out of curiosity, is there any programming language that operates on grapheme clusters (rather than code points) by default? FWIW, code point iteration is what I’d expect in any language.
It is the specified default for Perl 6 that can be modified through lexically scoped pragmas. I do not know the state of implementation.
On Thu, Oct 24, 2013 at 7:38 AM, Anne van Kesteren <annevk at annevk.nl> wrote:
On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:
Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.
Is that really a common operation? I would expect formatting, searching, etc. to dominate. E.g. whenever you do substr/substring you would want that to be grapheme-cluster aware.
I think I disagree. Trying to take this apart:
If you're searching, you don't want to use the iterator anyway, because finding character boundaries or grapheme boundaries is a waste of time. UTF-16 is designed so that you can search based on code units alone, without computing boundaries. RegExp searches fall in this category.
IIUC, "formatting" mostly involves finding patterns to replace—it's a special case of searching, right?
When you do substr/slice/substring, you should be using offsets that are on grapheme boundaries, but obtaining offsets by using String iteration and adding up the lengths will be very rare, I think.
So String iteration is kind of left looking around for a use case. I can't think of any that compel me to prefer graphemes over characters out of sheer practicality. Reversing strings, for example, I can't care about that. Anyone?
The internationalization working group is planning to support grapheme clusters through its text segmentation API - the strawman: globalization:text_segmentation
Note that Unicode Standard Annex #29 allows for tailored (language sensitive) grapheme clusters, which makes ECMA-402 a better fit than ECMA-262.
On Oct 24, 2013, at 7:38 , Anne van Kesteren <annevk at annevk.nl> wrote:
Is that really a common operation? I would expect formatting, searching, etc. to dominate. E.g. whenever you do substr/substring you would want that to be grapheme-cluster aware.
There are cases where you don't care about grapheme clusters, e.g. if you want to replace any occurrence of "{" + varname + "}" in a string with the value of the variable named varname.
In cases where you do care about grapheme clusters, it's usually more efficient to search based on code points or code units first, and then check whether the substring found begins and ends on grapheme cluster boundaries (e.g., if a search for "n" finds the first character of Claude's example "n̈", then you'll want to ignore that match).
On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
UTF-16 is designed so that you can search based on code units alone, without computing boundaries. RegExp searches fall in this category.
Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.
Norbert Lindenberg wrote:
Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.
If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units". I have written a Perl module that does it for UTF-8, search.cpan.org/dist/Unicode-SetAutomaton/; Russ Cox's swtch.com/~rsc/regexp/regexp3.html#step3 is a popular implementation. In effect it is still as though the implementation used Unicode scalar values, but that would be true of any implementation. It is much harder to implement something like this for other encodings like UTF-7 and Punycode.
It is useful to keep in mind features like character classes are just
syntactic sugar and can be decomposed into regular expression primitives
like a choice listing each member of the character class as literal. The
.
is just a large character class, and flags like //i just transform
parts of an expression where /a/i becomes something more like /a|A/.
On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:
Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.
Can you explain this more? ISTM case insensitive searches and character classes don't require finding boundaries in the string being searched. Matching /./ does, sometimes. The common use is /.*/ and in that case you don't have to find all boundaries in the text being matched, only at the end or (again, only in certain cases) if you have to backtrack.
Of course all those things have code-point-oriented semantics, which is great. But the implementation can be faster than that.
I'd like to know what you have in mind regarding quantifiers though.
Claude Pache wrote:
You might know that the following ES expressions are broken:
text.charAt(0) // get the first character of the text text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters
The reason is not because ES works with UTF-16 code units instead of Unicode code points (it's just a red herring!), but because graphemes (that is, what a human perceives as a "character") may span multiple code units and/or code points.
The example is deceptively simple. Truncating a string is a hard problem and a high quality implementation would probably be language-specific to avoid problematic truncations like when a suffix changes the meaning of a prefix; it would also take special characters into account, say you do not want the last character before the "..." to be an open quote mark, and if the string is 101 characters ending in "..." turning that into a string of 103 characters ending in "....." would also be silly.
Another issue that is often ignored is that you might want to use the truncated text in combination with other text, say in a HTML document with a "more" or "permalink" or some such link after it. Something like
<p>ABC ‮ DEF ‬ GHI <a href='...'>more</a></p>
<p>ABC ‮ DEF ... <a href='...'>more</a></p>
The second paragraph will render "ABC erom ... FED" because the control character that restores the bidirectional text state got lost when the string was truncated. These are all issues that counting graphemes in- stead of 16 bit units does not address and it is not clear to me that it would actually be an improvement.
"User-perceived character" is not an intuitive notion especially once you leave the realm of letters from a familiar script. In a string that contains 1 user-perceived character, what is the maximum number of zero- width spaces in that string? The maximum number of scalar values? What is the maximum width and maximum height of such a string when rendered, the maximum number of UTF-8 bytes needed to encode such a string? Should one perceive a horizontal ellipsis as three characters, or is it just one? How many are two thin spaces?
My smartphone comes with a "News" application that displays the latest headlines from various news sources and links to corresponding articles. If you use it for a day or two you will notice that it's not of German design, but one for a language that uses fewer or narrower "grapheme clusters per unit of information" if you will. Many of the headlines do not convey what the article might be about. A current example is 'Code- name "Lustre" - Frankreich liefert' which is roughly 'code name "lustre"
- France supplies' ... what? What does France supply? Or "Dortmund droht historische Pleite im" roughly "Dortmund faces historic ... in" where "Pleite" could be "bankruptcy", "defeat", "failure", ... could be sport, could be finance, can't tell.
That makes the application rather frustrating to use with german news. I imagine it works better with english headlines which tend to use fewer grapheme clusters. So truncating news headlines after a certain number of grapheme clusters untailored to the specific script and language is not the "right" design choice. Actually, it might be truncated by pixel measures because there is a visual space to fit, but english and german are very similar in their pixels per grapheme cluster metrics...
So it seems rather unlikely for someone to say "so we need the first 100 extended grapheme clusters as defined in UAX #29 of the string" and then someone responding "yes, that is clearly the right solution".
On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <derhoermi at gmx.net> wrote:
If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units". I have written a Perl module that does it for UTF-8, search.cpan.org/dist/Unicode-SetAutomaton/; Russ Cox's swtch.com/~rsc/regexp/regexp3.html#step3 is a popular implementation. In effect it is still as though the implementation used Unicode scalar values, but that would be true of any implementation. It is much harder to implement something like this for other encodings like UTF-7 and Punycode.
It is useful to keep in mind features like character classes are just syntactic sugar and can be decomposed into regular expression primitives like a choice listing each member of the character class as literal. The
.
is just a large character class, and flags like //i just transform parts of an expression where /a/i becomes something more like /a|A/.
OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved.
On Oct 26, 2013, at 6:58 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
I'd like to know what you have in mind regarding quantifiers though.
When I write /💩{2}/, I mean /💩💩/, but the current code unit based RegExp will interpret it as /💩\uDCA9/, which can't match any well-formed UTF-16 string.
On 26 Oct 2013, at 14:39, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:
If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units".
FWIW, Regenerate is a JavaScript library that can be used for this. A few examples from mathiasbynens.be/notes/javascript-unicode#regex:
Here’s a regular expression is created that matches any Unicode scalar value:
>> regenerate() .addRange(0x0, 0x10FFFF) // all Unicode code points .removeRange(0xD800, 0xDBFF) // minus high surrogates .removeRange(0xDC00, 0xDFFF) // minus low surrogates .toRegExp() /[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/
Similarly, to polyfill .
in a Unicode-enabled ES6 regex:
You might know that the following ES expressions are broken:
The reason is not because ES works with UTF-16 code units instead of Unicode code points (it's just a red herring!), but because graphemes (that is, what a human perceives as a "character") may span multiple code units and/or code points. For example, the letter "n̈" ("n" with diaeresis) is coded using two code points, namely U+006E (LATIN SMALL LETTER N) and U+0308 (COMBINING DIAERESIS), and two UTF-16 code units, and it should be avoided to cut a string between these two codes in order to keep the resulting text meaningful for humans. (Nota bene: I have carefully chosen a grapheme that exists in a current written language and does not exist as precomposed Unicode character.)
The correct technical notion to use here, is the notion of "grapheme cluster", that is a sequence of Unicode code points that represents a grapheme. See UAX29 (Unicode Standard Annex #29: Unicode text segmentation), section 3, for more info.
Therefore, I propose the following basic operations to operate on grapheme clusters:
(1) String.prototype.graphemeAt(pos)
This method is similar to
String.prototype.charAt()
, but it returns a grapheme cluster instead of a string composed of a single UTF-16 code unit. More precisely, it returns the shortest substring ofthis
beginning at positionpos
(inclusively) and ending at positionpos2
(exclusively), wherepos2
is the smallest position inthis
which is greater thanpos
and which is an (extended) grapheme cluster boundary, according to the specification in UAX29, section 3.1. Ifpos
is out of bounds, an empty string is returned.(2) String.prototype.graphemes(start = 0)
This method returns an iterator, enumerating the graphemes of
this
, starting at positionstart
. Given theString.prototype.graphemeAt
method as above, it could be approximatively expressed in ES6 as follows (ignoring edge cases):So, the two examples of the beginning of my message could be correctly implemented as follows:
As a side note, I ask whether the
String.prototype.symbolAt
/String.prototype.at
as proposed in a recent thread, and theString.prototype[@@iterator]
as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning ofString.prototype.graphemeAt
andString.prototype.graphemes
as discussed in the present message?Thoughts?