Working with grapheme clusters

# Claude Pache (12 years ago)

You might know that the following ES expressions are broken:

text.charAt(0) // get the first character of the text
text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters

The reason is not because ES works with UTF-16 code units instead of Unicode code points (it's just a red herring!), but because graphemes (that is, what a human perceives as a "character") may span multiple code units and/or code points. For example, the letter "n̈" ("n" with diaeresis) is coded using two code points, namely U+006E (LATIN SMALL LETTER N) and U+0308 (COMBINING DIAERESIS), and two UTF-16 code units, and it should be avoided to cut a string between these two codes in order to keep the resulting text meaningful for humans. (Nota bene: I have carefully chosen a grapheme that exists in a current written language and does not exist as precomposed Unicode character.)

The correct technical notion to use here, is the notion of "grapheme cluster", that is a sequence of Unicode code points that represents a grapheme. See UAX29 (Unicode Standard Annex #29: Unicode text segmentation), section 3, for more info.

Therefore, I propose the following basic operations to operate on grapheme clusters:

(1) String.prototype.graphemeAt(pos)

This method is similar to String.prototype.charAt(), but it returns a grapheme cluster instead of a string composed of a single UTF-16 code unit. More precisely, it returns the shortest substring of this beginning at position pos (inclusively) and ending at position pos2 (exclusively), where pos2 is the smallest position in this which is greater than pos and which is an (extended) grapheme cluster boundary, according to the specification in UAX29, section 3.1. If pos is out of bounds, an empty string is returned.

(2) String.prototype.graphemes(start = 0)

This method returns an iterator, enumerating the graphemes of this, starting at position start. Given the String.prototype.graphemeAt method as above, it could be approximatively expressed in ES6 as follows (ignoring edge cases):

String.prototype.graphemes = function*(pos = 0) {
	pos = Math.floor(pos)
	if (pos < 0 || Number.isNaN(pos))
		pos = 0
	while (pos < this.length) {
		let grapheme = this.graphemeAt(pos)
		pos += grapheme.length
		yield grapheme
	}
}

So, the two examples of the beginning of my message could be correctly implemented as follows:

text.graphemeAt(0) // get the first grapheme of the text

// shorten a text to its first hundred graphemes
var shortenText = ''
let numGraphemes = 0
for (let grapheme of text) {
	numGraphemes += 1
	if (numGraphemes > 100) {
		shortenText += '…'
		break
	}
	shortenText += grapheme
}

As a side note, I ask whether the String.prototype.symbolAt/String.prototype.at as proposed in a recent thread, and the String.prototype[@@iterator] as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of String.prototype.graphemeAt and String.prototype.graphemes as discussed in the present message?

Thoughts?

Hello,

You might know that the following ES expressions are broken:

	text.charAt(0) // get the first character of the text
	text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters

The reason is *not* because ES works with UTF-16 code units instead of Unicode code points (it's just a red herring!), but
because _graphemes_ (that is, what a human perceives as a "character") may span multiple code units and/or code points.
For example, the letter "n̈" ("n" with diaeresis) is coded using two code points, namely 
U+006E (LATIN SMALL LETTER N) and U+0308 (COMBINING DIAERESIS), and two UTF-16 code units, 
and it should be avoided to cut a string between these two codes in order to keep the resulting text meaningful for humans.
(Nota bene: I have carefully chosen a grapheme that exists in a current written language and does not exist 
as precomposed Unicode character.)

The correct technical notion to use here, is the notion of "grapheme cluster", that is 
a sequence of Unicode code points that represents a grapheme.
See [UAX29] (Unicode Standard Annex #29: Unicode text segmentation), section 3, for more info.

Therefore, I propose the following basic operations to operate on grapheme clusters:

(1) String.prototype.graphemeAt(pos)
	
This method is similar to `String.prototype.charAt()`, but it returns a grapheme cluster instead of a string 
composed of a single UTF-16 code unit. More precisely, it returns the shortest substring of `this` 
beginning at position `pos` (inclusively) and ending at position `pos2` (exclusively), where `pos2` is the 
smallest position in `this` which is greater than `pos` and which is an (extended) grapheme cluster boundary,
according to the specification in [UAX29], section 3.1.
If `pos` is out of bounds, an empty string is returned.

(2) String.prototype.graphemes(start = 0)

This method returns an iterator, enumerating the graphemes of `this`, starting at position `start`. 
Given the `String.prototype.graphemeAt` method as above,
it could be approximatively expressed in ES6 as follows (ignoring edge cases):

	String.prototype.graphemes = function*(pos = 0) {
		pos = Math.floor(pos)
		if (pos < 0 || Number.isNaN(pos))
			pos = 0
		while (pos < this.length) {
			let grapheme = this.graphemeAt(pos)
			pos += grapheme.length
			yield grapheme
		}
	}

So, the two examples of the beginning of my message could be correctly implemented as follows:

	text.graphemeAt(0) // get the first grapheme of the text

	// shorten a text to its first hundred graphemes
	var shortenText = ''
	let numGraphemes = 0
	for (let grapheme of text) {
		numGraphemes += 1
		if (numGraphemes > 100) {
			shortenText += '…'
			break
		}
		shortenText += grapheme
	}

As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread, 
and the `String.prototype[@@iterator]` as currently specified, are really what people need, 
or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt`
and `String.prototype.graphemes` as discussed in the present message?

Thoughts?

Claude

[UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: Unicode text segmentation."

# Anne van Kesteren (12 years ago)

On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache <claude.pache at gmail.com> wrote:

As a side note, I ask whether the String.prototype.symbolAt/String.prototype.at as proposed in a recent thread, and the String.prototype[@@iterator] as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of String.prototype.graphemeAt and String.prototype.graphemes as discussed in the present message?

Thoughts?

I think you're correct. If we want to make it easier for developers to work with text, we should offer them functionality at the grapheme cluster level and not distract everyone with code units and code points. Thanks for making a proposal!

On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache <claude.pache at gmail.com> wrote:
> As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread,
> and the `String.prototype[@@iterator]` as currently specified, are really what people need,
> or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt`
> and `String.prototype.graphemes` as discussed in the present message?
>
> Thoughts?

I think you're correct. If we want to make it easier for developers to
work with text, we should offer them functionality at the grapheme
cluster level and not distract everyone with code units and code
points. Thanks for making a proposal!

-- 
http://annevankesteren.nl/

# Mathias Bynens (12 years ago)

On 24 Oct 2013, at 16:02, Claude Pache <claude.pache at gmail.com> wrote:

Therefore, I propose the following basic operations to operate on grapheme clusters:

Out of curiosity, is there any programming language that operates on grapheme clusters (rather than code points) by default? FWIW, code point iteration is what I’d expect in any language.

text.graphemeAt(0) // get the first grapheme of the text

// shorten a text to its first hundred graphemes var shortenText = '' let numGraphemes = 0 for (let grapheme of text) { numGraphemes += 1 if (numGraphemes > 100) { shortenText += '…' break } shortenText += grapheme }

So, you would want to change the string iterator’s behavior too?

As a side note, I ask whether the String.prototype.symbolAt/String.prototype.at as proposed in a recent thread, and the String.prototype[@@iterator] as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of String.prototype.graphemeAt and String.prototype.graphemes as discussed in the present message?

I don’t think this would be an issue. The new String methods and the iterator are well-defined and documented in terms of code points.

IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:

// Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u    1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;

var zalgo = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞';

zalgo.match(regexGraphemeCluster);
[
  "Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍",
  "A̴̵̜̰͔ͫ͗͢",
  "L̠ͨͧͩ͘",
  "G̴̻͈͍͔̹̑͗̎̅͛́",
  "Ǫ̵̹̻̝̳͂̌̌͘",
  "!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"
]

On 24 Oct 2013, at 16:02, Claude Pache <claude.pache at gmail.com> wrote:

> Therefore, I propose the following basic operations to operate on grapheme clusters:

Out of curiosity, is there any programming language that operates on grapheme clusters (rather than code points) by default? FWIW, code point iteration is what I’d expect in any language.

> 	text.graphemeAt(0) // get the first grapheme of the text
> 
> 	// shorten a text to its first hundred graphemes
> 	var shortenText = ''
> 	let numGraphemes = 0
> 	for (let grapheme of text) {
> 		numGraphemes += 1
> 		if (numGraphemes > 100) {
> 			shortenText += '…'
> 			break
> 		}
> 		shortenText += grapheme
> 	}

So, you would want to change the string iterator’s behavior too?

> As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread, and the `String.prototype[@@iterator]` as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt` and `String.prototype.graphemes` as discussed in the present message?

I don’t think this would be an issue. The new `String` methods and the iterator are well-defined and documented in terms of *code points*.

IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:

    // Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
    var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u    1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;
    
    var zalgo = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞';
    
    zalgo.match(regexGraphemeCluster);
    [
      "Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍",
      "A̴̵̜̰͔ͫ͗͢",
      "L̠ͨͧͩ͘",
      "G̴̻͈͍͔̹̑͗̎̅͛́",
      "Ǫ̵̹̻̝̳͂̌̌͘",
      "!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"
    ]

# Mathias Bynens (12 years ago)

On 24 Oct 2013, at 16:22, Anne van Kesteren <annevk at annevk.nl> wrote:

On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache <claude.pache at gmail.com> wrote:

As a side note, I ask whether the String.prototype.symbolAt/String.prototype.at as proposed in a recent thread, and the String.prototype[@@iterator] as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of String.prototype.graphemeAt and String.prototype.graphemes as discussed in the present message?

Thoughts?

If we want to make it easier for developers to work with text, we should offer them functionality at the grapheme cluster level and not distract everyone with code units and code points. Thanks for making a proposal!

I’d welcome grapheme helper methods (even though the ES6 string methods already make it easier to deal with grapheme clusters than ever before), but I strongly disagree the string iterator should be changed. I think the use case of iterating over code points is much more common.

Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.

On 24 Oct 2013, at 16:22, Anne van Kesteren <annevk at annevk.nl> wrote:

> On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache <claude.pache at gmail.com> wrote:
>> As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread,
>> and the `String.prototype[@@iterator]` as currently specified, are really what people need,
>> or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt`
>> and `String.prototype.graphemes` as discussed in the present message?
>> 
>> Thoughts?
> 
> If we want to make it easier for developers to work with text, we should offer them functionality at the grapheme cluster level and not distract everyone with code units and code points. Thanks for making a proposal!

I’d welcome grapheme helper methods (even though the ES6 string methods already make it easier to deal with grapheme clusters than ever before), but I strongly disagree the string iterator should be changed. I think the use case of iterating over code points is much more common.

Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.

# Anne van Kesteren (12 years ago)

On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:

Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.

Is that really a common operation? I would expect formatting, searching, etc. to dominate. E.g. whenever you do substr/substring you would want that to be grapheme-cluster aware.

On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:
> Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.

Is that really a common operation? I would expect formatting,
searching, etc. to dominate. E.g. whenever you do substr/substring you
would want that to be grapheme-cluster aware.


-- 
http://annevankesteren.nl/

# Claude Pache (12 years ago)

Le 24 oct. 2013 à 16:24, Mathias Bynens <mathias at qiwi.be> a écrit :

text.graphemeAt(0) // get the first grapheme of the text

// shorten a text to its first hundred graphemes var shortenText = '' let numGraphemes = 0 for (let grapheme of text) { numGraphemes += 1 if (numGraphemes > 100) { shortenText += '…' break } shortenText += grapheme }

So, you would want to change the string iterator’s behavior too?

At least, I'd like to have the opportunity to iterate over what I need. I have no opinion whether iterating through code points, or through grapheme clusters, should be the default iterator, or if there should be none, forcing the developer to consciously pick the one they really mean.

As a side note, I ask whether the String.prototype.symbolAt/String.prototype.at as proposed in a recent thread, and the String.prototype[@@iterator] as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of String.prototype.graphemeAt and String.prototype.graphemes as discussed in the present message?

I don’t think this would be an issue. The new String methods and the iterator are well-defined and documented in terms of code points.

IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:
// Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u    1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;

Note that the specification in UAX29, section 3.1, for determining grapheme cluster boundaries does not just use the notion of "combining marks". I fear that, for some exotic scripts (apparently, at least Hangul), it is more complicated than just finding a span of combining marks.

Le 24 oct. 2013 à 16:24, Mathias Bynens <mathias at qiwi.be> a écrit :

> 
>> 	text.graphemeAt(0) // get the first grapheme of the text
>> 
>> 	// shorten a text to its first hundred graphemes
>> 	var shortenText = ''
>> 	let numGraphemes = 0
>> 	for (let grapheme of text) {
>> 		numGraphemes += 1
>> 		if (numGraphemes > 100) {
>> 			shortenText += '…'
>> 			break
>> 		}
>> 		shortenText += grapheme
>> 	}
> 
> So, you would want to change the string iterator’s behavior too?

At least, I'd like to have the opportunity to iterate over what I need. I have no opinion whether iterating through code points, or through grapheme clusters, should be the default
iterator, or if there should be none, forcing the developer to consciously pick the one they really mean.

> 
>> As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread, and the `String.prototype[@@iterator]` as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt` and `String.prototype.graphemes` as discussed in the present message?
> 
> I don’t think this would be an issue. The new `String` methods and the iterator are well-defined and documented in terms of *code points*.
> 
> IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:
> 
>    // Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
>    var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u    1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;

Note that the specification in [UAX29], section 3.1, for determining grapheme cluster boundaries does not just use the notion of "combining marks". I fear that, for some exotic scripts (apparently, at least Hangul), it is more complicated than just finding a span of combining marks.

—Claude


[UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: Unicode text segmentation."


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131024/65fa0c7f/attachment.html>

# Bjoern Hoehrmann (12 years ago)

Mathias Bynens wrote:

Out of curiosity, is there any programming language that operates on grapheme clusters (rather than code points) by default? FWIW, code point iteration is what I’d expect in any language.

It is the specified default for Perl 6 that can be modified through lexically scoped pragmas. I do not know the state of implementation.

* Mathias Bynens wrote:
>Out of curiosity, is there any programming language that operates on 
>grapheme clusters (rather than code points) by default? FWIW, code point 
>iteration is what I’d expect in any language.

It is the specified default for Perl 6 that can be modified through
lexically scoped pragmas. I do not know the state of implementation.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

# Jason Orendorff (12 years ago)

On Thu, Oct 24, 2013 at 7:38 AM, Anne van Kesteren <annevk at annevk.nl> wrote:

On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:

Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.

Is that really a common operation? I would expect formatting, searching, etc. to dominate. E.g. whenever you do substr/substring you would want that to be grapheme-cluster aware.

I think I disagree. Trying to take this apart:

If you're searching, you don't want to use the iterator anyway, because finding character boundaries or grapheme boundaries is a waste of time. UTF-16 is designed so that you can search based on code units alone, without computing boundaries. RegExp searches fall in this category.

IIUC, "formatting" mostly involves finding patterns to replace—it's a special case of searching, right?

When you do substr/slice/substring, you should be using offsets that are on grapheme boundaries, but obtaining offsets by using String iteration and adding up the lengths will be very rare, I think.

So String iteration is kind of left looking around for a use case. I can't think of any that compel me to prefer graphemes over characters out of sheer practicality. Reversing strings, for example, I can't care about that. Anyone?

On Thu, Oct 24, 2013 at 7:38 AM, Anne van Kesteren <annevk at annevk.nl> wrote:
> On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:
>> Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.
>
> Is that really a common operation? I would expect formatting,
> searching, etc. to dominate. E.g. whenever you do substr/substring you
> would want that to be grapheme-cluster aware.

I think I disagree. Trying to take this apart:

If you're searching, you don't want to use the iterator anyway,
because finding character boundaries or grapheme boundaries is a waste
of time. UTF-16 is designed so that you can search based on code units
alone, without computing boundaries. RegExp searches fall in this
category.

IIUC, "formatting" mostly involves finding patterns to replace—it's a
special case of searching, right?

When you do substr/slice/substring, you should be using offsets that
are on grapheme boundaries, but obtaining offsets by using String
iteration and adding up the lengths will be very rare, I think.

So String iteration is kind of left looking around for a use case. I
can't think of any that compel me to prefer graphemes over characters
out of sheer practicality. Reversing strings, for example, I can't
care about that. Anyone?

-j

# Norbert Lindenberg (12 years ago)

The internationalization working group is planning to support grapheme clusters through its text segmentation API - the strawman: globalization:text_segmentation

Note that Unicode Standard Annex #29 allows for tailored (language sensitive) grapheme clusters, which makes ECMA-402 a better fit than ECMA-262.

The internationalization working group is planning to support grapheme clusters through its text segmentation API - the strawman:
http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation

Note that Unicode Standard Annex #29 allows for tailored (language sensitive) grapheme clusters, which makes ECMA-402 a better fit than ECMA-262.

Norbert


On Oct 24, 2013, at 7:02 , Claude Pache <claude.pache at gmail.com> wrote:

> Hello,
> 
> You might know that the following ES expressions are broken:
> 
> 	text.charAt(0) // get the first character of the text
> 	text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters
> 
> The reason is *not* because ES works with UTF-16 code units instead of Unicode code points (it's just a red herring!), but
> because _graphemes_ (that is, what a human perceives as a "character") may span multiple code units and/or code points.
> For example, the letter "n̈" ("n" with diaeresis) is coded using two code points, namely 
> U+006E (LATIN SMALL LETTER N) and U+0308 (COMBINING DIAERESIS), and two UTF-16 code units, 
> and it should be avoided to cut a string between these two codes in order to keep the resulting text meaningful for humans.
> (Nota bene: I have carefully chosen a grapheme that exists in a current written language and does not exist 
> as precomposed Unicode character.)
> 
> The correct technical notion to use here, is the notion of "grapheme cluster", that is 
> a sequence of Unicode code points that represents a grapheme.
> See [UAX29] (Unicode Standard Annex #29: Unicode text segmentation), section 3, for more info.
> 
> Therefore, I propose the following basic operations to operate on grapheme clusters:
> 
> (1) String.prototype.graphemeAt(pos)
> 	
> This method is similar to `String.prototype.charAt()`, but it returns a grapheme cluster instead of a string 
> composed of a single UTF-16 code unit. More precisely, it returns the shortest substring of `this` 
> beginning at position `pos` (inclusively) and ending at position `pos2` (exclusively), where `pos2` is the 
> smallest position in `this` which is greater than `pos` and which is an (extended) grapheme cluster boundary,
> according to the specification in [UAX29], section 3.1.
> If `pos` is out of bounds, an empty string is returned.
> 
> (2) String.prototype.graphemes(start = 0)
> 
> This method returns an iterator, enumerating the graphemes of `this`, starting at position `start`. 
> Given the `String.prototype.graphemeAt` method as above,
> it could be approximatively expressed in ES6 as follows (ignoring edge cases):
> 
> 	String.prototype.graphemes = function*(pos = 0) {
> 		pos = Math.floor(pos)
> 		if (pos < 0 || Number.isNaN(pos))
> 			pos = 0
> 		while (pos < this.length) {
> 			let grapheme = this.graphemeAt(pos)
> 			pos += grapheme.length
> 			yield grapheme
> 		}
> 	}
> 
> So, the two examples of the beginning of my message could be correctly implemented as follows:
> 
> 	text.graphemeAt(0) // get the first grapheme of the text
> 
> 	// shorten a text to its first hundred graphemes
> 	var shortenText = ''
> 	let numGraphemes = 0
> 	for (let grapheme of text) {
> 		numGraphemes += 1
> 		if (numGraphemes > 100) {
> 			shortenText += '…'
> 			break
> 		}
> 		shortenText += grapheme
> 	}
> 
> As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread, 
> and the `String.prototype[@@iterator]` as currently specified, are really what people need, 
> or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt`
> and `String.prototype.graphemes` as discussed in the present message?
> 
> Thoughts?
> 
> Claude
> 
> [UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: Unicode text segmentation."
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Norbert Lindenberg (12 years ago)

On Oct 24, 2013, at 7:38 , Anne van Kesteren <annevk at annevk.nl> wrote:

Is that really a common operation? I would expect formatting, searching, etc. to dominate. E.g. whenever you do substr/substring you would want that to be grapheme-cluster aware.

There are cases where you don't care about grapheme clusters, e.g. if you want to replace any occurrence of "{" + varname + "}" in a string with the value of the variable named varname.

In cases where you do care about grapheme clusters, it's usually more efficient to search based on code points or code units first, and then check whether the substring found begins and ends on grapheme cluster boundaries (e.g., if a search for "n" finds the first character of Claude's example "n̈", then you'll want to ignore that match).

On Oct 24, 2013, at 7:38 , Anne van Kesteren <annevk at annevk.nl> wrote:

> On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:
>> Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.
> 
> Is that really a common operation? I would expect formatting,
> searching, etc. to dominate. E.g. whenever you do substr/substring you
> would want that to be grapheme-cluster aware.

There are cases where you don't care about grapheme clusters, e.g. if you want to replace any occurrence of "{" + varname + "}" in a string with the value of the variable named varname.

In cases where you do care about grapheme clusters, it's usually more efficient to search based on code points or code units first, and then check whether the substring found begins and ends on grapheme cluster boundaries (e.g., if a search for "n" finds the first character of Claude's example "n̈", then you'll want to ignore that match).

Norbert

# Norbert Lindenberg (12 years ago)

On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:

UTF-16 is designed so that you can search based on code units alone, without computing boundaries. RegExp searches fall in this category.

Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:

> UTF-16 is designed so that you can search based on code units
> alone, without computing boundaries. RegExp searches fall in this
> category.

Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

Norbert

# Bjoern Hoehrmann (12 years ago)

Norbert Lindenberg wrote:

Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units". I have written a Perl module that does it for UTF-8, search.cpan.org/dist/Unicode-SetAutomaton/; Russ Cox's swtch.com/~rsc/regexp/regexp3.html#step3 is a popular implementation. In effect it is still as though the implementation used Unicode scalar values, but that would be true of any implementation. It is much harder to implement something like this for other encodings like UTF-7 and Punycode.

It is useful to keep in mind features like character classes are just syntactic sugar and can be decomposed into regular expression primitives like a choice listing each member of the character class as literal. The . is just a large character class, and flags like //i just transform parts of an expression where /a/i becomes something more like /a|A/.

* Norbert Lindenberg wrote:
>On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>
>> UTF-16 is designed so that you can search based on code units
>> alone, without computing boundaries. RegExp searches fall in this
>> category.
>
>Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>quantifier - these all require looking at code points rather than UTF-16 code
>units in order to support the full Unicode character set.

If you have a regular expression over an alphabet like "Unicode scalar
values" it is easy to turn it into an equivalent regular expression over
an alphabet like "UTF-16 code units". I have written a Perl module that
does it for UTF-8, <http://search.cpan.org/dist/Unicode-SetAutomaton/>;
Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular
implementation. In effect it is still as though the implementation used
Unicode scalar values, but that would be true of any implementation. It
is much harder to implement something like this for other encodings like
UTF-7 and Punycode.

It is useful to keep in mind features like character classes are just
syntactic sugar and can be decomposed into regular expression primitives
like a choice listing each member of the character class as literal. The
`.` is just a large character class, and flags like //i just transform
parts of an expression where /a/i becomes something more like /a|A/.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

# Jason Orendorff (12 years ago)

On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

Can you explain this more? ISTM case insensitive searches and character classes don't require finding boundaries in the string being searched. Matching /./ does, sometimes. The common use is /.*/ and in that case you don't have to find all boundaries in the text being matched, only at the end or (again, only in certain cases) if you have to backtrack.

Of course all those things have code-point-oriented semantics, which is great. But the implementation can be faster than that.

I'd like to know what you have in mind regarding quantifiers though.

On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg
<ecmascript at lindenbergsoftware.com> wrote:
>
> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>
>> UTF-16 is designed so that you can search based on code units
>> alone, without computing boundaries. RegExp searches fall in this
>> category.
>
> Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

Can you explain this more?  ISTM case insensitive searches and
character classes don't require finding boundaries in the string being
searched. Matching /./ does, sometimes. The common use is /.*/ and in
that case you don't have to find all boundaries in the text being
matched, only at the end or (again, only in certain cases) if you have
to backtrack.

Of course all those things have code-point-oriented *semantics*, which
is great. But the implementation can be faster than that.

I'd like to know what you have in mind regarding quantifiers though.

-j

# Bjoern Hoehrmann (12 years ago)

Claude Pache wrote:

You might know that the following ES expressions are broken:
text.charAt(0) // get the first character of the text
text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters
The reason is not because ES works with UTF-16 code units instead of Unicode code points (it's just a red herring!), but because graphemes (that is, what a human perceives as a "character") may span multiple code units and/or code points.

The example is deceptively simple. Truncating a string is a hard problem and a high quality implementation would probably be language-specific to avoid problematic truncations like when a suffix changes the meaning of a prefix; it would also take special characters into account, say you do not want the last character before the "..." to be an open quote mark, and if the string is 101 characters ending in "..." turning that into a string of 103 characters ending in "....." would also be silly.

Another issue that is often ignored is that you might want to use the truncated text in combination with other text, say in a HTML document with a "more" or "permalink" or some such link after it. Something like

<p>ABC &#x202E; DEF &#x202C; GHI <a href='...'>more</a></p>
<p>ABC &#x202E; DEF ...          <a href='...'>more</a></p>

The second paragraph will render "ABC erom ... FED" because the control character that restores the bidirectional text state got lost when the string was truncated. These are all issues that counting graphemes in- stead of 16 bit units does not address and it is not clear to me that it would actually be an improvement.

"User-perceived character" is not an intuitive notion especially once you leave the realm of letters from a familiar script. In a string that contains 1 user-perceived character, what is the maximum number of zero- width spaces in that string? The maximum number of scalar values? What is the maximum width and maximum height of such a string when rendered, the maximum number of UTF-8 bytes needed to encode such a string? Should one perceive a horizontal ellipsis as three characters, or is it just one? How many are two thin spaces?

My smartphone comes with a "News" application that displays the latest headlines from various news sources and links to corresponding articles. If you use it for a day or two you will notice that it's not of German design, but one for a language that uses fewer or narrower "grapheme clusters per unit of information" if you will. Many of the headlines do not convey what the article might be about. A current example is 'Code- name "Lustre" - Frankreich liefert' which is roughly 'code name "lustre"

France supplies' ... what? What does France supply? Or "Dortmund droht historische Pleite im" roughly "Dortmund faces historic ... in" where "Pleite" could be "bankruptcy", "defeat", "failure", ... could be sport, could be finance, can't tell.

That makes the application rather frustrating to use with german news. I imagine it works better with english headlines which tend to use fewer grapheme clusters. So truncating news headlines after a certain number of grapheme clusters untailored to the specific script and language is not the "right" design choice. Actually, it might be truncated by pixel measures because there is a visual space to fit, but english and german are very similar in their pixels per grapheme cluster metrics...

So it seems rather unlikely for someone to say "so we need the first 100 extended grapheme clusters as defined in UAX #29 of the string" and then someone responding "yes, that is clearly the right solution".

* Claude Pache wrote:
>You might know that the following ES expressions are broken:
>
>	text.charAt(0) // get the first character of the text
>	text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters
>
>The reason is *not* because ES works with UTF-16 code units instead of 
>Unicode code points (it's just a red herring!), but because _graphemes_ 
>(that is, what a human perceives as a "character") may span multiple 
>code units and/or code points.

The example is deceptively simple. Truncating a string is a hard problem
and a high quality implementation would probably be language-specific to
avoid problematic truncations like when a suffix changes the meaning of
a prefix; it would also take special characters into account, say you do
not want the last character before the "..." to be an open quote mark,
and if the string is 101 characters ending in "..." turning that into a
string of 103 characters ending in "....." would also be silly.

Another issue that is often ignored is that you might want to use the
truncated text in combination with other text, say in a HTML document
with a "more" or "permalink" or some such link after it. Something like

  <p>ABC &#x202E; DEF &#x202C; GHI <a href='...'>more</a></p>
  <p>ABC &#x202E; DEF ...          <a href='...'>more</a></p>

The second paragraph will render "ABC erom ... FED" because the control
character that restores the bidirectional text state got lost when the
string was truncated. These are all issues that counting graphemes in-
stead of 16 bit units does not address and it is not clear to me that it
would actually be an improvement.

"User-perceived character" is not an intuitive notion especially once
you leave the realm of letters from a familiar script. In a string that
contains 1 user-perceived character, what is the maximum number of zero-
width spaces in that string? The maximum number of scalar values? What
is the maximum width and maximum height of such a string when rendered,
the maximum number of UTF-8 bytes needed to encode such a string? Should
one perceive a horizontal ellipsis as three characters, or is it just
one? How many are two thin spaces?

My smartphone comes with a "News" application that displays the latest
headlines from various news sources and links to corresponding articles.
If you use it for a day or two you will notice that it's not of German
design, but one for a language that uses fewer or narrower "grapheme
clusters per unit of information" if you will. Many of the headlines do
not convey what the article might be about. A current example is 'Code-
name "Lustre" - Frankreich liefert' which is roughly 'code name "lustre"
- France supplies' ... what? What does France supply? Or "Dortmund droht
historische Pleite im" roughly "Dortmund faces historic ... in" where
"Pleite" could be "bankruptcy", "defeat", "failure", ... could be sport,
could be finance, can't tell.

That makes the application rather frustrating to use with german news. I
imagine it works better with english headlines which tend to use fewer
grapheme clusters. So truncating news headlines after a certain number
of grapheme clusters untailored to the specific script and language is
not the "right" design choice. Actually, it might be truncated by pixel
measures because there is a visual space to fit, but english and german
are very similar in their pixels per grapheme cluster metrics...

So it seems rather unlikely for someone to say "so we need the first 100
extended grapheme clusters as defined in UAX #29 of the string" and then
someone responding "yes, that is clearly the right solution".
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

# Norbert Lindenberg (12 years ago)

On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units". I have written a Perl module that does it for UTF-8, search.cpan.org/dist/Unicode-SetAutomaton/; Russ Cox's swtch.com/~rsc/regexp/regexp3.html#step3 is a popular implementation. In effect it is still as though the implementation used Unicode scalar values, but that would be true of any implementation. It is much harder to implement something like this for other encodings like UTF-7 and Punycode.

It is useful to keep in mind features like character classes are just syntactic sugar and can be decomposed into regular expression primitives like a choice listing each member of the character class as literal. The . is just a large character class, and flags like //i just transform parts of an expression where /a/i becomes something more like /a|A/.

OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved.

On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

> * Norbert Lindenberg wrote:
>> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>> 
>>> UTF-16 is designed so that you can search based on code units
>>> alone, without computing boundaries. RegExp searches fall in this
>>> category.
>> 
>> Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>> quantifier - these all require looking at code points rather than UTF-16 code
>> units in order to support the full Unicode character set.
> 
> If you have a regular expression over an alphabet like "Unicode scalar
> values" it is easy to turn it into an equivalent regular expression over
> an alphabet like "UTF-16 code units". I have written a Perl module that
> does it for UTF-8, <http://search.cpan.org/dist/Unicode-SetAutomaton/>;
> Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular
> implementation. In effect it is still as though the implementation used
> Unicode scalar values, but that would be true of any implementation. It
> is much harder to implement something like this for other encodings like
> UTF-7 and Punycode.
> 
> It is useful to keep in mind features like character classes are just
> syntactic sugar and can be decomposed into regular expression primitives
> like a choice listing each member of the character class as literal. The
> `.` is just a large character class, and flags like //i just transform
> parts of an expression where /a/i becomes something more like /a|A/.

OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved.

Norbert

# Norbert Lindenberg (12 years ago)

On Oct 26, 2013, at 6:58 , Jason Orendorff <jason.orendorff at gmail.com> wrote:

I'd like to know what you have in mind regarding quantifiers though.

When I write /💩{2}/, I mean /💩💩/, but the current code unit based RegExp will interpret it as /💩\uDCA9/, which can't match any well-formed UTF-16 string.

On Oct 26, 2013, at 6:58 , Jason Orendorff <jason.orendorff at gmail.com> wrote:

> On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg
> <ecmascript at lindenbergsoftware.com> wrote:
>> 
>> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>> 
>>> UTF-16 is designed so that you can search based on code units
>>> alone, without computing boundaries. RegExp searches fall in this
>>> category.
>> 
>> Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

> I'd like to know what you have in mind regarding quantifiers though.

When I write /💩{2}/, I mean /💩💩/, but the current code unit based RegExp will interpret it as /💩\uDCA9/, which can't match any well-formed UTF-16 string.

Norbert

# Mathias Bynens (12 years ago)

On 26 Oct 2013, at 14:39, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units".

FWIW, Regenerate is a JavaScript library that can be used for this. A few examples from mathiasbynens.be/notes/javascript-unicode#regex:

Here’s a regular expression is created that matches any Unicode scalar value:

>> regenerate()
     .addRange(0x0, 0x10FFFF)     // all Unicode code points
     .removeRange(0xD800, 0xDBFF) // minus high surrogates
     .removeRange(0xDC00, 0xDFFF) // minus low surrogates
     .toRegExp()
/[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/

Similarly, to polyfill . in a Unicode-enabled ES6 regex:

On 26 Oct 2013, at 14:39, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

> * Norbert Lindenberg wrote:
>> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>> 
>>> UTF-16 is designed so that you can search based on code units
>>> alone, without computing boundaries. RegExp searches fall in this
>>> category.
>> 
>> Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>> quantifier - these all require looking at code points rather than UTF-16 code
>> units in order to support the full Unicode character set.
> 
> If you have a regular expression over an alphabet like "Unicode scalar
> values" it is easy to turn it into an equivalent regular expression over
> an alphabet like "UTF-16 code units".

FWIW, [Regenerate](http://mths.be/regenerate) is a JavaScript library that can be used for this. A few examples from <http://mathiasbynens.be/notes/javascript-unicode#regex>:

> Here’s a regular expression is created that matches any Unicode scalar value:
> 
>     >> regenerate()
>          .addRange(0x0, 0x10FFFF)     // all Unicode code points
>          .removeRange(0xD800, 0xDBFF) // minus high surrogates
>          .removeRange(0xDC00, 0xDFFF) // minus low surrogates
>          .toRegExp()
>     /[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/


Similarly, to polyfill `.` in a Unicode-enabled ES6 regex:

> When the `u` flag is set, `.` is equivalent to the following backwards-compatible regular expression pattern:
> 
>     >> regenerate()
>          .addRange(0x0, 0x10FFFF) // all Unicode code points
>          .remove(  // minus `LineTerminator`s (http://ecma-international.org/ecma-262/5.1/#sec-7.3):
>            0x000A, // Line Feed <LF>
>            0x000D, // Carriage Return <CR>
>            0x2028, // Line Separator <LS>
>            0x2029  // Paragraph Separator <PS>
>          )
>          .toString();
>     '[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'
>     
>     >> /foo(?:[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])bar/u.test('foo💩bar')
>     true