Bjoern Hoehrmann (2013-10-26T15:09:00.000Z)
domenic at domenicdenicola.com (2013-10-28T14:52:36.834Z)
Claude Pache wrote: > You might know that the following ES expressions are broken: > > ```js > text.charAt(0) // get the first character of the text > text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters > ``` > > The reason is *not* because ES works with UTF-16 code units instead of > Unicode code points (it's just a red herring!), but because _graphemes_ > (that is, what a human perceives as a "character") may span multiple > code units and/or code points. The example is deceptively simple. Truncating a string is a hard problem and a high quality implementation would probably be language-specific to avoid problematic truncations like when a suffix changes the meaning of a prefix; it would also take special characters into account, say you do not want the last character before the "..." to be an open quote mark, and if the string is 101 characters ending in "..." turning that into a string of 103 characters ending in "....." would also be silly. Another issue that is often ignored is that you might want to use the truncated text in combination with other text, say in a HTML document with a "more" or "permalink" or some such link after it. Something like ```html <p>ABC ‮ DEF ‬ GHI <a href='...'>more</a></p> <p>ABC ‮ DEF ... <a href='...'>more</a></p> ``` The second paragraph will render "ABC erom ... FED" because the control character that restores the bidirectional text state got lost when the string was truncated. These are all issues that counting graphemes in- stead of 16 bit units does not address and it is not clear to me that it would actually be an improvement. "User-perceived character" is not an intuitive notion especially once you leave the realm of letters from a familiar script. In a string that contains 1 user-perceived character, what is the maximum number of zero- width spaces in that string? The maximum number of scalar values? What is the maximum width and maximum height of such a string when rendered, the maximum number of UTF-8 bytes needed to encode such a string? Should one perceive a horizontal ellipsis as three characters, or is it just one? How many are two thin spaces? My smartphone comes with a "News" application that displays the latest headlines from various news sources and links to corresponding articles. If you use it for a day or two you will notice that it's not of German design, but one for a language that uses fewer or narrower "grapheme clusters per unit of information" if you will. Many of the headlines do not convey what the article might be about. A current example is 'Code- name "Lustre" - Frankreich liefert' which is roughly 'code name "lustre" - France supplies' ... what? What does France supply? Or "Dortmund droht historische Pleite im" roughly "Dortmund faces historic ... in" where "Pleite" could be "bankruptcy", "defeat", "failure", ... could be sport, could be finance, can't tell. That makes the application rather frustrating to use with german news. I imagine it works better with english headlines which tend to use fewer grapheme clusters. So truncating news headlines after a certain number of grapheme clusters untailored to the specific script and language is not the "right" design choice. Actually, it might be truncated by pixel measures because there is a visual space to fit, but english and german are very similar in their pixels per grapheme cluster metrics... So it seems rather unlikely for someone to say "so we need the first 100 extended grapheme clusters as defined in UAX #29 of the string" and then someone responding "yes, that is clearly the right solution".