ES Discuss - Message History

Norbert Lindenberg (2013-10-27T03:05:00.000Z)

Go to Source

On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

> * Norbert Lindenberg wrote:
>> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>> 
>>> UTF-16 is designed so that you can search based on code units
>>> alone, without computing boundaries. RegExp searches fall in this
>>> category.
>> 
>> Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>> quantifier - these all require looking at code points rather than UTF-16 code
>> units in order to support the full Unicode character set.
> 
> If you have a regular expression over an alphabet like "Unicode scalar
> values" it is easy to turn it into an equivalent regular expression over
> an alphabet like "UTF-16 code units". I have written a Perl module that
> does it for UTF-8, <http://search.cpan.org/dist/Unicode-SetAutomaton/>;
> Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular
> implementation. In effect it is still as though the implementation used
> Unicode scalar values, but that would be true of any implementation. It
> is much harder to implement something like this for other encodings like
> UTF-7 and Punycode.
> 
> It is useful to keep in mind features like character classes are just
> syntactic sugar and can be decomposed into regular expression primitives
> like a choice listing each member of the character class as literal. The
> `.` is just a large character class, and flags like //i just transform
> parts of an expression where /a/i becomes something more like /a|A/.

OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved.

Norbert

domenic at domenicdenicola.com (2013-10-28T14:53:08.115Z)

On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

> If you have a regular expression over an alphabet like "Unicode scalar
> values" it is easy to turn it into an equivalent regular expression over
> an alphabet like "UTF-16 code units". I have written a Perl module that
> does it for UTF-8, http://search.cpan.org/dist/Unicode-SetAutomaton/;
> Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular
> implementation. In effect it is still as though the implementation used
> Unicode scalar values, but that would be true of any implementation. It
> is much harder to implement something like this for other encodings like
> UTF-7 and Punycode.
> 
> It is useful to keep in mind features like character classes are just
> syntactic sugar and can be decomposed into regular expression primitives
> like a choice listing each member of the character class as literal. The
> `.` is just a large character class, and flags like //i just transform
> parts of an expression where /a/i becomes something more like /a|A/.

OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved.

Edit