ES Discuss - Message History

Bjoern Hoehrmann (2013-10-26T12:39:02.000Z)

Go to Source

* Norbert Lindenberg wrote:
>On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>
>> UTF-16 is designed so that you can search based on code units
>> alone, without computing boundaries. RegExp searches fall in this
>> category.
>
>Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>quantifier - these all require looking at code points rather than UTF-16 code
>units in order to support the full Unicode character set.

If you have a regular expression over an alphabet like "Unicode scalar
values" it is easy to turn it into an equivalent regular expression over
an alphabet like "UTF-16 code units". I have written a Perl module that
does it for UTF-8, <http://search.cpan.org/dist/Unicode-SetAutomaton/>;
Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular
implementation. In effect it is still as though the implementation used
Unicode scalar values, but that would be true of any implementation. It
is much harder to implement something like this for other encodings like
UTF-7 and Punycode.

It is useful to keep in mind features like character classes are just
syntactic sugar and can be decomposed into regular expression primitives
like a choice listing each member of the character class as literal. The
`.` is just a large character class, and flags like //i just transform
parts of an expression where /a/i becomes something more like /a|A/.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

domenic at domenicdenicola.com (2013-10-28T14:51:13.959Z)

Norbert Lindenberg wrote:

>Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>quantifier - these all require looking at code points rather than UTF-16 code
>units in order to support the full Unicode character set.

If you have a regular expression over an alphabet like "Unicode scalar
values" it is easy to turn it into an equivalent regular expression over
an alphabet like "UTF-16 code units". I have written a Perl module that
does it for UTF-8, http://search.cpan.org/dist/Unicode-SetAutomaton/; Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular implementation. In effect it is still as though the implementation used
Unicode scalar values, but that would be true of any implementation. It
is much harder to implement something like this for other encodings like
UTF-7 and Punycode.

It is useful to keep in mind features like character classes are just
syntactic sugar and can be decomposed into regular expression primitives
like a choice listing each member of the character class as literal. The
`.` is just a large character class, and flags like //i just transform
parts of an expression where /a/i becomes something more like /a|A/.

Edit