Bjoern Hoehrmann (2013-10-26T12:39:02.000Z)
domenic at domenicdenicola.com (2013-10-28T14:51:13.959Z)
Norbert Lindenberg wrote: >Not if the RegExp is case insensitive, or uses a character class, or ".", or a >quantifier - these all require looking at code points rather than UTF-16 code >units in order to support the full Unicode character set. If you have a regular expression over an alphabet like "Unicode scalar values" it is easy to turn it into an equivalent regular expression over an alphabet like "UTF-16 code units". I have written a Perl module that does it for UTF-8, http://search.cpan.org/dist/Unicode-SetAutomaton/; Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular implementation. In effect it is still as though the implementation used Unicode scalar values, but that would be true of any implementation. It is much harder to implement something like this for other encodings like UTF-7 and Punycode. It is useful to keep in mind features like character classes are just syntactic sugar and can be decomposed into regular expression primitives like a choice listing each member of the character class as literal. The `.` is just a large character class, and flags like //i just transform parts of an expression where /a/i becomes something more like /a|A/.