Norbert Lindenberg (2013-10-27T03:05:00.000Z)
domenic at domenicdenicola.com (2013-10-28T14:53:08.115Z)
On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <derhoermi at gmx.net> wrote: > If you have a regular expression over an alphabet like "Unicode scalar > values" it is easy to turn it into an equivalent regular expression over > an alphabet like "UTF-16 code units". I have written a Perl module that > does it for UTF-8, http://search.cpan.org/dist/Unicode-SetAutomaton/; > Russ Cox's http://swtch.com/~rsc/regexp/regexp3.html#step3 is a popular > implementation. In effect it is still as though the implementation used > Unicode scalar values, but that would be true of any implementation. It > is much harder to implement something like this for other encodings like > UTF-7 and Punycode. > > It is useful to keep in mind features like character classes are just > syntactic sugar and can be decomposed into regular expression primitives > like a choice listing each member of the character class as literal. The > `.` is just a large character class, and flags like //i just transform > parts of an expression where /a/i becomes something more like /a|A/. OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved.