Marja Hölttä (2015-01-28T10:36:05.000Z)
mathias at qiwi.be (2015-01-28T10:48:04.271Z)
Hello es-discuss, TL;DR: `/foo.bar/u.test(“foo\uD83Dbar”) == ?` The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation E.g., “Let A be the set of all *characters* except LineTerminator.” “Let ch be the *character* Input[e].” But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position? Q1: Are lonely surrogates allowed in `/u` regexps? E.g., `/foo\uD83D/u`; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair? Suggestion: we shouldn't allow lonely surrogates in `/u` regexps. If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-`/u` regexps. The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code. Q2: If the string contains a lonely surrogate, what should it match? Should it match `.`? Should it match `[^a]` ? (Or is it undefined behavior?) Test cases: ``` /foo.bar/u.test("foo\uD83Dbar") == ? /foo.bar/u.test("foo\uDC00bar") == ? /foo[^a]bar/u.test("foo\uD83Dbar") == ? /foo[^a]bar/u.test("foo\uDC00bar") == ? /foo/u.test("bar\uD83Dbarfoo") == ? /foo/u.test("bar\uDC00barfoo") == ? /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair? /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this? ``` Suggestion: a lonely surrogate should not be a character and it should not match `.` or `[^a]` etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching. If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior. For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts `/a.b/u` into `/a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/` and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this. If we convert the string into UC-32 before matching, then the "lonely surrogate is a character" behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with `/u` regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the "lonely surrogate is a character" implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for `/u` regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate). BR, Marja