ES Discuss - Message History

Marja Hölttä (2015-01-28T10:36:05.000Z)

Go to Source

Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen
if the regexp or the matched string contains lonely surrogates (a lead
surrogate without a trail, or a trail without a lead). For example, for the
. operator, the relevant parts of the spec speak about characters:

https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

E.g.,
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”

But is a lonely surrogate a character? According to the Unicode standard,
it’s not. If it's not, what will ch be if the input string contains a
lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed?
Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them
or remove them) then they can use non-/u regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and
the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u
Hex4Digits evaluates as follows: Return the character whose code is the SV
of Hex4Digits." - it's also unclear what this means if no valid character
has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should
it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases:
/foo.bar/u.test("foo\uD83Dbar") == ?
/foo.bar/u.test("foo\uDC00bar") == ?
/foo[^a]bar/u.test("foo\uD83Dbar") == ?
/foo[^a]bar/u.test("foo\uDC00bar") == ?
/foo/u.test("bar\uD83Dbarfoo") == ?
/foo/u.test("bar\uDC00barfoo") == ?
/foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the
backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we
allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not
match . or [^a] etc. However, a lonely surrogate in the input string
shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for .
gets complicated and difficult / slow to implement: . should not match
individual surrogates inside a surrogate pair, but if it has to match a
lonely surrogate, we'll end up needing lookahead and lookbehind logic to
implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular
expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into
/a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the "lonely
surrogate is a character" behavior gets easier to implement, but we
wouldn't want to be forced to do that. The intention behind the ES6 spec
seems to be that strings can / should still be stored as UC-16. Converting
strings to UC-32 before matching with /u regexps would require an
additional pass over the string which we'd want to avoid, and converting
only when strictly needed for the "lonely surrogate is a character"
implementation adds complexity. E.g., with some regexps we don't need to
scan the whole input string to find a match, and also most input strings,
even for /u regexps, probably won't contain surrogates (to find that out
we'd also need to scan the whole string, or some logic to fall back to
UC-32 matching when we see a surrogate).

BR,
Marja
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150128/4b794286/attachment.html>

mathias at qiwi.be (2015-01-28T10:48:04.271Z)

Hello es-discuss,

TL;DR: `/foo.bar/u.test(“foo\uD83Dbar”) == ?`

The ES6 unicode regexp spec is not very clear regarding what should happen
if the regexp or the matched string contains lonely surrogates (a lead
surrogate without a trail, or a trail without a lead). For example, for the
. operator, the relevant parts of the spec speak about characters:

https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

E.g.,
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”

But is a lonely surrogate a character? According to the Unicode standard,
it’s not. If it's not, what will ch be if the input string contains a
lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in `/u` regexps?

E.g., `/foo\uD83D/u`; (note lonely lead surrogate), should this be allowed?
Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in `/u` regexps.

If users actually want to match lonely surrogates (e.g., to check for them
or remove them) then they can use non-`/u` regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and
the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u
Hex4Digits evaluates as follows: Return the character whose code is the SV
of Hex4Digits." - it's also unclear what this means if no valid character
has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should
it match `.`? Should it match `[^a]` ? (Or is it undefined behavior?)

Test cases:

```
/foo.bar/u.test("foo\uD83Dbar") == ?
/foo.bar/u.test("foo\uDC00bar") == ?
/foo[^a]bar/u.test("foo\uD83Dbar") == ?
/foo[^a]bar/u.test("foo\uDC00bar") == ?
/foo/u.test("bar\uD83Dbarfoo") == ?
/foo/u.test("bar\uDC00barfoo") == ?
/foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the
backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we
allow splitting the surrogate pair like this?
```

Suggestion: a lonely surrogate should not be a character and it should not
match `.` or `[^a]` etc. However, a lonely surrogate in the input string
shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for .
gets complicated and difficult / slow to implement: . should not match
individual surrogates inside a surrogate pair, but if it has to match a
lonely surrogate, we'll end up needing lookahead and lookbehind logic to
implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular
expression transpiler ( https://mothereff.in/regexpu ) converts `/a.b/u` into
`/a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/` and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the "lonely
surrogate is a character" behavior gets easier to implement, but we
wouldn't want to be forced to do that. The intention behind the ES6 spec
seems to be that strings can / should still be stored as UC-16. Converting
strings to UC-32 before matching with `/u` regexps would require an
additional pass over the string which we'd want to avoid, and converting
only when strictly needed for the "lonely surrogate is a character"
implementation adds complexity. E.g., with some regexps we don't need to
scan the whole input string to find a match, and also most input strings,
even for `/u` regexps, probably won't contain surrogates (to find that out
we'd also need to scan the whole string, or some logic to fall back to
UC-32 matching when we see a surrogate).

BR,
Marja

Edit