Q: Lonely surrogates and unicode regexps

# Marja Hölttä (10 years ago)

Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:

people.mozilla.org/~jorendorff/es6-draft.html#sec-atom, people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation, people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

E.g., “Let A be the set of all characters except LineTerminator.” “Let ch be the character Input[e].”

But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases:

/foo.bar/u.test("foo\uD83Dbar") == ?
/foo.bar/u.test("foo\uDC00bar") == ?
/foo[^a]bar/u.test("foo\uD83Dbar") == ?
/foo[^a]bar/u.test("foo\uDC00bar") == ?
/foo/u.test("bar\uD83Dbarfoo") == ?
/foo/u.test("bar\uDC00barfoo") == ?
/foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the
backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we
allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

If we convert the string into UC-32 before matching, then the "lonely surrogate is a character" behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with /u regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the "lonely surrogate is a character" implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for /u regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate).

BR, Marja

# Mathias Bynens (10 years ago)

On 28 Jan 2015, at 11:36, Marja Hölttä <marja at chromium.org> wrote:

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:

people.mozilla.org/~jorendorff/es6-draft.html#sec-atom, people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation, people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

E.g., “Let A be the set of all characters except LineTerminator.” “Let ch be the character Input[e].”

But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?

Q1: Are lonely surrogates allowed in /u regexps?

E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?

Suggestion: we shouldn't allow lonely surrogates in /u regexps.

If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.

You’re proposing to define “characters” in terms of Unicode scalar values in the case /u is used. I could get behind that — it reinforces the idea that /u is like a strict mode for regular expressions.

Playing devil’s advocate, the problem is that regular expressions and strings go hand in hand, and there is no guarantee that JavaScript strings only consist of valid code points. Making . not match lone surrogates breaks the developer expectation that (.) matches every “part” of the string. Having to avoid /u to prevent this seems like a potentially bad thing.

The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.

Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)

Test cases: /foo.bar/u.test("foo\uD83Dbar") == ? /foo.bar/u.test("foo\uDC00bar") == ? /foo[^a]bar/u.test("foo\uD83Dbar") == ? /foo[^a]bar/u.test("foo\uDC00bar") == ? /foo/u.test("bar\uD83Dbarfoo") == ? /foo/u.test("bar\uDC00barfoo") == ? /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair? /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this?

Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.

If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.

For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|\uD800-\uDBFF|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

This is indeed an incomplete solution. The lack of lookbehind support in ES makes this hard to transpile correctly. Ideas welcome!

# Mark Davis ☕️ (10 years ago)

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <marja at chromium.org> wrote:

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:

​Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous term "code point". For example, U+0378​ is not (currently) an encoded character according to Unicode, but it would certainly be a terrible idea to disregard it, or not match it. It is a reserved code point that may be assigned as an encoded character in the future. So both U+D83D and U+0378 are not characters.

If a ES spec uses the term "character" instead of "code point", then at some point in the text it needs to disambiguate what is meant.

As to how this should be handled in regex expressions, I'd suggest looking at Java's approach.

Mark google.com/+MarkDavis

— Il meglio è l’inimico del bene —

# Erik Corry (10 years ago)

On Wed, Jan 28, 2015 at 11:45 AM, Mathias Bynens <mathias at qiwi.be> wrote:

On 28 Jan 2015, at 11:36, Marja Hölttä <marja at chromium.org> wrote:

For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|\uD800-\uDBFF|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.

This is indeed an incomplete solution. The lack of lookbehind support in ES makes this hard to transpile correctly. Ideas welcome!

I don't think your transpiler can work without lookbehind. If you could guarantee that none of your transpiled regexp matches a substring that ends in the middle of a pair, then I think you could get it right without lookbehind, but consider:

"TxL-TxLT".test(/(...)-\1./);

Where L stands for a lead surrogate, and T stands for a trailing surrogate. There's no way to stop the backreference from swallowing the last L, and without lookbehind there is no way to stop the . from matching the final T. A second issue is having a match that starts in the middle of a pair. You could test for this after the matching if JS gave you the index of the match in the string, but I don't think it does.

Ignoring the start-of-match-in-the-middle-of-a-pair issue, and the backreferences case, I think you can do without the backreference. Assuming the lonely-surrogates-are-a-character scenario, the period (.) transpiles to (ignore spaces added for readability):

(?: \L(?!\T) | \L\T | \T | [^\L\T\N])

where \L means leading surrogates, \T means trailing surrogates, \N means all newlines. Whatever comes before the . is not allowed to match a half

As an optimization, .x can transpile to (?: \L\T | . )x where the x stands in for any literal characters.

For a JS engine implementor, like Marja, it is of course possible to add 1-character negative lookbehind (\b already has elements of this). Then your in-engine transpiler turns . into

(?: \L(?!\T) | \L\T | (?<!\L)\T | [^\L\T\N])

Which is going to be truly horrible in terms of code size and performance. It's not like the period operator is a rare thing in a regexp, and other common things like [^a-z] and [^\d] will expand into similar horrors.

On the other hand, in the lonely-surrogates-match-nothing scenario, the . transpiles to

(?: \l\t | [^\l\t\n] )

which is quite a lot nicer and faster. In this scenario, .x expands to (?: \L\T | [^\T\L\N ) which still has no lookaheads and lookbehinds.

# André Bargull (10 years ago)

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <marja at chromium.org, mail.mozilla.org/listinfo/es-discuss> wrote:

/ The ES6 unicode regexp spec is not very clear regarding what should happen />/ if the regexp or the matched string contains lonely surrogates (a lead />/ surrogate without a trail, or a trail without a lead). For example, for the />/ . operator, the relevant parts of the spec speak about characters: />/ / ​Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous term "code point". For example, U+0378​ is not (currently) an encoded character according to Unicode, but it would certainly be a terrible idea to disregard it, or not match it. It is a reserved code point that may be assigned as an encoded character in the future. So both U+D83D and U+0378 are not characters.

If a ES spec uses the term "character" instead of "code point", then at some point in the text it needs to disambiguate what is meant.

"character" is defined in 21.2.2 Pattern Semantics [1]:

In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point.

[1] people.mozilla.org/~jorendorff/es6-draft.html#sec

# Wes Garland (10 years ago)

Some interesting questions here.

1 - What is a character? Is it a Unicode Code Point? 2 - Should we be able to match all possible JS Strings? 3 - Should we be able to match all possible Unicode Strings? 4 - What do we do if there is a character in a String we cannot match? 5 - Do unmatchable characters match . ? 6 - Are subsections of unmatchable strings matchable if they contain only matchable characters?

It is important to remember in these discussions that the Unicode specification allows strings which contain unmatched surrogates. Do we want regular expressions that can't match some Unicode strings? Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate? How about reserved code points? What happens when they become assigned?

# Mark Davis ☕️ (10 years ago)

Good, that sounds right.

Mark google.com/+MarkDavis

— Il meglio è l’inimico del bene —

# Marja Hölttä (10 years ago)

For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense.

# Mark Davis ☕️ (10 years ago)

I think the cleanest mental model is where UTF-16 or UTF-8 strings are interpreted as if they were transformed into UTF-32.

While that is generally feasible, it often represents a cost in performance which is not acceptable in practice. So you see various approaches that involve some deviation from that mental model.

Mark google.com/+MarkDavis

— Il meglio è l’inimico del bene —

# André Bargull (10 years ago)

For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense.

java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.

[1] hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890 [2] hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671

# André Bargull (10 years ago)

On 1/28/2015 2:51 PM, André Bargull wrote:

For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so, generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before matching, but Ex1 works as if it was def not. Idk what's the correct mental model where both Ex1 and Ex2 would make sense.

java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.

[1] hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890 [2] hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671

Err, the part about how patterns are read is not important here. What I should have written is that the input string is (also) read as a sequence of code points [3]. So in ex2 \uD834\uDC00 is read as a single code point (and not split into \uD834 and \uDC00 during backtracking).

[3] hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773

# Marja Hölttä (10 years ago)

Based on Ex1, looks like the input string is not read as a sequence of code points when we try to find a match for \1. So it's mostly read as a sequence of code points except when it's not. :/

# André Bargull (10 years ago)

On 1/28/2015 3:36 PM, Marja Hölttä wrote:

Based on Ex1, looks like the input string is not read as a sequence of code points when we try to find a match for \1. So it's mostly read as a sequence of code points except when it's not. :/

Yep, back references are matched as a sequence of code units. The first link I've posted points to the relevant method in java.util.regex.Pattern. I've got no idea why it's implemented that way, for example when you enable case-insensitive matching, back references are no longer matched as a sequence of code units:


int[] flags = { 0, Pattern.CASE_INSENSITIVE, Pattern.UNICODE_CASE, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE };

// Prints true, false, true, false Arrays.stream(flags).mapToObj(f -> Pattern.compile("foo(.+)bar\1", f)) .map(p -> p.matcher("foo\uD834bar\uD834\uDC00").find()) .forEach(System.out::println);

# Allen Wirfs-Brock (10 years ago)

On Jan 28, 2015, at 2:36 AM, Marja Hölttä <marja at chromium.org> wrote:

Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:

TL;DR: in a unicode regexp lonely surrogates are considered to be a single “character”.

As André has already covered “character” has a very specific meaning within the context of the ES6 RegExp specification in the second paragraph of people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . The specification uses the same set of algorithms to describe both BCP (i.e., 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching semantics. “Character” is used in those algorithm to refer to a single element of the mode that is currently operating within.

I think the ambiguity you find is in step 2.1 of people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :

  1. Return an internal closure that takes two arguments, a String str and an integer index, and performs the following:
  2. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character.

Apparently I don’t have an adequate definition of “interpreted as a UTF-16 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will provided one in the next spec. revisions. The intended semantics is that: In ascending string index order: Each valid UTF-16 surrogate pair is interpreted as a signal code point that is the UTF-16 encoded value Each “lonely” surrogate is interpreted as single code point that is the surrogate value Every other 16-bit code unit is interpreted as a single code point.

# Allen Wirfs-Brock (10 years ago)

On Jan 28, 2015, at 5:26 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:

I think the cleanest mental model is where UTF-16 or UTF-8 strings are interpreted as if they were transformed into UTF-32.

This is exactly the approach used in the ES6 spec (except that it doesn’t deal with UTF-8)

While that is generally feasible, it often represents a cost in performance which is not acceptable in practice. So you see various approaches that involve some deviation from that mental model.

While ES6 uses this approach in its specification, implementations are free to use any implementation technique that produces the same result.

# Marja Hölttä (10 years ago)

Cool, thanks for clarifications!

To make sure, as per the "intended semantics", we never allow splitting a valid surrogate pair (= matching only one of the surrogates but not the other), and thus we'll differ from the Java implementation here:

/foo(.+)bar\1/u.test("foo\uD834bar\uD834\uDC00"); we say "false", Java says "true".

(In addition, /^(.+)\1$/u.test("\uDC00foobar\uD834\uDC00foobar\uD834") == false.)

# Allen Wirfs-Brock (10 years ago)

On Jan 28, 2015, at 4:54 AM, Wes Garland <wes at page.ca> wrote:

Some interesting questions here.

These aren't discussion points. These are all things that must have answers that are directly derivable from the ES6 spec. If, after developing an adequate understand of that part of the specification, you can’t find the answer to these questions then there is probably something that needs to be clarified in the spec.

1 - What is a character? Is it a Unicode Code Point?

defined in: paragraph 2 people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

2 - Should we be able to match all possible JS Strings?

yes, there is nothing in the algorithms that restrict JS String values

3 - Should we be able to match all possible Unicode Strings?

yes, subject to what you mean but “Unicode Strings” as within JS Strings supplemental code points must be UTF-16 encoded.

4 - What do we do if there is a character in a String we cannot match?

RegExp.exec returns null if a string cannot be matched by a pattern

5 - Do unmatchable characters match . ?

there is no concept in the specification of an “unmatchable” character

6 - Are subsections of unmatchable strings matchable if they contain only matchable characters?

there is no concept in the specification of an “unmatchable” character

It is important to remember in these discussions that the Unicode specification allows strings which contain unmatched surrogates.

and ES6 //u patterns can match them

Do we want regular expressions that can't match some Unicode strings?

No, the ES6 specificaiton can match all possible strings

Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate?

we already have it: \u{D83D}

How about reserved code points? What happens when they become assigned?

Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched.

# André Bargull (10 years ago)

Cool, thanks for clarifications!

To make sure, as per the "intended semantics", we never allow splitting a valid surrogate pair (= matching only one of the surrogates but not the other), and thus we'll differ from the Java implementation here:

/foo(.+)bar\1/u.test("foo\uD834bar\uD834\uDC00"); we say "false", Java says "true".

Correct, the captures List entry is [\uD834], so when performing 21.2.2.9 AtomEscape, \uD834 is matched against \uD834\uDC00 in step 8 which results in a failure state.

(In addition, /^(.+)\1$/u.test("\uDC00foobar\uD834\uDC00foobar\uD834") == false.)

Yes, this expression also returns false.

# Norbert Lindenberg (10 years ago)

On Jan 28, 2015, at 8:11 , Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

On Jan 28, 2015, at 2:36 AM, Marja Hölttä <marja at chromium.org> wrote:

Hello es-discuss,

TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?

The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:

TL;DR: in a unicode regexp lonely surrogates are considered to be a single “character”.

As André has already covered “character” has a very specific meaning within the context of the ES6 RegExp specification in the second paragraph of people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . The specification uses the same set of algorithms to describe both BCP (i.e., 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching semantics. “Character” is used in those algorithm to refer to a single element of the mode that is currently operating within.

I think the ambiguity you find is in step 2.1 of people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :

  1. Return an internal closure that takes two arguments, a String str and an integer index, and performs the following:
  2. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character.

Apparently I don’t have an adequate definition of “interpreted as a UTF-16 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will provided one in the next spec. revisions. The intended semantics is that: In ascending string index order: Each valid UTF-16 surrogate pair is interpreted as a signal code point that is the UTF-16 encoded value Each “lonely” surrogate is interpreted as single code point that is the surrogate value Every other 16-bit code unit is interpreted as a single code point.

That definition is in section 6.1.4: people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type

A cross-reference would be useful.

Norbert

# Norbert Lindenberg (10 years ago)

On Jan 28, 2015, at 8:30 , Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

On Jan 28, 2015, at 4:54 AM, Wes Garland <wes at page.ca> wrote:

Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate? we already have it: \u{D83D}

Or to match any unpaired surrogate: /[\u{D800}-\u{DFFF}]/u

How about reserved code points? What happens when they become assigned? Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched.

There are a few places where RegExp applies Unicode semantics:

– //ui uses Unicode case folding to compare case-insensitively. If the comparison involves code points that are unassigned in the Unicode version assumed by an ECMAScript implementation and in a later version get assigned to characters that are case-variants of each other, then the RegExp behavior can change. See section 21.2.2.8.2.

– RegExp knows a few character classes: \d, \D, \s, \S, \w, \W. \d, \D, \w, \W are defined by character lists that cannot change, but \s and therefore \S could change if Unicode assigns new characters with the category “Separator, space”. See section 21.2.2.12.

But in general //u is defined based on code points and doesn’t care whether code points are assigned or reserved.

Norbert

# Erik Corry (10 years ago)

I think it's problematic that this is being standardized without a single implementation.

# Allen Wirfs-Brock (10 years ago)

On Jan 31, 2015, at 12:01 AM, Norbert Lindenberg wrote:

That definition is in section 6.1.4: people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type

A cross-reference would be useful.

Norbert

done