Using Unicode code point escape sequences in regular expressions without the `u` flag

# Mathias Bynens (12 years ago)

If I’m reading the latest draft correctly, RegExpUnicodeEscapeSequences aren’t allowed in regular expressions without the u flag. Why is that?

AFAICT, the only situations that require looking at code points rather than UCS-2/UTF-16 code units in order to support full Unicode are:

the regex is case-insensitive;
the regex contains a character class;
the regex uses .;
the regex uses a quantifier.

I’d suggest allowing \u{xxxxxx}-style escape sequences everywhere, and simply changing the behavior of the resulting regular expression depending

If I’m reading the latest draft correctly, `RegExpUnicodeEscapeSequence`s aren’t allowed in regular expressions without the `u` flag. Why is that?

AFAICT, the only situations that require looking at code points rather than UCS-2/UTF-16 code units in order to support full Unicode are:

* the regex is case-insensitive;
* the regex contains a character class;
* the regex uses `.`;
* the regex uses a quantifier.

I’d suggest allowing `\u{xxxxxx}`-style escape sequences everywhere, and simply changing the behavior of the resulting regular expression depending on the `u` flag. There’s no good reason to disallow e.g. `/\u{20}/` or even `/\u{1F4A9}/`.

# Erik Arvidsson (12 years ago)

On Thu, Nov 21, 2013 at 2:41 PM, Mathias Bynens <mathias at qiwi.be> wrote:

I’d suggest allowing \u{xxxxxx}-style escape sequences everywhere, and simply changing the behavior of the resulting regular expression depending on the u flag. There’s no good reason to disallow e.g. /\u{20}/ or even /\u{1F4A9}/.

That would unfortunately not be backwards compatible since /\u{123}/ is a valid RegExp in ES5.1.

On Thu, Nov 21, 2013 at 2:41 PM, Mathias Bynens <mathias at qiwi.be> wrote:

> I’d suggest allowing `\u{xxxxxx}`-style escape sequences everywhere, and
> simply changing the behavior of the resulting regular expression depending
> on the `u` flag. There’s no good reason to disallow e.g. `/\u{20}/` or even
> `/\u{1F4A9}/`.
>

That would unfortunately not be backwards compatible since /\u{123}/ is a
valid RegExp in ES5.1.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131121/56237286/attachment.html>

# Mathias Bynens (12 years ago)

Ah, doh! I was thinking in terms of strings: modern engines throw errors for things like '\u{123}'. Thanks for the explanation.

On 21 Nov 2013, at 15:07, Erik Arvidsson <erik.arvidsson at gmail.com> wrote:

> That would unfortunately not be backwards compatible since /\u{123}/ is a valid RegExp in ES5.1.

Ah, doh! I was thinking in terms of strings: modern engines throw errors for things like `'\u{123}'`. Thanks for the explanation.

# Mathias Bynens (12 years ago)

One more related question: are these three regular expression literals equivalent?

/[💩-💫]/u: raw astral symbols
/[\u{1F4A9}-\u{1F4AB}]/u: astral symbols represented using Unicode code point escape sequences
/[\uD83D\uDCA9-\uD83D\uDCAB]/u: astral symbols represented as a surrogate pair

One more related question: are these three regular expression literals equivalent?

1. `/[💩-💫]/u`: raw astral symbols
2. `/[\u{1F4A9}-\u{1F4AB}]/u`: astral symbols represented using Unicode code point escape sequences
3. `/[\uD83D\uDCA9-\uD83D\uDCAB]/u`: astral symbols represented as a surrogate pair

# Allen Wirfs-Brock (12 years ago)

Did you check the ES6 draft grammar? The answer to that should be fairly obvious there and if it isn't it would be good to know so we can try to make it clearer in the spec.

Did you check the ES6 draft grammar{1]?  The answer to that should be fairly obvious there and if it isn't it would be good to know so we can try to make it clearer in the spec.

[1]: http://people.mozilla.org/~jorendorff/es6-draft.html#sec-patterns 

Allen

On Nov 22, 2013, at 9:25 AM, Mathias Bynens wrote:

> One more related question: are these three regular expression literals equivalent?
> 
> 1. `/[💩-💫]/u`: raw astral symbols
> 2. `/[\u{1F4A9}-\u{1F4AB}]/u`: astral symbols represented using Unicode code point escape sequences
> 3. `/[\uD83D\uDCA9-\uD83D\uDCAB]/u`: astral symbols represented as a surrogate pair
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131122/42112a31/attachment.html>

# Mathias Bynens (12 years ago)

It’s pretty clear that (1) is equivalent to (2). I guess (3) is equivalent to (1) and (2) because of the following:

RegExpUnicodeEscapeSequence[U] ::
    [+U] LeadSurrogate \u TrailSurrogate

…but I was looking for confirmation.

On 22 Nov 2013, at 11:20, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

> Did you check the ES6 draft grammar{1]?  The answer to that should be fairly obvious there and if it isn't it would be good to know so we can try to make it clearer in the spec.
> 
> [1]: http://people.mozilla.org/~jorendorff/es6-draft.html#sec-patterns 

It’s pretty clear that (1) is equivalent to (2). I guess (3) is equivalent to (1) and (2) because of the following:

    RegExpUnicodeEscapeSequence[U] ::
        [+U] LeadSurrogate \u TrailSurrogate

…but I was looking for confirmation.

# Allen Wirfs-Brock (12 years ago)

yes that's what the above production says. Also see the semantics for that production in people.mozilla.org/~jorendorff/es6-draft.html#sec-characterescape

On Nov 22, 2013, at 11:02 PM, Mathias Bynens wrote:
> 
> It’s pretty clear that (1) is equivalent to (2). I guess (3) is equivalent to (1) and (2) because of the following:
> 
>    RegExpUnicodeEscapeSequence[U] ::
>        [+U] LeadSurrogate \u TrailSurrogate
> 
> …but I was looking for confirmation.

yes that's what the above production says.  Also see the semantics for that production in http://people.mozilla.org/~jorendorff/es6-draft.html#sec-characterescape 

Allen


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131123/b756ea94/attachment.html>