RegExp free-spacing & comments

# Brian Terlson (9 years ago)

While looking in to proposals for look-behinds and named capture groups for ECMAScript RegExps, I thought I'd also think about other oft-requested features we currently lack: free-spacing and comments. Comments allow a programmer to embed comments inside the regexp literal. Free-spacing tells the RegExp engine to ignore spaces, tabs, and line breaks. These two features go really nicely together to allow human readable regexps - different parts of a pattern can be split into separate lines with comments on each line describing what it matches. XRegExp supports both of these features as well as the RegExp engines in Perl, Java, C# and others.

One challenge with supporting free-spacing in ECMAScript is that we don't allow line breaks inside our regexp literal and constructing regexps from strings is somewhat annoying. The best we could have right now (I think) is something like:

let re = new RegExp(String.raw`
    (\d{3}-)? # area code (optional)
    \d{3}-    # prefix
    \d{4}     # line number
`, "x");

I think this is still a win for long confusing patterns, but maybe I'm alone! Is free-spacing and comments still reasonable if we have to use string templates? Or is there a nice way to extend the grammar of regular expression literals to allow for line breaks? (Eg. maybe we only allow free-spacing with a mode specifier like (?x) inside the pattern?) Any other thoughts?

# Caitlin Potter (9 years ago)

(Repost from IRC discussion)

Could do something like this:


let re = /```<linefeed>
    (\d{3}-)?    # area code (opt)
    \d{3}-        # exchange
    \d{4}         # line number
```/;

It looks sort of heredoc-ish, which is nice, and it shouldn’t break existing parsing rules.

RegularExpressionLiteral: /``` LineTerminator RegularExpressionBody[SpacingAndComments] ```/ RegularExpressionFlags

This is a PrimaryExpression, and therefore should not be confused with TemplateLiterals, or MultiplicativeExpressions. And because of the required linefeed, should not break any existing RegularExpressionLiterals.

Just throwing it out there as a possibility.

The 3 backticks is a cute idea, but it probably doesn’t matter how it’s represented. It just looks very heredoc-ish, and distinct from TemplateLiterals.

# C. Scott Ananian (9 years ago)

If you're using string templates, why not do the full regex there, instead of just passing the result to new RegExp?

See esdiscuss.org/topic/regexp-escape#content-22 and esdiscuss.org/topic/regexp-escape#content-28 for some examples, and benjamingr/RegExp.escape#37 for a discussion of creating a new template string function called re.

# C. Scott Ananian (9 years ago)

We have a template string mechanism, and it allows linefeeds. Let's use that, instead of inventing new heredoc syntax.

# Brian Terlson (9 years ago)

RegExp.re or similar seems nice:

let re = RegExp.re("x")`
    (\d{3}-)? # area code (optional)
    \d{3}-    # prefix
    \d{4}     # line number
`;

But it seems like previous proposals of this want escaping which doesn't seem ideal for this purpose. Do we need both RegExp.re and RegExp.escapedRe?

# Caitlin Potter (9 years ago)

Using that argument, you could say “well, we’ve already got StringLiterals, so we don’t need RegularExpressionLiterals” — And yet in reality, RegularExpressionLiterals are the most common way to actually use RegularExpressions. There’s a reason for that: it’s more concise, it’s identifiably a regular expression, it doesn’t have to worry about the RegExp constructor being overwritten or deleted, there’s no need to escape characters, or worse, use a horrible idea like tagged templates.

With the RegExp constructor, since you’re passing in a string, you can essentially add comments anyways by just terminating the string and commenting after it, so this isn’t really adding anything novel.

let re = RegExp([
  “(\\d{3} <smb://d%7B3%7D>-)?”, // Area Code
  “(\\d{3}- <smb://d%7B3%7D->)”,   // Exchange
  “(\\d{4} <smb://d%7B4%7D>),     // Line
].join(“”);

Template strings don’t really add anything special here — but nobody wants to write code like this anyways :)

# C. Scott Ananian (9 years ago)

On Fri, Nov 6, 2015 at 1:20 PM, Brian Terlson <Brian.Terlson at microsoft.com> wrote:

RegExp.re or similar seems nice:

let re = RegExp.re("x")`
    (\d{3}-)? # area code (optional)
    \d{3}-    # prefix
    \d{4}     # line number
`;

But it seems like previous proposals of this want escaping which doesn't seem ideal for this purpose. Do we need both RegExp.re and RegExp.escapedRe?

Escaping happens if you use interpolation into the string template:

let re = RegExp.re`(?x:
    (\d{3}-)?      # area code (optional)
    ${ /\d{3}/ }-  # prefix
    \d{4}           # line number
    ( ${ "*" } \d+ )?  # extension
)`;

If the interpolated expression is a regexp, then things seem relatively straightforward (although there are corner cases to consider). If the interpolated expression is a string, then it is suggested that you use some sort of automatic escaping.

# Isiah Meadows (9 years ago)

The problem with using the RegExp constructor is that it is never cached by the engine. As a literal, engines usually internalize them, speeding up matches very quickly.

# Gorkem Yakin (9 years ago)

I don’t know about other engines, but Chakra does cache the RegExp matcher when the RegExp object is created via the constructor.

Gorkem

From: es-discuss [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Isiah Meadows Sent: Friday, November 6, 2015 4:36 PM To: C. Scott Ananian <ecmascript at cscott.net>; Brian Terlson <Brian.Terlson at microsoft.com>

Cc: es-discuss at mozilla.org Subject: Re: RegExp free-spacing & comments

The problem with using the RegExp constructor is that it is never cached by the engine. As a literal, engines usually internalize them, speeding up matches very quickly.

On Fri, Nov 6, 2015, 14:24 C. Scott Ananian <ecmascript at cscott.net<mailto:ecmascript at cscott.net>> wrote:

On Fri, Nov 6, 2015 at 1:20 PM, Brian Terlson <Brian.Terlson at microsoft.com<mailto:Brian.Terlson at microsoft.com>> wrote:

RegExp.re or similar seems nice:

let re = RegExp.re("x")`
    (\d{3}-)? # area code (optional)
    \d{3}-    # prefix
    \d{4}     # line number
`;

But it seems like previous proposals of this want escaping which doesn't seem ideal for this purpose. Do we need both RegExp.re and RegExp.escapedRe?

Escaping happens if you use interpolation into the string template:

let re = RegExp.re`(?x:
    (\d{3}-)?      # area code (optional)
    ${ /\d{3}/ }-  # prefix
    \d{4}           # line number
    ( ${ "*" } \d+ )?  # extension
)`;

If the interpolated expression is a regexp, then things seem relatively straightforward (although there are corner cases to consider). If the interpolated expression is a string, then it is suggested that you use some sort of automatic escaping.

# Isiah Meadows (9 years ago)

Does it still read the string?

# kdex (9 years ago)

Actually, in many languages, regular expressions already have a dedicated syntax for commenting them. This might be related to ES7 bug 4423 (Support comment tokens in regular expressions):

ecmascript#4423