ES RegExp parser

# Dmitry Soshnikov (7 years ago)

I started working on a ECMAScript regular expressions parser[1] with an AST format similar to Mozilla's Parser API. This might later be extended to support more powerful constructs, like lookbehind assertions, multiline regexes, groups naming, comments, etc.

And while this is mostly an FYI post (probably someone will find it useful for regexes analysis in source transformation tools, or source code editors), I'd appreciate any feedback on the specification of AST nodes (currently totally made up by myself).

E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't tell anything (and shouldn't of course) which AST node this quantifier node produces.

This leaves open questions like "whether a quantifier should be a part of the parsed expression, or should it vice-versa be a main node itself, and have the expression as a sub-node?":

In other words, which format is more appropriate (taking into account AST traversal tools in order to implement NFA/DFA engine for it later):

/a+?/

Char is main, quantifier is a sub-node:

{
  type: 'Char',
  value: 'a',
  quantifier: {
    type: '+',
    greedy: false,
  }
}

The quantifier is main (creating Repetition AST node), char is the expression sub-node:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a'
  },
  quantifier: {
    type: '+',
    greedy: false,
  }
}

Currently I chose the second approach (with Repetition node) as more practical when building an AST traversal -- it may have onRepetition generic handler, and call onChar internally for its expression, instead of making onChar (or any other node) to check, and handle its quantifier, and do a repeat.

Anyways, if you have any thought or feedback on AST nodes format, please feel free to contact me.

Dmitry

[1] www.npmjs.com/package/regexp

# Jason Orendorff (7 years ago)

The second approach, hands down.

  1. With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check re.quantifier anywhere, your code will run, but you have a bug. Much easier to only have to check re.type.

  2. If you have a regexp re and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write {type: '+', expression: re} than to have to examine re.quantifier and (if it's already present) figure out how to modify it.

  3. With the first approach, you don't have to represent (?:) group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: /(?:\d{4,6})?/

To me this is not even a question.

# Jason Orendorff (7 years ago)

{type: '+', expression: re}

Sorry, I guess the real expression would be something more like:

{
  type: 'Repetition',
  expression: re,
  quantifier: {
    type: '+',
    greedy: true
  }
}

The point stands.

# Dmitry Soshnikov (7 years ago)

On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com>

wrote:

The second approach, hands down.

  1. With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check re.quantifier anywhere, your code will run, but you have a bug. Much easier to only have to check re.type.

  2. If you have a regexp re and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write {type: '+', expression: re} than to have to examine re.quantifier and (if it's already present) figure out how to modify it.

  3. With the first approach, you don't have to represent (?:) group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: /(?:\d{4,6})?/

To me this is not even a question.

Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.

Dmitry

# Dmitry Soshnikov (7 years ago)

On Mon, Mar 20, 2017 at 9:19 AM, Dmitry Soshnikov < dmitry.soshnikov at gmail.com> wrote:

On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com> wrote:

The second approach, hands down.

  1. With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check re.quantifier anywhere, your code will run, but you have a bug. Much easier to only have to check re.type.

  2. If you have a regexp re and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write {type: '+', expression: re} than to have to examine re.quantifier and (if it's already present) figure out how to modify it.

  3. With the first approach, you don't have to represent (?:) group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: /(?:\d{4,6})?/

To me this is not even a question.

Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.

OK, I added docs and specs for AST node types, which can also be a good learning material: www.npmjs.com/package/regexp-tree#ast-nodes-specification

Any feedback is welcome!

Parsing regexes is fun :) with this you realize that these are completely valid regexp patterns:

/^^^^$$$/.test(''); // true

/$^/.test(''); // true

/[---]/.test('-'); // true, a range from '-' to '-'!
/[-]/.test('-'); // true

Dmitry