ES RegExp parser

# Dmitry Soshnikov (8 years ago)

I started working on a ECMAScript regular expressions parser[1] with an AST format similar to Mozilla's Parser API. This might later be extended to support more powerful constructs, like lookbehind assertions, multiline regexes, groups naming, comments, etc.

And while this is mostly an FYI post (probably someone will find it useful for regexes analysis in source transformation tools, or source code editors), I'd appreciate any feedback on the specification of AST nodes (currently totally made up by myself).

E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't tell anything (and shouldn't of course) which AST node this quantifier node produces.

This leaves open questions like "whether a quantifier should be a part of the parsed expression, or should it vice-versa be a main node itself, and have the expression as a sub-node?":

In other words, which format is more appropriate (taking into account AST traversal tools in order to implement NFA/DFA engine for it later):

/a+?/

Char is main, quantifier is a sub-node:

{
  type: 'Char',
  value: 'a',
  quantifier: {
    type: '+',
    greedy: false,
  }
}

The quantifier is main (creating Repetition AST node), char is the expression sub-node:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a'
  },
  quantifier: {
    type: '+',
    greedy: false,
  }
}

Currently I chose the second approach (with Repetition node) as more practical when building an AST traversal -- it may have onRepetition generic handler, and call onChar internally for its expression, instead of making onChar (or any other node) to check, and handle its quantifier, and do a repeat.

Anyways, if you have any thought or feedback on AST nodes format, please feel free to contact me.

Dmitry

[1] www.npmjs.com/package/regexp

I started working on a ECMAScript regular expressions parser[1] with an AST
format similar to Mozilla's Parser API. This might later be extended to
support more powerful constructs, like lookbehind assertions, multiline
regexes, groups naming, comments, etc.

And while this is mostly an FYI post (probably someone will find it useful
for regexes analysis in source transformation tools, or source code
editors), I'd appreciate any feedback on the specification of AST nodes
(currently totally made up by myself).

E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't
tell anything (and shouldn't of course) which AST node this quantifier node
produces.

This leaves open questions like "whether a quantifier should be a part of
the parsed expression, or should it vice-versa be a main node itself, and
have the expression as a sub-node?":

In other words, which format is more appropriate (taking into account AST
traversal tools in order to implement NFA/DFA engine for it later):

```
/a+?/
```

Char is main, quantifier is a sub-node:

```
{
  type: 'Char',
  value: 'a',
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

The quantifier is main (creating `Repetition` AST node), char is the
`expression` sub-node:


```
{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a'
  },
  quantifier: {
    type: '+',
    greedy: false,
  }
}
```

Currently I chose the second approach (with `Repetition` node) as more
practical when building an AST traversal -- it may have `onRepetition`
generic handler, and call `onChar` internally for its `expression`, instead
of making `onChar` (or any other node) to check, and handle its
`quantifier`, and do a repeat.

Anyways, if you have any thought or feedback on AST nodes format, please
feel free to contact me.

Dmitry

 [1] https://www.npmjs.com/package/regexp-tree
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20170319/7da357d9/attachment-0001.html>

# Jason Orendorff (8 years ago)

The second approach, hands down.

With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check re.quantifier anywhere, your code will run, but you have a bug. Much easier to only have to check re.type.
If you have a regexp re and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write {type: '+', expression: re} than to have to examine re.quantifier and (if it's already present) figure out how to modify it.
With the first approach, you don't have to represent (?:) group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: /(?:\d{4,6})?/

To me this is not even a question.

The second approach, hands down.

1. With the first approach, you're setting up a situation where it's very
easy to write buggy analysis code: if you forget to check `re.quantifier`
anywhere, your code will run, but you have a bug. Much easier to only have
to check `re.type`.

2. If you have a regexp `re` and you want to programmatically build a
regexp that matches one or more repetitions of it, it's much easier to
write `{type: '+', expression: re}` than to have to examine `re.quantifier`
and (if it's already present) figure out how to modify it.

3. With the first approach, you don't have to represent `(?:)` group in the
AST at all (rather like how Esprima drops redundant parentheses). With the
latter, I think you have to, because it's possible for a single regexp
"node" to have multiple quantifiers: `/(?:\d{4,6})?/`

To me this is not even a question.

-j



On Sun, Mar 19, 2017 at 4:52 PM, Dmitry Soshnikov <
dmitry.soshnikov at gmail.com> wrote:

> I started working on a ECMAScript regular expressions parser[1] with an
> AST format similar to Mozilla's Parser API. This might later be extended to
> support more powerful constructs, like lookbehind assertions, multiline
> regexes, groups naming, comments, etc.
>
> And while this is mostly an FYI post (probably someone will find it useful
> for regexes analysis in source transformation tools, or source code
> editors), I'd appreciate any feedback on the specification of AST nodes
> (currently totally made up by myself).
>
> E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't
> tell anything (and shouldn't of course) which AST node this quantifier node
> produces.
>
> This leaves open questions like "whether a quantifier should be a part of
> the parsed expression, or should it vice-versa be a main node itself, and
> have the expression as a sub-node?":
>
> In other words, which format is more appropriate (taking into account AST
> traversal tools in order to implement NFA/DFA engine for it later):
>
> ```
> /a+?/
> ```
>
> Char is main, quantifier is a sub-node:
>
> ```
> {
>   type: 'Char',
>   value: 'a',
>   quantifier: {
>     type: '+',
>     greedy: false,
>   }
> }
> ```
>
> The quantifier is main (creating `Repetition` AST node), char is the
> `expression` sub-node:
>
>
> ```
> {
>   type: 'Repetition',
>   expression: {
>     type: 'Char',
>     value: 'a'
>   },
>   quantifier: {
>     type: '+',
>     greedy: false,
>   }
> }
> ```
>
> Currently I chose the second approach (with `Repetition` node) as more
> practical when building an AST traversal -- it may have `onRepetition`
> generic handler, and call `onChar` internally for its `expression`, instead
> of making `onChar` (or any other node) to check, and handle its
> `quantifier`, and do a repeat.
>
> Anyways, if you have any thought or feedback on AST nodes format, please
> feel free to contact me.
>
> Dmitry
>
>  [1] https://www.npmjs.com/package/regexp-tree
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20170320/1fbc68ab/attachment.html>

# Jason Orendorff (8 years ago)

{type: '+', expression: re}

Sorry, I guess the real expression would be something more like:

{
  type: 'Repetition',
  expression: re,
  quantifier: {
    type: '+',
    greedy: true
  }
}

The point stands.

> `{type: '+', expression: re}`

Sorry, I guess the real expression would be something more like:

    {
      type: 'Repetition',
      expression: re,
      quantifier: {
        type: '+',
        greedy: true
      }
    }

The point stands.

-j


On Mon, Mar 20, 2017 at 10:36 AM, Jason Orendorff <jason.orendorff at gmail.com
> wrote:

> The second approach, hands down.
>
> 1. With the first approach, you're setting up a situation where it's very
> easy to write buggy analysis code: if you forget to check `re.quantifier`
> anywhere, your code will run, but you have a bug. Much easier to only have
> to check `re.type`.
>
> 2. If you have a regexp `re` and you want to programmatically build a
> regexp that matches one or more repetitions of it, it's much easier to
> write `{type: '+', expression: re}` than to have to examine `re.quantifier`
> and (if it's already present) figure out how to modify it.
>
> 3. With the first approach, you don't have to represent `(?:)` group in
> the AST at all (rather like how Esprima drops redundant parentheses). With
> the latter, I think you have to, because it's possible for a single regexp
> "node" to have multiple quantifiers: `/(?:\d{4,6})?/`
>
> To me this is not even a question.
>
> -j
>
>
>
> On Sun, Mar 19, 2017 at 4:52 PM, Dmitry Soshnikov <
> dmitry.soshnikov at gmail.com> wrote:
>
>> I started working on a ECMAScript regular expressions parser[1] with an
>> AST format similar to Mozilla's Parser API. This might later be extended to
>> support more powerful constructs, like lookbehind assertions, multiline
>> regexes, groups naming, comments, etc.
>>
>> And while this is mostly an FYI post (probably someone will find it
>> useful for regexes analysis in source transformation tools, or source code
>> editors), I'd appreciate any feedback on the specification of AST nodes
>> (currently totally made up by myself).
>>
>> E.g. when we have a quantifier from ES spec for RegExp grammar, it
>> doesn't tell anything (and shouldn't of course) which AST node this
>> quantifier node produces.
>>
>> This leaves open questions like "whether a quantifier should be a part of
>> the parsed expression, or should it vice-versa be a main node itself, and
>> have the expression as a sub-node?":
>>
>> In other words, which format is more appropriate (taking into account AST
>> traversal tools in order to implement NFA/DFA engine for it later):
>>
>> ```
>> /a+?/
>> ```
>>
>> Char is main, quantifier is a sub-node:
>>
>> ```
>> {
>>   type: 'Char',
>>   value: 'a',
>>   quantifier: {
>>     type: '+',
>>     greedy: false,
>>   }
>> }
>> ```
>>
>> The quantifier is main (creating `Repetition` AST node), char is the
>> `expression` sub-node:
>>
>>
>> ```
>> {
>>   type: 'Repetition',
>>   expression: {
>>     type: 'Char',
>>     value: 'a'
>>   },
>>   quantifier: {
>>     type: '+',
>>     greedy: false,
>>   }
>> }
>> ```
>>
>> Currently I chose the second approach (with `Repetition` node) as more
>> practical when building an AST traversal -- it may have `onRepetition`
>> generic handler, and call `onChar` internally for its `expression`, instead
>> of making `onChar` (or any other node) to check, and handle its
>> `quantifier`, and do a repeat.
>>
>> Anyways, if you have any thought or feedback on AST nodes format, please
>> feel free to contact me.
>>
>> Dmitry
>>
>>  [1] https://www.npmjs.com/package/regexp-tree
>>
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20170320/46215ccf/attachment.html>

# Dmitry Soshnikov (8 years ago)

On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com>

wrote:

The second approach, hands down.

With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check re.quantifier anywhere, your code will run, but you have a bug. Much easier to only have to check re.type.

If you have a regexp re and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write {type: '+', expression: re} than to have to examine re.quantifier and (if it's already present) figure out how to modify it.

With the first approach, you don't have to represent (?:) group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: /(?:\d{4,6})?/

To me this is not even a question.

Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.

Dmitry

On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com>
wrote:

> The second approach, hands down.
>
> 1. With the first approach, you're setting up a situation where it's very
> easy to write buggy analysis code: if you forget to check `re.quantifier`
> anywhere, your code will run, but you have a bug. Much easier to only have
> to check `re.type`.
>
> 2. If you have a regexp `re` and you want to programmatically build a
> regexp that matches one or more repetitions of it, it's much easier to
> write `{type: '+', expression: re}` than to have to examine `re.quantifier`
> and (if it's already present) figure out how to modify it.
>
> 3. With the first approach, you don't have to represent `(?:)` group in
> the AST at all (rather like how Esprima drops redundant parentheses). With
> the latter, I think you have to, because it's possible for a single regexp
> "node" to have multiple quantifiers: `/(?:\d{4,6})?/`
>
> To me this is not even a question.
>

Jason, thanks; all good points! And I came to similar conclusions while was
experimenting. Wanted to double-check, thanks for confirming.

Dmitry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20170320/9aa72726/attachment-0001.html>

# Dmitry Soshnikov (8 years ago)

On Mon, Mar 20, 2017 at 9:19 AM, Dmitry Soshnikov < dmitry.soshnikov at gmail.com> wrote:

On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com> wrote:

The second approach, hands down.

With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check re.quantifier anywhere, your code will run, but you have a bug. Much easier to only have to check re.type.

If you have a regexp re and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write {type: '+', expression: re} than to have to examine re.quantifier and (if it's already present) figure out how to modify it.

With the first approach, you don't have to represent (?:) group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers: /(?:\d{4,6})?/

To me this is not even a question.

Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.

OK, I added docs and specs for AST node types, which can also be a good learning material: www.npmjs.com/package/regexp-tree#ast-nodes-specification

Any feedback is welcome!

Parsing regexes is fun :) with this you realize that these are completely valid regexp patterns:

/^^^^$$$/.test(''); // true

/$^/.test(''); // true

/[---]/.test('-'); // true, a range from '-' to '-'!
/[-]/.test('-'); // true

Dmitry

On Mon, Mar 20, 2017 at 9:19 AM, Dmitry Soshnikov <
dmitry.soshnikov at gmail.com> wrote:

>
> On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com>
> wrote:
>
>> The second approach, hands down.
>>
>> 1. With the first approach, you're setting up a situation where it's very
>> easy to write buggy analysis code: if you forget to check `re.quantifier`
>> anywhere, your code will run, but you have a bug. Much easier to only have
>> to check `re.type`.
>>
>> 2. If you have a regexp `re` and you want to programmatically build a
>> regexp that matches one or more repetitions of it, it's much easier to
>> write `{type: '+', expression: re}` than to have to examine `re.quantifier`
>> and (if it's already present) figure out how to modify it.
>>
>> 3. With the first approach, you don't have to represent `(?:)` group in
>> the AST at all (rather like how Esprima drops redundant parentheses). With
>> the latter, I think you have to, because it's possible for a single regexp
>> "node" to have multiple quantifiers: `/(?:\d{4,6})?/`
>>
>> To me this is not even a question.
>>
>
> Jason, thanks; all good points! And I came to similar conclusions while
> was experimenting. Wanted to double-check, thanks for confirming.
>
>
OK, I added docs and specs for AST node types, which can also be a good
learning material:
https://www.npmjs.com/package/regexp-tree#ast-nodes-specification

Any feedback is welcome!

Parsing regexes is fun :) with this you realize that these are completely
valid regexp patterns:

```
/^^^^$$$/.test(''); // true

/$^/.test(''); // true

/[---]/.test('-'); // true, a range from '-' to '-'!
/[-]/.test('-'); // true
```

Dmitry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20170320/6fdd1d4f/attachment.html>