ES RegExp parser
The second approach, hands down.
-
With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check
re.quantifier
anywhere, your code will run, but you have a bug. Much easier to only have to checkre.type
. -
If you have a regexp
re
and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write{type: '+', expression: re}
than to have to examinere.quantifier
and (if it's already present) figure out how to modify it. -
With the first approach, you don't have to represent
(?:)
group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers:/(?:\d{4,6})?/
To me this is not even a question.
{type: '+', expression: re}
Sorry, I guess the real expression would be something more like:
{
type: 'Repetition',
expression: re,
quantifier: {
type: '+',
greedy: true
}
}
The point stands.
On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com>
wrote:
The second approach, hands down.
With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check
re.quantifier
anywhere, your code will run, but you have a bug. Much easier to only have to checkre.type
.If you have a regexp
re
and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write{type: '+', expression: re}
than to have to examinere.quantifier
and (if it's already present) figure out how to modify it.With the first approach, you don't have to represent
(?:)
group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers:/(?:\d{4,6})?/
To me this is not even a question.
Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.
Dmitry
On Mon, Mar 20, 2017 at 9:19 AM, Dmitry Soshnikov < dmitry.soshnikov at gmail.com> wrote:
On Mon, Mar 20, 2017 at 8:36 AM Jason Orendorff <jason.orendorff at gmail.com> wrote:
The second approach, hands down.
With the first approach, you're setting up a situation where it's very easy to write buggy analysis code: if you forget to check
re.quantifier
anywhere, your code will run, but you have a bug. Much easier to only have to checkre.type
.If you have a regexp
re
and you want to programmatically build a regexp that matches one or more repetitions of it, it's much easier to write{type: '+', expression: re}
than to have to examinere.quantifier
and (if it's already present) figure out how to modify it.With the first approach, you don't have to represent
(?:)
group in the AST at all (rather like how Esprima drops redundant parentheses). With the latter, I think you have to, because it's possible for a single regexp "node" to have multiple quantifiers:/(?:\d{4,6})?/
To me this is not even a question.
Jason, thanks; all good points! And I came to similar conclusions while was experimenting. Wanted to double-check, thanks for confirming.
OK, I added docs and specs for AST node types, which can also be a good learning material: www.npmjs.com/package/regexp-tree#ast-nodes-specification
Any feedback is welcome!
Parsing regexes is fun :) with this you realize that these are completely valid regexp patterns:
/^^^^$$$/.test(''); // true
/$^/.test(''); // true
/[---]/.test('-'); // true, a range from '-' to '-'!
/[-]/.test('-'); // true
Dmitry
I started working on a ECMAScript regular expressions parser[1] with an AST format similar to Mozilla's Parser API. This might later be extended to support more powerful constructs, like lookbehind assertions, multiline regexes, groups naming, comments, etc.
And while this is mostly an FYI post (probably someone will find it useful for regexes analysis in source transformation tools, or source code editors), I'd appreciate any feedback on the specification of AST nodes (currently totally made up by myself).
E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't tell anything (and shouldn't of course) which AST node this quantifier node produces.
This leaves open questions like "whether a quantifier should be a part of the parsed expression, or should it vice-versa be a main node itself, and have the expression as a sub-node?":
In other words, which format is more appropriate (taking into account AST traversal tools in order to implement NFA/DFA engine for it later):
Char is main, quantifier is a sub-node:
The quantifier is main (creating
Repetition
AST node), char is theexpression
sub-node:Currently I chose the second approach (with
Repetition
node) as more practical when building an AST traversal -- it may haveonRepetition
generic handler, and callonChar
internally for itsexpression
, instead of makingonChar
(or any other node) to check, and handle itsquantifier
, and do a repeat.Anyways, if you have any thought or feedback on AST nodes format, please feel free to contact me.
Dmitry
[1] www.npmjs.com/package/regexp