JavaScript parser API

# Axel Rauschmayer (14 years ago)

blog.mozilla.com/dherman/2011/06/28/the-js-parser-api-has-landed

I’ve just read D. Herman’s post on Firefox’s parser API. Is there any chance that this kind of API will make it into Harmony? It would be really useful for a variety of generative/meta-programming tasks.

Axel

# Mike Shaver (14 years ago)

On Tue, Jun 28, 2011 at 6:34 PM, Axel Rauschmayer <axel at rauschma.de> wrote:

blog.mozilla.com/dherman/2011/06/28/the-js-parser-api-has-landed

I’ve just read D. Herman’s post on Firefox’s parser API. Is there any chance that this kind of API will make it into Harmony? It would be really useful for a variety of generative/meta-programming tasks.

I'm interested in this too, for a number of applications, but I keep getting stuck on one thing.

Would you standardize the resulting parse tree, too? That would probably mean that every engine would have two parsers, since I'm sure we produce different parse trees right now, and wouldn't want to lock down our parser mechanics for all time.

If you don't standardize the parse tree, is it still useful? More useful than just using narcissus or whatever other parsing libraries exist?

Mike

# Mark S. Miller (14 years ago)

I'm glad you're asking the right question. To clarify for everyone, "EcmaScript-Harmony" (or just "Harmony") names our agreed trajectory for the language following ES5. We understand that the goals of this trajectory will take several new editions to realize. "ES-next" is our working name for the next edition along this trajectory, likely to be named "ES6".

So, not having need proposed or sanctioned prior to the May meeting cutoff date, I'd say the chances of this making ii into ES-next are vanishingly small. However, it is consistent with the overall goals for Harmony, so yes. Just because it's not a candidate for ES-next is no reason to postponing discussing it on es-discuss of course. I would very much like to see something along these lines become standard in some later edition.

# Mark S. Miller (14 years ago)

On Tue, Jun 28, 2011 at 5:02 PM, Mike Shaver <mike.shaver at gmail.com> wrote:

On Tue, Jun 28, 2011 at 6:34 PM, Axel Rauschmayer <axel at rauschma.de> wrote:

blog.mozilla.com/dherman/2011/06/28/the-js-parser-api-has-landed

I’ve just read D. Herman’s post on Firefox’s parser API. Is there any chance that this kind of API will make it into Harmony? It would be really useful for a variety of generative/meta-programming tasks.

I'm interested in this too, for a number of applications, but I keep getting stuck on one thing.

Would you standardize the resulting parse tree, too?

A resulting AST, yes. And the need to agree on an AST is part of why previous proposals along these lines failed to achieve consensus. I continue to like the ASTs we generate at < es-lab.googlecode.com/svn/trunk/site/esparser/index.html>. YMMV. But

I would be in favor any AST format we could agree on.

That would probably mean that every engine would have two parsers, since I'm sure we produce different parse trees right now, and wouldn't want to lock down our parser mechanics for all time.

If you don't standardize the parse tree, is it still useful? More useful than just using narcissus or whatever other parsing libraries exist? mail.mozilla.org/listinfo/es-discuss

The only option I can see for standardizing a parser but not an AST is to standardize an (SAX-like) callback emitting parser. But this doesn't really avoid any of the hard issues. The signature of each of these callbacks has exactly the same design issues as the format of an AST.

# Axel Rauschmayer (14 years ago)

Maybe we could start with an “official” separate JavaScript-only implementation?

# David Herman (14 years ago)

Yeah, tough questions. I don't know. I tried to make the API flexible by allowing custom builders, and in fact if you look at the test suite you'll see I did a proof-of-concept showing how you could generate the format that Mark mentioned:

http://hg.mozilla.org/tracemonkey/file/2ce7546583ff/js/src/tests/js1_8_5/extensions/reflect-parse.js#l1051

But there are also tough questions about what the parser should do with engine-specific language extensions.

I agree about the issue of multiple parsers. The reason I was able to do the SpiderMonkey library fairly easily was that I simply reflect exactly the parser that exists. But to have a standards-compliant parser, we'd probably have to write a separate parser. That's definitely a tall order.

# Claus Reinke (14 years ago)

and other interested parties,

it would be helpful to summarize the issues on a page linked from the AST API strawman - given the positive discussions on this list, I thought the idea was implicitly accepted last year, modulo details, so I was surprised not to see a refined strawman promoted. Would it help if interested parties started discussing/refining the details, or are the different users so far apart that agreement is out of reach?

My own impressions, from adding much of a Spidermonkey-based AST to my parser:

- it does not support generic traversals, so it definitely needs a
    pre-implemented traversal, sorting out each type of Node
    (Array-based ASTs, like the es-lab version, make this slightly
    easier - Arrays elements are ordered, unlike Object properties);

    at that stage, simple applications (such as tag generation)
    may be better of working with hooks into the parser, rather
    than hooks into an AST traversal? also, there is the risk that
    one pre-implemented traversal might not cover all use cases,
    in which case the boilerplate tax would have to be paid again;

- it is slightly easier to manipulate than an Array-based AST, but
    lack of pattern matching fall-through (alternative patterns for
    destructuring) still hurts, and the selectors are lengthy, which
    hampers visualization and construction; (this assumes that
    fp-style AST processing is preferred over oo-style processing)

- it is biased towards evaluation, which is a hindrance for other
    uses (such as faithful unparsing, for program transformations);

    this can be seen clearly in Literals, which are evaluated (why
    not evaluate Object, Array, Function Literals as well? eval should
    be part of AST processing, not of AST construction), but it also
    shows in other constructs (comments are not stored at all, and
    if commas/semicolons are not stored, how does one know
    where they were located - programmers tend to be picky
    about their personal or project-wide style guides?);

- there are some minor oddities, from spelling differences to
    the spec (Label(l)ed), to structuring decisions (why separate
    UpdateExpression and LogicalExpression, when everything
    else is in UnaryExpression and BinaryExpression?);

    btw, why alternate/consequent instead of then/else, and
    shouldn't that really be consequent->then and alternate->else
    instead of the other way round (as the optional null for
    consequent suggests)?

My main issue is unparsing support for program transformations (though IDEs will similarly need more info, for comment extraction, syntax highlighting, and syntax-based operations).

What I did for now was to add a field to each Node, in which I store an unprocessed Array of the sub-ASTs, including tokens. Essentially, the extended AST Nodes provide both abstract info for analysis and evaluation and a structured view of the token stream belonging to each Node, for lower-level needs.

Whitespace/comments are stored separately, indexed by the start position of the following token (this is going to work better for comment-before-token that for comment-after-token, but it is a start, for unparsing or comment-extraction tools).

This allows for a generic traversal of the Array-based unprocessed AST fragments, for unparsing, but I still have to rearrange things so that I can actually store the information I need (can't add info to null as an AST value) and distinguish meta-info ("computed" and "prefix" properties) from sub-ASTs.

Overall, the impression is that this AST was designed by someone resigned to the fact of having to write Node-type-specific traversal code for each purpose, with a limited number of purposes planned (such as evaluation). This could be a burden for other uses of such ASTs (boilerplate tax).

I hope these notes help - I'd really like to see a standard JS parser API implemented across engines. For language experimentation, we'd still need separate tweakable parsers, but access to the efficient engine parsers for current JS would give tool development a boost.

But there are also tough questions about what the parser should do with engine-specific language extensions.

Actually, that starts before the AST: I'd like to see feature-based language versioning, instead of the current monolithic version numbering - take generators as an example feature:

Perhaps JS1.7 ("javascript;version=1.7") happens to be the first JS version to support "yield", and is backwards compatible with JS1.5, which might happen to match ES3; and JS1.8.5, which happens to match ES5, might be backwards compatible with JS1.7. But it is unlikely that the JSx which happens to match ES6 will be backwards compatible with JS1.7 (while ES5-breaking changes will be limited, replacing experimental JS1.x features with standardized variants is another matter).

Whereas, if I was able to specify "use yield", and be similarly selective about other language features, then either of JS1.7, JS1.8.5 and ES6 engines might be able to do the job, depending on what other language features my code depends on. Also, other engines might want to implement some features -like "yield"- selectively, without aiming to support all of JS1.7, and long before being able to support all of ES6.

I agree about the issue of multiple parsers. The reason I was able to do the SpiderMonkey library fairly easily was that I simply reflect exactly the parser that exists. But to have a standards-compliant parser, we'd probably have to write a separate parser. That's definitely a tall order.

It should not be, provided one distinguishes between standards-compliant and production use. If the ES grammar is LR(1), it should really be specified in a parser tool format, both for verification and to generate standards-compliant tools to compare against. Depending on how efficient the JS Bison implementation is, this might even lead to useable parser performance.

There may be problems in finding a tool that generates all the information needed for a useful AST (source locations, comments, scope info, ..), but we do not need to solve every issue immediately to make progress, right? And if the ES committee were to ask ES parser generator implementors whether their tools could be extended to serve an AST spec, response might be favourable.

It would be nice if the spec parser was generated in Javascript, but any tool-usable standard grammar would be useful - once the grammar can be processed by a freely available tool, it can be translated to similar formats, some of which have Javascript implementations (eg Jison, ANTLR).

Having played a little with the ANTLRWorks environment, it looks promising, is easy to install (just a .jar), has user-contributed ES grammars, and can spot some ambiguities easily (though I don't think its check is complete, and the ES grammar is too complex to make naïve parse-tree visualization helpful). If other tools have better ES grammar development support, I'd like to hear about them.

Without a standard spec-conformant tool-readable grammar, such tools remain of limited use. With a tool-readable grammar, adding AST generation might turn out to be an afternoon's work (followed by years of testing/debugging;-).

Claus

# David Herman (14 years ago)

the AST API strawman - given the positive discussions on this list, I thought the idea was implicitly accepted last year, modulo details, so I was surprised not to see a refined strawman promoted.

It hasn't really been championed so far. I was concentrating on other proposals for ES.next.

  • it does not support generic traversals, so it definitely needs a pre-implemented traversal, sorting out each type of Node (Array-based ASTs, like the es-lab version, make this slightly easier - Arrays elements are ordered, unlike Object properties);

I designed it to be easily JSON-{de}serializable, so no special prototype. However, you can use the builder API to construct your own format:

https://developer.mozilla.org/en/SpiderMonkey/Parser_API#Builder_objects

With a custom builder you can create objects with whatever methods you want, and builders for various formats can be shared in libraries.

  at that stage, simple applications (such as tag generation)
  may be better of working with hooks into the parser, rather
  than hooks into an AST traversal? also, there is the risk that
  one pre-implemented traversal might not cover all use cases,
  in which case the boilerplate tax would have to be paid again;

I don't understand any of this.

  • it is slightly easier to manipulate than an Array-based AST, but

More than slightly, IMO.

  lack of pattern matching fall-through (alternative patterns for
  destructuring) still hurts, and the selectors are lengthy, which
  hampers visualization and construction; (this assumes that
  fp-style AST processing is preferred over oo-style processing)

If I'd defined a new object type with its own prototype, it still wouldn't define all operations anyone would ever want. So they'd either have to monkey-patch it or it would need a visitor. Which you could write anyway. So I don't see much benefit to pre-defining a node prototype.

But again, see the builder API, where you can create your own custom node type.

  • it is biased towards evaluation, which is a hindrance for other uses (such as faithful unparsing, for program transformations);

It's just a reflection of the built-in SpiderMonkey parser, which was designed for the sole purpose of evaluation. I didn't reimplement a new parser.

  this can be seen clearly in Literals, which are evaluated (why
  not evaluate Object, Array, Function Literals as well? eval should
  be part of AST processing, not of AST construction), but it also
  shows in other constructs (comments are not stored at all, and
  if commas/semicolons are not stored, how does one know
  where they were located - programmers tend to be picky
  about their personal or project-wide style guides?);

None of this data is available in a SpiderMonkey parse node.

  • there are some minor oddities, from spelling differences to the spec (Label(l)ed),

Heh, I shouldn't've capitulated to my (excellent and meticulous!) reviewer, who was unfamiliar with the spec:

https://bugzilla.mozilla.org/show_bug.cgi?id=533874#c28

I can probably change that.

to structuring decisions (why separate UpdateExpression and LogicalExpression, when everything else is in UnaryExpression and BinaryExpression?);

I separated update expressions and logical expressions because they have different control structure from the other unary and binary operators.

  btw, why alternate/consequent instead of then/else, and

I was avoiding using keywords as property names, and consequent/alternate are standard terminology. I suppose .then/.else would be more convenient.

  shouldn't that really be consequent->then and alternate->else
  instead of the other way round (as the optional null for
  consequent suggests)?

Doc bug, thanks. Fixed.

My main issue is unparsing support for program transformations

https://bugzilla.mozilla.org/show_bug.cgi?id=590755

(though IDEs will similarly need more info, for comment extraction, syntax highlighting, and syntax-based operations).

This is all the stuff that will almost certainly require separate implementations from the engine's core parser. And maybe that's fine. In my case, I wanted to implement a reflection of our existing parser, because it's guaranteed to track the behavior of SpiderMonkey's parser.

What I did for now was to add a field to each Node, in which I store an unprocessed Array of the sub-ASTs, including tokens. Essentially, the extended AST Nodes provide both abstract info for analysis and evaluation and a structured view of the token stream belonging to each Node, for lower-level needs.

Whitespace/comments are stored separately, indexed by the start position of the following token (this is going to work better for comment-before-token that for comment-after-token, but it is a start, for unparsing or comment-extraction tools).

You've lost me again. Are you describing a parser you wrote?

This allows for a generic traversal of the Array-based unprocessed AST fragments, for unparsing, but I still have to rearrange things so that I can actually store the information I need (can't add info to null as an AST value) and distinguish meta-info ("computed" and "prefix" properties) from sub-ASTs.

I'm still lost.

Overall, the impression is that this AST was designed by someone resigned to the fact of having to write Node-type-specific traversal code for each purpose, with a limited number of purposes planned (such as evaluation). This could be a burden for other uses of such ASTs (boilerplate tax).

It was designed to be minimal and serializable. It was a lot of code, so I figured I would just focus on a) making sure all the data was there and b) making it possible to provide a custom data format via the builder API. This is what I came up with, but I can revisit the API design if it's useful.

I hope these notes help - I'd really like to see a standard JS parser API implemented across engines. For language experimentation, we'd still need separate tweakable parsers, but access to the efficient engine parsers for current JS would give tool development a boost.

I'm still not convinced this is such a big win. Reflect.parse gives you some performance, but it still requires two traversals (one to generate the internal C++ JSParseNode tree and then a second to convert that to a JS object tree). But part of the benefit is knowing you have exactly the SpiderMonkey parser. Once implementors have to write a separate parser, the possibility of divergence increases, and the maintenance cost of building a second parser in a low-level language is high. At that point, they might just want to write it in JS. But anybody could do that.

But there are also tough questions about what the parser should do with engine-specific language extensions.

Actually, that starts before the AST: I'd like to see feature-based language versioning, instead of the current monolithic version numbering - take generators as an example feature:

Perhaps JS1.7 ("javascript;version=1.7") happens to be the first JS version to support "yield", and is backwards compatible with JS1.5, which might happen to match ES3; and JS1.8.5, which happens to match ES5, might be backwards compatible with JS1.7. But it is unlikely that the JSx which happens to match ES6 will be backwards compatible with JS1.7 (while ES5-breaking changes will be limited, replacing experimental JS1.x features with standardized variants is another matter).

Whereas, if I was able to specify "use yield", and be similarly selective about other language features, then either of JS1.7, JS1.8.5 and ES6 engines might be able to do the job, depending on what other language features my code depends on. Also, other engines might want to implement some features -like "yield"- selectively, without aiming to support all of JS1.7, and long before being able to support all of ES6.

That's asking for quite a modularized/configurable parser.

I agree about the issue of multiple parsers. The reason I was able to do the SpiderMonkey library fairly easily was that I simply reflect exactly the parser that exists. But to have a standards-compliant parser, we'd probably have to write a separate parser. That's definitely a tall order.

It should not be, provided one distinguishes between standards-compliant and production use. If the ES grammar is LR(1), it should really be specified in a parser tool format,

Mainstream production JS engines have moved away from parser generators.

both for verification and to generate standards-compliant tools to compare against. Depending on how efficient the JS Bison implementation is, this might even lead to useable parser performance.

Again, this could be implemented by anyone as a pure JS library.

There may be problems in finding a tool that generates all the information needed for a useful AST (source locations, comments, scope info, ..), but we do not need to solve every issue immediately to make progress, right? And if the ES committee were to ask ES parser generator implementors whether their tools could be extended to serve an AST spec, response might be favourable.

It would be nice if the spec parser was generated in Javascript, but any tool-usable standard grammar would be useful - once the grammar can be processed by a freely available tool, it can be translated to similar formats, some of which have Javascript implementations (eg Jison, ANTLR).

Having played a little with the ANTLRWorks environment, it looks promising, is easy to install (just a .jar), has user-contributed ES grammars, and can spot some ambiguities easily (though I don't think its check is complete, and the ES grammar is too complex to make naïve parse-tree visualization helpful). If other tools have better ES grammar development support, I'd like to hear about them.

Without a standard spec-conformant tool-readable grammar, such tools remain of limited use. With a tool-readable grammar, adding AST generation might turn out to be an afternoon's work (followed by years of testing/debugging;-).

A standard, machine-processable grammar would be a nice-to-have. Agreed.

I hate to complain, but can you try to trim your messages? It takes an enormous amount of time to read and respond to these huge messages.

https://twitter.com/#!/statpumpkin/status/66187260407709696
# Zachary Carter (14 years ago)

On Wed, Jul 6, 2011 at 12:00 AM, David Herman <dherman at mozilla.com> wrote:

the AST API strawman - given the positive discussions on this list, I thought the idea was implicitly accepted last year, modulo details, so I was surprised not to see a refined strawman promoted.

It hasn't really been championed so far. I was concentrating on other proposals for ES.next.

- it does not support generic traversals, so it definitely needs a       pre-implemented traversal, sorting out each type of Node       (Array-based ASTs, like the es-lab version, make this slightly       easier - Arrays elements are ordered, unlike Object properties);

I designed it to be easily JSON-{de}serializable, so no special prototype. However, you can use the builder API to construct your own format:

developer.mozilla.org/en/SpiderMonkey/Parser_API#Builder_objects

With a custom builder you can create objects with whatever methods you want, and builders for various formats can be shared in libraries.

at that stage, simple applications (such as tag generation)       may be better of working with hooks into the parser, rather       than hooks into an AST traversal? also, there is the risk that       one pre-implemented traversal might not cover all use cases,       in which case the boilerplate tax would have to be paid again;

I don't understand any of this.

- it is slightly easier to manipulate than an Array-based AST, but

More than slightly, IMO.

lack of pattern matching fall-through (alternative patterns for       destructuring) still hurts, and the selectors are lengthy, which       hampers visualization and construction; (this assumes that       fp-style AST processing is preferred over oo-style processing)

If I'd defined a new object type with its own prototype, it still wouldn't define all operations anyone would ever want. So they'd either have to monkey-patch it or it would need a visitor. Which you could write anyway. So I don't see much benefit to pre-defining a node prototype.

But again, see the builder API, where you can create your own custom node type.

- it is biased towards evaluation, which is a hindrance for other       uses (such as faithful unparsing, for program transformations);

It's just a reflection of the built-in SpiderMonkey parser, which was designed for the sole purpose of evaluation. I didn't reimplement a new parser.

this can be seen clearly in Literals, which are evaluated (why       not evaluate Object, Array, Function Literals as well? eval should       be part of AST processing, not of AST construction), but it also       shows in other constructs (comments are not stored at all, and       if commas/semicolons are not stored, how does one know       where they were located - programmers tend to be picky       about their personal or project-wide style guides?);

None of this data is available in a SpiderMonkey parse node.

- there are some minor oddities, from spelling differences to       the spec (Label(l)ed),

Heh, I shouldn't've capitulated to my (excellent and meticulous!) reviewer, who was unfamiliar with the spec:

bugzilla.mozilla.org/show_bug.cgi?id=533874#c28

I can probably change that.

to structuring decisions (why separate       UpdateExpression and LogicalExpression, when everything       else is in UnaryExpression and BinaryExpression?);

I separated update expressions and logical expressions because they have different control structure from the other unary and binary operators.

btw, why alternate/consequent instead of then/else, and

I was avoiding using keywords as property names, and consequent/alternate are standard terminology. I suppose .then/.else would be more convenient.

shouldn't that really be consequent->then and alternate->else       instead of the other way round (as the optional null for       consequent suggests)?

Doc bug, thanks. Fixed.

My main issue is unparsing support for program transformations

bugzilla.mozilla.org/show_bug.cgi?id=590755

(though IDEs will similarly need more info, for comment extraction, syntax highlighting, and syntax-based operations).

This is all the stuff that will almost certainly require separate implementations from the engine's core parser. And maybe that's fine. In my case, I wanted to implement a reflection of our existing parser, because it's guaranteed to track the behavior of SpiderMonkey's parser.

What I did for now was to add a field to each Node, in which I store an unprocessed Array of the sub-ASTs, including tokens. Essentially, the extended AST Nodes provide both abstract info for analysis and evaluation and a structured view of the token stream belonging to each Node, for lower-level needs.

Whitespace/comments are stored separately, indexed by the start position of the following token (this is going to work better for comment-before-token that for comment-after-token, but it is a start, for unparsing or comment-extraction tools).

You've lost me again. Are you describing a parser you wrote?

This allows for a generic traversal of the Array-based unprocessed AST fragments, for unparsing, but I still have to rearrange things so that I can actually store the information I need (can't add info to null as an AST value) and distinguish meta-info ("computed" and "prefix" properties) from sub-ASTs.

I'm still lost.

Overall, the impression is that this AST was designed by someone resigned to the fact of having to write Node-type-specific traversal code for each purpose, with a limited number of purposes planned (such as evaluation). This could be a burden for other uses of such ASTs (boilerplate tax).

It was designed to be minimal and serializable. It was a lot of code, so I figured I would just focus on a) making sure all the data was there and b) making it possible to provide a custom data format via the builder API. This is what I came up with, but I can revisit the API design if it's useful.

I hope these notes help - I'd really like to see a standard JS parser API implemented across engines. For language experimentation, we'd still need separate tweakable parsers, but access to the efficient engine parsers for current JS would give tool development a boost.

I'm still not convinced this is such a big win. Reflect.parse gives you some performance, but it still requires two traversals (one to generate the internal C++ JSParseNode tree and then a second to convert that to a JS object tree). But part of the benefit is knowing you have exactly the SpiderMonkey parser. Once implementors have to write a separate parser, the possibility of divergence increases, and the maintenance cost of building a second parser in a low-level language is high. At that point, they might just want to write it in JS. But anybody could do that.

But there are also tough questions about what the parser should do with engine-specific language extensions.

Actually, that starts before the AST: I'd like to see feature-based language versioning, instead of the current monolithic version numbering - take generators as an example feature:

Perhaps JS1.7 ("javascript;version=1.7") happens to be the first JS version to support "yield", and is backwards compatible with JS1.5, which might happen to match ES3; and JS1.8.5, which happens to match ES5, might be backwards compatible with JS1.7. But it is unlikely that the JSx which happens to match ES6 will be backwards compatible with JS1.7 (while ES5-breaking changes will be limited, replacing experimental JS1.x features with standardized variants is another matter).

Whereas, if I was able to specify "use yield", and be similarly selective about other language features, then either of JS1.7, JS1.8.5 and ES6 engines might be able to do the job, depending on what other language features my code depends on. Also, other engines might want to implement some features -like "yield"- selectively, without aiming to support all of JS1.7, and long before being able to support all of ES6.

That's asking for quite a modularized/configurable parser.

I agree about the issue of multiple parsers. The reason I was able to do the SpiderMonkey library fairly easily was that I simply reflect exactly the parser that exists. But to have a standards-compliant parser, we'd probably have to write a separate parser. That's definitely a tall order.

It should not be, provided one distinguishes between standards-compliant and production use. If the ES grammar is LR(1), it should really be specified in a parser tool format,

Mainstream production JS engines have moved away from parser generators.

both for verification and to generate standards-compliant tools to compare against. Depending on how efficient the JS Bison implementation is, this might even lead to useable parser performance.

Again, this could be implemented by anyone as a pure JS library.

FWIW, I've implemented such a library here: zaach/reflect.js The grammar is based on the old JavaScriptCore Bison grammar with some tweaks to make it LALR(1) (Jison doesn't do efficient LR(1) yet.)

# Brendan Eich (14 years ago)

On Jul 5, 2011, at 9:00 PM, David Herman wrote:

  • there are some minor oddities, from spelling differences to the spec (Label(l)ed),

Heh, I shouldn't've capitulated to my (excellent and meticulous!) reviewer, who was unfamiliar with the spec:

bugzilla.mozilla.org/show_bug.cgi?id=533874#c28

I can probably change that.

Your API names do not follow the spec (which I'm not complaining about), so I don't see why they must use UK spelling (misspellings, in my humble opinion!) of "Labeled". :-P

 btw, why alternate/consequent instead of then/else, and

I was avoiding using keywords as property names, and consequent/alternate are standard terminology. I suppose .then/.else would be more convenient.

Is there any way an engine not supporting reserved identifiers as property names could use Reflect.parse-based (conditionally-based) code? I.e. is avoiding .else desirable? If not, then I'm +1 on .then and .else as shorter and plainer names.

(though IDEs will similarly need more info, for comment extraction, syntax highlighting, and syntax-based operations).

This is all the stuff that will almost certainly require separate implementations from the engine's core parser. And maybe that's fine. In my case, I wanted to implement a reflection of our existing parser, because it's guaranteed to track the behavior of SpiderMonkey's parser.

Agreed, this is a critical point. For any AST standard (which won't be in ES.next, Claus should note -- cutoff was May), implementors will not want to write two parsers and keep them in sync. We'll have to find common ground among implementations. This is uncertain and it will take at least time and a spirit of compromise among implementors.

I'm still not convinced this is such a big win. Reflect.parse gives you some performance, but it still requires two traversals (one to generate the internal C++ JSParseNode tree and then a second to convert that to a JS object tree). But part of the benefit is knowing you have exactly the SpiderMonkey parser. Once implementors have to write a separate parser, the possibility of divergence increases, and the maintenance cost of building a second parser in a low-level language is high. At that point, they might just want to write it in JS. But anybody could do that.

+1 again -- divergence bugs in an implementation, plus the footprint hit, are two strikes against two parsers instead of one. OTOH implementing in JS is "just library code" and there won't be a need for a de-jure standard ahead of de-facto competition.

It should not be, provided one distinguishes between standards-compliant and production use. If the ES grammar is LR(1), it should really be specified in a parser tool format,

Mainstream production JS engines have moved away from parser generators.

Right, indeed most (all but JavaScriptCore, IINM) never used a parser generator in the first place. A great many industrial language implementations do not use generated parsers.

# Brendan Eich (14 years ago)

On Jul 5, 2011, at 10:35 PM, Brendan Eich wrote:

On Jul 5, 2011, at 9:00 PM, David Herman wrote:

Mainstream production JS engines have moved away from parser generators.

Right, indeed most (all but JavaScriptCore, IINM) never used a parser generator in the first place. A great many industrial language implementations do not use generated parsers.

Some of the reasons for this include

  • Hand-crafted parsers can be significantly faster than Bison or the like. Yes, I'm generalizing.
  • Lack of parameterized productions (e.g. for "NoIn" vs. unsuffixed ECMA-262 nonterminals) requires significant duplication of sub-grammar productions and their semantic actions.
  • Error recovery is usually better in hand-crafted parsers.

Oliver may have more to say -- he got rid of JavaScriptCore's Bison grammar, which dates back to KJS.

# Claus Reinke (14 years ago)

the AST API strawman .. It hasn't really been championed so far. I was concentrating on other proposals for ES.next.

So it was just timing, not behind-the-scenes disagreements. Ok.

  • it does not support generic traversals, so it definitely needs a pre-implemented traversal, sorting out each type of Node (Array-based ASTs, like the es-lab version, make this slightly easier - Arrays elements are ordered, unlike Object properties);

I designed it to be easily JSON-{de}serializable, so no special prototype. However, you can use the builder API to construct your own format:

developer.mozilla.org/en/SpiderMonkey/Parser_API#Builder_objects

With a custom builder you can create objects with whatever methods you want, and builders for various formats can be shared in libraries.

  at that stage, simple applications (such as tag generation)
  may be better of working with hooks into the parser, rather
  than hooks into an AST traversal? also, there is the risk that
  one pre-implemented traversal might not cover all use cases,
  in which case the boilerplate tax would have to be paid again;

I don't understand any of this.

Your builder API defines hooks into the parser (code the parser calls on encountering certain pieces of syntax). The alternative is to let the parser build its standard Node AST, then write traversals for that AST, and have hooks in the traversal (code the traversal calls on encountering certain Node types).

Parser hooks are tied to the parsing algorithm, but that can be sufficient for tasks that can be handled in a single pass (generating tags files, generating lint advice, ..). AST traversal hooks can build on different traversal schemes (top-down is useful for some tasks, bottom-up for others, a few tasks require combinations of both).

In both cases, having to write code for each Node type or for each kind of syntax phrase is a pain. It also tends to be unnecessary, as most AST-based tasks can be factored into default code (the same scheme for most types of Node) and a few special cases.

Having to write out the default code (often a recursion scheme) repeatedly, for each type of Node, is known as boilerplate code, and people working with ASTs try to avoid that if they can. In statically typed languages, this isn't trivial, but in dynamically typed JS, it should be easy.

Take a couple of standard examples:

  • extracting free variables:

    • by default, recurse into sub-ASTs, collecting sub-results;
    • for variable occurrences, record them;
    • for variable binders, recurse and collect, then remove the bound variables from the results.
  • generating tags file:

    • by default, recurse into sub-ASTs
    • for variable/function definitions, write out source location

For less trivial tasks the number of special cases increases, but the idea is the same: only write out the non-standard cases.

Starting from a Node AST, we'd have to write handlers for every Node type. With an Array-based AST, we could write a generic handler for the recursive traversal, then override it for the few node types of interest. With the Builder API, we could omit handlers, but the omitted handlers default to producing Nodes (if we store information in non-local variables, this might suffice for the two simple tasks above).

Things get more complex for realistic tasks, which mix top-down with bottom-up traversals for program analysis and transformation.

One could write a Stratego[1]-style traversal library on top of the Node AST, and then define traversals in terms of this library, without the boilerplate code (which is what we did for the Haskell Refactorer HaRe, and also what the Wrangler team did for Erlang).

That would also provide some isolation from AST details, as traversal-based code would only mention the Node types it needs to handle, not all the Node types it needs to traverse.

I'll stop here, to split my reply into smaller chunks, as requested.

Claus

[1] strategoxt.org/Stratego/WebHome

For an overview of the strategic rewriting approach to AST
traversals and transformations, see this manual section:

hydra.nixos.org/build/1145532/download/1/manual/chunk-chapter/stratego-rewriting-strategies.html#id1124294

# Oliver Hunt (14 years ago)

On Jul 5, 2011, at 10:45 PM, Brendan Eich wrote:

On Jul 5, 2011, at 10:35 PM, Brendan Eich wrote:

On Jul 5, 2011, at 9:00 PM, David Herman wrote:

Mainstream production JS engines have moved away from parser generators.

Right, indeed most (all but JavaScriptCore, IINM) never used a parser generator in the first place. A great many industrial language implementations do not use generated parsers.

Some of the reasons for this include

  • Hand-crafted parsers can be significantly faster than Bison or the like. Yes, I'm generalizing.

In our case 2-3x, when going from bison to hand coded

  • Lack of parameterized productions (e.g. for "NoIn" vs. unsuffixed ECMA-262 nonterminals) requires significant duplication of sub-grammar productions and their semantic actions.

Yup -- absence of mode switches in generators is poor.

  • Error recovery is usually better in hand-crafted parsers.

Can't speak for error recovery (JSC's parser makes no attempt to recover after an error), but you can produce much more useful error messages than most generated parsers allow.

Oliver may have more to say -- he got rid of JavaScriptCore's Bison grammar, which dates back to KJS.

There are a few other advantages, mostly relating to controlling stack usage and minor tweaks like branch ordering.

One issue I had looking at parser generators is that they all tend to either want a large library brought in (ANTLR) or specific (ugly) member names in the types that you provide (entire YACC family). Also they have a tendency to assume trusted content -- ANTLR will happily overflow the stack, which is clearly not ideal.

# Oliver Hunt (14 years ago)

On Jul 5, 2011, at 10:45 PM, Brendan Eich wrote:

On Jul 5, 2011, at 10:35 PM, Brendan Eich wrote:

On Jul 5, 2011, at 9:00 PM, David Herman wrote:

Mainstream production JS engines have moved away from parser generators.

Right, indeed most (all but JavaScriptCore, IINM) never used a parser generator in the first place. A great many industrial language implementations do not use generated parsers.

Some of the reasons for this include

  • Hand-crafted parsers can be significantly faster than Bison or the like. Yes, I'm generalizing.

In our case 2-3x, when going from bison to hand coded

  • Lack of parameterized productions (e.g. for "NoIn" vs. unsuffixed ECMA-262 nonterminals) requires significant duplication of sub-grammar productions and their semantic actions.

Yup -- absence of mode switches in generators is poor.

  • Error recovery is usually better in hand-crafted parsers.

Can't speak for error recovery (JSC's parser makes no attempt to recover after an error), but you can produce much more useful error messages than most generated parsers allow.

Oliver may have more to say -- he got rid of JavaScriptCore's Bison grammar, which dates back to KJS.

There are a few other advantages, mostly relating to controlling stack usage and minor tweaks like branch ordering.

One issue I had looking at parser generators is that they all tend to either want a large library brought in (ANTLR) or specific (ugly) member names in the types that you provide (entire YACC family). Also they have a tendency to assume trusted content -- ANTLR will happily overflow the stack, which is clearly not ideal.

# Allen Wirfs-Brock (14 years ago)

On Jul 6, 2011, at 10:02 AM, Claus Reinke wrote:

the AST API strawman .. It hasn't really been championed so far. I was concentrating on other proposals for ES.next.

So it was just timing, not behind-the-scenes disagreements. Ok.

I don't think we have any consensus within TC39 WRT whether such an API should be part of some future version Ecma-262. Personally, this sounds like library functionality that in the future might manifest as a module. I think we need to draw a line at adding very many more such libraries to the core standard. For things like this it is too hard to specify them and not clear that there is a single preferred solution.

Everybody would like their favorite library to be "built-in" so there is no loading cost. But there are a huge number of potentially useful libraries and only a few are ever going to get built-in. Which ones do you choose? Is JavaScript AST generation really the highest priority on the list? What percentage of web applications need to parse JS code?

Rather than pushing for more built-in libraries I think we should all get behind just doing "it" in JS. To be successful we also need to continue to make JS a more powerful language, with higher performance implementations, and browser hosts with better caching schemes.

# Claus Reinke (14 years ago)
  • it is slightly easier to manipulate than an Array-based AST, but More than slightly, IMO.

Ok. Though destructuring closes some of the gap.

  lack of pattern matching fall-through (alternative patterns for
  destructuring) still hurts, and the selectors are lengthy, which
  hampers visualization and construction; (this assumes that
  fp-style AST processing is preferred over oo-style processing)

If I'd defined a new object type with its own prototype, it still wouldn't define all operations anyone would ever want. So they'd either have to monkey-patch it or it would need a visitor. Which you could write anyway. So I don't see much benefit to pre-defining a node prototype.

You're right that no single traversal scheme will serve all purposes, and I agree that a prototype isn't needed for fp-style processing. I was just remarking that much of the conciseness of fp for AST processing comes from good deconstruction and construction support, which is not entirely there yet in JS.

Instead of writing cases of destructuring patterns and selecting the first that matches, we have to write conditions to find out what type of Node we have, and only then destructure. And the long selectors make even that slightly tedious (though they help with readability):

if (node.type==='IfStatement') { let {type, test, consequent, alternative} = node; let notTest = {type='UnaryExpression', operator={type='UnaryOperator',token='!'}, prefix=true, argument=test}; let newIfStatement = {type, notTest, consequent=alternative, alternative=consequent }; }

Object literal improvements help, object update { p: v, ... : obj } would help, and a way to recover from refutable destructuring would help. So the proposals and strawmen are heading in the right direction.

But again, see the builder API, where you can create your own custom node type.

Yes, that is useful. But to use this to create custom ASTs which could make specific tasks easier, I would need to write a handler for every builder callback, so I'm back to boilerplate code.

It seems there is no way to avoid writing a generic traversal library for Node ASTs. That has to be done only once, though, and does not need to be part of the standard.

  • it is biased towards evaluation, which is a hindrance for other uses (such as faithful unparsing, for program transformations);

It's just a reflection of the built-in SpiderMonkey parser, which was designed for the sole purpose of evaluation. I didn't reimplement a new parser.

Right. But is that what we'd want from a standard Parser API?

  this can be seen clearly in Literals, which are evaluated (why
  not evaluate Object, Array, Function Literals as well? eval should
  be part of AST processing, not of AST construction), but it also
  shows in other constructs (comments are not stored at all, and
  if commas/semicolons are not stored, how does one know
  where they were located - programmers tend to be picky
  about their personal or project-wide style guides?);

None of this data is available in a SpiderMonkey parse node.

Indeed. You wouldn't believe how frustrating it can be that textbook parsing ignores comments, when you're looking for a parser to use for program transformations, or a documentation generator, or other IDE-style tools. It means that the majority of parsers can't be re-used.

The question is: does one augment existing parsers, to enable tool building on top, or does one let every tool builder write and maintain their own parser? The former means a little work for existing parsers, the latter means a lot of work for wannabe tool builders. The latter also means that lots of useful small tools won't be built, because building them wouldn't be a small effort.

Having a standard JS-in-JS parser could be a good compromise, but even for that a standard AST has merit.

  • there are some minor oddities, from spelling differences to the spec (Label(l)ed),

Heh, I shouldn't've capitulated to my (excellent and meticulous!) reviewer, who was unfamiliar with the spec:

bugzilla.mozilla.org/show_bug.cgi?id=533874#c28

I can probably change that.

The AST is not directly related to the grammar, so it isn't a big deal. Just confusing to us English-as-a-foreign-language folks.

to structuring decisions (why separate UpdateExpression and LogicalExpression, when everything else is in UnaryExpression and BinaryExpression?);

I separated update expressions and logical expressions because they have different control structure from the other unary and binary operators.

Hm, ok, at least that explains it (perhaps add that note to the docs?). I'm not convinced the separation is useful, though - for many tasks, it is just another obstacle. Only experience will tell.

  btw, why alternate/consequent instead of then/else, and

I was avoiding using keywords as property names, and consequent/alternate are standard terminology. I suppose .then/.else would be more convenient.

Ok, perhaps a useful strategy while older engines are still in use.

Still, shorter would be better - we have to write out those selectors for deconstruction and construction, and for displaying ASTs. Not to mention nested ASTs (alternate.alternate.alternate.consequent;-).

My main issue is unparsing support for program transformations

bugzilla.mozilla.org/show_bug.cgi?id=590755

Thanks, that is a start. Actually, it will be sufficient for some applications.

Unfortunately, experience tells me it won't be sufficient for user- level program transformations. Several refactoring tool builders have commented how changing user code by pretty-printing negatively impacted on tool uptake. When Haskell and Erlang refactorers took some pains to keep user-code code as written (apart from places where refactorings generated entirely new code), that was one of the most positively evaluated aspects of these tools.

(though IDEs will similarly need more info, for comment extraction, syntax highlighting, and syntax-based operations).

This is all the stuff that will almost certainly require separate implementations from the engine's core parser. And maybe that's fine. In my case, I wanted to implement a reflection of our existing parser, because it's guaranteed to track the behavior of SpiderMonkey's parser.

Understood. But shouldn't separate parsers also implement the standard parser API? And shouldn't it therefore cover the information needed for such common use cases?

Browser parsers might then only support a partial profile of the full standard API - whatever they can support without negative impact on their main usage.

Though it might not actually cost much to support the additional info in SpiderMonkey: most of it could be in the token stream, which is usually thrown away, but could be kept via a flag, and the AST's source locations can be used to extract segments of the token stream (such as any comments preceding a location).

Breaking again, Claus

# Oliver Hunt (14 years ago)

This is all the stuff that will almost certainly require separate implementations from the engine's core parser. And maybe that's fine. In my case, I wanted to implement a reflection of our existing parser, because it's guaranteed to track the behavior of SpiderMonkey's parser.

Understood. But shouldn't separate parsers also implement the standard parser API? And shouldn't it therefore cover the information needed for such common use cases?

The problem is our parsers don't produce the same AST, and the structure produced by parsing an arbitrary piece of JS is API. To have a standard API requires a standardised AST, which seems unlikely to happen (for reasons laid out many times in the past).

Browser parsers might then only support a partial profile of the full standard API - whatever they can support without negative impact on their main usage.

Partial support of an API means your code would have to deal with what is missing (which will vary between browsers).

JSC's parser is constructed in such a way that we could generate a parse tree directly into JS form (it would be a matter of jumping through hoops to create the correct builder), however doing so probably won't provide a tree that you necessarily want as something to manipulate. We drop var declarations (relying on other tracking instead) we don't track token locations beyond what is needed for some specific cases. Some "lists" in the grammar are represented as linked lists, others as arrays, etc, etc.

I have a vague recollection that the SM parser strips out some characters prior to parsing (brendan?) and generates bytecode into the ast (though based on dherman's comments maybe that's no longer the case)

Though it might not actually cost much to support the additional info in SpiderMonkey: most of it could be in the token stream, which is usually thrown away, but could be kept via a flag, and the AST's source locations can be used to extract segments of the token stream (such as any comments preceding a location).

Speaking again for JSC -- we don't have an actual "token stream" the lexer provides. The lexer simply walks the input source a token at a time as requested, partially because how we lex is driven by the context and mode of the parser, and partially because creating a distinct token stream is nice in an academic context, but would be a huge performance hole in practice. And the tokens that we do have don't necessarily contain all the information you would want (because each additional write the lexer makes to the token structure is actually measurable in some our perf tests).

# Brendan Eich (14 years ago)

On Jul 6, 2011, at 12:55 PM, Oliver Hunt wrote:

I have a vague recollection that the SM parser strips out some characters prior to parsing (brendan?)

Nothing is stripped other than space (including comments).

and generates bytecode into the ast (though based on dherman's comments maybe that's no longer the case)

Not for many, many years.

Your points are well-taken. A standard AST reflection library is a very dark horse at this point. Better to build parsers that suit tools' needs, including comment preservation, etc.

# Claus Reinke (14 years ago)

I don't think we have any consensus within TC39 WRT whether such an API should be part of some future version Ecma-262. Personally, this sounds like library functionality that in the future might manifest as a module. I think we need to draw a line at adding very many more such libraries to the core standard. For things like this it is too hard to specify them and not clear that there is a single preferred solution.

Ah, sorry. Seems I got the wrong impression from the list archives - perhaps only the interested parties participated in those older threads. Thanks for clarifying.

Anyway, even for a library version, there needs to be an API. And since there are multiple JS parsers, and JS parser users, it would be helpful if they could agree on a standard AST. Which brings us back to the topic, independent of whether or not browser vendors choose to expose their ASTs.

Everybody would like their favorite library to be "built-in" so there is no loading cost. But there are a huge number of potentially useful libraries and only a few are ever going to get built-in.

If this was something you still needed to build in, I wouldn't argue for it. The point is that the JS parser is already in there. I can even call it from JS, via 'eval', in any JS implementation. I just can't get my hands on the intermediate AST yet!-)

From recent messages, it seems that the general opinion is

actually against a standard parser API, which is surprising.

In as much as this due to technical reasons (efficiency concerns, too large a gap between internal representation and useful external API), could this be reflected on the strawman page, please?

strawman:ast

Just so no-one else gets unreasonably optimistic.

Which ones do you choose? Is JavaScript AST generation really the highest priority on the list? What percentage of web applications need to parse JS code?

If some startups get their wishes, some JS development might move into the browser (browser-based editors and IDEs, storage in the cloud). All developers using those tools would depend on browser-based JS parsing. Even in bog-standard development today, debuggers and profilers could profit from parsing and instrumenting code (building such tools in JS instead of hooking into lower-level browser APIs).

And JS has long ceased to be only about web applications or browsers. Once JS engines have commandline interfaces, or socket interfaces, they are accessible wherever any JS tool development is taking place. So why shouldn't they offer their internal functionality to Javascript coders, to support building tools for JS in JS?

Yes, JS should be fast enough to support JS-based parsers, and for playing with extended JS, or with non-standard parser features, there is no way around that. But why have two standard parsers when the built-in one will do the job?

But as Oliver has pointed out, the built-in one might not be able to do the job after all.

Claus

# David Herman (14 years ago)
  • it is biased towards evaluation, which is a hindrance for other uses (such as faithful unparsing, for program transformations);

It's just a reflection of the built-in SpiderMonkey parser, which was designed for the sole purpose of evaluation. I didn't reimplement a new parser.

Right. But is that what we'd want from a standard Parser API?

I mentioned SpiderMonkey's Reflect.parse on the wiki page, but I haven't actually proposed it as a standard; it's just currently there as a tool to reflect what SpiderMonkey does.

The question is: does one augment existing parsers, to enable tool building on top, or does one let every tool builder write and maintain their own parser?

False dichotomy. If the only way to share code were for TC39 to put it into the standard library, JS would have died a long, long time ago. :)

Thanks, that is a start. Actually, it will be sufficient for some applications.

Unfortunately, experience tells me it won't be sufficient for user- level program transformations.

Sorry to hear that. I really would suggest you experiment with a pure JS parser for your needs. You could even start with one of the existing ones (there are several) to help you get off the ground.

Though it might not actually cost much to support the additional info in SpiderMonkey: most of it could be in the token stream, which is usually thrown away, but could be kept via a flag, and the AST's source locations can be used to extract segments of the token stream (such as any comments preceding a location).

This is a fantasy, I'm afraid. The parser is big, complex, and heavily optimized.

# David Herman (14 years ago)

Ah, sorry. Seems I got the wrong impression from the list archives - perhaps only the interested parties participated in those older threads. Thanks for clarifying.

I think you're looking for a thumbs-up/thumbs-down answer from TC39, when we haven't really discussed it that much. We've explained in this thread what we perceive as the challenges, but so far it hasn't really been the focus of many TC39 discussions.

I think the best thing people could do to advance the state of the discussion would be to actually go out and build such a library, in pure JS. As Allen says, this stuff is all implementable in JS. There's no actual need for it to be standardized by TC39.

That's not to say that that won't ever happen, but I don't think it's high up on anyone's priority list right now. It certainly is off the table for ES.next.

In as much as this due to technical reasons (efficiency concerns, too large a gap between internal representation and useful external API), could this be reflected on the strawman page, please?

strawman:ast

Just so no-one else gets unreasonably optimistic.

I've added a quick list.