ES parsing tools (Re: Short Functions)

# Claus Reinke (14 years ago)

Mark and Tom used Ometa for code.google.com/p/es-lab, code.google.com/p/es-lab -- slo-o-o-o-wwwww.

OMeta does do memoization, but it's still slow ;-)

I'm curious: what level of slow are we talking about here?

A basic memoing top-down parser for PEG-style grammars with good representation but without further optimizations will be too slow for interactive use (no replacement for lints or runtime-preprocessors, never mind incrementally parsing IDE frontends). With modern browser engines, it should be fine for experimental use, from a few seconds for a small file to a minute or so for a larger one (a few thousand lines).

Much of the inefficiency seems inherent in the grammar, when viewed from a naïve top-down, backtracking, commit-to-first- successful-branch parser. Since being close to the spec is part of the game here, rewriting the input grammar is out, making grammar optimizations in the parser generator important if one wants better performance than the above. OMeta seemed to have a good lead on that, so I expected it to be faster than similar grammar-based top-down parsers.

If OMeta's should really be slower than similar parsers, and the grammar-optimizing side is covered by the authors, perhaps there is room for old-fashioned JS code optimization? For instance, OMeta/JS is not the main line of OMeta code, and code that is idiomatic in another language might be less than optimal if moved to JS.

Btw, it would be nice to have an es-tools-discuss list, where one could discuss ES implementation and tooling issues (preferably with the engine implementers listening in). Does such a list exist?

Claus

PS. Sometimes, I wonder whether es-discuss would profit from being split into es-{spec,design,tools}-discuss: more focus in the 'spec' part, more room in the 'design' part, and pragmatic realities in 'tools'.

The 'design' part might be more inviting to all those
research groups that have built JS semantics or analyses,
and the 'tools' part might help to keep the variety of tool
efforts in view. Both would invite useful input/background
for 'spec'.
# Brendan Eich (14 years ago)

On May 28, 2011, at 1:56 AM, Claus Reinke wrote:

If OMeta's should really be slower than similar parsers, and the grammar-optimizing side is covered by the authors, perhaps there is room for old-fashioned JS code optimization? For instance, OMeta/JS is not the main line of OMeta code, and code that is idiomatic in another language might be less than optimal if moved to JS.

Does this line of inquiry do anything for (a) the spec grammar needing validation (no ambiguities); (b) implementor acceptability?

I don't see how (a) is served. Please correct me if I'm missing something.

The answer to (b) is "no". Implementors use tuned hand-crafted top-down parsers, not Ometa or anything like it.

Btw, it would be nice to have an es-tools-discuss list, where one could discuss ES implementation and tooling issues (preferably with the engine implementers listening in). Does such a list exist?

No, we're not going to split this list. Volume is up lately but nowhere near fatally high, and topics interconnect.

Researching JS parsing is probably better done in a research setting. Talking about it here in open-ended ways doesn't help make near-term progress in the standardized language. Having a more closed or "done" solution that addresses (a) and (b) above would help, though.

I'm not against research (Mozilla funds research), and we do need better methods over time. But developing them is specialized work best done in more specialized venues, without the more practical constraints that JS developers, interested experts, and TC39 members face in working on "ES" and discussing it.

And if we can avoid research by using existing formalisms and tools, then we should. That's strictly better for standardization.

# Claus Reinke (14 years ago)

tl;dr: - JS-based PEG + ANTLR as a route for ES grammar experiments - would like to know about route viability and alternative routes

If OMeta's should really be slower than similar parsers, and the grammar-optimizing side is covered by the authors, perhaps there is room for old-fashioned JS code optimization? For instance, OMeta/JS is not the main line of OMeta code, and code that is idiomatic in another language might be less than optimal if moved to JS.

Does this line of inquiry do anything for (a) the spec grammar needing validation (no ambiguities); (b) implementor acceptability?

As indicated by the change of subject, this is a separate topic, triggered by remarks made by several posters on the previous topic. Having good tools supporting practice testing of proposals is important to proposal quality, and several remarks suggested a tooling problem in the experimental ES extension parsing area.

Since committee members have invested in Ometa usage, knowing whether or not their work can be sped up without a complete reimplementation is relevant.

I don't see how (a) is served. Please correct me if I'm missing something.

To repeat my previous suggestion for (a): have a look at ANTLRWorks[0]. While I have not tried it myself yet, it claims to be a "sophisticated grammar development environment" (longer excerpt from blurb included below).

The relation to the current topic is that ANTLR's LL(*) parsing [1] aims to generalize and optimize PEGs and related formalisms typical for top-down parsers (while GLR and bison are typical for bottom-up parsers, which are not relevant for ES implementations, according to your information).

PEGs without left-recursion can apparently be translated to ANTLR grammars, and PEGs with limited lookahead can be implemented by Antlr with close to LL(k) efficiency. So starting with a JS-based PEG tool for in-browser experimentation, then working to limit backtracking (using finite or regular lookahead for committed choice) and keeping ANTLRWorks in mind as an analysis framework sounds viable. It is the route I have chosen, so far.

(Antlr even has backends in various languages, though I doubt that existing ES engine implementations will want to reengineer their frontends on top of Antlr-generated parsers)

If there are other or better options for ES grammar experiments, I hope this thread will bring them out. There are certainly other "language workbenches" (spoofax and mps come to mind), but do they offer a JS-based starting point, like the PEG route? Or grammar debugging support?

The related work section of a paper on "Language Modularization and Composition" [2] lists some alternatives, but recent ones tend to be projectional (source is a view, projected from ast as a model) rather than parser-based (ast is extracted from source).

The answer to (b) is "no". Implementors use tuned hand-crafted top-down parsers, not Ometa or anything like it.

The postings which triggered my inquiry were authored by coders not implementing the major ES engines, but by ES implementers nevertheless. Some wanted to be able to play with proposed ES extensions, to get a feel for existing proposals, or to prepare more realistic suggestions of their own, others have a working implementation of an ES security extension, but expressed concern about the efficiency of their solution. My own interests lie in providing tools for JS, and implementable suggestions for ES.

Your own reaction suggested that you consider Ometa too slow, and that you test your own extensions on grammar subsets in bison, adding only the most promising ones to full, handcrafted JS implementations. The overall impression was that there is no viable option for practical full ES grammar experiments, short of extending a full JS implementation, and that the latter do not offer a validation route.

I want to express a counterpoint to those remarks, but I need some input first: not having practical experience with Ometa, I cannot say whether it is slower than one would expect from the publications. If it is slower than the reference I provided, based on experiments with other JS-based parser tools, then it is likely that one can optimize the code or switch to a more efficient parsing tool, without having to reimplement the secure ES implementation or study parsing techniques.

If Ometa already is as fast as similar parsing tools, just too slow for the secure ES project, I can still attest that such tools are fast enough for smaller experiments with ES extensions.

Being based on explicit representations of the ES grammar, they are also easier to understand and change than hand- optimized parsers and the grammars can often be translated to the input forms expected by other tools (for verification, performance, IDE support, or documentation/spec purposes).

Btw, it would be nice to have an es-tools-discuss list, where one could discuss ES implementation and tooling issues (preferably with the engine implementers listening in). Does such a list exist?

No, we're not going to split this list. Volume is up lately but nowhere near fatally high, and topics interconnect.

Researching JS parsing is probably better done in a research setting. Talking about it here in open-ended ways doesn't help make near-term progress in the standardized language. Having a more closed or "done" solution that addresses (a) and (b) above would help, though.

The problem is not with research, it is with connecting research to the ES design process. An open-ended discussion list for theory, tools, and problems relevant to this process might attract the researchers who have been working on ES semantics and analysis, but currently remain silent or absent on this list.

Keeping the open-ended list separate from the spec-focused list would provide a specialized venue for recording ES-specific problems (which interested researchers might want to work on) and solutions (the suitability of which researchers might want to demonstrate).

To be interesting to researchers, such a venue has to be close to the action -es-discuss-, without being limited by the timeline and tl;dr constraints of es-discuss, but with a demonstrable chance of discussions having practical relevance and impact (cross-pollination with es-discuss and language committee).

Claus

[0] www.antlr.org/works

from the blurb: It combines an excellent grammar-aware editor with an interpreter for rapid prototyping and a language-agnostic debugger for isolating grammar errors. ANTLRWorks helps eliminate grammar nondeterminisms, one of the most difficult problems for beginners and experts alike, by highlighting nondeterministic paths in the syntax diagram associated with a grammar. ANTLRWorks' goal is to make grammars more accessible to the average programmer, improve maintainability and readability of grammars by providing excellent grammar navigation and refactoring tools, and address the most common questions and problems encountered by grammar developers: - Why is this grammar fragment nondeterministic? - Does this rule match a sample input? - Why is this grammar improperly matching this complete input? - Why is there a syntax error given this input? - Why is there no syntax error given this ungrammatical input?

[1] "LL(*): The Foundation of the ANTLR Parser Generator" Terence Parr and Kathleen Fisher Sat Feb 5, 2011 13:07 [This is a draft of paper accepted to PLDI 2011] www.antlr.org/papers/LL-star-PLDI11.pdf

[2] "Language Modularization and Composition with Projectional Language Workbenches illustrated with MPS" Markus Voelter, Konstantin Solomatov voelter.de/data/pub/VoelterSolomatov_SLE2010_LanguageModularizationAndCompositionLWBs.pdf (see section "4 Related Tools and Approaches")

# Brendan Eich (14 years ago)

On May 29, 2011, at 2:11 AM, Claus Reinke wrote:

tl;dr:

  • JS-based PEG + ANTLR as a route for ES grammar experiments

Mark and Tom already use Ometa at a code.google.com project, as noted. What more do you want?

This does not address the issue of a usable spec grammar that can be validated (my point (a)). In spite of over a thousand words and many paragraphs in reply :-/.

  • would like to know about route viability and alternative routes

Tom testified to slowness. Mark and Tom being on TC39 does not address the "implementor acceptability" (my point (b)) of Ometa, which is nearly nil.

The relation to the current topic is that ANTLR's LL(*) parsing [1] aims to generalize and optimize PEGs and related formalisms typical for top-down parsers (while GLR and bison are typical for bottom-up parsers, which are not relevant for ES implementations, according to your information).

The ECMA-262 spec uses LR(1), and ES3's grammar was validated by Waldemar using his custom common lisp program, source available in the old Mozilla CVS repo:

mxr.mozilla.org/mozilla/source/js2/semantics

Changing to something new is certainly possible, but we need at least "that much" validation (what Yacc calls shift/reduce and reduce/reduce conflict detection).

Top-down formalisms may not be suitable if they do not check for ambiguities and rule them out. To quote from Waldemar on this list in October 2008,

"This is why ambiguous grammars are bad and unsuitable for our spec. In an unambiguous grammar, if you find a series of production expansions that matches a source program, then you know that those are the expansions that will be taken. In an ambiguous grammar, you also need to prove the negative: no other expansion can match that source program and shadow your expansions. Proving the negative causes trouble because a language extension could turn the mismatch into a match and because sometimes[...], you expected some other part of the grammar to shadow your expansions but it didn't."

At this point, I'd like to plead for fewer words and aspirational statements, and more concrete details about validation against ambiguity in the system(s) you are proposing.

The postings which triggered my inquiry were authored by coders not implementing the major ES engines, but by ES implementers nevertheless.

If you mean Mark and Tom, please say so. They did not implement a performant JS engine, so that's not helpful for point (b). Last time I'll say this, please don't rehash.

Many people are playing with JS parsers, testing extensions, building transpilers. That's all great, but it does not address the spec issue, specifically both points (a) and (b).

# Brendan Eich (14 years ago)

On May 29, 2011, at 12:21 PM, Brendan Eich wrote:

Mark and Tom already use Ometa at a code.google.com project, as noted. What more do you want?

This does not address the issue of a usable spec grammar that can be validated (my point (a)). In spite of over a thousand words and many paragraphs in reply :-/.

To say a bit more, in a more positive tone: I agree that deterministic choice is attractive considered in a vacuum, to avoid ambiguities and better match how hand-crafted recusrive descent parsers are written. So PEG or stronger is plausible when experimenting, no argument there. Mark and Tom's choice to use Ometa for their language lab seems fine.

Recasting the ECMA-262 specification's unambiguous (after ASI, and assuming ES5 added no ambiguities to ES3) LR(1) grammar using a PEG could be tried and evaluated. I would be interested in seeing a no-other-changes translation of the spec grammar into a PEG that is validated somehow.

Here the issue is not validation against ambiguity, since the PEG is unambiguous because of ordered choice. The validation we'd need would be that both grammars produce sentences only and exactly in the same language.

# Kam Kasravi (14 years ago)

I've been experimenting with pegjs - which generates a parser based on the ecma-262 5th edition grammar. I've been building a backend that walks the ast to regenerate the code (less white related formatting). The nice thing about pegjs is most of the ast nodes agree with the parser api (developer.mozilla.org/en/SpiderMonkey/Parser_API) My approach is to add various strawmen and have them translated to plain javascript - transpiler approach. I've been very happy with the ast it generates thus far. (pegjs.majda.cz)

# Kam Kasravi (14 years ago)

Does Waldemar still maintain the tool? the source dates seemed fairly old...

# Waldemar Horwat (14 years ago)

On 05/29/11 19:35, Kam Kasravi wrote:

Does Waldemar still maintain the tool? the source dates seemed fairly old...

It still works. I didn't bother updating the ES3 parser for ES5 because I had already explored the same syntax as part of ES4 and it worked. I'm going to update it for ES.next.

The code does more than just validate the lexical and syntactic grammars. It can encode semantics and evaluate ECMAScript expressions directly from the semantics. It also validates things like that there are no contexts in the grammar where a / would be both a division symbol and the start of a regular expression.

 Waldemar
# Tom Van Cutsem (14 years ago)

I just wanted to provide a bit more context on our Ometa experiment: our goal was to build an executable parser whose definition was as close as possible to the BNF in the ES5 spec, which worked out fairly well. The main purpose of the tool was to be able to quickly experiment with syntax extensions, but it has never been our goal to have this tool be a "production-quality" JS parser.

For rapid prototyping, Ometa is a marvelous tool. I liked it a lot. Performance-wise, I have not pitched it against other JS parser generators (I actually don't know of many others).

In terms of grammar validation, Ometa is fairly weak. We did at one point get Alex Warth to implement an exclusive choice operator "||", in addition to the standard PEG prioritized choice ("|" operator in Ometa), as a tool to help us get more confidence in the grammar. However, the implementation of that operator is naive and slow (as testified here: < tinlizzie.org/ometa-js/#xor_perf>).

2011/5/29 Brendan Eich <brendan at mozilla.com>

# Kam Kasravi (14 years ago)

On May 31, 2011, at 4:20 PM, Waldemar Horwat <waldemar at google.com> wrote:

On 05/29/11 19:35, Kam Kasravi wrote:

Does Waldemar still maintain the tool? the source dates seemed fairly old...

It still works. I didn't bother updating the ES3 parser for ES5 because I had already explored the same syntax as part of ES4 and it worked. I'm going to update it for ES.next.

The code does more than just validate the lexical and syntactic grammars. It can encode semantics and evaluate ECMAScript expressions directly from the semantics. It also validates things like that there are no contexts in the grammar where a / would be both a division symbol and the start of a regular expression.

Thanks, sounds like a great tool. Do you build an ast along the lines of the parser API? developer.mozilla.org/en/SpiderMonkey/Parser_API

# Claus Reinke (14 years ago)

I've been experimenting with pegjs - which generates a parser based on the ecma-262 5th edition grammar.

yes, pegjs [1] and jsparse [2] were the alternatives I had been looking at. pegjs is probably a bit ahead when you just want to use the parse result; I chose to build on jsparse because it gives me more flexibility for grammar-level experimentation (no fixed barrier between grammar and grammar processor).

pegjs comes with a fairly complete transliteration of the ES5 grammar, but you should give it a look-over for PEG vs LR issues (jsparse and ometa have similar issues): a PEG parser will commit to a choice alternative locally, even where the grammar was written with the implicit requirement of a 1 token lookahead for parsing.

Typical example, in Statement:

ExpressionStatement / LabelledStatement

For input "id \n : 1", a PEG parser will commit to the first branch after successfully parsing an Identifier and inserting a semicolon, without ever checking whether the next token is ':' (this assumes the usual local and explicit implementation of ASI, instead of the implicit error correction with 1 token lookahead required by the ASI spec; which is global in that it depends on the whole grammar).

Apart from fixing some such issues in jsparse, I've moved its grammar closer to ES5 (not complete, but can parse code like fulljslint.js or the ASI-reliant read-json.js [3] from node's npm, mentioned in an earlier thread here), added parse error messages and grammar debugging help (grammar rule tracing and rule stack record).

I've been building a backend that walks the ast to regenerate the code (less white related formatting). The nice thing about pegjs is most of the ast nodes agree with the parser api (developer.mozilla.org/en/SpiderMonkey/Parser_API) My approach is to add various strawmen and have them translated to plain javascript - transpiler approach. I've been very happy with the ast it generates thus far.

Thanks for the reminder. There actually is a strawman for parser API standardization:

http://wiki.ecmascript.org/doku.php?id=strawman:ast

Sadly, that isn't in Harmony yet (oversight, or decision?) but if everyone here agrees on the general format, I'd add it as an option to my parser. If would be good to agree on how to preserve whitespace and comments (essential for source to source transformations).

Claus

[1] pegjs.majda.cz [2] doublec/jsparse (my own extensions to Chris' work are not online yet, but I could make a snapshot available if there is interest) [3] isaacs/npm/blob/master/lib/utils/read-json.js (a useful ASI-reliant parser test case)

# Claus Reinke (14 years ago)

I just wanted to provide a bit more context on our Ometa experiment: our goal was to build an executable parser whose definition was as close as possible to the BNF in the ES5 spec, which worked out fairly well. The main purpose of the tool was to be able to quickly experiment with syntax extensions, but it has never been our goal to have this tool be a "production-quality" JS parser.

Yes, that is understood. Still, one doesn't want to be hampered by unnecessary performance issues. I found your project only recently, so I made my own decision before I knew that there is an ES5 grammar for Ometa/JS. Ever since then, I've wondered how big an advantage Ometa's grammar optimizations give you over what I'm working on. Hence my surprise when Brendan described it as "slo-o-o-o-wwwww".

For rapid prototyping, Ometa is a marvelous tool. I liked it a lot. Performance-wise, I have not pitched it against other JS parser generators (I actually don't know of many others).

Thanks for confirming the general suitability. As for alternatives, so far, I've found pegjs and jsparse (see my response to Kam).

In terms of grammar validation, Ometa is fairly weak. We did at one point get Alex Warth to implement an exclusive choice operator "||", in addition to the standard PEG prioritized choice ("|" operator in Ometa), as a tool to help us get more confidence in the grammar. However, the implementation of that operator is naive and slow (as testified here: tinlizzie.org/ometa-js/#xor_perf).

Ah, so that is a check-there-is-at-most-one-match operator, which backtracks through every alternative even after a successful match. If that was enabled by default, it would explain the impression of Ometa being slo-o-o-o-wwwww, rather than slow but useable.

Still, from playing with esparser/index.html, I get the impression that the implementation is drastically slower than I expected (compared to pegjs, and my extended version of jsparse). And that does seem to have a call to "ES5Parser.disableXORs()" in it.

For reference, peg-0.6.1.js can self-parse in less than 30 seconds (using FF4 on a notebook), while esparser can't handle that code here.

If that first impression should be representative, it would be an obstacle to your work. From browsing esparser/bundle.js, there are some very odd implementation decisions in Ometa/JS that are not likely to be good for performance (check the implementation of "Keyword", which has a method call and String construction for every single character, each of which is extracted via .at(idx), after several levels of function calls; or the implementation of _or, which uses exception handling to handle choices, in an inner loop).

Assuming that most of Ometa/JS is itself generated, you might be able to tune the generator or its library code to get more useable performance, without having to change your own code.

Btw, doing my usual ES5/PEG check, I like that "||" alerts you to some grammar problems (only when they occur, a dynamic rather than static check) but you might still want to correct them (the "id \n : 1" example will not parse as expected).

Claus

# Kam Kasravi (14 years ago)

Yes I would be very interested in your extensions. Thanks for the pointers below.

# Waldemar Horwat (14 years ago)

On 06/01/11 01:32, Kam Kasravi wrote:

On May 31, 2011, at 4:20 PM, Waldemar Horwat<waldemar at google.com> wrote:

On 05/29/11 19:35, Kam Kasravi wrote:

Does Waldemar still maintain the tool? the source dates seemed fairly old...

It still works. I didn't bother updating the ES3 parser for ES5 because I had already explored the same syntax as part of ES4 and it worked. I'm going to update it for ES.next.

The code does more than just validate the lexical and syntactic grammars. It can encode semantics and evaluate ECMAScript expressions directly from the semantics. It also validates things like that there are no contexts in the grammar where a / would be both a division symbol and the start of a regular expression.

Thanks, sounds like a great tool. Do you build an ast along the lines of the parser API? developer.mozilla.org/en/SpiderMonkey/Parser_API

The grammar is the AST. You attach semantic rules directly to the grammar productions.

 Waldemar
# Claus Reinke (14 years ago)

Hi Claus. Yes I would be very interested in your extensions. Thanks for the pointers below.

My code is now available on github (JS parser combinators and a grammar edging closer to ES5, an AST with unparsing support; an experiment beyond ES5 is included as an example, by translating new syntax for paren-free function calls/definitions to old syntax):

https://github.com/clausreinke/jstr

Please consider this a semi-public preview, there is still much to do and many of the comments are notes to self rather than user docs.

I took the liberty of adding most of a SpiderMonkey-style AST (there are still some unwanted deviations, as well as a few items needing clarification, so you wont be able to port your code just yet). It turns out that this AST spec is biased towards evaluation, so I needed to augment it to support proper unparsing.

As I said, I've got a long and growing TODO list, but if there are any particular improvements/fixes that would help you or others here, knowing about them would influence my priorities.

Thanks for your interest, Claus

# Breton Slivka (14 years ago)

(and I was doing so well with picking the right "from" email address)

On Mon, May 30, 2011 at 5:55 AM, Brendan Eich <brendan at mozilla.com> wrote:

The validation we'd need would be that both grammars produce sentences only and exactly in the same language. /be

Relating to my own grammar experiments, what would be sufficient validation that a language with an experimental grammar is "Compatible" with the existing language? Passing all the V8/spidermonkey tests? or something more akin to a mathematical proof of correctness?

# Brendan Eich (14 years ago)

Proving two CFGs generate the same language is in general undecideable.

Refactoring the spec's LR(1) grammar (carefully, showing your work) to an LL(1) grammar with ordered choice, plus testing your work, would be ok by me.