LR(1) grammar/parser and lookahead-restrictions

# Michael Dyck (6 months ago)

In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:

  • modify the grammar (before the parser is constructed),
  • modify the construction of the parser, and/or
  • modify the operation of the parser. It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

# Dmitry Soshnikov (6 months ago)

One yet has to check the pure ES6+ grammar to be LR(1) compatible (the Cover grammar with nodes reinterpretations makes it harder), but for a very practical use-case, e.g. combined with Flow, it'll not be easy to have an automated parser (since Flow's parser uses backtracking for some edge cases which is already not LR-parser).

I have some subset of ES6 working as an example language for Syntax tool[1] (since I just took a bunch of productions from JS), but I haven't tried the full ES6 grammar (not having time to work on this project, since it'll be not a straightforward process of just matching the grammar 1:1 to the parser generator -- based on the above reasons).

Long time ago I also had a brief discussion with authors of another popular parser generators for JS. Both didn't try it as well, but mentioned that even for ES5 it was hard to parse e.g. regexp, etc.

Someone (with enough time for the project) should just sit, and spend some time actually porting the grammar from the spec on some powerful parsers generator.

[1] Syntax: DmitrySoshnikov/syntax [2] twitter.com/DmitrySoshnikov/status/666327209959751680

# Isiah Meadows (6 months ago)

Heuristically, I doubt it's even context-free at this point, considering the concept and widespread prevalence of early errors now. I suspect it's mildly context-sensitive (maybe tree-adjoining?), but I'm no formal language expert here.

# Michael Dyck (6 months ago)

On 17-01-11 09:09 PM, Isiah Meadows wrote:

Heuristically, I doubt it's even context-free at this point, considering the concept and widespread prevalence of early errors now. I suspect it's mildly context-sensitive (maybe tree-adjoining?), but I'm no formal language expert here.

I think the claim of/desire for LR(1)-ness only applies to the language defined by the BNF-ish rules, i.e. ignoring the effects of prose-y things like early errors and ASI. Anyhow, that's the language I'm asking about.

# Dmitry Soshnikov (6 months ago)

Well, it's still context-free (contex-sensitive may have terminals on LHS too, JS cannot). It requires extra cover grammar, and additional static semantics, but still potentially can be parsed by a LALR/CLR parser - yes, you'll have to do parsed nodes reinterpretation in semantics handlers (e.g. validate and transform parsed ObjectLiteral into an ObjectPattern when parsing destructuring), but should be doable in general. It won't be doable if lookaheafs won't be enough, and only a backtracking can recover (like the cases with Flow parser).

Dmitry

# Isiah Meadows (6 months ago)

Okay. I see now. How do early errors factor into this?

# Dmitry Soshnikov (6 months ago)

The "early errors" are just parser errors which are enforced not by grammar rules, but by additional validation algorithms which follow the parsed node. E.g. eval cannot be assigned in strict mode: it's not a grammar issue (grammar allows eval on LHS), and it's not a runtime issue, it's an "early error" enforced by static semantics, which still issues it as a ParseError. Example in handling it in a semantic action for a production (on a simplified AssignmentExpression):

%%

AssignmentExpression
  : IDENTIFIER '=' Expression {
      const id = $1;

      // Static semantics: early parser error:

      if (mode.isStrict && (id === 'eval' || id == 'arguments')) {
        throw new ParseError(
          '`eval` and `arguments` cannot be assigned in strict mode.'
        );
      }

      return {
        type: 'AssigmentExpression',
        left: $1,
        right: $3,
      };
    }

Similarly other "early errors".

BTW, here is the example I mentioned above of how Flow has to do backtracking in parsing in some edge cases, making the grammar completely not LR(1) anymore: babel/babel#1991

Dmitry

# Isiah Meadows (6 months ago)

Okay. I see now. Lots of backtracking in the case of arrow functions and ternary expressions vs type annotations.

# Waldemar Horwat (6 months ago)

On 01/11/2017 10:28, Michael Dyck wrote:

In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:

  • modify the grammar (before the parser is constructed),
  • modify the construction of the parser, and/or
  • modify the operation of the parser. It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I'm the source of the LR(1) condition. I've been doing that ever since ECMAScript Edition 3, and in fact am using the LR parser to help design and debug the spec. I have an implementation of the parser with a few extensions to the LR grammar, including support for parametrized productions, lookahead restrictions, and lexer-parser feedback used to disambiguate things such as what token / will start. I validate various things such as making sure that there is no place where both an identifier and a regular expression can occur.

 Waldemar
# Dmitry Soshnikov (6 months ago)

On Mon, Jan 23, 2017 at 5:24 PM, Waldemar Horwat <waldemar at google.com>

wrote:

On 01/11/2017 10:28, Michael Dyck wrote:

In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:

  • modify the grammar (before the parser is constructed),
  • modify the construction of the parser, and/or
  • modify the operation of the parser. It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I'm the source of the LR(1) condition. I've been doing that ever since ECMAScript Edition 3, and in fact am using the LR parser to help design and debug the spec. I have an implementation of the parser with a few extensions to the LR grammar, including support for parametrized productions, lookahead restrictions, and lexer-parser feedback used to disambiguate things such as what token / will start. I validate various things such as making sure that there is no place where both an identifier and a regular expression can occur.

Thanks for the confirmation.

lookahead restrictions

Yeah, all the lookahead restrictions, such as lookahead ∉ {{, function} add a lot of extra "No<Lookahead>" productions, which makes the grammar (and hence the parsing table) bigger unfortunately. Example is e.g. ExpressionNoIn, VariableDeclarationNoIn from the spec, and similar for ...NoLeftCurrly, and ...NoFunction to support those lookahead restrictions, and allow ExpressionStatement to be parsed without "reduce/reduce" conflicts.

That's why in the mentioned "sub-set" of the ECMAScript example language [1] I chose not to use the same syntax for function expression and function declaration: fn foo() {} for declaration, and just an arrow function for function-expressions :) -- no ambiguity, no lookahead restrictions, hence no larger grammar, hence a faster parser.

lexer-parser feedback

Yes, having a lexer/parser state solves it more elegantly. Also, if a parser tool allows semantic action not only at the end (i.e on reduce), but for each symbol on RHS (i.e. "S-attributed" definition, and not "L-attributed"), handling the lookahead ∉ {{, function} could be even easier: on matching the { from the statement position, one can track the parser state that the block should be parsed, and not an object literal. This probably can be solved with lexer state too -- to avoid again a bunch of extra "No<lookahead>" productions. And such cases are simpler in manual

recursive decent.

[1] Example lang based on subset of ECMAScript grammar: DmitrySoshnikov/syntax/blob/master/examples/lang.bnf

Dmitry

# Michael Dyck (6 months ago)

On 17-01-23 08:24 PM, Waldemar Horwat wrote:

On 01/11/2017 10:28, Michael Dyck wrote:

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:

  • modify the grammar (before the parser is constructed),
  • modify the construction of the parser, and/or
  • modify the operation of the parser. It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I have an implementation of the parser with a few extensions to the LR grammar, including support for ... lookahead restrictions, ...

Great! So how do you handle lookahead restrictions?

# Dmitry Soshnikov (6 months ago)

On Tue, Jan 24, 2017 at 8:23 AM, Michael Dyck <jmdyck at ibiblio.org> wrote:

On 17-01-23 08:24 PM, Waldemar Horwat wrote:

On 01/11/2017 10:28, Michael Dyck wrote:

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:

  • modify the grammar (before the parser is constructed),
  • modify the construction of the parser, and/or
  • modify the operation of the parser. It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I have an implementation of the parser with a few extensions to the LR grammar, including support for ... lookahead restrictions, ...

Great! So how do you handle lookahead restrictions?

As mentioned above, there are two ways to handle lookahead restrictions in LR-parser:

  1. Using "No<Lookahead>" productions. This unfortunately may double number

of productions in some sub-grammar. E.g. to handle Block vs. ObjectLiteral in ECMAScript, one needs to double all the expression productions to correctly parse ExpressionStatement. As an example in the spec itself, take a look e.g. at RelationalExpression and RelationalExpressionNoIn, and how it repeats (doubles) all the productions but with "NoIn" suffix es5.github.io/#x11.8

  1. Another approach is to use Cover grammar. It's when you parse a node with two possible ways, and enforce (reinterpret) needed sub-nodes in the cover-grammar post-processor (after having parsed a generic node). E.g. I chose this approach to parse generically ObjectLiteral, and Block to handle { lookahead restriction for ExpressionStatement. This approach allows not to double grammar (hence making parsing table smaller), however adds some post-processing overhead. DmitrySoshnikov/syntax/blob/master/examples/lang.bnf#L206-L222

P.S.: there is actually option #3 -- to use "S-attributed" parser actions, it's when you may execute handling action not only at the end of production (i.e. not only on "reduce"), but after each element on RHS:

statement
options
{
k = 1 ;
}
: { input.LA(1) == LBRACE }? block
| statementTail
;
statementTail
: variableStatement
| emptyStatement
| expressionStatement
        ...

This option is probably the best alternative.

Dmitry

# Michael Dyck (6 months ago)

On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:

As mentioned above, there are two ways to handle lookahead restrictions in LR-parser:

  1. Using "No<Lookahead>" productions. This unfortunately may double number of productions in some sub-grammar. E.g. to handle Block vs. ObjectLiteral in ECMAScript, one needs to double all the expression productions to correctly parse ExpressionStatement. As an example in the spec itself, take a look e.g. at RelationalExpression and RelationalExpressionNoIn, and how it repeats (doubles) all the productions but with "NoIn" suffix es5.github.io/#x11.8

So, e.g., one could take the production IterationStatement : ... [lookahead != 'let'] LHSExpr ... and transform it into IterationStatement : ... LHSExpr-that-does-not-start-with-let ... and then generate productions for the new nonterminal, e.g. LHSExpr-that-does-not-start-with-let : New-Expression-that-does-not-start-with-let Call-Expression-that-does-not-start-with-let etc, (eventually bottoming out when a RHS of the original grammar will either always or never start with 'let').

The problem with this approach is that, in general, a lookahead-restriction is not a restriction on the nonterminal that immediately follows it in the production, it's a restriction on the next few input elements, which might be generated by more than just that nonterminal. I'm not saying this is insurmountable, but it does seem to get messy fairly quickly.

And it's unclear how this approach would handle a lookahead-restriction at the end of a production.

  1. Another approach is to use Cover grammar.

Is there a way to automatically (a) construct the cover grammar and (b) reconstruct the parse tree as if parsed by the original grammar, or do you code it manually?

P.S.: there is actually option #3 -- to use "S-attributed" parser actions, it's when you may execute handling action not only at the end of production (i.e. not only on "reduce"), but after each element on RHS:

I'm doubtful this would help. I wouldn't expect the LR-parser-generator to 'understand' the content of actions, so it can't use them to alter the generation of the parser, and so the LR automaton will have lots of conflicts (because it can't 'separate' the alternatives that the lookahead-restrictions are there to separate).

# Dmitry Soshnikov (6 months ago)

On Mon, Jan 30, 2017 at 9:26 AM, Michael Dyck <jmdyck at ibiblio.org> wrote:

On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:

As mentioned above, there are two ways to handle lookahead restrictions in LR-parser:

  1. Using "No<Lookahead>" productions. This unfortunately may double number of productions in some sub-grammar. E.g. to handle Block vs. ObjectLiteral in ECMAScript, one needs to double all the expression productions to correctly parse ExpressionStatement. As an example in the spec itself, take a look e.g. at RelationalExpression and RelationalExpressionNoIn, and how it repeats (doubles) all the productions but with "NoIn" suffix es5.github.io/#x11.8

So, e.g., one could take the production IterationStatement : ... [lookahead != 'let'] LHSExpr ... and transform it into IterationStatement : ... LHSExpr-that-does-not-start-with-let ... and then generate productions for the new nonterminal, e.g. LHSExpr-that-does-not-start-with-let : New-Expression-that-does-not-start-with-let Call-Expression-that-does-not-start-with-let etc, (eventually bottoming out when a RHS of the original grammar will either always or never start with 'let').

Correct, and this is how ES spec itself handles all those "No<lookahead>"

duplicated productions.

The problem with this approach is that, in general, a lookahead-restriction is not a restriction on the nonterminal that immediately follows it in the production, it's a restriction on the next few input elements, which might be generated by more than just that nonterminal.

Well, in this case it's not a "lookahead" anymore. You could have k=2,3,...N for lookahead in parser generators, but it's not like parse the whole non-terminal, and then see a lookahead after it. This "parse whole non-terminal till the end" might be parse the whole program. In this case a cover grammar is simpler, and that's why it's used starting since ES6 spec. Take a look e.g. at CoverParenthesizedExpressionAndArrowParameterList from ES2015.

A simplified example: DmitrySoshnikov/syntax/blob/master/examples/lang.bnf#L514-L524

By just looking at:

(x, y)

you can't tell whether it's a grouping operator (SequenceExpression node), or a beginning of a lambda function (i.e. its parameters node). So in the generic ParenthisizedExpression production, you reinterpret it either to lambda parameters, or keep just a grouping operator, if there is no LambdaTail. Here it's a function of course, though for this you have to parse (x, y) not as LambdaParameters, and not as SequenceExpression, but as a generic ParenthisizedExpression, and using Cover grammar:

(x, y) -> { return x + y; }

I'm not saying this is insurmountable, but it does seem to get messy fairly quickly.

And it's unclear how this approach would handle a lookahead-restriction at the end of a production.

Yes, it adds a bunch of duplicated productions to the grammar, which on practice may increases number of parsing states simnifically. As a real practical example: ES5 spec in LALR(1) mode would increase from ~350 states to 470.

  1. Another approach is to use Cover grammar.

Is there a way to automatically (a) construct the cover grammar and (b) reconstruct the parse tree as if parsed by the original grammar, or do you code it manually?

Usually it's a manual process of designing a grammar, and defining post-processing static semantics. Take a look again in the example language I gave above, it have couple of case of cover grammar.

P.S.: there is actually option #3 -- to use "S-attributed" parser actions,

it's when you may execute handling action not only at the end of production (i.e. not only on "reduce"), but after each element on RHS:

I'm doubtful this would help. I wouldn't expect the LR-parser-generator to 'understand' the content of actions, so it can't use them to alter the generation of the parser, and so the LR automaton will have lots of conflicts (because it can't 'separate' the alternatives that the lookahead-restrictions are there to separate).

Oh, it'll work for sure, depending on how powerful a parser generator is. Yes, technically it'll be a "reduce-reduce" conflict in the table, but parser will never enter this state based on the semantic action executed before entering some non-terminal (not after). In this "before" action you can communicated to lexer, and check the lookahead.

Dmitry

# Michael Dyck (6 months ago)

On 17-01-30 03:00 PM, Dmitry Soshnikov wrote:

On Mon, Jan 30, 2017 at 9:26 AM, Michael Dyck <jmdyck at ibiblio.org <mailto:jmdyck at ibiblio.org>> wrote:

On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:


    As mentioned above, there are two ways to handle lookahead
    restrictions in
    LR-parser:

    1. Using "No<Lookahead>" productions.

So, e.g., one could take the production
    IterationStatement : ...  [lookahead != 'let'] LHSExpr ...
and transform it into
    IterationStatement : ... LHSExpr-that-does-not-start-with-let ...
and then generate productions for the new nonterminal, e.g.
    LHSExpr-that-does-not-start-with-let :
        New-Expression-that-does-not-start-with-let
        Call-Expression-that-does-not-start-with-let
etc, (eventually bottoming out when a RHS of the original grammar will
either always or never start with 'let').

Correct, and this is how ES spec itself handles all those "No<lookahead>" duplicated productions.

If you're talking about the 'NoIn' productions in ES5, that seems only marginally relevant, since they don't address a lookahead problem (i.e., a problem that could have been solved via a lookahead-restriction).

The problem with this approach is that, in general, a
lookahead-restriction is not a restriction on the nonterminal that
immediately follows it in the production, it's a restriction on the next
few input elements, which might be generated by more than just that
nonterminal.

Well, in this case it's not a "lookahead" anymore.

It isn't a "lookahead" in the resulting grammar, but it is in the original, and the two grammars should define the same language. What I'm saying is that it's incorrect to assume that the effect of the lookahead (in the original) can be confined to the nonterminal that immediately follows, which is what the suggested transformation does.

You could have k=2,3,...N for lookahead in parser generators, but it's not like parse the whole non-terminal, and then see a lookahead after it. This "parse whole non-terminal till the end" might be parse the whole program.

I don't really follow what you're suggesting there.

In this case a cover grammar is simpler, and that's why it's used starting since ES6 spec.

But note that none of the ES5 lookahead-restrictions was replaced with a cover grammar in ES6. They're all still lookahead-restrictions.

I'm not saying this is insurmountable, but it does seem to get messy
fairly quickly.

And it's unclear how this approach would handle a lookahead-restriction
at the end of a production.

Yes, it adds a bunch of duplicated productions to the grammar, which on practice may increases number of parsing states simnifically. As a real practical example: ES5 spec in LALR(1) mode would increase from ~350 states to 470.

By "messy", I wasn't talking about the duplicated productions, but rather the process of generating them.

    2. Another approach is to use Cover grammar.


Is there a way to automatically (a) construct the cover grammar and (b)
reconstruct the parse tree as if parsed by the original grammar, or do
you code it manually?

Usually it's a manual process of designing a grammar, and defining post-processing static semantics.

Okay, I'm not really interested then. I'm looking for an algorithm (i.e., a modification of the LR algorithm) that starts with a grammar-with-lookahead-restrictions and ends with an LR parser.

    P.S.: there is actually option #3 -- to use "S-attributed" parser
    actions,
    it's when you may execute handling action not only at the end of
    production
    (i.e. not only on "reduce"), but after each element on RHS:


I'm doubtful this would help. I wouldn't expect the LR-parser-generator
to 'understand' the content of actions, so it can't use them to alter
the generation of the parser, and so the LR automaton will have lots of
conflicts (because it can't 'separate' the alternatives that the
lookahead-restrictions are there to separate).

Oh, it'll work for sure, depending on how powerful a parser generator is. Yes, technically it'll be a "reduce-reduce" conflict in the table,

So when your generated parser has conflicts, is it easy to tell which you don't have to worry about and which you do?

but parser will never enter this state based on the semantic action executed before entering some non-terminal (not after). In this "before" action you can communicated to lexer, and check the lookahead.

It seems to me that the parser will enter such a state, but always via only one of the conflicting 'paths' (the other(s) being prevented at parse-time by the semantic actions). So only one of the conflicting actions will be valid, but you've still got to figure out which one that is.

# Dmitry Soshnikov (6 months ago)

On Mon, Jan 30, 2017 at 2:22 PM, Michael Dyck <jmdyck at ibiblio.org> wrote:

On 17-01-30 03:00 PM, Dmitry Soshnikov wrote:

On Mon, Jan 30, 2017 at 9:26 AM, Michael Dyck <jmdyck at ibiblio.org <mailto:jmdyck at ibiblio.org>> wrote:

On 17-01-30 12:15 AM, Dmitry Soshnikov wrote:


    As mentioned above, there are two ways to handle lookahead
    restrictions in
    LR-parser:

    1. Using "No<Lookahead>" productions.

So, e.g., one could take the production
    IterationStatement : ...  [lookahead != 'let'] LHSExpr ...
and transform it into
    IterationStatement : ... LHSExpr-that-does-not-start-with-let ...
and then generate productions for the new nonterminal, e.g.
    LHSExpr-that-does-not-start-with-let :
        New-Expression-that-does-not-start-with-let
        Call-Expression-that-does-not-start-with-let
etc, (eventually bottoming out when a RHS of the original grammar will
either always or never start with 'let').

Correct, and this is how ES spec itself handles all those "No<lookahead>" duplicated productions.

If you're talking about the 'NoIn' productions in ES5, that seems only marginally relevant, since they don't address a lookahead problem (i.e., a problem that could have been solved via a lookahead-restriction).

The problem with this approach is that, in general, a
lookahead-restriction is not a restriction on the nonterminal that
immediately follows it in the production, it's a restriction on the

next few input elements, which might be generated by more than just that nonterminal.

Well, in this case it's not a "lookahead" anymore.

It isn't a "lookahead" in the resulting grammar, but it is in the original, and the two grammars should define the same language. What I'm saying is that it's incorrect to assume that the effect of the lookahead (in the original) can be confined to the nonterminal that immediately follows, which is what the suggested transformation does.

A "lookahead" is a terminal token. A non-terminal is not a "lookahead".

To reiterate, imaging you have:

• (x, y)

and

• (x, y) -> { return x + y; }

where is a cursor position. Your lookaheads are "(", "x", "," "y", etc. Usually 1 lookahead token is enough to route the parser, in this case "(".

What you're talking that you need not these lookaheads (at the beginning of the cursor), but after the parsed non-terminal (i.e. after fully parsed (x, y)), i.e. you're looking for "->", which goes at the end of the

non-terminal. This is not possible, since it's not a lookahead at that specific cursor position. It will be a lookahead after you have parsed the non-terminal corresponding to "(x, y)".

So you can't have two non-terminals for this, since it'll be a "reduce/reduce" conflict, and you need a cover grammar here.

You could have k=2,3,...N for lookahead in parser generators, but it's

not like parse the whole non-terminal, and then see a lookahead after it. This "parse whole non-terminal till the end" might be parse the whole program.

I don't really follow what you're suggesting there.

Like mentioned above:

• (x, y) -> { return x + y; }

In order to determine whether it's a grouping operator, or arrow function params, it's not enough analyzing a "lookahead" being that the cursor position. You need to actually fully parse this non-terminal, and advance to this position:

(x, y) • -> { return x + y; }

and from there you may already analyze a real lookahead. However, the problem is the "(x, y)" might the full program (imagine million of entries on several GBs file source code -- what kind of "lookahead" can you analyze here? :))

In this case a cover grammar is simpler, and that's why it's used

starting since ES6 spec.

But note that none of the ES5 lookahead-restrictions was replaced with a cover grammar in ES6. They're all still lookahead-restrictions.

Sure, because it's normally possible to do with the "No<Lookahead>" doubled

productions. So this actually answers your initial question (it's a real practice).

I'm not saying this is insurmountable, but it does seem to get messy
fairly quickly.

And it's unclear how this approach would handle a

lookahead-restriction at the end of a production.

Yes, it adds a bunch of duplicated productions to the grammar, which on practice may increases number of parsing states simnifically. As a real practical example: ES5 spec in LALR(1) mode would increase from ~350 states to 470.

By "messy", I wasn't talking about the duplicated productions, but rather the process of generating them.

    2. Another approach is to use Cover grammar.
Is there a way to automatically (a) construct the cover grammar and

(b) reconstruct the parse tree as if parsed by the original grammar, or do you code it manually?

Usually it's a manual process of designing a grammar, and defining post-processing static semantics.

Okay, I'm not really interested then. I'm looking for an algorithm (i.e., a modification of the LR algorithm) that starts with a grammar-with-lookahead-restrictions and ends with an LR parser.

OK, yeah, at this point I think we already clarified that it is possible to parse ES grammar, and handle it's lookahead restrictions with several techniques described. If you're interested in generic LR parsing, feel free to contact me (we can omit generic off-topic discussions on this list).

Dmitry

# Michael Dyck (6 months ago)

On 17-01-30 06:20 PM, Dmitry Soshnikov wrote:

On Mon, Jan 30, 2017 at 2:22 PM, Michael Dyck <jmdyck at ibiblio.org <mailto:jmdyck at ibiblio.org>> wrote:

            1. Using "No<Lookahead>" productions.

        The problem with this approach is that, in general, a
        lookahead-restriction is not a restriction on the nonterminal that
        immediately follows it in the production, it's a restriction on
        the next few input elements, which might be generated by more
        than just that nonterminal.

    Well, in this case it's not a "lookahead" anymore.

It isn't a "lookahead" in the resulting grammar, but it is in the
original, and the two grammars should define the same language. What I'm
saying is that it's incorrect to assume that the effect of the lookahead
(in the original) can be confined to the nonterminal that immediately
follows, which is what the suggested transformation does.

A "lookahead" is a terminal token. A non-terminal is not a "lookahead".

Okay, so then in my previous statement, read "lookahead" as "lookahead-restriction". (But I did say "lookahead-restriction" in the statement before that.)

To reiterate, imaging you have:

• (x, y)

and

• (x, y) -> { return x + y; }

where is a cursor position. Your lookaheads are "(", "x", "," "y", etc. Usually 1 lookahead token is enough to route the parser, in this case "(".

What you're talking that you need not these lookaheads (at the beginning of the cursor), but after the parsed non-terminal (i.e. after fully parsed (x, y)), i.e. you're looking for "->", which goes at the end of the non-terminal.

I'm not saying I don't need the lookahead tokens right after the cursor. But I am saying that the lookahead tokens that the hypothetical parser needs to examine (in order to enforce a lookahead-restriction), even if it's only 1 or 2, might extend past the 'next' nonterminal.

But that's an awkward way to express my objection.

To recap, I'm talking about the suggestion to replace: [lookahead != <some sequence of terminals>] Nonterminal

in a production in the original grammar with Nonterminal-that-does-not-start-with-that-sequence (and a bunch of new productions) in the transformed grammar, and whether that's a valid transformation.

I'm saying it isn't valid when the Nonterminal can derive phrases with fewer terminals than the prohibited lookahead-sequence. (E.g., if Nonterminal is nullable, or if it can derive a phrase of length 1 when the prohibited lookahead-sequence is length 2.)

This is not possible, since it's not a lookahead at that specific cursor position. It will be a lookahead after you have parsed the non-terminal corresponding to "(x, y)".

So you can't have two non-terminals for this, since it'll be a "reduce/reduce" conflict, and you need a cover grammar here.

Well, I'm saying you'd run into problems before you even tried to construct the LR automaton, but I think you're agreeing with me that this approach is not generally applicable?

    In this case a cover grammar is simpler, and that's why it's used
    starting since ES6 spec.

But note that none of the ES5 lookahead-restrictions was replaced with a
cover grammar in ES6. They're all still lookahead-restrictions.

Sure, because it's normally possible to do with the "No<Lookahead>" doubled productions.

But they weren't replaced with "No" productions (or grammatical parameters) either! All the lookahead-restrictions in ES5 are still lookahead-restrictions in the current draft.

So this actually answers your initial question (it's a real practice).

No, I don't think it does.


            2. Another approach is to use Cover grammar.

    Usually it's a manual process of designing a grammar, and defining
    post-processing static semantics.

Okay, I'm not really interested then. I'm looking for an algorithm
(i.e., a modification of the LR algorithm) that starts with a
grammar-with-lookahead-restrictions and ends with an LR parser.

OK, yeah, at this point I think we already clarified that it is possible to parse ES grammar, and handle it's lookahead restrictions with several techniques described.

Well, ES implementations exist, so clearly there are techniques for handling those aspects of the ES grammar. But that doesn't mean that any of them are parsing according to the ES grammar. E.g., it's unlikely that they're finding exactly the parse tree indicated by the ES grammar. (Nor is there any requirement that they do so.)


By way of contrast, consider "opt" subscripts and grammatical parameters. If you look up the LR parser-construction algorithm, it doesn't talk about how to handle such things, but the spec says how to transform them into plain context-free productions, which the LR algorithm can handle. So it's an easy shorthand to say that the ES grammar is LR(1) iff the 'expanded' grammar is LR(1).

But the spec doesn't give that kind of CFG-equivalence for lookahead-restrictions. Until there's agreement about what they mean in a CFG or LR sense, it isn't really meaningful to ask if the ES grammar is LR(1).

# Dmitry Soshnikov (6 months ago)

The spec is a formal language definition (including for syntactic grammar). It doesn't talk about specifics of implementation much, nor it's supposed to do so (so demanding this from spec isn't appropriate).

It doesn't tell one to implement an LR(1), neither it describes the techniques how one going to do so (even though as was confirmed in this thread it's LR(1)-parsable). It just uses some formal notation for this, e.g. lookahead ∉ {{, function}. Specific techniques how exactly you going to achieve this, is not the purpose of the specification.

(but it's possible to clarify such techniques on discussions which we have now -- please note though, the techniques described above are not an "assumption", or a "guess", they are actual techniques used on practice to achieve this in LR(1)).

The spec in definitions is not consistent though. In one place it uses explicit technique for lookahead restriction via the "No<lookahead>"

productions. In some cases it just uses formal description as lookahead ∉ {{, function}, leaving a specific technique to achieve this up to implementations.

In fact (for consistency), the rule from ES5:

for ( ExpressionNoIn ; Expression ; Expression ) Statement

could be described as:

for ( `lookahead ∉ {in}` Expression ; Expression ; Expression ) Statement

and would mean the same.

You may search for ES5 LR(1) construction (there are several over internets), they use the described above techniques for lookahead ∉ {{, function} (adding NoCurlyBrace, or NoFunction productions). That's also said, it could be rewritten using Cover grammar which adds though post-processing overhead -- that might be the reason why some of the No<lookahead> productions were removed from ES6 spec, but lookahead ∉ {{, function} is still there. More complex cases like Arrow functions params is still implemented as a Cover grammar.

Dmitry

# Michael Dyck (6 months ago)

On 17-01-31 02:17 PM, Dmitry Soshnikov wrote:

The spec is a formal language definition (including for syntactic grammar).

(Well, the extent to which the spec is formal is debatable. The grammar is the most formal part of it, but that's true of most programming language definitions.)

It doesn't talk about specifics of implementation much, nor it's supposed to do so (so demanding this from spec isn't appropriate).

Right (and I'm pretty sure I didn't make such a demand).

It doesn't tell one to implement an LR(1),

Nor should it.

neither it describes the techniques how one going to do so (even though as was confirmed in this thread it's LR(1)-parsable). It just uses some formal notation for this, e.g. lookahead ∉ {{, function}. Specific techniques how exactly you going to achieve this, is not the purpose of the specification.

Right.

(but it's possible to clarify such techniques on discussions which we have now -- please note though, the techniques described above are not an "assumption", or a "guess", they are actual techniques used on practice to achieve this in LR(1)).

I'm not disputing the actualness of the techniques. I'm saying:

  • technique #1 (duplicated productions): As stated, looks insufficient for the ES grammar. Which may just mean that there's a more complicated version of the approach that is sufficient for ES.

  • technique #2 (cover grammar): Can you algorithmically (a) construct the cover grammar and (b) recover the parse according to the original grammar?

  • technique #3 (parse-time actions): In the LR automaton, can you algorithmically distinguish action-avoided conflicts from 'true' conflicts?

The spec in definitions is not consistent though. In one place it uses explicit technique for lookahead restriction via the "No<lookahead>" productions. In some cases it just uses formal description as lookahead ∉ {{, function}, leaving a specific technique to achieve this up to implementations.

In fact (for consistency), the rule from ES5:

for ( ExpressionNoIn ; Expression ; Expression ) Statement

could be described as:

for ( `lookahead ∉ {in}` Expression ; Expression ; Expression ) Statement

and would mean the same.

No, they wouldn't. ExpressionNoIn (or Expression[~In]) is a version of Expression that doesn't ("openly") use the 'in' operator (at the RelationalExpression level), i.e. it doesn't have the 'in' operator somewhere in the middle, whereas [lookahead ∉ {in}] Expression prohibits an Expression that starts with the token 'in'. You can't express the 'NoIn' restriction with any finite amount of lookahead, because a RelationalExpression can be arbitrarily large.

You may search for ES5 LR(1) construction (there are several over internets), they use the described above techniques for lookahead ∉ {{, function} (adding NoCurlyBrace, or NoFunction productions).

But the chances are low that they've been algorithmically generated from the ES spec. And while it's conceivable I could figure out how to do it algorithmically by examining examples of it done manually, I'm doubtful of that too.

That's also said, it could be rewritten using Cover grammar which adds though post-processing overhead -- that might be the reason why some of the No<lookahead> productions were removed from ES6 spec,

What removed productions are you thinking of? ES5's "NoIn" productions became ES6+'s "[In]" parameter (which is basically the same thing with a more compact notation). And ES5's "NoDash" productions are still in the current draft.

# Dmitry Soshnikov (6 months ago)

On Wed, Feb 1, 2017 at 7:37 AM, Michael Dyck <jmdyck at ibiblio.org> wrote:

On 17-01-31 02:17 PM, Dmitry Soshnikov wrote:

The spec is a formal language definition (including for syntactic grammar).

(Well, the extent to which the spec is formal is debatable. The grammar is the most formal part of it, but that's true of most programming language definitions.)

It doesn't talk about specifics of implementation much, nor it's supposed

to do so (so demanding this from spec isn't appropriate).

Right (and I'm pretty sure I didn't make such a demand).

It doesn't tell one to implement an LR(1),

Nor should it.

neither it describes the

techniques how one going to do so (even though as was confirmed in this thread it's LR(1)-parsable). It just uses some formal notation for this, e.g. lookahead ∉ {{, function}. Specific techniques how exactly you going to achieve this, is not the purpose of the specification.

Right.

OK, I mostly meant that a spec can be fully abstract, even in abstract definitions themselves, and use different notations instead of providing fully "CFG-equivalence for lookahead restrictions".

(but it's possible to clarify such techniques on discussions which we have

now -- please note though, the techniques described above are not an "assumption", or a "guess", they are actual techniques used on practice to achieve this in LR(1)).

I'm not disputing the actualness of the techniques. I'm saying:

  • technique #1 (duplicated productions): As stated, looks insufficient for the ES grammar. Which may just mean that there's a more complicated version of the approach that is sufficient for ES.

Here's an example: raw.githubusercontent.com/zaach/jison/master/examples/jscore.jison (see how lookahead restriction for { in ExpressionStatement is implemented using NoBF productions).

  • technique #2 (cover grammar): Can you algorithmically (a) construct the cover grammar and (b) recover the parse according to the original grammar?

And this approach I chose (instead of duplicating productions) to handle the same { lookahead restriction, what this comment explains: DmitrySoshnikov/syntax/blob/master/examples/lang.bnf#L206-L222

As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

  • technique #3 (parse-time actions): In the LR automaton, can you algorithmically distinguish action-avoided conflicts from 'true' conflicts?

The spec in definitions is not consistent though. In one place it uses

explicit technique for lookahead restriction via the "No<lookahead>" productions. In some cases it just uses formal description as lookahead ∉ {{, function}, leaving a specific technique to achieve this up to implementations.

In fact (for consistency), the rule from ES5:

for ( ExpressionNoIn ; Expression ; Expression ) Statement

could be described as:

for ( `lookahead ∉ {in}` Expression ; Expression ; Expression ) Statement

and would mean the same.

No, they wouldn't. ExpressionNoIn (or Expression[~In]) is a version of Expression that doesn't ("openly") use the 'in' operator (at the RelationalExpression level), i.e. it doesn't have the 'in' operator somewhere in the middle, whereas [lookahead ∉ {in}] Expression prohibits an Expression that starts with the token 'in'. You can't express the 'NoIn' restriction with any finite amount of lookahead, because a RelationalExpression can be arbitrarily large.

I agree, the example not the best, I was leaning to, that a spec could use even more abstract definition for this instead of explicit productions.

You may search for ES5 LR(1) construction (there are several over

internets), they use the described above techniques for lookahead ∉ {{, function} (adding NoCurlyBrace, or NoFunction productions).

But the chances are low that they've been algorithmically generated from the ES spec. And while it's conceivable I could figure out how to do it algorithmically by examining examples of it done manually, I'm doubtful of that too.

I'm not sure about "algorithmic" transformations, but yeah, how the grammar is described in the spec, and which extra transformation you have to do in the resulting LR(1), might be different things. Does it mean the spec doesn't define an LR(1) grammar? Perhaps, but it doesn't really matter, since again the spec uses formal notation, and for shortness may omit actual techniques.

That's also said, it could be rewritten using Cover grammar which adds

though post-processing overhead -- that might be the reason why some of the No<lookahead> productions were removed from ES6 spec,

Yeah, seems most of them were replaced with inverted notation (using [In] suffix) -- still leaving the interpretation of this abstract notation up to implementations.

P.S:

All this reduces to that practically today it might be easier to do just a manual recursive decent/accent. Yes, it's a lot of recursion, yes, it might be slower than table-driven parser. But gives you much control, and avoid all those described techniques (hence a manual recursive descent is chosen on practice by most of the implementations today). Or, the syntax of a language could be different itself, and avoid ambiguities.

Dmitry

# Waldemar Horwat (6 months ago)

On 02/01/2017 10:25, Dmitry Soshnikov wrote:

As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

Cover grammars are an ugly hack that we had to add when there was no other practical choice. Avoid them as much as possible. They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.

When designing the grammar, the preferences are:

  • Use standard LR(1) productions
  • Use parametrized productions
  • Use lookahead restrictions if parametrized productions would introduce too many parameters into rules
  • Use a cover grammar if the grammar can't be reasonably expressed in any other way. They're a last resort with lots of problems.

Lookahead restrictions fit very well into an LR(1) engine as long as they're limited to only one token, and that's what I've implemented in the validator. You need to be very careful with them if looking more than one token ahead because lexing of the tokens can vary based on context. For example, if the next few characters in front of the cursor are )/abc/i+, then what is the second token? What is the third token? It's actually context-dependent.

The same problem is even worse for cover grammars.

 Waldemar
# Dmitry Soshnikov (6 months ago)

On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <waldemar at google.com> wrote:

On 02/01/2017 10:25, Dmitry Soshnikov wrote:

As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

Cover grammars are an ugly hack that we had to add when there was no other practical choice. Avoid them as much as possible. They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.

Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse { as a Block vs. ObjectLiteral, and handle ExpressionStatement is to make the resulting grammar short, and don't introduce a bunch of NoCurly, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between { and } are properties, or vice-versa, statements, based on the expression/statement position.

Practically, a cover grammar still can be slower because of such post-processing overhead (imaging you have a giant file with just object -- all those now are needed to be validated to contain property nodes). OTOH, a trade off with lookahead restrictions is introducing ~1.5x more parsing states.

When designing the grammar, the preferences are:

  • Use standard LR(1) productions
  • Use parametrized productions
  • Use lookahead restrictions if parametrized productions would introduce too many parameters into rules
  • Use a cover grammar if the grammar can't be reasonably expressed in any other way. They're a last resort with lots of problems.

Just to double-check, by "parametrized productions" you mean the No<lookahead> extra productions?

Lookahead restrictions fit very well into an LR(1) engine as long as they're limited to only one token, and that's what I've implemented in the validator. You need to be very careful with them if looking more than one token ahead because lexing of the tokens can vary based on context. For example, if the next few characters in front of the cursor are )/abc/i+, then what is the second token? What is the third token? It's actually context-dependent.

The same problem is even worse for cover grammars.

Yeah, that's also true. Thanks again for the confirmation, that such a validator even exists, it's good -- at least we know it's parsable in principle. Those, that's said, on real practice now implementing a manual recursive descent for ECMAScript is more approachable now.

Dmitry

# Dmitry Soshnikov (6 months ago)

On Thu, Feb 2, 2017 at 3:56 PM, Dmitry Soshnikov <dmitry.soshnikov at gmail.com

wrote:

On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <waldemar at google.com> wrote:

On 02/01/2017 10:25, Dmitry Soshnikov wrote:

As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

Cover grammars are an ugly hack that we had to add when there was no other practical choice. Avoid them as much as possible. They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.

Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse { as a Block vs. ObjectLiteral, and handle ExpressionStatement is to make the resulting grammar short, and don't introduce a bunch of NoCurly, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between { and } are properties, or vice-versa, statements, based on the expression/statement position.

Practically, a cover grammar still can be slower because of such post-processing overhead (imaging you have a giant file with just object -- all those now are needed to be validated to contain property nodes). OTOH, a trade off with lookahead restrictions is introducing ~1.5x more parsing states.

When designing the grammar, the preferences are:

  • Use standard LR(1) productions
  • Use parametrized productions
  • Use lookahead restrictions if parametrized productions would introduce too many parameters into rules
  • Use a cover grammar if the grammar can't be reasonably expressed in any other way. They're a last resort with lots of problems.

Just to double-check, by "parametrized productions" you mean the No<lookahead> extra productions?

Or you actually mean "lookahead-parametrized" productions (like in SLR(1) which doesn't use it, vs. LALR(1), or CLR(1) which do use it)?

Dmitry

# Michael Dyck (6 months ago)

On 17-02-02 06:23 PM, Waldemar Horwat wrote:

Lookahead restrictions fit very well into an LR(1) engine

Again: Great, but how? E.g., do you pre-process the grammar, modify the construction of the automaton, and/or modify the operation of the parser?

as long as they're limited to only one token, and that's what I've implemented in the validator.

So when the grammar has a prohibited lookahead-sequence with more than one token (in ExpressionStatement, IterationStatement, and ExportDeclaration), does the validator just use the first token?

You need to be very careful with them if looking more than one token ahead because lexing of the tokens can vary based on context. For example, if the next few characters in front of the cursor are )/abc/i+, then what is the second token? What is the third token? It's actually context-dependent.

But the context-dependentness of lexing is a parse-time problem, not a validation-time problem, right? The syntactic level can just assume a stream of (correctly lexed) input elements. Or do you validate deeper than the syntactic grammar?

(And it seems to me that multi-token lookahead-restrictions would be hard for a validator to handle even if lexing wasn't context-dependent, but maybe that depends on how you handle them.)

# Michael Dyck (6 months ago)

On 17-02-02 06:56 PM, Dmitry Soshnikov wrote:

Just to double-check, by "parametrized productions" you mean the No<lookahead> extra productions?

See the 'Grammar Notation' section of ES6+, at the paragraph beginning "A production may be parameterized by a subscripted annotation".

They were introduced in draft 20 of ES6, and replaced the use of the "NoIn" productions of the previous draft (and ES3-5).

(which, strictly speaking, weren't "No<lookahead> productions", because they

didn't express a restriction on lookahead)

# Waldemar Horwat (6 months ago)

On 02/02/2017 15:56, Dmitry Soshnikov wrote:

On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <waldemar at google.com <mailto:waldemar at google.com>> wrote:

On 02/01/2017 10:25, Dmitry Soshnikov wrote:

    As mentioned, Cover grammar is usually the process of the grammar design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).


Cover grammars are an ugly hack that we had to add when there was no other practical choice.  Avoid them as much as possible.  They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.

Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse { as a Block vs. ObjectLiteral, and handle ExpressionStatement is to make the resulting grammar short, and don't introduce a bunch of NoCurly, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between { and } are properties, or vice-versa, statements, based on the expression/statement position.

Don't use a cover grammar to unify blocks with object literals. That's a really bad idea and you'll likely get it wrong. If the { starts a block, then if the next token is a keyword such as if then it's parsed as a keyword. If the { starts an object literal, then if the next token is a keyword then it's parsed as an identifiername. As we extend the language, the expansions of these can later diverge to the point where you won't know whether a / starts a regular expression or is a division symbol.

 Waldemar
# Waldemar Horwat (6 months ago)

On 02/03/2017 08:17, Michael Dyck wrote:

On 17-02-02 06:23 PM, Waldemar Horwat wrote:

Lookahead restrictions fit very well into an LR(1) engine

Again: Great, but how? E.g., do you pre-process the grammar, modify the construction of the automaton, and/or modify the operation of the parser?

For each state × token combination, the automaton describes what happens when you're in state S and see a token T. The lookahead restrictions remove possible transitions; without them there would be ambiguities where a given state × token combination would want to do two incompatible things.

That's different from parametrized rules, which simply macro-expand into lots of individual rules.

as long as they're limited to only one token, and that's what I've implemented in the validator.

So when the grammar has a prohibited lookahead-sequence with more than one token (in ExpressionStatement, IterationStatement, and ExportDeclaration), does the validator just use the first token?

My LR(1) validator can't actually handle the case of multiple tokens of lookahead, and I didn't feel like doing an LR(2) grammar. I had to check these by hand.

You need to be very careful with them if looking more than one token ahead because lexing of the tokens can vary based on context. For example, if the next few characters in front of the cursor are )/abc/i+, then what is the second token? What is the third token? It's actually context-dependent.

But the context-dependentness of lexing is a parse-time problem, not a validation-time problem, right?

No.

The syntactic level can just assume a stream of (correctly lexed) input elements.

No! It's impossible to create a stream of correctly lexed input elements without doing syntactic level parsing. See the example I gave above.

Or do you validate deeper than the syntactic grammar?

Yes. The syntactic grammar controls the lexer. The validator looks for problems such as the syntactic grammar giving the lexer contradictory instructions. An example would be any syntactic grammar automaton state where one outgoing syntactic transition would swallow a regexp token and another outgoing syntactic transition from the same state would swallow a / token. If any such state exists, the grammar is broken.

# Dmitry Soshnikov (6 months ago)

On Fri, Feb 3, 2017 at 2:32 PM, Waldemar Horwat <waldemar at google.com> wrote:

On 02/02/2017 15:56, Dmitry Soshnikov wrote:

On Thu, Feb 2, 2017 at 3:23 PM, Waldemar Horwat <waldemar at google.com <mailto:waldemar at google.com>> wrote:

On 02/01/2017 10:25, Dmitry Soshnikov wrote:

    As mentioned, Cover grammar is usually the process of the grammar

design itself (as in ES6 spec itself). I'm not aware of automatic transformations for this (if you find any please let me know).

Cover grammars are an ugly hack that we had to add when there was no

other practical choice. Avoid them as much as possible. They are only used in situations where lookahead restrictions and parametrized grammar rules do not work in any practical way.

Yeah, absolutely agreed. The reason why I used cover grammar in that example to parse { as a Block vs. ObjectLiteral, and handle ExpressionStatement is to make the resulting grammar short, and don't introduce a bunch of NoCurly, etc extra productions for this. That's also said, this ugly hack also forces you to do post-processing overhead -- either validation of the nodes, or even re-interpretation (rewriting) to other nodes -- in my case I have to actually check that all nodes between { and } are properties, or vice-versa, statements, based on the expression/statement position.

Don't use a cover grammar to unify blocks with object literals. That's a really bad idea and you'll likely get it wrong. If the { starts a block, then if the next token is a keyword such as if then it's parsed as a keyword. If the { starts an object literal, then if the next token is a keyword then it's parsed as an identifiername. As we extend the language, the expansions of these can later diverge to the point where you won't know whether a / starts a regular expression or is a division symbol.

Ah, good catch. In fact, this can be fixed by allowing keywords as property names, as in JS (and in fact, I have just fixed it after your comment -- see example code changes in this commit: DmitrySoshnikov/syntax/commit/ccb3029cf7515bc078436c37831e027b50cb7fa4 -- there are no ambiguities, and it's still correctly parsed as an ObjectLiteral, or Block in needed position).

Just to note, this example language is a smaller sub-set of ECMAScript grammar, but not ES grammar itself, so, so far, Cover grammar works ;) (still being a "dirty hack" since the language syntax was designed so, allowing the same construct in different logically positions).

P.S.:

Another observation to note, using Cover grammar for this specific case with { requires parsing at least in LALR(1) mode, and SLR(1) will not work, eventually reaching to "reduce-reduce" conflict for when parsing e.g. an empty block inside the empty { { } }:

(output from the parsing states analysis tool)

- BlockStatement -> BlockOrObject • (kernel, reduce by production 13)
- ObjectLiteral -> BlockOrObject • (kernel, reduce by production 68)

LALR(1) uses parametrized lookahead set, and is free from this issue. It's the one which is used the most on practice anyways (taking more time for table calculation though)

Dmitry

# Michael Dyck (6 months ago)

On 17-02-03 05:32 PM, Waldemar Horwat wrote:

On 02/03/2017 08:17, Michael Dyck wrote:

On 17-02-02 06:23 PM, Waldemar Horwat wrote:

Lookahead restrictions fit very well into an LR(1) engine

Again: Great, but how? E.g., do you pre-process the grammar, modify the construction of the automaton, and/or modify the operation of the parser?

For each state × token combination, the automaton describes what happens when you're in state S and see a token T. The lookahead restrictions remove possible transitions; without them there would be ambiguities where a given state × token combination would want to do two incompatible things.

Okay, so do you generate the automaton (ignoring lookahead restrictions) and then remove transitions (using lookahead restrictions)? Or do you integrate the lookahead-restrictions into the generation of the automaton?

That's different from parametrized rules, which simply macro-expand into lots of individual rules.

Yup.

But the context-dependentness of lexing is a parse-time problem, not a validation-time problem, right?

No.

The syntactic level can just assume a stream of (correctly lexed) input elements.

(I should have said "Validation of the syntactic level [etc]")

No! It's impossible to create a stream of correctly lexed input elements without doing syntactic level parsing.

I quite agree. I didn't mean to suggest otherwise. What I mean is that, once you've generated the automaton for the syntactic grammar, you can just look at each state's set of expected terminals, and from that deduce the goal symbol that the lexer will need to use to get the next token when the parser is in that state. The point being that you can do that after generating the syntactic automaton. So the context-dependentness of lexing doesn't have to affect the process of generating the syntactic automaton.

(That's assuming an LR(1)-ish parser, and an approach where you don't try to combine the syntactic and lexical grammars to generate a single [scannerless] automaton. Which may not be your approach.)

The validator looks for problems such as the syntactic grammar giving the lexer contradictory instructions. An example would be any syntactic grammar automaton state where one outgoing syntactic transition would swallow a regexp token and another outgoing syntactic transition from the same state would swallow a / token. If any such state exists, the grammar is broken.

Yup, the spec asserts the non-existence of such states ("syntactic grammar contexts").

# Waldemar Horwat (6 months ago)

On 02/04/2017 07:20, Michael Dyck wrote:

On 17-02-03 05:32 PM, Waldemar Horwat wrote:

On 02/03/2017 08:17, Michael Dyck wrote:

On 17-02-02 06:23 PM, Waldemar Horwat wrote:

Lookahead restrictions fit very well into an LR(1) engine

Again: Great, but how? E.g., do you pre-process the grammar, modify the construction of the automaton, and/or modify the operation of the parser?

For each state × token combination, the automaton describes what happens when you're in state S and see a token T. The lookahead restrictions remove possible transitions; without them there would be ambiguities where a given state × token combination would want to do two incompatible things.

Okay, so do you generate the automaton (ignoring lookahead restrictions) and then remove transitions (using lookahead restrictions)? Or do you integrate the lookahead-restrictions into the generation of the automaton?

It's integrated. You can't generate a valid automaton without the lookahead restrictions.

That's different from parametrized rules, which simply macro-expand into lots of individual rules.

Yup.

But the context-dependentness of lexing is a parse-time problem, not a validation-time problem, right?

No.

The syntactic level can just assume a stream of (correctly lexed) input elements.

(I should have said "Validation of the syntactic level [etc]")

No! It's impossible to create a stream of correctly lexed input elements without doing syntactic level parsing.

I quite agree. I didn't mean to suggest otherwise. What I mean is that, once you've generated the automaton for the syntactic grammar, you can just look at each state's set of expected terminals, and from that deduce the goal symbol that the lexer will need to use to get the next token when the parser is in that state. The point being that you can do that after generating the syntactic automaton. So the context-dependentness of lexing doesn't have to affect the process of generating the syntactic automaton.

That's correct.

(That's assuming an LR(1)-ish parser, and an approach where you don't try to combine the syntactic and lexical grammars to generate a single [scannerless] automaton. Which may not be your approach.)

The parser and lexer stay separate, other than the lexer providing tokens to the parser and the parser selecting one of several top-level lexer goal symbols for lexing the next token. I do not use any kind of unified parser-lexer grammar; that could run into issues such as the syntactic grammar making lexing non-greedy: a+++++b lexing as a ++ + ++ b instead of the correct a ++ ++ + b (which would then generate a syntax error at the syntactic level).

The validator looks for problems such as the syntactic grammar giving the lexer contradictory instructions. An example would be any syntactic grammar automaton state where one outgoing syntactic transition would swallow a regexp token and another outgoing syntactic transition from the same state would swallow a / token. If any such state exists, the grammar is broken.

Yup, the spec asserts the non-existence of such states ("syntactic grammar contexts").

-Michael

 Waldemar
# Michael Dyck (6 months ago)

On 17-02-06 07:32 PM, Waldemar Horwat wrote:

On 02/04/2017 07:20, Michael Dyck wrote:

On 17-02-03 05:32 PM, Waldemar Horwat wrote:

On 02/03/2017 08:17, Michael Dyck wrote:

On 17-02-02 06:23 PM, Waldemar Horwat wrote:

Lookahead restrictions fit very well into an LR(1) engine

Again: Great, but how? E.g., do you pre-process the grammar, modify the construction of the automaton, and/or modify the operation of the parser?

For each state × token combination, the automaton describes what happens when you're in state S and see a token T. The lookahead restrictions remove possible transitions; without them there would be ambiguities where a given state × token combination would want to do two incompatible things.

Okay, so do you generate the automaton (ignoring lookahead restrictions) and then remove transitions (using lookahead restrictions)? Or do you integrate the lookahead-restrictions into the generation of the automaton?

It's integrated. You can't generate a valid automaton without the lookahead restrictions.

So, when you're making LR items for a production with a lookahead-restriction (l-r), do you:

-- treat the l-r as a (nullable) symbol, so that you get an item with the dot before the l-r, and one with the dot after, or

-- treat the l-r as occurring between two adjacent symbols, so that you get an item where the dot is "at" the l-r, or

-- something else?

# Isiah Meadows (6 months ago)

Just out of curiosity, what's the best way to handle async arrow function arguments vs async function calls? Is it term rewriting, or is there a better way of handling that? I'm much more familiar with recursive-descent and LL parsers than LR or shift-reduce ones, so I'm not as readily familiar with how to handle these cases.

Isiah Meadows me at isiahmeadows.com

# Waldemar Horwat (5 months ago)

On 02/07/2017 06:39, Michael Dyck wrote:

On 17-02-06 07:32 PM, Waldemar Horwat wrote:

On 02/04/2017 07:20, Michael Dyck wrote:

On 17-02-03 05:32 PM, Waldemar Horwat wrote:

On 02/03/2017 08:17, Michael Dyck wrote:

On 17-02-02 06:23 PM, Waldemar Horwat wrote:

Lookahead restrictions fit very well into an LR(1) engine

Again: Great, but how? E.g., do you pre-process the grammar, modify the construction of the automaton, and/or modify the operation of the parser?

For each state × token combination, the automaton describes what happens when you're in state S and see a token T. The lookahead restrictions remove possible transitions; without them there would be ambiguities where a given state × token combination would want to do two incompatible things.

Okay, so do you generate the automaton (ignoring lookahead restrictions) and then remove transitions (using lookahead restrictions)? Or do you integrate the lookahead-restrictions into the generation of the automaton?

It's integrated. You can't generate a valid automaton without the lookahead restrictions.

So, when you're making LR items for a production with a lookahead-restriction (l-r), do you:

-- treat the l-r as a (nullable) symbol, so that you get an item with the dot before the l-r, and one with the dot after, or

-- treat the l-r as occurring between two adjacent symbols, so that you get an item where the dot is "at" the l-r, or

-- something else?

It's something else — it's directly tied into the generation of the automaton states. Each automaton state contains a collection of possible places in the expansions of grammar rules that the state can represent. Following a terminal symbol T from a state A leads to a state B. A lookahead restriction prevents state B's collection from including expansions that would have been prohibited by the lookahead restriction on T. If that generates an inconsistency (for example, if there are two ways to get to an identical state, one with a lookahead restriction and one without), the grammar validation fails.

 Waldemar
# Michael Dyck (5 months ago)

On 17-02-07 06:02 PM, Waldemar Horwat wrote:

Lookahead restrictions fit very well into an LR(1) engine [...] it's directly tied into the generation of the automaton states. Each automaton state contains a collection of possible places in the expansions of grammar rules that the state can represent. Following a terminal symbol T from a state A leads to a state B. A lookahead restriction prevents state B's collection from including expansions that would have been prohibited by the lookahead restriction on T. If that generates an inconsistency (for example, if there are two ways to get to an identical state, one with a lookahead restriction and one without), the grammar validation fails.

So that sounds like you form the closure of state A in the standard way, and then use the lookahead-restriction to limit the transitions out of A (and the items that A contributes to the kernel of B). But it seems to me that forming the closure is also affected by the lookahead-restriction. Is that true in your validator?

# Dmitry Soshnikov (5 months ago)

On Mon, Jan 23, 2017 at 10:46 PM, Dmitry Soshnikov < dmitry.soshnikov at gmail.com> wrote:

On Mon, Jan 23, 2017 at 5:24 PM, Waldemar Horwat <waldemar at google.com> wrote:

On 01/11/2017 10:28, Michael Dyck wrote:

In the past, it has been said (usually by Brendan Eich) that TC39 intends that the ECMAScript grammar be LR(1). Is that still the case? (I'm not so much asking about the "1", but more about the "LR".)

If so, I'm wondering how lookahead-restrictions (e.g., [lookahead <! terminals]) fit into the LR approach. It seems like there are three possibilities:

  • modify the grammar (before the parser is constructed),
  • modify the construction of the parser, and/or
  • modify the operation of the parser. It's not clear to me that any of these (or any combination of these) will work.

Has anyone algorithmically generated an LR parser from the spec grammar? If so, how did you you handle lookahead-restrictions?

I'm the source of the LR(1) condition. I've been doing that ever since ECMAScript Edition 3, and in fact am using the LR parser to help design and debug the spec. I have an implementation of the parser with a few extensions to the LR grammar, including support for parametrized productions, lookahead restrictions, and lexer-parser feedback used to disambiguate things such as what token / will start. I validate various things such as making sure that there is no place where both an identifier and a regular expression can occur.

Thanks for the confirmation.

lookahead restrictions

Yeah, all the lookahead restrictions, such as lookahead ∉ {{, function} add a lot of extra "No<Lookahead>" productions, which makes the grammar (and hence the parsing table) bigger unfortunately. Example is e.g. ExpressionNoIn, VariableDeclarationNoIn from the spec, and similar for ...NoLeftCurrly, and ...NoFunction to support those lookahead restrictions, and allow ExpressionStatement to be parsed without "reduce/reduce" conflicts.

That's why in the mentioned "sub-set" of the ECMAScript example language [1] I chose not to use the same syntax for function expression and function declaration: fn foo() {} for declaration, and just an arrow function for function-expressions :) -- no ambiguity, no lookahead restrictions, hence no larger grammar, hence a faster parser.

lexer-parser feedback

Yes, having a lexer/parser state solves it more elegantly.

FYI: I just built this as a proof of concept for ECMAScript's { and }, and have parsed normally { } in statement position as a BlockStatement, and in the expression position as an ObjectLiteral. With special "activation productions" one can change the lexer state, and yield different token types for the same characters.

See the example here: DmitrySoshnikov/syntax/blob/master/examples/parser-lexer-communication.g

And some details in the docs: DmitrySoshnikov/syntax#access-tokenizer-from-parser-semantic-actions

Dmitry