ANTLR v4 - Is a ECMAScript 2015 grammar 'free of actions' possible to write?

218 views
Skip to first unread message

burt_...@hotmail.com

unread,
Aug 18, 2016, 4:04:43 PM8/18/16
to antlr-discussion
Can a modern JavaScript grammar be written in ANTLR4 that doesn't end-up with target-language specific grammar extensions?

Looking at the grammar of JavaScript, both ECMAScript 2015 as standardized, and the ANTLR grammar at https://github.com/antlr/grammars-v4/tree/master/ECMAScript (based on ECMAScript 5 I think.)  I was a bit disappointed that the ANTLR4 example has multiple different grammar files depending on target language.   That doesn't seem to meet the expectation that the grammars are free of actions statement included at the top of the antlr/grammars-v4 page.  But in the case of for ECMAScript, I'm not sure,  is that a reasonable goal? 

The ECMA 2015 spec uses production parameterized by a subscripted annotation, e.g. In, Return, Yield, and Return.   The spec describes these parameters as shorthand for multiple productions, after reading the ANTLR4 book, it seems like one would logically want to use actions and predicates in the parser in order to deal with these, but it seems like even these simple cases require target-language specific declarations, actions and predicates.  

Perhaps a tougher problems is that the  ES 2015 spec includes the concept of a *goal symbol* at a lexical level (e.g. InputElementDiv, InputElementTemplateTail, InputElementRegExp, & InputElementRegExpOrTemplateTail.    While think that these could potentially map well onto the concept of ANTLR4 lexer modes,  it seems that control of which of these lexical goals to use is specified at semantic (parser) level rather than lexically.  

At a high level, how would the ANTLR experts out there tackle an ECMAScript 2015 grammar?

Mike Lischke

unread,
Aug 21, 2016, 5:07:11 AM8/21/16
to antlr-di...@googlegroups.com
Burt,

Can a modern JavaScript grammar be written in ANTLR4 that doesn't end-up with target-language specific grammar extensions?

Looking at the grammar of JavaScript, both ECMAScript 2015 as standardized, and the ANTLR grammar at https://github.com/antlr/grammars-v4/tree/master/ECMAScript (based on ECMAScript 5 I think.)  I was a bit disappointed that the ANTLR4 example has multiple different grammar files depending on target language.   That doesn't seem to meet the expectation that the grammars are free of actions statement included at the top of the antlr/grammars-v4 page.  But in the case of for ECMAScript, I'm not sure,  is that a reasonable goal? 

I also was surprised to see the various taget variants of this grammar and I believe the ECMA grammar is a perfect example how bad it is to have so much target specific code in a grammar at all. However, it's actually relatively easy to circumvent all that. The better approach is to derive the generated parser + lexer classes from intermediate classes that implement all the logic you see in the grammar actions.

Most languages support the function syntax where you have an identifier with parenthesis and parameters, which return a value. Something like "isThisValid(ctx)" can be used in C/C++, Java, JS, C# etc. etc. So, by replacing all the code by calls to functions in a base class you can easily create a target independent grammar.

Of course you would need target specific base classes, but usually you will need them anyway as you have to have other support code, which probably is implemented in a derived class, normally. All that can also go in the base class(es).

burt_...@hotmail.com

unread,
Aug 22, 2016, 5:33:47 PM8/22/16
to antlr-discussion
Mike,

On Sunday, August 21, 2016 at 2:07:11 AM UTC-7, Mike Lischke wrote:
...
 
I also was surprised to see the various taget variants of this grammar and I believe the ECMA grammar is a perfect example how bad it is to have so much target specific code in a grammar at all. However, it's actually relatively easy to circumvent all that. The better approach is to derive the generated parser + lexer classes from intermediate classes that implement all the logic you see in the grammar actions.


Yes, I experimented earlier with a base class for the ECMAScript Lexer and into a few complications.   For example is `isRegexPossible()` was tricky to define in a base class because its implementations references token constants generated by the tool.   I'm not saying I need help to resolve that, its just part of the bigger picture.

This sort of complication in ECMAScript is cause by lexical dual meanings of certain characters, e.g. the forward slash.    In some contexts, it is the division operator, but in other contexts it delimits the beginning of a regular expression.   In ECMAScript 2015 there are more of these. 

What's needed is conceptually close to ANTLR4 lexer modes, but further experiments showed that (as of ANTLR 4.5.4) modes don't seem a good fit; I found no way to share productions between modes without a lot of duplication.   That's generally fine for island grammars, but makes their use rather expensive for this sort of minor variation in lexical context.   Lexical semantic predicates still seem needed, but I would rather not see target-language implementation mixed into the grammar except for quick-and-dirty cases.
 
Most languages support the function syntax where you have an identifier with parenthesis and parameters, which return a value. Something like "isThisValid(ctx)" can be used in C/C++, Java, JS, C# etc. etc. So, by replacing all the code by calls to functions in a base class you can easily create a target independent grammar.

Yes, I'm with you, and agree the syntax are close, but note that in some languages (JavaScript in particular), the method (or property) name **has** to be prefixed with a `this.` to get invoked correctly, and in Python the required syntax seems to be `self.`.   Close only counts in horseshoes and hand grenades.

I think ANTLR might benefit from a target-language independent syntax to both declare and invoke abstract semantic predicates.   The code generation templates could then emit abstract classes with pure-virtual functions and allow the predicates semantics to be implemented in target-language classes inheriting the generated code.      

   

Ivan Kochurkin

unread,
Aug 22, 2016, 6:30:14 PM8/22/16
to antlr-discussion
вторник, 23 августа 2016 г., 0:33:47 UTC+3 пользователь burt_...@hotmail.com написал:

I think ANTLR might benefit from a target-language independent syntax to both declare and invoke abstract semantic predicates.   The code generation templates could then emit abstract classes with pure-virtual functions and allow the predicates semantics to be implemented in target-language classes inheriting the generated code.      

 
I like this idea and I suggested the same issue earlier: Unified Actions Language.
It benefits not only universal semantic predicates for different runtimes, but checking of them on code generation step, not code compilation.
Grammar and semantic predicates will be merged into a single unit with an ability to parse context-sensitive languages.
 

Mike Lischke

unread,
Aug 23, 2016, 3:19:55 AM8/23/16
to antlr-di...@googlegroups.com


I think ANTLR might benefit from a target-language independent syntax to both declare and invoke abstract semantic predicates.   The code generation templates could then emit abstract classes with pure-virtual functions and allow the predicates semantics to be implemented in target-language classes inheriting the generated code.      

 
I like this idea and I suggested the same issue earlier: Unified Actions Language.
It benefits not only universal semantic predicates for different runtimes, but checking of them on code generation step, not code compilation.
Grammar and semantic predicates will be merged into a single unit with an ability to parse context-sensitive languages.

In a sense this is already happening in ANTLR's code generation. Take for example access to variables ($x syntax). This could be enhanced to allow for more constructs. However, even then it would not solve the target dependent code you have to keep in the @members section (entire functions, class variables and whatnot). So, it's not really a good solution.


Mike Lischke

unread,
Aug 23, 2016, 3:26:09 AM8/23/16
to antlr-di...@googlegroups.com
Yes, I experimented earlier with a base class for the ECMAScript Lexer and into a few complications.   For example is `isRegexPossible()` was tricky to define in a base class because its implementations references token constants generated by the tool.   I'm not saying I need help to resolve that, its just part of the bigger picture.

Well, that's seems to be a normal issue and you can simply import your lexer definition to have the token values available. I was also thinking the other day if it wouldn't be better if the token/channel enums would be exported to a separate file, so you can reference that without having to pull in the full lexer/parser class. But that requires some changes in the code generation part.

 
Most languages support the function syntax where you have an identifier with parenthesis and parameters, which return a value. Something like "isThisValid(ctx)" can be used in C/C++, Java, JS, C# etc. etc. So, by replacing all the code by calls to functions in a base class you can easily create a target independent grammar.

Yes, I'm with you, and agree the syntax are close, but note that in some languages (JavaScript in particular), the method (or property) name **has** to be prefixed with a `this.` to get invoked correctly, and in Python the required syntax seems to be `self.`.   Close only counts in horseshoes and hand grenades.

Yes, true. I also thought about that. But this could be solved easily by supporting a single language neutral $this token, which is translated to what the target actually needs (and be it the empty string).


Reply all
Reply to author
Forward
0 new messages