Imports in Antlr4

2,694 views
Skip to first unread message

Ashish Shanker

unread,
Oct 18, 2017, 10:51:27 AM10/18/17
to antlr-discussion

I'm new to Antlr, so this may appear to be very basic.

I am constructing a grammar where you need to supporting parsing for different progressive versions of the same product. As you might imagine, higher versions update and build upon the feature set of lower versions. As a result, for instance, new reserved words get introduced by a higher version which weren't there in the older version. So I took the path of writing the first version supported (call it base version) and then through imports use the same grammar, but with "overrides" for the next version. The benefit of doing so appears to be reuse of existing code. I believe one could write one global grammar and snip out specific grammars for lower version through semantic predicates (antlr book takes this approach in the Java 1.7 enum example), but that appeared more burdensome both from programming as well as runtime performance perspectives. Not to mention configuration management challenges, since every time a new version comes along, I have to touch code for older versions. I'd rather know that older version parsing works and remains untouched when new version parsing is enabled.

So in taking this approach, I realized that imports don't work like inheritance i.e. you don't just import older version and write new stuff in the grammar file for the new version, thinking that new version would override older version. Instead, what appears to work is to have a shell grammar file that imports new grammar first and the base version last. I reasoned that works because of Antlr's rule precedence: match the longest string, but apply the rule that comes first. By making base version the last import, I am ensuring that newer (overriding) rules are encountered first. If it falls through, then base version rules apply, which are essentially the rules that were not modified in the new version.

The other realization was with respect to specific vs. generic rules. So if I had a more generic rule in base version grammar, I can't simply import base grammar into next version; I actually have to break out those generic rules from base grammar so generic rules always come in last in the next version grammar as well.

As an example, suppose I have the following for base grammar



//file BaseLexer.g4
lexer grammar
BaseLexer;
import BaseReservedWords,
           
Common
;


//file BaseReservedWords.g4
ALTER
: "alter";
DATABASE
: "database";


//file Common.g4
IDENTIFIER
:[a-zA-Z_]*?;

Now if I defined next version grammar simply as:

//file NextVersionLexer.g4
lexer grammar
NextVersionLexer;
import BaseLexer,
           
NextVersionReservedWords
;


//file NextVersionReservedWords.g4
lexer grammar
NextVersionReservedWords;
FILE
: "file";

The next version grammar will lex "file" as an <IDENTIFIER> token instead of <FILE> token

So the right way appears to be:

//file NextVersionLexer.g4
lexer grammar
NextVersionLexer;
import NextVersionReservedWords,
       
BaseReservedWords,
       
Common
;


//file NextVersionReservedWords.g4
lexer grammar
NextVersionReservedWords;
FILE
: "file";


Now more generic rule IDENTIFIER is guaranteed to show up later than any reserved words for the new version and "file" would indeed be lexed as <FILE>. not <IDENTIFIER>. Note that I had to break out generic lexer rules into a separate grammar file (Common.g4) and could not have written IDENTIFIER rule within BaseLexer.g4 body; BaseLexer.g4 must remain a "shell" file whose only purpose is to marshal imports in the correct order. 

I have a similar challenge on parser grammar side. If you had an alterCommand rule that accounted for only "alter database" in base version, but allowed "alter file" in new version, you would have to re-define alterCommand in the new version, but in order for Antlr to pick it up, you would have to import base parser grammar last, not first.

Here is the corresponding files for parser grammar:

//file BaseParser.g4
parser grammar
BaseParser;
alterCommand
: ALTER DATABASE dbSubRule;


dbSubRule
: ...;




//file NextVersionParser.g4
parser grammar
NextVersionParser;
import NextVersionAlterCommandParser,
           
BaseParser
;


//file: NextVersionAlterCommandParser
parser grammar
NextVersionAlterCommandParser;


alterCommand
:
    ALTER
   
( DATABASE dbSubRule
     
| FILE fileSubRule
   
)
;

fileSubRule
: ....;


If I didn't import BaseParser last, I would get an error for "alter file ... " input since Antlr already picked the original alterCommand rule definition when transpiling the grammar to target language code. What would also not work, is the following:

//file NextVersionParser.g4
parser grammar
NextVersionParser;
import BaseParser
;



alterCommand
:
 ALTER
 
( DATABASE dbSubRule
 
| FILE fileSubRule
 
)
;


The reason for that is that imports essentially prepend the imported grammar to the importing grammar; the original alterCommand would always trump the new alterCommand stated in the body of NextVersionParser. So the only way to really reuse code is to create this "shell" parser for the version whose only responsibility is to enforce a precedence order for grammar rules such that newer grammars that override older ones always show up first.


My questions are two-fold:

1. Am I correct in reasoning about the precedence and overriding feature of Antlr? Is there a different way to look at this, assuming I am breaking up grammar through imports.
2. Is breaking up grammar into smaller pieces and leveraging imports in the way described the "best practices" approach to writing grammars? I can't see an escape from it because let's suppose we wrote all version specific grammar in one big file, When I import this big file for the next version, all the rules defined in the original version would automatically enjoy precedence over newer definitions for the same rule. But perhaps there is something I'm missing here?



Mike Lischke

unread,
Oct 18, 2017, 11:49:40 AM10/18/17
to antlr-di...@googlegroups.com
I am constructing a grammar where you need to supporting parsing for different progressive versions of the same product. As you might imagine, higher versions update and build upon the feature set of lower versions. As a result, for instance, new reserved words get introduced by a higher version which weren't there in the older version. So I took the path of writing the first version supported (call it base version) and then through imports use the same grammar, but with "overrides" for the next version. The benefit of doing so appears to be reuse of existing code. I believe one could write one global grammar and snip out specific grammars for lower version through semantic predicates (antlr book takes this approach in the Java 1.7 enum example), but that appeared more burdensome both from programming as well as runtime performance perspectives. Not to mention configuration management challenges, since every time a new version comes along, I have to touch code for older versions. I'd rather know that older version parsing works and remains untouched when new version parsing is enabled.

So in taking this approach, I realized that imports don't work like inheritance i.e. you don't just import older version and write new stuff in the grammar file for the new version, thinking that new version would override older version. Instead, what appears to work is to have a shell grammar file that imports new grammar first and the base version last. I reasoned that works because of Antlr's rule precedence: match the longest string, but apply the rule that comes first. By making base version the last import, I am ensuring that newer (overriding) rules are encountered first. If it falls through, then base version rules apply, which are essentially the rules that were not modified in the new version.

Well, read this: https://github.com/antlr/antlr4/blob/master/doc/grammars.md#grammar-imports to see how this is implemented.

Btw. imports won't help you with hiding rules that are no longer valid in newer versions of the language. I'm in a similar situation with my MySQL grammar, where I have to support server versions from 4.0 up to 8.0. What I do is to use predicates to gate available tokens and rule alts depending on the server version (which I keep in my recognition base class from which my generated parser + lexer classes are derived). An example:

CONTINUE_SYMBOL:                        C O N T I N U E                                             {serverVersion >= 50000}?;   // SQL-2003-R

That approach also allows to generate only a single parser for all versions and you can switch at runtime which version you want to support.

Here's an older version of the MySQL grammar (for ANTLR3): https://github.com/mysql/mysql-workbench/blob/master/library/parsers/grammars/MySQL.g, which uses that approach as well.


Ashish Shanker

unread,
Oct 18, 2017, 12:29:21 PM10/18/17
to antlr-discussion
Hi Mike - thank you for your quick response. I am aware of that article and have read through the antlr4 book multiple times. The strategem I described does work i.e. I am able to hide / override older rules in newer version without using sematic predicates. I was looking for some external validation, but it appears that is not the preferred way to deal with multiple versions. The challenge I have is that I would like to shield older grammar from change when new version comes along, so I really don't prefer the semantic predicates approach because of the reasons outlined in the original post. There is an additional twist in that different dev teams may take up different versions, so I can't really assume future teams would not regress the solution if I let them touch older grammar.

I believe semantic predicates degrade performance as well. In your experience has that been noticeable? For Parsers I came to the understanding that the degradation in performance is on account of implementing semantic predicates through the exceptions infrastructure. Is that the same for lexers or does the code generator actually discard the tokens completely? 

Right now I am generating one parser per version partly because that is how the original grammar (in ancient antlr) was written and the rest of the code expects to see a db system + version specific parser.

Thank you for your guidance! much appreciated :)

Regards

Ashish

Mike Lischke

unread,
Oct 19, 2017, 3:18:05 AM10/19/17
to antlr-di...@googlegroups.com
> I believe semantic predicates degrade performance as well. In your experience has that been noticeable?

I haven't done A | B testing with the predicates as I'm using that approach for a long time already, but yes they will require some extra time, no doubt. If used wrongly they can really slow down the parsing process (search this mailing list, we discussed this before). But after all it's additional functionality and you have to pay a price, if you want to have it.

> For Parsers I came to the understanding that the degradation in performance is on account of implementing semantic predicates through the exceptions infrastructure. Is that the same for lexers or does the code generator actually discard the tokens completely?

Best is to look at the generated code. The predicates are mostly used in simple if() statements to guide the process. The speed also depends heavily on how much of the optional (version dependent) input comes in actually and how many optional parts you have in your grammar.


Mike
--
www.soft-gems.net

Ashish Shanker

unread,
Oct 20, 2017, 11:40:17 AM10/20/17
to antlr-discussion
> Best is to look at the generated code. The predicates are mostly used in simple if() statements to guide the process. 

You are right; the semantic predicate condition is checked through an if() statement. However, the handling of the semantic predicate is through a FailePredicateException(), which certainly has an overhead. If I have grammar that has alternate subrules activated/deactivated through predicates, I am guaranteed to visit the FailedPredicateException at least once in processing that rule. So far I have been able to avoid semantic predicates for the most part. The cases where semantic predicates do show up are the ones where they were absolutely unavoidable. Also--unrelated to the topic at hand--language portability across Java and C++ is a design goal in the project, so there is really very limited processing I can do in the body of semantic predicates anyways.

Thank you for sharing your perspective. This was really helpful.
Reply all
Reply to author
Forward
0 new messages