I'm new to Antlr, so this may appear to be very basic.
I am constructing a grammar where you need to supporting parsing for different progressive versions of the same product. As you might imagine, higher versions update and build upon the feature set of lower versions. As a result, for instance, new reserved words get introduced by a higher version which weren't there in the older version. So I took the path of writing the first version supported (call it base version) and then through imports use the same grammar, but with "overrides" for the next version. The benefit of doing so appears to be reuse of existing code. I believe one could write one global grammar and snip out specific grammars for lower version through semantic predicates (antlr book takes this approach in the Java 1.7 enum example), but that appeared more burdensome both from programming as well as runtime performance perspectives. Not to mention configuration management challenges, since every time a new version comes along, I have to touch code for older versions. I'd rather know that older version parsing works and remains untouched when new version parsing is enabled.
So in taking this approach, I realized that imports don't work like inheritance i.e. you don't just import older version and write new stuff in the grammar file for the new version, thinking that new version would override older version. Instead, what appears to work is to have a shell grammar file that imports new grammar first and the base version last. I reasoned that works because of Antlr's rule precedence: match the longest string, but apply the rule that comes first. By making base version the last import, I am ensuring that newer (overriding) rules are encountered first. If it falls through, then base version rules apply, which are essentially the rules that were not modified in the new version.
The other realization was with respect to specific vs. generic rules. So if I had a more generic rule in base version grammar, I can't simply import base grammar into next version; I actually have to break out those generic rules from base grammar so generic rules always come in last in the next version grammar as well.
As an example, suppose I have the following for base grammar
//file BaseLexer.g4
lexer grammar BaseLexer;
import BaseReservedWords,
Common
;
//file BaseReservedWords.g4
ALTER: "alter";
DATABASE: "database";
//file Common.g4
IDENTIFIER:[a-zA-Z_]*?;
Now if I defined next version grammar simply as:
//file NextVersionLexer.g4
lexer grammar NextVersionLexer;
import BaseLexer,
NextVersionReservedWords
;
//file NextVersionReservedWords.g4
lexer grammar NextVersionReservedWords;
FILE: "file";
The next version grammar will lex "file" as an <IDENTIFIER> token instead of <FILE> token
So the right way appears to be:
//file NextVersionLexer.g4
lexer grammar NextVersionLexer;
import NextVersionReservedWords,
BaseReservedWords,
Common
;
//file NextVersionReservedWords.g4
lexer grammar NextVersionReservedWords;
FILE: "file";
Now more generic rule IDENTIFIER is guaranteed to show up later than any reserved words for the new version and "file" would indeed be lexed as <FILE>. not <IDENTIFIER>. Note that I had to break out generic lexer rules into a separate grammar file (Common.g4) and could not have written IDENTIFIER rule within BaseLexer.g4 body; BaseLexer.g4 must remain a "shell" file whose only purpose is to marshal imports in the correct order.
I have a similar challenge on parser grammar side. If you had an alterCommand rule that accounted for only "alter database" in base version, but allowed "alter file" in new version, you would have to re-define alterCommand in the new version, but in order for Antlr to pick it up, you would have to import base parser grammar last, not first.
Here is the corresponding files for parser grammar:
//file BaseParser.g4
parser grammar BaseParser;
alterCommand: ALTER DATABASE dbSubRule;
dbSubRule: ...;
//file NextVersionParser.g4
parser grammar NextVersionParser;
import NextVersionAlterCommandParser,
BaseParser
;
//file: NextVersionAlterCommandParser
parser grammar NextVersionAlterCommandParser;
alterCommand:
ALTER
( DATABASE dbSubRule
| FILE fileSubRule
)
;
fileSubRule: ....;
If I didn't import BaseParser last, I would get an error for "alter file ... " input since Antlr already picked the original alterCommand rule definition when transpiling the grammar to target language code. What would also not work, is the following:
//file NextVersionParser.g4
parser grammar NextVersionParser;
import BaseParser
;
alterCommand:
ALTER
( DATABASE dbSubRule
| FILE fileSubRule
)
;
The reason for that is that imports essentially prepend the imported grammar to the importing grammar; the original alterCommand would always trump the new alterCommand stated in the body of NextVersionParser. So the only way to really reuse code is to create this "shell" parser for the version whose only responsibility is to enforce a precedence order for grammar rules such that newer grammars that override older ones always show up first.
My questions are two-fold:
1. Am I correct in reasoning about the precedence and overriding feature of Antlr? Is there a different way to look at this, assuming I am breaking up grammar through imports.
2. Is breaking up grammar into smaller pieces and leveraging imports in the way described the "best practices" approach to writing grammars? I can't see an escape from it because let's suppose we wrote all version specific grammar in one big file, When I import this big file for the next version, all the rules defined in the original version would automatically enjoy precedence over newer definitions for the same rule. But perhaps there is something I'm missing here?