As I mentioned a few months back, I'm working on the grammar for the Rexx programming language. The language has a couple of interesting quirks from a parsing perspective. The one that's giving me headaches right now is that while whitespace is generally treated as a token seperator and ignorable (like 90+% of all languages), there is one case where it is not. When two values (variables, constants, function calls,
etc.) are separated by whitespace, the whitespace acts an infix binary concatenation operator, joining its two operands with a single ASCII space between them. Just to make things even more complicated, there is a third form of concatenation: as long as two values can be abutted without lexical confusion (
e.g.,
"a string>"3, or
a_function()"a string"), the abuttal of the values acts as a traditional concatenation,
sans the operator (
i.e., functionally identical to
"a string>" || 3, and
a_function()" || a string") . Thus the concatenation productions look somewhat like this:
concatenation : addition (concatenation_op addition)*
concatenation_op : CONCAT # explicit_concat
| <SOME MAGIC> # blank_concat
| <OTHER MAGIC> # implied_concat
;
CONCAT : '|' '|' ;
I've tried a bunch of different ways to handle this in ANTLR4, all to no avail. The most-promising attempt involved pushing whitespace onto a WHITESPACE_CHANNEL, and then using a parser semantic predicate to check if the previous token was on that channel and gate the blank_concat alternative with it. It almost even worked. But, because the semantics of concatenation differ between regular concatenation and blank-concatenation, I want the parse tree to reflect the difference, and I couldn't make that happen.
The most obvious technique would be to create a token representing the blank_concat pseudo-operator, and insert it into the parse. But I've read that ANTLR4 doesn't support modifying the parse tree. Even more obvious, but tedious, would be to treat all whitespace as lexically significant, and to explicitly indicate its presence in all the parser rules, including the blank_concat production where it has real meaning.
Anybody got some advice?
Ross