Semi-significant whitespace in an ANTLR4 grammar

Ross Patterson

unread,

Jul 27, 2022, 5:26:22 PM7/27/22

to antlr-discussion

As I mentioned a few months back, I'm working on the grammar for the Rexx programming language. The language has a couple of interesting quirks from a parsing perspective. The one that's giving me headaches right now is that while whitespace is generally treated as a token seperator and ignorable (like 90+% of all languages), there is one case where it is not. When two values (variables, constants, function calls, etc.) are separated by whitespace, the whitespace acts an infix binary concatenation operator, joining its two operands with a single ASCII space between them. Just to make things even more complicated, there is a third form of concatenation: as long as two values can be abutted without lexical confusion (e.g., "a string>"3, or a_function()"a string"), the abuttal of the values acts as a traditional concatenation, sans the operator (i.e., functionally identical to "a string>" || 3, and a_function()" || a string") . Thus the concatenation productions look somewhat like this:

concatenation : addition (concatenation_op addition)*

concatenation_op : CONCAT # explicit_concat

| <SOME MAGIC> # blank_concat

| <OTHER MAGIC> # implied_concat

;

CONCAT : '|' '|' ;

I've tried a bunch of different ways to handle this in ANTLR4, all to no avail. The most-promising attempt involved pushing whitespace onto a WHITESPACE_CHANNEL, and then using a parser semantic predicate to check if the previous token was on that channel and gate the blank_concat alternative with it. It almost even worked. But, because the semantics of concatenation differ between regular concatenation and blank-concatenation, I want the parse tree to reflect the difference, and I couldn't make that happen.

The most obvious technique would be to create a token representing the blank_concat pseudo-operator, and insert it into the parse. But I've read that ANTLR4 doesn't support modifying the parse tree. Even more obvious, but tedious, would be to treat all whitespace as lexically significant, and to explicitly indicate its presence in all the parser rules, including the blank_concat production where it has real meaning.

Anybody got some advice?

Ross

Ross Patterson

unread,

Aug 5, 2022, 11:15:55 AM8/5/22

to antlr-discussion

Writing this description of the problem helped get my head right and drive me to a solution. As an old-line hand-generated-parser author, I kept thinking about the token stream, when I should have been thinking about the parse tree. The answer is this:

RexxLexer.g4 diff:

-WHISPACES : Whitespaces_ -> channel(HIDDEN);
+WHITESPACES : Whitespaces_ -> channel(WHITESPACE_CHANNEL);

...

CONCAT : VBar_ VBar_ ;

...

fragment VBar_ : '|' ;

RexxParser.g4 diff:
-concatenation : addition ( CONCAT? addition )* ;
+concatenation : addition (concatenation_op addition)* ;
+ concatenation_op : {
+ (getTokenStream().get(getCurrentToken().getTokenIndex()-1).getChannel() == RexxLexer.WHITESPACE_CHANNEL)
+ }? blank_concatenation_op // If previous token is whitespace, this is blank-concatenation.
+ | normal_concatenation_op
+ ;
+ normal_concatenation_op : {
+ (getTokenStream().get(getCurrentToken().getTokenIndex()-1).getChannel() != RexxLexer.WHITESPACE_CHANNEL)
+ }? // If previous token is not whitespace, this is abuttal-concatenation.
+ // Note: no token or rule to match here, just the predicate.
+ | CONCAT
+ ;
+ blank_concatenation_op : ; // Note: no token or rule to match here, just the predicate.

That leads to having a parse tree node that represents the concatenation operator, which can be used to differentiate the two different types of concatenation. It's either normal_concatenation_op (with no token child, or a token child "||"), or blank_concatenation_op with no token child.

As part of another change, I'll be moving the Java code out to a RexxParserBase class, so I can work on multi-target support, but that's orthoganal to the problem at hand.

Ross

Terence Parr

unread,

Aug 5, 2022, 4:08:24 PM8/5/22

to antlr-di...@googlegroups.com

hi Ross, Glad we could “help” haha. yeah, it's amazing how just writing something out can help organize our thoughts. Yes, in general I try not to get too tricky in the lexer.

Ter

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/69087560-edf9-4fd5-b254-64dcde2c0b4fn%40googlegroups.com.

Reply all

Reply to author

Forward