Nongreedy Lexer rules

158 views

Skip to first unread message

Sashka Sanakoev

unread,

Mar 4, 2019, 7:57:40 AM3/4/19

to antlr-di...@googlegroups.com

I want to achieve following behavior: User:class should be parsed to Object - User; Type - class, alsoUs:er:class should result Object - Us:er; Type - class. I can't make second part work, as soon as I add : as a legal symbol for WORD it parses whole input as an object Object - Us:er:class. My grammar:

grammar Sketch;

/*
 * Parser Rules
 */
input               : (object)+ EOF ;
object              : objectName objectType? NEWLINE ;
objectType          : ':' TYPE ;
objectName          : WORD ;

/*
 * Lexer Rules
 */ 
fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment NUMBER     : [0-9] ;
fragment WHITESPACE : (' ') ;
fragment SYMBOLS    : [!-/:-@[-`] ;
fragment C          : [cC] ;
fragment L          : [lL] ;
fragment A          : [aA] ;
fragment S          : [sS] ;
fragment T          : [tT] ;
fragment U          : [uU] ;
fragment R          : [rR] ;

TYPE                : ((C L A S S) | (S T R U C T));

NEWLINE             : ('\r'? '\n' | '\r')+ ;

WORD                : (LOWERCASE | UPPERCASE | NUMBER | WHITESPACE | SYMBOLS)+ ;

I wrote simple example just to explain what kind of behavior I want to get, In fact, my parser is much more complicated. As I understand, when multiple lexer rules can be fulfilled together, antlr chooses the longest token from all, and only if they are same length, order of rule declaration matters. What I want to achieve is to make order superior over token length. I found something related to that in "The definitive Antlr4 reference(15.6 Wildcard Operator and Nongreedy Subrules, page 283)". But, unfortunately, I still can't make it work with my example. I assume it's cause, in book examples are applied only to subrules. Any suggestions are appreciated.

Geoff Groos

unread,

Mar 7, 2019, 1:55:55 PM3/7/19

to antlr-discussion

Perhaps unfortunately, the general strategy when you get in to these kinds of problems is to simplify your lexer and push the logic to your parser. Thankfully (I think) ANTLR's back-tracking will handle this situation reasonably elegantly.

input: object+ EOF;
object: objectName (COLON objectType)? NEWLINE ;
objectType: TYPE ;
objectName: WORD (COLON WORD)* ;

//lexers as above, of course with
COLON: ':';

In this way you're token sequence will become fairly clear, (where each colon will be its own token), the parser onboards the complexity of trying to infer if a colon is a separator or part of the name, etc.

You're already handling the tokenizer ambiguity where the text 'class' and 'struct' match both WORD and TYPE, which the ANTLR lexer is resolving as type first. I presume there are cercomstances where ANTLR would have to do some back-tracking or resolve some ambiguities (or perhalps just get embedded in the LL-ness of the table's look-ahead?), but I'm not finished my first cup of coffee so I'll leave it to you (or another message) to tell me if that's a problem. Needless to say ANTLR will probably employ simple strategies that probably give you what you want.