Understanding lexer precedence handling in antlr4

1,578 views
Skip to first unread message

Jonathan Coveney

unread,
Apr 10, 2014, 7:31:08 PM4/10/14
to antlr-di...@googlegroups.com
Let's say that I have the following:

lexer grammar Hmm;

fragment INTEGER : [0-9];

NUMBER : INTEGER+;
COMMA : ',';
MANY_NUMBERS : NUMBER (COMMA NUMBER)*;

If I give this the entry:

1,2,3

then it will lex it as MANY_NUMBERS. I thought that lexing went in top down order of preference, and was greedy. That is, I thought it would be tokenized as:

NUMBER COMMA NUMBER COMMA NUMBER

How should I reason about this sort of thing? I realize that using parser rules I can disambiguate this, but I'd like to better understand what's going on at the lexer level.

Is it that it goes for the longest continuous possible, and defers first to the longest, then the first?

IE it's matching NUMBER, then it sees a COMMA, and instead of prefering two tokens, it prefers MANY_NUMBERS because that will locally minimize the number of tokens?

Thanks
Jon

Ter Cs

unread,
Apr 10, 2014, 8:19:11 PM4/10/14
to antlr-di...@googlegroups.com
It goes for the longest sequence first.  if two rules or more match the longest possible sequence then it chooses the lexical rule specified first

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages