Syntactic or semantic predicates?

45 views
Skip to first unread message

drsoran

unread,
Nov 25, 2013, 4:26:47 AM11/25/13
to antlr-di...@googlegroups.com
Hi,

I have the following recognition problem and do not know how to solve it correctly with ANTLR 3.3. Please look at the following sketch grammar.


grammar Test;

test
: accessA | accessB;

accessA
: '%' 'M' ('W' | 'D') INTEGER;

accessB
: 'Instr#' ID;

INTEGER
: ('0'..'9')+;

ID : ('A' .. 'Z' | 'a' .. 'z') ('A' .. 'Z' | 'a' .. 'z' | INTEGER | '_')*;

The goal is, if an '%' is recognized in an input like '%MD1', the following 'M' and ''D' should be recognized separate. Because 'M' should trigger an action and 'D' also. But in a sentence like 'Instr#MD1' the 'MD1' should be recognized as ID token.

Currently, the lexer tokenizes '%MD1' as '%' ID not '%' 'M' 'D' INTEGER.

I understand that the lexer acts greedy and tries to consume as much input as possible. Is there a way to help the Lexer/Parser with predicates? 


Many thanks!

Mike Lischke

unread,
Nov 25, 2013, 11:03:44 AM11/25/13
to antlr-di...@googlegroups.com
Try making it non-greedy:

ID : ('A' .. 'Z' | 'a' .. 'z') (options { greedy = false; }: 'A' .. 'Z' | 'a' .. 'z' | INTEGER | '_')*;

This should match all letters + numbers until any of the keywords are found ('M', 'W', 'D'). The same trick is used for multiline comments like "/* ... */" to not skip closing tokens and match something like "/* ... */ ... */".

I also recommend to define all your keywords as lexer tokens and define them before the ID rule, to have them excluded from the ID recognition (otherwise 'M' will still be parsed as ID).

Jim Idle

unread,
Nov 25, 2013, 7:36:25 PM11/25/13
to antlr-di...@googlegroups.com
I don;t think that it needs to be non-greedy - just move the definitions in to the lexer before the ID rule. '%' is not a valid member of the ID set, so that will end the ID and start the ACCESSA:

test: accessA | accessB EOF;
accessA: (ACCESSAW | ACCESSAD) INTEGER;
accessB: INSTR ID;

fragment ACCESSAW: ;
fragment ACCESSAD: ;

ACCESS: '%M' ( 'W' { $type = ACCESSW; } | 'D' {$type = ACCESSD; } ) ;
INSTR : 'Instr#';
INTEGER ...
ID ...

Jim



--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

drsoran

unread,
Nov 26, 2013, 2:09:35 AM11/26/13
to antlr-di...@googlegroups.com, mike.l...@googlemail.com
Hi Mike,

Thanks for your answer, but if I try to make it non-greedy, ANTLR gives me a warning that none of the alternatives of 'A' .. 'Z' | 'a' .. 'z' | INTEGER | '_' can be matched anymore. I think it will stop after the leading ('A' .. 'Z' | 'a' .. 'z'). Due to the * and non-greedy it simply must not continue reading characters. Or am I wrong?

drsoran

unread,
Nov 26, 2013, 2:16:27 AM11/26/13
to antlr-di...@googlegroups.com
Hi Jim,

I tried your proposal and it worked! Thank you! But I have a question regarding the

 
ACCESS: '%M' ( 'W' { $type = ACCESSW; } | 'D' {$type = ACCESSD; } ) ;

It is a complex lexer rule. Isn't that something that belongs to parser level? It feels like describing the grammar in the lexer. I always thought the Lexer forms up the tokens and the Parser describes how to assemble them to a correct grammar. What do you think?

Jim Idle

unread,
Nov 26, 2013, 2:23:09 AM11/26/13
to antlr-di...@googlegroups.com
Just changing the token type so that you know which option it was. You can just assemble one token and look in the text when it hits the parser, but I think it is better this way. Compared to things like SQL that is not a complicate token at all ;). I would not worry about it. 

jim


--
Reply all
Reply to author
Forward
0 new messages