Antlr recognises 'and' but not 'or' without a space

49 views
Skip to first unread message

Peter Klavins

unread,
Apr 14, 2017, 5:30:59 PM4/14/17
to antlr-discussion
I'm using the ANTLR 4 plugin in IntelliJ, and I have the most bizarre bug. I'll start with the relevant parser/lexer rules:

// Take care of whitespace.
WS
: [ \r\t\f\n]+ -> skip;

OTHER
: . -> skip;

STRING
: '"' [A-z ]+ '"'
;

evaluate
// starting rule.
: textbox? // could be an empty textbox.
;

textbox
: (row '\n')*
;

row
: ability
| ability_list

ability
: activated_ability
| triggered_ability
| static_ability

triggered_ability
: trigger_words ',' STRING
;

trigger_words
: ('when'|'whenever'|'as') whenever_triggers|'at'
;

whenever_triggers
: triggerer (('or'|'and') triggerer)* // this line has the issue.
;

triggerer
: self

self: '~'

I pass it this text: whenever ~ or ~, and it fails on the or, saying line 1:10 mismatched input ' or' expecting {'or', 'and'}. However, if I add a space to the whenever_triggers rule's or string (making it ' or'|'and'), it works fine.

The weirdest thing is that if I try whenever ~ and ~, it works fine even without the rule having a space in the and string. This doesn't change if I make 'and'|'or' a lexer rule either. It's just bizarre. I've confirmed this bug happens when running the 'test rig' in Antlrworks 2, so it's not just an IntelliJ thing. I've checked, and I'm using the latest 4.7 version in the IntelliJ plugin.

I've attached an image of the parse tree I get when this error occurs.

Mike Lischke

unread,
Apr 15, 2017, 9:15:01 AM4/15/17
to antlr-di...@googlegroups.com
I'm using the ANTLR 4 plugin in IntelliJ, and I have the most bizarre bug. I'll start with the relevant parser/lexer rules:

Peter, what tokens does your lexer produce? The first thing you should always check is if the token list is correct. Also, I see you have a mix of explicit and implicit lexer rules (tokens). It has been said many times before, but let me repeat: always define your tokens in the lexer grammar (part), not as implicit tokens in the parser, because such implicit tokens can get different ids and can lead to strange behavior (maybe implicit lexer tokens should be prohibited entirely, they are already for split (non-combined) grammars). You can even have an explicit token and an implicit rule which matches the same input, thinking that both are equal. Wrong, they are not. They refer to different token values (and only one of both is ever matched). So, make this your standard habit, to explicitly define lexer rules.

Next thing is that your OTHER rule matches everything and since it is listed before STRING your STRING token will never be produced (since OTHER will kick in first). Then I'm not sure if a catch-all clause will also kick in before any implicit lexer token, but this is something you should check also (by looking at your token stream dump), but would not be worth a thought if you only had explicit lexer rules.

Peter Klavins

unread,
Apr 15, 2017, 1:29:02 PM4/15/17
to antlr-discussion
Yep, turns out I had another rule that had ' or' defined in it. I've since learned that implicit tokens are bad, so I'm going to go through my grammar and tokenize everything in its own lexer rules. I'm also going to put all my catch-all rules (like STRING, which is only temporary, and OTHER, at the end).

The source of my confusion was that I never imagined that lexer rules that hadn't been 'used' by a test string would affect ones that were being used by the test string. But now I know better.

Mike Lischke

unread,
Apr 15, 2017, 1:36:20 PM4/15/17
to antlr-di...@googlegroups.com
Yep, turns out I had another rule that had ' or' defined in it. I've since learned that implicit tokens are bad, so I'm going to go through my grammar and tokenize everything in its own lexer rules. I'm also going to put all my catch-all rules (like STRING, which is only temporary, and OTHER, at the end).

Yes, that's the best position for them. Always put more specialized rules at the top and general ones at the end. Btw, your STRING rule is not a catch-all rule if you define it correctly. Make it match non-greedily (by placing a question mark after the + operator). That should then match only til the next double quote.
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Loring Craymer

unread,
Apr 16, 2017, 6:07:46 AM4/16/17
to antlr-discussion
I think that the real problem is the use of  ' or' rather than 'or'.  Including the space causes conflicts with WS.  If your input has a series of spaces preceding 'or', then the recognizer has to identify both WS and ' or' not followed by text.  What you are seeing is still a bug in the recognizer algorithm (or implementation), but it is one that should be avoided.  Still, it is better to avoid using literals.

Mike Lischke

unread,
Apr 16, 2017, 6:32:06 AM4/16/17
to antlr-di...@googlegroups.com
I think that the real problem is the use of  ' or' rather than 'or'.  Including the space causes conflicts with WS. 

That ' or' part is about the input, not a rule in the grammar. There is no rule which defines this char sequence. I think ' or' was matched by OTHER.

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Loring Craymer

unread,
Apr 17, 2017, 12:15:57 AM4/17/17
to antlr-discussion
Unless Ter significantly changed literal recognition in ANTLR 4, a literal in a parser rule generates a lexer rule that is inserted at the front of the lexer grammar.  The behavior you describe was correct for ANTLR 2, but not ANTLR 3 (and 4, I expect).

--Loring
Reply all
Reply to author
Forward
0 new messages