ANTLR not parsing unambiguous string correctly (potentially due to prefering a shorter match), why?

31 views
Skip to first unread message

Natanji

unread,
Dec 10, 2017, 10:51:51 AM12/10/17
to antlr-discussion
While doing Advent of Code 2017's exercise from yesterday (http://adventofcode.com/2017/day/9), I figured that aside from writing an own simple parser, an elegant and quick way to solve this would be defining an ANTLR grammar and making it generate the parser instead.
But while doing so, some unexpected behaviour popped up that I would like to understand, perhaps somebody wants to help me out there?

First, my grammar:


grammar Day9;

expr    : '{}'          # EmptyGroup
        | '<>' # EmptyGarbage
        | '<' garb '>' # Garbage
        | '{' expr '}'  # Group
        | expr ',' expr # Multi
        ;

garb :  '!!'        # Ignore
        | '!' CHAR  # CanceledGarbage
        | CHAR     #CharGarbage
        | garb garb #MultiGarbage
        ;

CHAR  :   [0-9a-zA-Z]
      |   ','
      |   '_'
      |   '<'
      |   '>'
      |   '{'
      |   '}'
      |   '"'
      |   '\''
      ;


The grammar is supposed to parse everything between < and > as "garbage" which can later be discarded. Additionally, every character after an exclamation mark should be ignored, meaning that even a closing > should be ignored if an exclamation mark is right in front of it. As a tiny example, consider the string

{<!>>}

Here the curly braces should be parsed as a Group, and then the first and last pointy braces should be Garbage, and inside the garb rule it should become CanceledGarbage. I mean, by hand I can parse the string this way...
But ANTLR does it differently:

line 1:3 missing CHAR at '>'
line 1:4 extraneous input '>' expecting '}'

So for some reason the parser says it is missing a char at the token '>'. But that doesn't make any sense to me: CHAR can be '>', it's right there in the token rule! Indeed, removing the '<' from CHAR doesn't change the output in the slightest. I can also put all characters between the [], it doesn't change a thing.
And okay, this could be because for some reason, ANTLR is treating the first '>' as the closing pointy brace of the Garbage rule. But then it doesn't make sense that it afterwards complains about an extra '>' before the closing curly brace '}'. ANTLR is supposed to match as long as it can, isn't it?

Can anyone tell me what I am doing wrong and how I can fix my grammar to make '>' indeed be part of CHAR, and parsed that way? There is really only one unambigious way to parse the example string correctly, but ANTLR apparently isn't finding it and I don't understand why.
 

Mike Lischke

unread,
Dec 10, 2017, 11:08:09 AM12/10/17
to antlr-di...@googlegroups.com
grammar Day9;

expr    : '{}'          # EmptyGroup
        | '<>' # EmptyGarbage
        | '<' garb '>' # Garbage
        | '{' expr '}'  # Group
        | expr ',' expr # Multi
        ;

garb :  '!!'        # Ignore
        | '!' CHAR  # CanceledGarbage
        | CHAR     #CharGarbage
        | garb garb #MultiGarbage
        ;

CHAR  :   [0-9a-zA-Z]
      |   ','
      |   '_'
      |   '<'
      |   '>'
      |   '{'
      |   '}'
      |   '"'
      |   '\''
      ;


The grammar is supposed to parse everything between < and > as "garbage" which can later be discarded. Additionally, every character after an exclamation mark should be ignored, meaning that even a closing > should be ignored if an exclamation mark is right in front of it. As a tiny example, consider the string

{<!>>}

Here the curly braces should be parsed as a Group, and then the first and last pointy braces should be Garbage, and inside the garb rule it should become CanceledGarbage. I mean, by hand I can parse the string this way...
But ANTLR does it differently:

Very likely again that problem with implicitly defined tokens. You are not defining tokens separately (e.g. for '{}', but also for '<' and '>') and you expect '>' to match CHAR. However, ANTLR4 will implicitly create a token for '>' before your CHAR token and match that. Hence the error.

Therefore let me repeat as stated several times before: define your tokens explicitly! For example:

GREATER_THAN: '>';
LESS_THAN: '>';
CURLY_BRACES: '{}';
etc.

Finally: you cannot match '>' as both, a dedicated token in expr and as part of CHAR. Only one of them will match (the first one in the token list, since both occurrences have the same length).

Natanji

unread,
Dec 10, 2017, 12:05:58 PM12/10/17
to antlr-discussion

Thank you, your explanation about explicitly defining tokens does make sense, I was not aware of the implicit definitions causing problems. However, defining them explicitly and making sure each token is unambiguous/non-overlapping doesn't seem to work either. It outputs the (very peculiar) error:

line 1:0 mismatched input '{' expecting {'{', '<'}

Why can't it match the leading '{' (which should now be the same token everywhere, right?) when it expects '{' or '<'? That seems self-contradictory, I don't understand what's wrong.

Here is the modified grammar:

rammar Garbage;

expr    : CURLY_OPEN CURLY_CLOSE        # EmptyGroup
        | LESS_THAN GREATER_THAN           # EmptyGarbage
        | LESS_THAN garb GREATER_THAN   # Garbage
        | CURLY_OPEN expr CURLY_CLOSE   # Group
        | expr COMMA expr               # Multi
        ;

garb :  EXCLAMATION EXCLAMATION # Ignore
        | EXCLAMATION char      # CanceledGarbage
        | char                  # CharGarbage
        | garb garb             # MultiGarbage
        ;

char  :   CHAR
      |   COMMA
      |   LESS_THAN
      |   GREATER_THAN
      |   CURLY_OPEN
      |   CURLY_CLOSE
      ;

CHAR : [0-9a-zA-Z_"\'];

CURLY_OPEN: '{';
CURLY_CLOSE: '}';
LESS_THAN: '<';
GREATER_THAN: '>';
EXCLAMATION: '!';
COMMA: ',';
--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/Ah4ajVkSX_A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Lischke

unread,
Dec 11, 2017, 3:33:38 AM12/11/17
to antlr-di...@googlegroups.com

Am 10.12.2017 um 18:05 schrieb Natanji <nat...@gmail.com>:

line 1:0 mismatched input '{' expecting {'{', '<'}

For me this is a sign that you still have 2 different token values for `{` (which usually happens when you define implicit tokens.


Natanji

unread,
Dec 11, 2017, 5:07:12 AM12/11/17
to 'Mike Lischke' via antlr-discussion

But look at the modified grammar I provided: the only token in which '{' appears is CURLY_OPEN. I can literally Ctrl+F search the grammar file for all occurences of '{' and there is only a single one. There is also no implicit token defined anywhere in the grammar, from what I can tell. So I don't understand your explanation.

-Natanji

Mike Lischke

unread,
Dec 11, 2017, 6:19:41 AM12/11/17
to antlr-di...@googlegroups.com

But look at the modified grammar I provided: the only token in which '{' appears is CURLY_OPEN. I can literally Ctrl+F search the grammar file for all occurences of '{' and there is only a single one. There is also no implicit token defined anywhere in the grammar, from what I can tell. So I don't understand your explanation.

Then maybe you did not regenerate your lexer/parser? Dump the token stream to see what your lexer recognized (type and content of tokens). That should give you a hint where you have the unexpected token value (which is what this error is about, one token type was expected, but another came in, for the same token string).


Natanji

unread,
Dec 11, 2017, 8:57:45 AM12/11/17
to 'Mike Lischke' via antlr-discussion

Okay, you're right - I figured it out now. Thank you for your time. And phew, this turned out much more complicated than I would have thought. Writing correct grammars is hard!

-Natanji

Reply all
Reply to author
Forward
0 new messages