Antlr4 grammar/token design.

James Hart

unread,

Apr 24, 2013, 10:12:34 AM4/24/13

to antlr-di...@googlegroups.com

I ran into a counter intuitive behavior in antlr4 which I thought was a bug but I am told is by design:
Given the grammar:

grammar lexissue;
example : CONTAINER*;
dimensions : INT ('x'|'X') INT # setWindowDimensions;

CONTAINER: [a-z] [a-zA-Z]*;
INT: [0-9]+;
and listing:

acontainer
a
x

I get a token list of:

[@0,0:9='acontainer',<3>,1:0]
[@1,11:11='a',<3>,2:0]
[@2,13:13='x',<1>,3:0]
[@3,15:14='<EOF>',<-1>,4:0]

Notice token @ 2 is . I would expect the token to follow the lexer rules and be (CONTAINER). It seems to be unexpected that the grammar rule dimension would affect the tokenization because parser rules should not affect lexing at all.

However, I was told this is by designed because I added 'x' as a token identifier in the grammar. This blows away how I am trying to solve one of my grammars use cases.

In my full blown grammar i am using a predicate which temporarly allows something like [124x256] to be parsed without triggering the CONTAINER token. This, should, greatly simplify my parser as now I can use a grammar rule like in my example and still match an x container later. This antlr4 behavior does the opposite of what I need.... It forces 'x' to be a token.

In the traditional sense a grammar doesn't affect the tokenization because Lexing happens prior to knowing anything about the parsing. Antlr4 bends the rules a bit. So how do I solve my use case without over complicating my parser?

Kevin J. Cummings

unread,

Apr 24, 2013, 11:03:10 AM4/24/13

to antlr-di...@googlegroups.com

On 04/24/2013 10:12 AM, James Hart wrote:
> I ran into a counter intuitive behavior in antlr4 which I thought was a bug but I am told is by design:
> Given the grammar:
>
> grammar lexissue;
> example : CONTAINER*;
> dimensions : INT ('x'|'X') INT # setWindowDimensions;
>
> CONTAINER: [a-z] [a-zA-Z]*;
> INT: [0-9]+;
> and listing:
>
> acontainer
> a
> x
>
> I get a token list of:
>
> [@0,0:9='acontainer',<3>,1:0]
> [@1,11:11='a',<3>,2:0]
> [@2,13:13='x',<1>,3:0]
> [@3,15:14='<EOF>',<-1>,4:0]
>
> Notice token @ 2 is . I would expect the token to follow the lexer rules and be (CONTAINER). It seems to be unexpected that the grammar rule dimension would affect the tokenization because parser rules should not affect lexing at all.

You used 'x' in a parser rule. This creates an unnamed TOKEN for 'x'.
I'm not sure from your snippet why 'a' is a token. Did you forget
something in your example?

> However, I was told this is by designed because I added 'x' as a token identifier in the grammar. This blows away how I am trying to solve one of my grammars use cases.

You can try the following changed rules:

dimension: INT X INT;

X: 'x' | 'X' ;
CONTAINER: 'a'-'z' ('a'-'z' | 'A'-'Z')*
| X;

> In my full blown grammar i am using a predicate which temporarly allows something like [124x256] to be parsed without triggering the CONTAINER token. This, should, greatly simplify my parser as now I can use a grammar rule like in my example and still match an x container later. This antlr4 behavior does the opposite of what I need.... It forces 'x' to be a token.
>
> In the traditional sense a grammar doesn't affect the tokenization because Lexing happens prior to knowing anything about the parsing. Antlr4 bends the rules a bit. So how do I solve my use case without over complicating my parser?
>

--
Kevin J. Cummings
kjc...@verizon.net
cumm...@kjchome.homeip.net
cumm...@kjc386.framingham.ma.us
Registered Linux User #1232 (http://www.linuxcounter.net/)

Jim Idle

unread,

Apr 24, 2013, 8:30:25 PM4/24/13

to antlr-di...@googlegroups.com

I am afraid that your question is too garbled to be able to answer properly as your example grammar seems incomplete. I feel that you need to take a few steps back and learn how the system works, before trying to solve more complicated issues.

You may be trying to bite off a bigger problem than you can chew just yet - you will be better going through all the example grammars and making sure you understand them. the art is to understand how to solve problems using what Antlr does, not try to solve a problem in a way you think it should be solved and then be surprised when Antlr does not do what you expect ;)

However, if you make 'x' a token, then you will get 'x' back, not CONTAINER. Using the literal does not mean that 'x' is some kind of temporary token - it means that the parser auto-creates a token (which is why the advice not to use literals in parser grammars unless you really know where your towel is has been handed out here since before the days of Zarquon).

So, you are making a few mistakes here I think:

1) Be as loose as you can be with lexer and parser rules - don't try to define different token types to match exactly the same patterns (that said, see the use of lexer modes)

2) Don't use 'literals' in your parser grammar

3) Don't try to parse things with the lexer - that is the parser's job

4) Move errors as far down the chain as you can, as you have more context in which to give a better error - don't try to validate tokens in the lexer, gather anything that is roughly correct, then validate semantically. Don't try to force syntactical order or exclusions in the parser, do so in the walker, and so on.

Hope this helps.

Jim

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

James Hart

unread,

Apr 24, 2013, 8:35:24 PM4/24/13

to antlr-di...@googlegroups.com

Kevin,

<snip>

> I get a token list of:
>
> [@0,0:9='acontainer',<3>,1:0]
> [@1,11:11='a',<3>,2:0]
> [@2,13:13='x',<1>,3:0]
> [@3,15:14='<EOF>',<-1>,4:0]
>
> Notice token @ 2 is . I would expect the token to follow the lexer rules and be (CONTAINER). It seems to be unexpected that the grammar rule dimension would affect the tokenization because parser rules should not affect lexing at all.

You used 'x' in a parser rule. This creates an unnamed TOKEN for 'x'.
I'm not sure from your snippet why 'a' is a token. Did you forget
something in your example?

The listing has an 'a'. It should be resolved as the same token as container, which it is.

> However, I was told this is by designed because I added 'x' as a token identifier in the grammar. This blows away how I am trying to solve one of my grammars use cases.

You can try the following changed rules:

dimension: INT X INT;

X: 'x' | 'X' ;
CONTAINER: 'a'-'z' ('a'-'z' | 'A'-'Z')*
| X;

That wouldn't allow just an 'x' to be tokenized as a CONTAINER. However, I think you are pointing me in the right direction. If I add the predicate condition to an explicit X token like you depict and use it in the parser rule it should resolve all the issues. It is a bit counter intuitive to have to do it this way for this edge case but it should work!

Thanks for responding,

James

Jim Idle

unread,

Apr 24, 2013, 8:49:14 PM4/24/13

to antlr-di...@googlegroups.com

Don't try to get specific on when x is a container and when it is not. CONTAINER si going to match it anyway. Use lexer modes or don't bother distinguishing.

Jim

James Hart

unread,

Apr 25, 2013, 1:13:48 AM4/25/13

to antlr-di...@googlegroups.com

Jim thanks for your advice.

Discussing ANTLR with knowledgeable people provides much more insight than the example grammars (which I have looked through). I do understand with everyone's help that an implicit token in a PARSER rule can affect the LEXICAL analysis phase. This seems to only be the case when the lexical grammar is imported and combined in the file with the parser grammar. While learning ANTLR I would define the lexer separate from the parser grammar and generate the code for each separately. In those cases an implicit rule would not bleed over to the lexical analysis phase. End result is that I never would see 'x' become the implicit token and everything was happy.

My ultimate use case is parsing an existing language which has several 'island' grammar that have a different structure.

I did want to use lexical modes. However the infrastructure I'm working in requires an imported or embedded lexer grammar. Lexical modes will not worked with imported grammars. This is an outstanding feature targeted for ANTLR 4.x on the git issue board. ANTLR doesn't seem to support lexer modes when using a combined grammar either. That is why I cannot use lexical modes.

That leaves me with either generalizing several token types or using semantic predicates in the lexer grammar. Using the predicates in the imported lexer grammar will set up the parser grammar nicely for the day I can use lexical modes instead.

The option for using more general tokens will make the runtime more complex and harder to maintain. A special parse of the tokens text will need to be done depending on what context the parser is in. It isn't as trivial as the 'keywords as identifiers' problem that my example implies.

The infrastructure is complex and to change it to support separate lexer and parser grammars will not be as trivial as it sound. I'll need good reasons to spend time on it when the promise of ANTLR supporting the import of multi mode lexers is in the (near?) future. What would be some good points to bring up to via for time and resources to change the infrastructure?