Ordering lexer rules in a grammar

Mark Baumann

unread,

Sep 11, 2017, 9:40:40 PM9/11/17

to antlr-discussion

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.

I want the parser to be able to handle something like this:

Hello <<name>>, how are you?

At runtime, after parsing, I will replace "<<name>>" with the user's name, for instance, so that it should come out as "Hello Mark, how are you?"

So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<<something>>" tag, which I am calling a "func" in my lexer rules.

Here is my grammar:

doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;

WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;

Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".

If I run this parser on the above sentence, I get a parse tree that looks like this:

So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that uses double angle brackets around a word (so no "<< WORD >>") and only a rule that says "<< ID >>", so I'm not clear on why that is happening.

If I swap the order of "ID" and "WORD" in my grammar (so now ID comes before WORD) and run the parser, I get a parse tree like this:

So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.

So if WORD comes first, then ID's are not handled how I want, and if ID comes first, then WORD's aren't handled how I want.  How do I get past this conundrum?

I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).

Thanks for any help!

Mark

Norman Dunbar

unread,

Sep 12, 2017, 9:38:40 AM9/12/17

to antlr-di...@googlegroups.com

Hi Mark.

What is the grammer you used for the second image? You don't get many items and there's a name rule too.

Cheers,
Norm.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Mark Baumann

unread,

Sep 12, 2017, 10:31:23 AM9/12/17

to antlr-di...@googlegroups.com

Hi Norm, thanks for your reply.

The highlighted portions in the image are parser errors, so it is not recognizing "Hello", "how", "are", and "you?" as items, WORDs, or anything else.

Here is the grammar that was used to generate the second image. It is identical to the first grammar, except the order of the WORD and ID rules is swapped:

doc: item* EOF ;

item: (func | WORD) PUNCT? ;

func: '<<' ID '>>' ;

WS : [ \t\n\r] -> skip ;

fragment LETTER : [a-zA-Z] ;

fragment DIGIT : [0-9] ;

fragment CHAR : (LETTER | DIGIT | SYMB ) ;

ID: LETTER ( LETTER | DIGIT)* ;

WORD : CHAR+ ;

PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;

Thanks,

Mark

Norman Dunbar

unread,

Sep 12, 2017, 10:43:07 AM9/12/17

to antlr-di...@googlegroups.com

Thanks. And apologies, I thought you had missed a rule called "name" from the grammar, that was why I asked. Silly me, it was part of your test text.

Cheers,
Norm.

On 12 September 2017 15:31:22 BST, Mark Baumann <markcb...@gmail.com> wrote:

Hi Norm, thanks for your reply.

Norman Dunbar

unread,

Sep 12, 2017, 11:33:39 AM9/12/17

to antlr-di...@googlegroups.com

Hi Mark,

the following grammar works for me. It's pretty much yours but with
added extras:

I moved WORD from the lexer to the parser;

I added a FUNC_BEGIN and FUNC_END token;

PUNCT is optional after a FUNC_END and after a word - in the parser
rules - this (hopefully) matches your desire that a WORD allowed
punctuation while an ID did not.

The WORD lexer rule is gone, history! lost etc.

//------------------------------------
grammar Mark;

doc: item* EOF ;
item: (func | word) ;
func: FUNC_BEGIN ID FUNC_END PUNCT? ;
word : ID PUNCT? ;

WS : [ \t\n\r] -> skip ;

ID: LETTER ( LETTER | DIGIT)* ;

FUNC_BEGIN : '<<' ;
FUNC_END : '>>' ;
PUNCT : [.,?!] ;

fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;

fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;

fragment CHAR : (LETTER | DIGIT | SYMB ) ;

//------------------------------------

Running this against "Hello <<name>>, how are you?" or "Hello <<name>>
how are you" etc works fine for me, as it does against "Greetings
<<Earthling>>, Take me to your leader, please, at once, now! And tell me
your <<name>>."

Also, if there's punctuation in a func, after the FUNC_BEGIN and/or
before the FUNC_END, then it gets rejected - so no punctuation in an ID
as per your rules.

Hopefully, this is acceptable.

DISCLAIMER: I'm not a compiler writer, nor do I play one on TV!

Cheers,
Norm.

--
Norman Dunbar
Dunbar IT Consultants Ltd

Registered address:
27a Lidget Hill
Pudsey
West Yorkshire
United Kingdom
LS28 7LG

Company Number: 05132767

--
Cheers,
Norm. [TeamT]

Mark Baumann

unread,

Sep 12, 2017, 12:09:53 PM9/12/17

to antlr-di...@googlegroups.com

Norm, thanks so much for the help!

If you have time I was wondering: why did making "word" a grammar rule instead of a lexer rule help solve the issue? And/or, how did you decide to make that change? Like I said before, I am new to this so I am trying to learn the thought process that goes into it. Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/9Snht-8KCAg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussion+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Norman Dunbar

unread,

Sep 13, 2017, 4:25:30 AM9/13/17

to antlr-di...@googlegroups.com

Morning Mark,

Because ID and WORD were pretty similar, apart from punctuation, there were probably always going to be problems with the determination of which was which, so the one that comes first in the lexer rule, wins out. The lexer takes as much as it can of the input stream to extract a token, so the WORD should win if there's punctuation, because that's longer than the ID without. (I think!)

I tried a few other things in the lexer before I changed up to the parser, but nothing worked properly. So, sometimes, you have to throw these things up the tree to a higher level. That could be getting the parser to do it, or in the case of parser rules, letting your listener or walker do it in code.

In this case, the parser rule did the job.

Don't forget though, I'm not all that good at this parser stuff, much as it has a huge fascination for me, so I muddle along and sometimes, get things to work! (I have one grammar accepted into the ANTLR4 grammar hall of fame! The tnsnames grammar for Oracle stuff.) You can find these example grammars on github and they do make interesting reading especially when you are starting out.

HTH

Cheers,
Norm.

Mark Baumann

unread,

Sep 13, 2017, 10:37:29 AM9/13/17

to antlr-discussion

Norm,

Thanks so much for your great explanation!

I'll take a look at the sample grammars on github, that sounds very helpful.

Cheers,

Mark

Reply all

Reply to author

Forward