Trying to use keywords as identifiers in ANTLR4; not working

1,537 views
Skip to first unread message

rtm...@googlemail.com

unread,
Feb 11, 2016, 11:10:18 AM2/11/16
to antlr-discussion
Hi all,
this is a continuation of SO post <http://stackoverflow.com/questions/35304065/trying-to-use-keywords-as-identifiers-in-antlr4-not-working>. Thought I'd follow up here; it may be a better place. Code to reproduce is below.

REGULAR_IDENT used to be a complex regexp but I've simplified it to nothing, same with all keyword such as FILESTREAM, and removed all fragments, to minimise all possible interfering factors.

Start rule for parse is start_rule (FYI java org.antlr.v4.gui.TestRig MSSQL start_rule -gui -trace -tokens)


If I use "xxx" (without quotes, likewise for all following inputs) it accepts as it matches (1).

If I use "action" it matches fine because of (2).
ACTION and a few other words were clearly copied from KEYWORD_AS_IDENT, which seems redundant, why not rely on KEYWORD_AS_IDENT directly?

Well, if I use "last" (which appears at (3) in KEYWORD_AS_IDENT only) it fails. Here's my debug trace:

last    <<-- my input
^Z
[@0,0:3='last',<2>,1:0]        <<-- I've checked, is correct
[@1,6:5='<EOF>',<-1>,2:0]
enter   start_rule, LT(1)=last
enter   regular_ident, LT(1)=last
line 1:0 mismatched input 'last' expecting {'filestream',
'sparse', 'no', 'action', 'persisted', KEYWORD_AS_IDENT, 'xxx'}
exit    regular_ident, LT(1)=<EOF>
consume [@1,6:5='<EOF>',<-1>,2:0] rule start_rule
exit    start_rule, LT(1)=<EOF>

why is input 'last', which should match LAST in KEYWORD_AS_IDENT, not being recognised? the debug trace specifically says trace KEYWORD_AS_IDENT as something it was expecting, and LAST is in that so...?
Either this is a bug I should have run into before, or I've fundamentally misunderstood the lexer. Can anyone reproduce? Any thoughts?

thanks

jan

-- code below --

grammar MSSQL;

PRIOR :'prior';
LAST :'last';
ABSOLUTE :'absolute';
RELATIVE :'relative';
FILETABLE :'filetable';
FILESTREAM :'filestream';
SPARSE :'sparse';
NO :'no';
ACTION :'action';
PERSISTED :'persisted';

KEYWORD_AS_IDENT :
    PRIOR
  | LAST                // <- (3)
  | ABSOLUTE
  | RELATIVE
  | FILETABLE
// these below have been copied to regular_ident
  | FILESTREAM
  | SPARSE
  | NO
  | ACTION
  | PERSISTED
;

start_rule :
        regular_ident
        EOF
    ;

regular_ident :
   FILESTREAM
  | SPARSE
  | NO
  | ACTION                // <- (2)
  | PERSISTED

  | KEYWORD_AS_IDENT
  | REGULAR_IDENT        // <- (1)
    ;

REGULAR_IDENT : 'xxx' ; // for simplicity

SKIPWS : [ \t\r\n]+ -> skip ;






Eric Vergnaud

unread,
Feb 12, 2016, 6:59:07 AM2/12/16
to antlr-discussion
LAST is declared before KEYWORD_AS_IDENT so when the lexer encounters 'last', it generates a LAST token, not a KEYWORD_AS_IDENT.
Your start rule does not accept LAST ten as a valid input, hence the shouting.
Your grammar will actually NEVER produce a KEYWORD_AS_IDENT token, because another valid token will match before.
It seems you are trying to get the lexer do the job of the parser i.e. handle multiple semantic alternatives, but at the time the token reaches the parser it's too late...
Have you tried making KEYWORD_AS_IDENT a parser rule (lowercase) rather than a lever rule?

Nilo Roberto da Cruz Paim

unread,
Feb 12, 2016, 10:43:51 AM2/12/16
to antlr-di...@googlegroups.com

Hi, All.

 

I’m using Antlr4, C# version, under Visual Studio. It works nicely.

 

AFAIK (but I may be wrong), Listeners and Visitor can be used in similar tasks. I’ve already done projects using both.

 

So, my question is: how to decide between them? Or can I use both on the same project? Note that all samples I’ve found uses Listener OR Visitor, but not both at the same time. It can be done?

 

Is there some “best practice” about this?

 

TIA,

Nilo - Brazil


Este e-mail foi enviado por um computador sem vírus e protegido pelo Avast.
www.avast.com

rtm...@googlemail.com

unread,
Feb 12, 2016, 10:47:54 AM2/12/16
to antlr-discussion
Great stuff. I've misunderstood the lexer, so major user error.
If I switch things around so KEYWORD_AS_IDENT defintion appears before the block of lex leaf definitions (ie. before defs of PRIOR/LAST etc), it works.
As you say, I'm trying to abuse the lexer into doing the parser's work and I've not had this problem before because I've not done things this way before.

That said, I still don't get the lexer's semantics. If I had

KEYWORD_AS_IDENT: PRIOR | LAST ;
PRIOR :'prior';
LAST :'last';


then this can only mean anything if antlr ultimately inlines the rules leaf defs to get

KEYWORD_AS_IDENT: 'prior' | 'last' ;
PRIOR :'prior';
LAST :'last';


now I get it.

For posterity, could you post your answer on stackoverflow and I'll accept it, or would you prefer I copy it over?

thank you

jan

Mike Lischke

unread,
Feb 14, 2016, 6:52:22 AM2/14/16
to antlr-di...@googlegroups.com
Jan,

That said, I still don't get the lexer's semantics. If I had

KEYWORD_AS_IDENT: PRIOR | LAST ;
PRIOR :'prior';
LAST :'last';


then this can only mean anything if antlr ultimately inlines the rules leaf defs to get

KEYWORD_AS_IDENT: 'prior' | 'last' ;
PRIOR :'prior';
LAST :'last';


now I get it.

That cannot work. The lexer can only create one token from a given input (unless you do some tricky processing and type changing in an action depending on some condition). You cannot tell it to sometimes return KEYWORD_AS_IDENT and sometimes PRIOR for the same input 'prior'.


For posterity, could you post your answer on stackoverflow and I'll accept it, or would you prefer I copy it over?

Eric mentioned the same suggestion as I did on SO: use a parser rule instead.


thank you

jan



On Friday, 12 February 2016 11:59:07 UTC, Eric Vergnaud wrote:
LAST is declared before KEYWORD_AS_IDENT so when the lexer encounters 'last', it generates a LAST token, not a KEYWORD_AS_IDENT.
Your start rule does not accept LAST ten as a valid input, hence the shouting.
Your grammar will actually NEVER produce a KEYWORD_AS_IDENT token, because another valid token will match before.
It seems you are trying to get the lexer do the job of the parser i.e. handle multiple semantic alternatives, but at the time the token reaches the parser it's too late...
Have you tried making KEYWORD_AS_IDENT a parser rule (lowercase) rather than a lever rule?

rtm...@googlemail.com

unread,
Feb 14, 2016, 5:51:34 PM2/14/16
to antlr-discussion
Sorry Mike, I misunderstood part of your answer on SO. You said "...as this was the only way to get this reliably working" by which I took you to mean that you were had the same problem as I and weren't sure why. I agree a parse rule is the correct approach but the question was about why the lexer stuff wasn't working as I expected. I was trying to understand why. I don't like working around things that I don't understand.

So, what are the semantics of the lexer? You said

"That cannot work. The lexer can only create one token from a given input (unless you do some tricky processing and type changing in an action depending on some condition). You cannot tell it to sometimes return KEYWORD_AS_IDENT and sometimes PRIOR for the same input 'prior'."

After Eric's explanation I considered that, and the possibility that a new token 'overrides' an earlier one and rejected both, so again, what are the lexer's semantics? What does "KEYWORD_AS_IDENT: PRIOR | LAST ;" mean? If the semantics were of substitution then what I wrote would explain the behaviour.
Assuming the substitution was done on this


KEYWORD_AS_IDENT: PRIOR | LAST ;
PRIOR :'prior';
LAST :'last';

to get to this

KEYWORD_AS_IDENT: 'prior' | 'last' ;
PRIOR :'prior';
LAST :'last';

then this makes sense as
KEYWORD_AS_IDENT matches text 'prior', and anything after it (viz. PRIOR) gets ignored so KEYWORD_AS_IDENT gets returned. That fits the behaviour I saw.

If that's not what happens, I don't what is.

anyway, sorry for that mixup and thanks again

jan

Mike Lischke

unread,
Feb 15, 2016, 2:55:50 AM2/15/16
to antlr-di...@googlegroups.com
> Sorry Mike, I misunderstood part of your answer on SO. You said "...as this was the only way to get this reliably working" by which I took you to mean that you were had the same problem as I and weren't sure why. I agree a parse rule is the correct approach but the question was about why the lexer stuff wasn't working as I expected. I was trying to understand why. I don't like working around things that I don't understand.

Ah, sorry, I thought you just wanted to have a working solution.

>
> So, what are the semantics of the lexer? You said
>
> "That cannot work. The lexer can only create one token from a given input (unless you do some tricky processing and type changing in an action depending on some condition). You cannot tell it to sometimes return KEYWORD_AS_IDENT and sometimes PRIOR for the same input 'prior'."
>
> After Eric's explanation I considered that, and the possibility that a new token 'overrides' an earlier one and rejected both, so again, what are the lexer's semantics? What does "KEYWORD_AS_IDENT: PRIOR | LAST ;" mean? If the semantics were of substitution then what I wrote would explain the behaviour.

I have to admit, I only tested with ANTLR3. I might well be that ANTLR4's behavior is different. I found that idea to create a lexer rule appealing, especially as I then would avoid having to add action code to change the resulting token type. So I tested that idea, only to get an error that either KEYWORD_AS_IDENT or the keywords mentioned in it would no longer be matched, depending on the appearance of the rules in the grammar. Until then I hoped for some trick ANTLR would use internally to make this work, but realized that my original expectation still held true and made sense.

> Assuming the substitution was done on this
>
> KEYWORD_AS_IDENT: PRIOR | LAST ;
> PRIOR :'prior';
> LAST :'last';
>
> to get to this
>
> KEYWORD_AS_IDENT: 'prior' | 'last' ;
> PRIOR :'prior';
> LAST :'last';
>

There is no substitution happening. Both variants translate the literals into a token value. The first one with token names you define, the latter with generated token names (well maybe ANTLR automatically substitutes the generated token names with that you specify, if there are lexer rules for a previously found literal).

Essential for your matching is the order of your rules. Earlier rules are matched first, until there is another rule that can match more input. Hence a rule like KEYWORD_AS_IDENT can never match any of the keyword rules in it and at the same time those keyword rules match as well, because one of them overrides the other (depending on their position in the grammar).

Mike
--
www.soft-gems.net

Reply all
Reply to author
Forward
0 new messages