Getting Expected Tokens from Error

476 views
Skip to first unread message

JW

unread,
May 27, 2015, 3:36:40 PM5/27/15
to antlr-di...@googlegroups.com
I'm very new to anltr and could use some help understanding. I want to use the parser to validate the syntax of a reasonably simple custom query language, which is straightforward enough. I would also, however, like to use the data from parser syntax errors to provide some level of useful help to the user constructing the query. Because my language will be somewhat simple, I think just getting the full list of possible tokens at the point an error was encountered would probably suffice for my needs.
 
I have played around with getExpectedTokens but it does not always seem to give the full list (or, more likely, I'm doing something wrong).
 
As a simple example, I have created the following example grammar:

grammar Simple;
 
stat:   (ID)* ( '(' expr ')' )? EOF
    ;
 
expr:   ID '=' STRING
    |   expr AND expr
    |   expr OR expr
    |   OPENP expr CLOSEP
    ;
 
FORCE_ERROR
    : '~FORCE_ERROR~' ;
AND :   'and' ;
OR  :   'or' ;
OPENP : '(' ;
CLOSEP: ')' ;
ID  :   [a-zA-Z]+ ;
STRING:  '"' ~["]* '"' ;
WS  :   [ \t\r\n]+ -> skip ;
 
The FORCE_ERROR token was just my way to introduce a token at any time that would force an error and not potentially match another valid token. An example input I was testing with is as follows:

someId (val = "one" and ~FORCE_ERROR~

In the IntelliJ plugin, the error is reported as "line 1:24 mismatched input '~FORCE_ERROR~' expecting {'(', ID}". The expected tokens identified by the IntelliJ plugin would be fine for what I want to do.

When I run the same input through the TestRig, however, the reported error is "line 1:24 no viable alternative at input '~FORCE_ERROR~'".

Ultimately, I need a JavaScript target that allows me to retrieve the same set of expected tokens as is reported by the IntelliJ plugin. Any guidance you could offer would be appreciated.
 
Thanks,
 
Jerry

Eric Vergnaud

unread,
May 27, 2015, 8:06:04 PM5/27/15
to antlr-di...@googlegroups.com
Hi Jerry,

the JavaScript target should produce the same output as the Java one.
please post a working sample so I can have a look.

Eric

JW

unread,
May 28, 2015, 1:23:22 PM5/28/15
to antlr-di...@googlegroups.com
Ok, I've continued to work with this. I found that with my generated JavaScript parser, using the grammar and input from my original post above, I am able to get the expected tokens in my error listener (in the syntaxError method).

I did find, however, that I do not get the results I expected if the grammar is modified as below and I pass the modified input given below:

grammar Simple;
 
stat:   (ID)* ( '(' query_clause ')' )? EOF
    ;
 
query_clause
    : expr ((AND | OR) query_clause)?
    ;
 
expr:   ID '=' STRING
//    |   expr AND expr
//    |   expr OR expr
    |   OPENP expr CLOSEP
    ;
 
FORCE_ERROR
    : '~FORCE_ERROR~' ;
AND :   'and' ;
OR  :   'or' ;
OPENP : '(' ;
CLOSEP: ')' ;
ID  :   [a-zA-Z]+ ;
STRING:  '"' ~["]* '"' ;
WS  :   [ \t\r\n]+ -> skip ;

New input string: someId (val = "one" ~FORCE_ERROR~

Notice the addition of the query_clause rule in the parser grammar and the modification of the expr rule. This grammar more closely reflects the original grammar for our custom query language that I was given to work with, and it seems functionally equivalent to me.

With this updated grammar and input string, the IntelliJ plugin reports "line 1:20 mismatched input '~FORCE_ERROR~' expecting {'and', 'or', ')'}". My generated JavaScript parser, which did end up working with the data from my original post, now reports the following with the updated grammar and input string "line 1:20 mismatched input '~FORCE_ERROR~' expecting ')'".

Could someone help me understand the difference between these grammar representations and why they effect the behavior of getExpectedTokens in the generated parser? I would also be curious to know why the IntelliJ plugin reports the correct expected tokens but my generated parser does not.

Thanks again,

Jerry

JW

unread,
May 30, 2015, 5:05:30 AM5/30/15
to antlr-di...@googlegroups.com
Eric,

Thanks for taking a look. I may have created some confusion there. I have not seen any difference between my JavaScript target and the Java target (as tested with the TestRig). What I have seen is a difference in the expected tokens reported by the IntelliJ plugin and my targets (either JavaScript or Java).

I have been digging around some more and see that the IntelliJ plugin uses interpreters (https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parser+and+lexer+interpreters). My current assumption is that this is where the difference lies between the errors reported by the plugin and those reported by my generated parsers (JavaScript or Java).

I really want my generated JavaScript parser to provide me with the same kind of information that is reported by the IntelliJ plugin, but I have not been able to figure out how to do it yet. The simplest example I have currently that demonstrates a difference is with this grammar:

stat:   ( '(' expr? ')' )? EOF ;
expr: ID '=' STR ;

ERR : '~FORCE_ERROR~' ;
ID : [a-zA-Z]+ ;
STR : '"' ~["]* '"' ;

WS : [ \t\r\n]+ -> skip ;

If I provide an input string of ( ~FORCE_ERROR~ I get the following results:

IntelliJ plugin:

line 1:2 mismatched input '~FORCE_ERROR~' expecting {')', ID}

*This is exactly what I want - it knows that either a close paren or an ID would be valid tokens.

JavaScript target or Java target (executed with TestRig):

line 1:2 mismatched input '~FORCE_ERROR~' expecting ')'

*My generated parsers do not report that ID would be a valid token.

I would like to understand better why there is a difference between the plugin's output and my generated parsers, and I would especially like to know how I could make my JavaScript parser report expected tokens in the same manner as the plugin.

Thanks,

Jerry

Eric Vergnaud

unread,
May 30, 2015, 12:28:15 PM5/30/15
to antlr-di...@googlegroups.com
Jerry,

please provide a working sample i.e. generated parser + code used to produce the below.

Eric

JW

unread,
May 30, 2015, 7:55:08 PM5/30/15
to antlr-di...@googlegroups.com
Eric,

I have attached the following:

  • Example.g4 - My example grammar.
  • ExampleLexer.js - The generated JavaScript lexer.
  • ExampleParser.js - The generated JavaScript parser.
  • example.htm - My test HTML page. I serve the page from my local web server. The HTML page and the generated .js files are in the root folder and the ANTLR JavaScript runtime folders (antlr and lib) are there as well. When the page opens, just click the "Parse" button with the browser's console open to view the results of the parse.
Again, viewing the same grammar in the IntelliJ plugin with the same input string gives me the list of expected tokens I really want to see.

Thanks!

Jerry
ExampleParser.js
example.htm
ExampleLexer.js
Example.g4

Eric Vergnaud

unread,
May 31, 2015, 5:15:55 AM5/31/15
to antlr-di...@googlegroups.com
Jerry,

thanks for this.
Using your grammar and input, the Java and JavaScript runtimes generate the exact same error i.e. expecting ')'.
I'm not a TestRig expert but my guess is that TestRig uses a different parsing mode than the default one, hence the different output.

Eric

JW

unread,
May 31, 2015, 10:13:01 AM5/31/15
to antlr-di...@googlegroups.com
Just to be clear, I have always seen the same output from Java targets executed with TestRig and my JavaScript target. When I see different output from the IntelliJ plugin, it is when I'm using the "Test Rule" capability within the plugin. That is the kind of output I want to be able to display to the users of my JavaScript target so they know what tokens would be valid when entering a query. Here is what it looks like in IntelliJ:

I do agree that the plugin must do something different. From looking at the plugin source, it appears to use interpreters. I don't know if that would matter or not because I don't have a good understanding of the difference between that approach and a generated parser. They also set the prediction mode to LL_EXACT_AMBIG_DETECTION. I don't know if that matters either. Can it be changed in the JavaScript parser?

Any other ideas on how I might get all of the allowed tokens from an error condition would be greatly appreciated.

Thanks,

Jerry

Eric Vergnaud

unread,
May 31, 2015, 11:11:40 AM5/31/15
to antlr-di...@googlegroups.com
You can set the prediction mode in javascript
simply set it the _interp instantiated for you on the parser function
I would recommend that you read Terence's book.

Terence Parr

unread,
May 31, 2015, 12:18:51 PM5/31/15
to antlr-di...@googlegroups.com
somebody else mentioned a diff with plugin and tool. the interp should do the same thing, minus LL_EXACT_AMBIG_DETECTION mode. As Eric says, you can try setting that in your app. It forces ANTLR to figure out precisely what input / grammar elements are ambiguous but it also might change how error messages for bad input.

Ter
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

JW

unread,
May 31, 2015, 3:52:02 PM5/31/15
to antlr-di...@googlegroups.com
I tried setting the prediction mode in JavaScript as follows:

parser._interp.predictionMode = antlr4.atn.PredictionMode.LL_EXACT_AMBIG_DETECTION;

I added this line before I started the parse. It did not, however, change the result. I added breakpoints anywhere that the predictionMode property was used and none were hit, so it seems to have no effect for my example grammar and input.

I have read most of Terrence's ANTLR book, by the way. I prefer to figure these things out on my own, and the book has helped the learning process, but I am really stuck on this one. :)

Eric Vergnaud

unread,
Jun 1, 2015, 9:33:19 AM6/1/15
to antlr-di...@googlegroups.com
The next step would be for you to do a parallel run of the java plugin and the javascript parser in debug mode.
I'm sure that with such a minimal grammar, you'll rapidly locate where the difference lies.

JW

unread,
Jun 1, 2015, 12:30:58 PM6/1/15
to antlr-di...@googlegroups.com
Do I put the parser into debug mode in code? I haven't done that yet.

You did lead me to look at the options for the ANTLR tool again while looking to run the parser in debug mode. I noticed the option -Xforce-atn which is described in the ANTLR book as follows:

ANTLR normally builds traditional “switch on token type” decisions where possible (one token of lookahead is sufficient to distinguish between all alternatives in a decision). To force even these simple decisions into the adaptive LL(*) mechanism, use this option.

I wonder if you (Eric and Ter) could offer your thoughts on this? Again, I'm just learning here, so bear with me. I noticed when I looked at the generated parser that this rule resulted in a simple "if" statement in the code so the parser never entered the expr rule at all (the error was reported from the "stat" context instead of the "expr" context where it could identify the expected tokens I wanted):

stat: ( '(' expr? ')' )? EOF ; 

I regenerated the parser with the Xforce-atn option and now I get the list of tokens I expect. My assumption is that the IntelliJ plugin also uses "the adaptive LL(*) mechanism" for all decisions because it uses the ParserInterpreter, which mimics a generated parser by stepping through the ATN, but it's not ultimately generated code, so it won't be handling simple decisions independently of adaptivePredict like the generated parser would. Sorry if I butchered that description... do you think my understanding of that could be anywhere in the ballpark?

I'm going to play around some more with the parser generated with the Xforce-atn option to see if it will work for me.

Terence Parr

unread,
Jun 1, 2015, 1:13:44 PM6/1/15
to antlr-di...@googlegroups.com
Yep, plugin doesn’t use the LL(1) optimization so that is one diff, though not sure why it changes the error sets. I think maybe the code gen for LL(1) does a simple switch-on-type that assumes error is “don’t match the optinoal thing”.

Ter

JW

unread,
Jun 3, 2015, 12:23:37 PM6/3/15
to antlr-di...@googlegroups.com
I'm posting a followup here in case it's useful to anyone else or to give people the opportunity to point out the error of my ways. :)

I continued to try to get the expected tokens I wanted given the example grammar I posted in this thread. I had some initial success experimenting with generating my parser with the -Xforce-atn option, but quickly found that it didn't work for various minimal changes to the grammar. I abandoned that route.

My end goal is to provide better error reporting for a small domain-specific query language. My input strings will never be very large like they could be for a programming language, so I can get away with a lot with respect to performance. The IntelliJ plugin feature that allows you to test a specific rule always seems to return the set of expected tokens that you would... expect. It performs parsing using a ParserInterpreter from the runtime, which seems to be the reason for the slightly different behavior from my generated parser.

I hacked together a JavaScript version of the ParserInterpreter and it was also able to return the "correct" expected tokens. I think I understand now why the ParserInterpreter gives me the expected tokens I want and the generated parser does not. Here is my simple example grammar again:

grammar Example;

stat:   ( '(' expr? ')' )? EOF ;
expr:   ID '=' STR ;

ERR :   '~FORCE_ERROR~' ;
ID  :   [a-zA-Z]+ ;
STR :   '"' ~["]* '"' ;
WS  :   [ \t\r\n]+ -> skip ;

And here is just a small snippet from the "stat" function of the generated JavaScript parser:

        _la = this._input.LA(1);
       
if(_la===ExampleParser.T__0) {
           
this.state = 4;
           
this.match(ExampleParser.T__0);
           
this.state = 6;
            _la
= this._input.LA(1);
           
if(_la===ExampleParser.ID) {
               
this.state = 5;
               
this.expr();
           
}

           
this.state = 8;
           
this.match(ExampleParser.T__1);
       
}

T__0 and T__1 are the open and close parentheses in the stat rule. Because expr is optional in the stat rule, there are 2 possible transitions from state = 6. It can either transition to state 5 (to attempt to match an expr rule) or state 8 (to attempt to match the close parenthesis). If I give this generated parser the following input, it will only report the close parenthesis ")" as an expected token:

( ~FORCE_ERROR~

It is obvious in the code why this is the case. The ~FORCE_ERROR~ token is encountered by the lookahead, so the parser immediately skips checking the expr rule and proceeds to state 8, where the subsequent match method will report the error. It reports the error from state = 8, missing the fact that ID from the expr rule could have been a valid token as well.

The ParserInterpreter avoids this behavior because it moves through the states of the ATN to simulate parsing, but every time it visits a state that has more than one possible transition, it calls the sync() method on the error handler. The sync() method does a lookahead to see if the next token is valid for the given parser state. For the specific behavior I'm trying to get, where I force an error at the end of the input with the bogus ~FORCE_ERROR~ token, the sync cannot recover and reports the error while the parser is in the earliest state where the error could be recognized (state = 6 in the example above). This allows the ParserInterpreter to report the error prior to the state transition, also allowing it to compute the "correct" expected tokens - ")" and ID.

To get this same behavior from my generated JavaScript parser, instead of using the ParserInterpreter, I monkeyed with the parser a little before using it to parse input (without actually modifying the generated parser's code). In my HTML page that loads the parser, I'm doing something like this:

   var myParser = require('./ExampleParser');

   
Object.defineProperty(myParser.ExampleParser.prototype, "state", {
     
get: function ()
     
{
         
return this._stateNumber;
     
},
     
set: function (state)
     
{
         
this._stateNumber = state;
         
         
if (state != -1)
         
{
           
var atnState = this.atn.states[state];
           
if (atnState.transitions.length > 1)
           
{
               
this._errHandler.sync(this);
           
}
         
}
     
}
   
});

This replaces the default "setter" method for the parser's state. Now whenever the generated JavaScript parser changes the parser's state, it will perform the error handler's synchronization if the state has multiple transitions, just like the ParserInterpreter does. And, finally, I get back the set of expected tokens I want to report to the user for my specific scenario.

I don't know if this is a good or bad approach, so feedback is always welcome. I just know that it's working pretty well for me now.

Thanks,

Jerry

Terence Parr

unread,
Jun 3, 2015, 12:49:11 PM6/3/15
to antlr-di...@googlegroups.com
A nice analysis Jerry! Yep, LL(1) optimization tends to skip optional stuff. Could you use an ErrorHandler instead btw?
Ter

JW

unread,
Jun 3, 2015, 2:19:08 PM6/3/15
to antlr-di...@googlegroups.com
Thanks, Ter. You and Eric have helped quite a bit to steer me in the right direction.

I considered an ErrorHandler when I first started looking into this. I've learned quite a bit since then, so I could revisit it. I reread the Error Reporting and Recovery chapter from your book, and I'm still not positive I can get what I want from an ErrorHandler. Mostly because the error handling is being invoked after the optional case is skipped over by the generated parser and the parser state has already changed. Because getExpectedTokens() depends on the current parser (recognizer) state, it is too late by the time error handling kicks in for me to get all of the expected tokens. If I tried to use just an ErrorHandler, I think I'd need some way to work backwards to the parser state where the error could have first been recognized and get the expected tokens from that state (and context)... I think?

Incidentally, in my example, if I change expr? to expr*, everything works fine. As you said, optional stuff is skipped. Subrule loops like expr+ or expr* are handled by ANTLR by performing an error handler sync() at the start of the subrule and during each iteration (Recovering from Errors in Subrules, The Definitive ANTLR 4 Reference). I essentially want to treat the optional expr? like a subrule loop that might iterate, at most, one time. My "fix" where I modified the state setter forces that error handler sync() into the code right before the start of the optional subrule.

Thanks,

Jerry

Terence Parr

unread,
Jun 3, 2015, 2:20:23 PM6/3/15
to antlr-di...@googlegroups.com
how about (expr | ), which might do same thing. :)
Ter

JW

unread,
Jun 3, 2015, 2:44:29 PM6/3/15
to antlr-di...@googlegroups.com
Ah, good one! That does work. I know you know this, but for anyone following:

The generated parser resulting from this grammar has a switch statement on the lookahead token for the alternatives, including the empty alternative. Because none of these cases match my ~FORCE_ERROR~ token, the default case of the switch throws a NoViableAltException while the parser is still in a state that lets it return all of the expected tokens from the alternatives. Very cool.

I have to decide, however, if I want to modify my grammar specifically to support my intended user interface. The JavaScript target will support the users in entering valid queries through the web browser, but a C# target will eventually execute the query on the server. I don't want to have two versions of the grammar for these different purposes, and I may not want to risk other developers modifying the grammar in ways that will suddenly break my user interface (someone decides that writing expr? looks cleaner than a subrule with an empty alternative).

Thanks,

Jerry
Reply all
Reply to author
Forward
Message has been deleted
0 new messages