I'm posting a followup here in case it's useful to anyone else or to give people the opportunity to point out the error of my ways. :)
I continued to try to get the expected tokens I wanted given the example grammar I posted in this thread. I had some initial success experimenting with generating my parser with the -Xforce-atn option, but quickly found that it didn't work for various minimal changes to the grammar. I abandoned that route.
My end goal is to provide better error reporting for a small domain-specific query language. My input strings will never be very large like they could be for a programming language, so I can get away with a lot with respect to performance. The IntelliJ plugin feature that allows you to test a specific rule always seems to return the set of expected tokens that you would... expect. It performs parsing using a ParserInterpreter from the runtime, which seems to be the reason for the slightly different behavior from my generated parser.
I hacked together a JavaScript version of the ParserInterpreter and it was also able to return the "correct" expected tokens. I think I understand now why the ParserInterpreter gives me the expected tokens I want and the generated parser does not. Here is my simple example grammar again:
And here is just a small snippet from the "stat" function of the generated JavaScript parser:
_la = this._input.LA(1);
if(_la===ExampleParser.T__0) {
this.state = 4;
this.match(ExampleParser.T__0);
this.state = 6;
_la = this._input.LA(1);
if(_la===ExampleParser.ID) {
this.state = 5;
this.expr();
}
this.state = 8;
this.match(ExampleParser.T__1);
}
T__0 and T__1 are the open and close parentheses in the stat rule. Because expr is optional in the stat rule, there are 2 possible transitions from state = 6. It can either transition to state 5 (to attempt to match an expr rule) or state 8 (to attempt to match the close parenthesis). If I give this generated parser the following input, it will only report the close parenthesis ")" as an expected token:
( ~FORCE_ERROR~
It is obvious in the code why this is the case. The ~FORCE_ERROR~ token is encountered by the lookahead, so the parser immediately skips checking the expr rule and proceeds to state 8, where the subsequent match method will report the error. It reports the error from state = 8, missing the fact that ID from the expr rule could have been a valid token as well.
The ParserInterpreter avoids this behavior because it moves through the states of the ATN to simulate parsing, but every time it visits a state that has more than one possible transition, it calls the sync() method on the error handler. The sync() method does a lookahead to see if the next token is valid for the given parser state. For the specific behavior I'm trying to get, where I force an error at the end of the input with the bogus ~FORCE_ERROR~ token, the sync cannot recover and reports the error while the parser is in the earliest state where the error could be recognized (state = 6 in the example above). This allows the ParserInterpreter to report the error prior to the state transition, also allowing it to compute the "correct" expected tokens - ")" and ID.
To get this same behavior from my generated JavaScript parser, instead of using the ParserInterpreter, I monkeyed with the parser a little before using it to parse input (without actually modifying the generated parser's code). In my HTML page that loads the parser, I'm doing something like this:
var myParser = require('./ExampleParser');
Object.defineProperty(myParser.ExampleParser.prototype, "state", {
get: function ()
{
return this._stateNumber;
},
set: function (state)
{
this._stateNumber = state;
if (state != -1)
{
var atnState = this.atn.states[state];
if (atnState.transitions.length > 1)
{
this._errHandler.sync(this);
}
}
}
});
This replaces the default "setter" method for the parser's state. Now whenever the generated JavaScript parser changes the parser's state, it will perform the error handler's synchronization if the state has multiple transitions, just like the ParserInterpreter does. And, finally, I get back the set of expected tokens I want to report to the user for my specific scenario.
I don't know if this is a good or bad approach, so feedback is always welcome. I just know that it's working pretty well for me now.
Thanks,
Jerry