How to suggest next token based on grammar?

2,073 views
Skip to first unread message

Otis Gospodnetic

unread,
Apr 24, 2014, 11:28:41 PM4/24/14
to antlr-di...@googlegroups.com
Hi,

I have ANTLR4 grammar (for something like SQL) and would like to use it to build a UI that offers suggestions as user types.

For example:
if one starts typing "SELECT " then I want to offer a list of various functions that can follow that SELECT, like "count()".

Does anyone have any pointers?

Thanks,
Otis

Mike Lischke

unread,
Apr 25, 2014, 2:52:25 AM4/25/14
to antlr-di...@googlegroups.com

Otis, the keywords for such a feature are "code completion" or "auto completion". There have been previous discussions about this. Best is to search for those keywords in conjunction with ANTLR. There's no ready-to-use solution AFAIK, but e.g. Sam Harwell listed some suggestions on Stackoverflow.com (search there).

I have ANTLR4 grammar (for something like SQL) and would like to use it to build a UI that offers suggestions as user types.

For example:
if one starts typing "SELECT " then I want to offer a list of various functions that can follow that SELECT, like "count()".

Terence Parr

unread,
Apr 25, 2014, 9:03:09 AM4/25/14
to antlr-di...@googlegroups.com
hi Otis! nice to hear from you. At any point, you can ask the parser for the set of tokens that could come next but it is not always what is expected from autocompletion.

Ter
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Otis Gospodnetic

unread,
Apr 25, 2014, 9:44:27 AM4/25/14
to antlr-di...@googlegroups.com
Hi Ter,

On Friday, April 25, 2014 9:03:09 AM UTC-4, the_antlr_guy wrote:
hi Otis! nice to hear from you. At any point, you can ask the parser for the set of tokens that could come next

Aha!  That's what I'm after.  Can you please point me to something, some method name or some link or something to check out to see how to get this "set of tokens that could come next"?

 
but it is not always what is expected from autocompletion.

Hm, interesting, though I can't imagine the situation where one wouldn't want to see autocompletion offering the next set of possible tokens.  Do you have an example in mind that would help me understand?
 
Thanks,
Otis

Matthieu Moy

unread,
Apr 25, 2014, 10:01:09 AM4/25/14
to antlr-di...@googlegroups.com
Otis Gospodnetic <otis.gos...@gmail.com> writes:

> Hm, interesting, though I can't imagine the situation where one
> wouldn't want to see autocompletion offering the next set of possible
> tokens. Do you have an example in mind that would help me understand?

The most useful autocompletion is IMHO to autocomplete identifiers.
Then, the grammar will tell you "identifier expected next", but you
can't do much with only this information.

--
Matthieu Moy
http://www-verimag.imag.fr/~moy/

Mike Lischke

unread,
Apr 25, 2014, 10:44:14 AM4/25/14
to antlr-di...@googlegroups.com

>
>> Hm, interesting, though I can't imagine the situation where one
>> wouldn't want to see autocompletion offering the next set of possible
>> tokens. Do you have an example in mind that would help me understand?
>
> The most useful autocompletion is IMHO to autocomplete identifiers.
> Then, the grammar will tell you "identifier expected next", but you
> can't do much with only this information.

Exactly! And that's why the parser's follow set is almost useless for this task. Additionally, while doing auto completion you almost always have an incorrect grammar which the parser is usually not well forgiving (maybe ANTLR 4 is better in this regard, but in general this holds true and it certainly behaves differently with certain parse strategies, like backtracking).

So, what one really needs is the list of possible parser rules (non-terminals) at a given point, provided that the grammar is organized such that a rule name uniquely determines the context information at that point (say, in an SQL command you start auto completion after a schema name and a dot which would then show all possible table names, which could be represented as a rule "table_identifier" in the grammar).

This however requires to parse your grammar, put it in a set of data structures that you can walk for a given input to collect that list (up to the caret position). A similar approach is the list of parse trees ANTLRWorks2 generates, IIRC. See https://github.com/tunnelvisionlabs/antlrworks2/tree/master/org-antlr-works-editor/src/org/antlr/works/editor/grammar/completion for code details.

Mike
--
www.soft-gems.net

Terence Parr

unread,
Apr 25, 2014, 11:32:05 AM4/25/14
to antlr-di...@googlegroups.com
hi. As others have said, it’s really about picking good identifiers.  The way intellij handles it, they call one of your functions with the identifier prefix you have typed and a location within their parse tree. From there, it’s a matter of walking the parse tree or looking things up in a simple table to compute the set of all visible identifiers, whether they be types or method names or whatever. Then you just have to check which one start with that prefix or have that substring contained within.

Does that give you an idea how to do it yourself?

Ultimately, I hope to get a “build me a plug-in” button within my ANTLR 4 intellij plug-in. All the pieces of there, more or less, but unlike xtext it would not do much more than syntax highlighting and syntax checking, at least out-of-the-box.

Ter

Terence Parr

unread,
Apr 25, 2014, 11:46:29 AM4/25/14
to antlr-di...@googlegroups.com
On Fri, Apr 25, 2014 at 6:44 AM, Otis Gospodnetic <otis.gos...@gmail.com> wrote:
Hi Ter,

On Friday, April 25, 2014 9:03:09 AM UTC-4, the_antlr_guy wrote:
hi Otis! nice to hear from you. At any point, you can ask the parser for the set of tokens that could come next

Aha!  That's what I'm after.  Can you please point me to something, some method name or some link or something to check out to see how to get this "set of tokens that could come next"?

I believe it's just a parser method like getExpectedTokensWithinCurrentRule() 

 
but it is not always what is expected from autocompletion.

Hm, interesting, though I can't imagine the situation where one wouldn't want to see autocompletion offering the next set of possible tokens.  Do you have an example in mind that would help me understand?
 

The xtext guys and I had a conversation the other day and I can't remember what they were telling me  but it made sense ;)

Ter 


--
Dictation in use. Please excuse homophones, malapropisms, and nonsense. 

Sam Harwell

unread,
Apr 25, 2014, 1:21:45 PM4/25/14
to antlr-di...@googlegroups.com
The framework used by ANTLRWorks 2 is designed to properly handle code completion within an IDE, without requiring the user to write a specially crafted (specifically LL(1)) grammar, and without sacrificing accuracy. Rather than produce a deterministic parse tree, the code modifies ANTLR's prediction algorithms to produce a non-deterministic "forest" of all possible parse trees up to, but no farther than, a specific location in the input (e.g. where the caret is). The text after the caret is treated as "anything", as opposed to other alternatives like treating it as EOF. It also allows for some level of ambiguity, such as treating 'private' like either a keyword or identifier in Java, since the user might be in the process of typing 'privateVariable'. The current implementation of this functionality is not as efficient as it could be, but in real-world applications like GoWorks it has not proven to be a problem for speed or memory consumption (at least not yet).

In order to produce accurate code completion without writing a parser that allows ambiguity, you must write your grammar as LL(1).

Sam

Bojan V

unread,
Jan 20, 2015, 5:37:04 AM1/20/15
to antlr-di...@googlegroups.com

Parser has also method getExpectedTokens() which can be used, but it retrieves just mandatory tokens, how to retrieve both non mandatory and mandatory tokens?

Mike Lischke

unread,
Jan 20, 2015, 7:22:09 AM1/20/15
to antlr-di...@googlegroups.com

Parser has also method getExpectedTokens() which can be used, but it retrieves just mandatory tokens, how to retrieve both non mandatory and mandatory tokens?

I'd also be interested how you would know what to return (a var name, a class name etc.) if you only get an IDENTIFIER token from the parse tree.


Bojan V

unread,
Jan 20, 2015, 10:12:40 AM1/20/15
to antlr-di...@googlegroups.com
I guess in that case you would have to check which was the last rule that parser run into

Terence Parr

unread,
Jan 20, 2015, 7:07:10 PM1/20/15
to antlr-di...@googlegroups.com
do you mean in terms of what an IDENTIFIER can generate?
Ter

Mike Lischke

unread,
Jan 21, 2015, 3:44:24 AM1/21/15
to antlr-di...@googlegroups.com
do you mean in terms of what an IDENTIFIER can generate?

Well, the main problem here is that the parse tree only gives you a generic IDENTIFIER token which can be anything. For code completion you'd need to know however what exactly is that identifier. With id qualification it can even be that you reference a much higher structure (e.g. refering to a schema in SQL when writing a column reference).

However, I stil think a parser is not the right tool for code completion (as I have expressed in an earlier posting), because a parser requires the input to be correct, while for code completion you almost always have incorrect syntax.

I have implemented a completely new code completion approach using ANTLR 3.4 in the upcoming MySQL Workbench 6.3 release and plan to write a blog post about the details. The essence is: parse the grammar and create a data structure you can traverse. Use the lexer to go from token to token and collect the candidates from this data structure (considering mandatory and optional tokens/rules). Use the rule names from your grammar to know exactly what type of info you need to present (e.g. schemas, tables, columns, engine, charset etc. etc.). This approach works very precisely, even with a complex language like (My)SQL.
Mike


Sebastian Zarnekow

unread,
Jan 21, 2015, 3:58:33 AM1/21/15
to antlr-di...@googlegroups.com
For Xtext we did something similar as Mike did:
We take the document from start to the cursor position. The cursor may be within a token, so we use the starting position of that token and the exact cursor position and do two passes on these two inputs. During parsing, we record the path through the grammar that was taken such that we know exactly what that thing was doing, including error recovery operations etc. The parser is also configured with a token stream that will bail out as soon as it tries to read beyond EOF. Essentially we thereby create an event and introspect the parser's state and the path that it took when this happens.
Now there are basically two scenarios: The parser was using lookAhead 1: We are good to go and have our expectation. The parser had a bigger lookAhead. If that happens, we take the remaining tokens and use these to restart the parser for each alternative that it may have used from there. So basically we respin the parser and force it into each decision as long as it doesn't reach LA 1.
This approach allows to use the exact same semantics as Antlr uses during production parsing. Instead of manually using the lexer and matching tokens against some custom data structure, we still let Antlr do that heavy lifting.

Best,
Sebastian

Mike Lischke

unread,
Jan 21, 2015, 4:10:10 AM1/21/15
to antlr-di...@googlegroups.com
Hey Sebastian,

We take the document from start to the cursor position. The cursor may be within a token, so we use the starting position of that token and the exact cursor position and do two passes on these two inputs. During parsing, we record the path through the grammar that was taken such that we know exactly what that thing was doing, including error recovery operations etc. The parser is also configured with a token stream that will bail out as soon as it tries to read beyond EOF. Essentially we thereby create an event and introspect the parser's state and the path that it took when this happens. 

Interesting approach. How do you know what path the parser has taken? When you manually traverse you can just use a LIFO stack to collect the sequence (which I actually do in MySQL Workbench to be able to travel up the stack to a known rule I can handle as a whole, e.g. "table_reference", when the caret is currently in an identifier rule).

But even then you need to consider possible alternatives. Many paths can lead to the same input up to a certain point, so you have to walk all possible alternatives that match up to the caret position (and I mean not only alts in the current rule, but any rule higher up). Which creates the parse forest Sam spoke about (only that I don't really create the forest, but do everything in the traverse process).


Bojan V

unread,
Jan 29, 2015, 6:11:38 AM1/29/15
to antlr-di...@googlegroups.com
Is it possible to override Parser.getExpectedTokens method so it retrieves both mandatory and non mandatory tokens? So basic question is how to get non mandatory tokens?

Terence Parr

unread,
Feb 11, 2015, 1:05:38 PM2/11/15
to antlr-di...@googlegroups.com
Hi Sebastian, let's talk about this approach more at xtext day or at the language implementation meet up. I just implemented, in an experimental branch, for v4 a mechanism that will retry the input with decision overrides. That allows me to find all possible interpretations of the input when I find an ambiguity. I'm doing is primarily to help people identify ambiguities in the grammar, but it seems it could also serve the purpose of finding the set of expected symbols.

I think the idea would be to, as you have done, parse up until the current token in all possible ways. Given partial input, even an unambiguous grammar can yield ambiguities.

Ter

Terence Parr

unread,
Feb 11, 2015, 1:10:08 PM2/11/15
to antlr-di...@googlegroups.com
Hi Mike, What do you think of the approach I mentioned where I simply re-parse from the beginning to the cursor and collect the ambiguities with a diagnostic listener? The ideas than to retry the input again, once for each possible path for each detected ambiguity.  This could be expensive but even for large input parsing with an efficient grammar is no big deal. On my 2010 mac with a single core, ANTLR v4's Java grammar can parse 132M of Java 6 lib source in about five seconds.

Ter

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Mike Lischke

unread,
Feb 12, 2015, 3:30:24 AM2/12/15
to antlr-di...@googlegroups.com
Hi Ter,

Hi Mike, What do you think of the approach I mentioned where I simply re-parse from the beginning to the cursor and collect the ambiguities with a diagnostic listener? The ideas than to retry the input again, once for each possible path for each detected ambiguity. 

I'm not sure why you need 2 runs, but I'm not very familiar with ANTLR4. Maybe that's the way to go there. However, you still have 2 problems that cannot be solved by any of the approaches using a parser:

1) Input is often invalid when invoking code completion which may cause the parser to bail out at any time without giving you back any useful information (e.g. a parse tree).

2) For keywords listing the next valid terminals is a good approach, but this is only half the battle. You often have cases where for instance identifiers are expected, but without context you cannot say if you need a variable, constant, function name or whatever. In order to solve that you need domain knowlege. The easiest approach IMO without having to hand code too much is to use the grammar as such and put meaning on certain rules. E.g. I have a parser rule table_ref which can be id.id.id, with optional parts. When an id is hit you have to go up the parse tree until you find one of a set of rules that have special meaning (like the table_ref rule). Once you have that you know exactly what to present to the user. I have rules like object_name + object_ref for every possible object in my grammar which makes it simple to know where a reference (and hence a list of possible references) is needed.

When I say "parse tree" in point 2) I don't really mean a parser tree as generated by the parser (because you may not have one), but instead a walker that acts similar like a parser but doesn't require the entire input to match. I believe in ANTLR4 there is some functionality (ATN) to generate tree alts (from what I understood in Sam's code) but that is not available in ANTL3 so I parse my grammar file and generate a representation in memory for it. Then I can easily walk that with a few recursive functions to any point in the input, collecting all possible candidates on the way.


Terence Parr

unread,
Feb 23, 2015, 1:40:13 PM2/23/15
to antlr-di...@googlegroups.com
Yep, the invalid input problem will give trouble to a formal parser. Hmm...I'll have to investigate. What I need is a smallish grammar and some sample input to demo the issues.
Ter

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bojan V

unread,
Sep 29, 2015, 8:51:01 AM9/29/15
to antlr-discussion
Otis,

you still there? Can you tell us your experience regarding this topic?
I have been playing with antlr4 and trying to achieve autocomplete for our language, I had some partial success. Our idea was to give antlr input text until caret to parse it, and than depending on last tree node that antlr run into decide what to offer to the user (or in some cases we would just take expected tokens when antlr runs into an error). As I mentioned - we had partial success, reason is following: when you type text in your editor, most of the time syntax is not correct, and when you give antlr such text to parse sometimes parser gets confused and goes into direction that you don't expect and as such parse tree is no longer usefull
Reply all
Reply to author
Forward
Message has been deleted
0 new messages