Building autocompletion for an ANTLR parser

616 views
Skip to first unread message

Federico Tomassetti

unread,
May 20, 2017, 2:57:17 AM5/20/17
to antlr-discussion
Hi everyone,
I am currently working on improving my implementation of autocompletion for ANTLR. It is included in the open-source project Kanvas (a framework to build editors based on ANTLR).

The way I find the types of the tokens that can follow the caret is by interpreting the ATN data structure, following the transitions and remembering the states I have traversed, so that I can understand which states are incompatible with my current history of states traversed.

That part works and I can figure out which types of tokens can follow. I call this part "syntactic autocompletion" and it works for all kinds of tokens but identifiers. I wrote the algorithm once and I can apply it without changes to all possible ANTLR parsers.

For identifiers I would need to understand which symbols are visible in a certain context and propose only them. This is of couse language specific.

First of all I track in which rule I am currently in (still by analyzing the ATN data structure) and I know the tokens I traversed so in some cases this is enough to figure out which symbols can be referred.

For example, I have a language to describe statemachine in which I can specify transitions like:

"on <EVENTNAME> -> <STATENAME>"

So when I am in a transition and I know an ID should follow I look at the preceeding symbol: if it is "on" then I propose all the events if it is the arrow "->" I propose all the states.

How do I know all the events and states? I grab them from the last version of the AST that I was able to build. This is not ideal and it does not work properly all the time but it seems to me decent enough for now (if you have better ideas I would be happy to hear them).

Now the problem is when I need to give suggestions specific to the position of the caret. In a language I can define functions and when an ID can follow the caret inside the function I want to suggest the parameters:

fun foo(Int myParam) Int {
   // here I would suggest myParam
}
// here I would NOT suggest myParam

I am not sure what is the best way to implement this: I would probably need to build incomplete ASTs.

My typical scenario is that a user type:

fun foo(Int myParam) Int { // incomplete function
// here I would expect that it suggest myParam to me

Do you have implemented a solution to similar problems? Do you have any suggestion? 

Thanks!

Mike Lischke

unread,
May 20, 2017, 6:53:11 AM5/20/17
to antlr-di...@googlegroups.com
Hey Federico, I didn't know you are still working on your auto completion code :-)

I am currently working on improving my implementation of autocompletion for ANTLR. It is included in the open-source project Kanvas (a framework to build editors based on ANTLR).

I wish we could join forces and you would work with me on the Visual Studio Code extension for ANTLR4 (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4), instead writing an own editor (but you would have to switch from Kotlin to Typescript).

The extension I'm working on is based on 2 node modules I also wrote, namely:

- Code Completion Core for ANTLR4: https://github.com/mike-lischke/antlr4-c3
- ANLTR4 Grammar Parsing Service (graps): https://github.com/mike-lischke/antlr4-graps



The way I find the types of the tokens that can follow the caret is by interpreting the ATN data structure, following the transitions and remembering the states I have traversed, so that I can understand which states are incompatible with my current history of states traversed.

When I started with my implementation I also consulted your code, but had quite some trouble with it. Especially the recursion per state approach is a serious problem for complex grammars, not to mention the slowness I saw. Speed is actually one of the biggest concerns here, also my implementation is too slow for complex grammars, which is why I implemented an ATN graph viewer in my vscode extension (http://www.soft-gems.net/files/atn.gif). That will hopefully help to better understand how to process the ATN.

A hint: don't track states you visited before and ignore them later on. This will not work with multiple occurrences of the same subrule or recursive rules.


That part works and I can figure out which types of tokens can follow. I call this part "syntactic autocompletion" and it works for all kinds of tokens but identifiers. I wrote the algorithm once and I can apply it without changes to all possible ANTLR parsers.

For identifiers I would need to understand which symbols are visible in a certain context and propose only them. This is of couse language specific.

There is a way to make even this language neutral, provided a few conditions are met. Read the description of antlr4-c3 (https://github.com/mike-lischke/antlr4-c3/blob/master/readme.md) to understand how to solve that (a summary is below).


How do I know all the events and states? I grab them from the last version of the AST that I was able to build. This is not ideal and it does not work properly all the time but it seems to me decent enough for now (if you have better ideas I would be happy to hear them).

One of the premises for useful auto completion suggestions is that the code is syntactically correct up to the caret position. Otherwise you can never be sure what the user actually intended. That means however you should be able to collect all symbols up to this point using a normal tree visitor. Symbols are usually managed in a symbol table and can be hierarchically organized, for instance you can have one for each source file + a number of static ones for e.g. functions from included libraries and runtimes. The antlr4-c3 module also contains a symbol table implementation.


Now the problem is when I need to give suggestions specific to the position of the caret. In a language I can define functions and when an ID can follow the caret inside the function I want to suggest the parameters:

fun foo(Int myParam) Int {
   // here I would suggest myParam
}
// here I would NOT suggest myParam

I am not sure what is the best way to implement this: I would probably need to build incomplete ASTs.

My approach is different and works 100% of the time for everything that can be derived from the grammar. The problem is how to make domain knowledge available to the auto completion engine, without creating a lot of extra code to handle language specific parts. The solution I found is very simple: your grammar can hold that knowledge. You would not only structure your grammar after logical aspects, but in addition you include rules for all entities (function names, parameter names, class names, you name it) you wan to have auto completion for. Then when you walk the ATN and reach the caret position, you only need to go up in the rule call chain until you find one of these rules. You can then return the rule index as candidate and the caller knows exactly what type of symbol must be shown at that point. The surrounding editor code only has to configure the c3 engine by telling it which rules (rule indices actually) are relevant and should be returned when found. I have explained that a bit more in the readme.md file mentioned above.


My typical scenario is that a user type:

fun foo(Int myParam) Int { // incomplete function
// here I would expect that it suggest myParam to me

Do you have implemented a solution to similar problems? Do you have any suggestion? 

One problem that might look first as if it is very difficult to solve is scoping. Not all symbols are visible everywhere. Hence you need a way to query *visible* symbols at a given point. Also this can easily be done by a symbol table implementation. Other than what the name suggests it's not really a table but a tree (some symbols, like blocks, function bodies or modules can hold other symbols). When you visit your parse tree to collect symbols you *know* in which scope you are, hence you can add found symbols at the right position in the hierarchy and querying visible symbols from a given symbol (position) is then trivial. My antlr4-c3 implementation contains functions for that: https://github.com/mike-lischke/antlr4-c3/blob/master/src/SymbolTable.ts#L283. It returns all symbols of a given type (e.g. all class names) visible from this symbol's position, including those from dependent symbol tables (traversing so the entire hierarchy).

In order to see how to use all that look either in the module's tests (https://github.com/mike-lischke/antlr4-c3/tree/master/test), which includes code completion for a simple expression parser as well as for a C++14 grammar, or study the antlr4-graps code which uses the c3 code.

Federico Tomassetti

unread,
May 20, 2017, 8:24:28 AM5/20/17
to antlr-discussion


On Saturday, 20 May 2017 12:53:11 UTC+2, Mike Lischke wrote:
Hey Federico, I didn't know you are still working on your auto completion code :-)

Hi Mike,
thank you for your detailed answer! It was a pleasure to read it.
I wasn't working on autocompletion for a while: my solution was good enough for the previous usage I had, now it is time to step up :)
 

I am currently working on improving my implementation of autocompletion for ANTLR. It is included in the open-source project Kanvas (a framework to build editors based on ANTLR).

I wish we could join forces and you would work with me on the Visual Studio Code extension for ANTLR4 (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4), instead writing an own editor (but you would have to switch from Kotlin to Typescript).

I would love to join forces. I am now writing a solution for desktop editors but I see the need also for web editors. Do you think your project could be adapted to work both in Visual Studio Code and in a web editor?
Your project is not using the Language Server Protocol, right?
I could work on mine and your project.

Also, Kotlin compiles to JavaScript... just saying... :)
 

The extension I'm working on is based on 2 node modules I also wrote, namely:

- Code Completion Core for ANTLR4: https://github.com/mike-lischke/antlr4-c3
- ANLTR4 Grammar Parsing Service (graps): https://github.com/mike-lischke/antlr4-graps



The way I find the types of the tokens that can follow the caret is by interpreting the ATN data structure, following the transitions and remembering the states I have traversed, so that I can understand which states are incompatible with my current history of states traversed.

When I started with my implementation I also consulted your code, but had quite some trouble with it. Especially the recursion per state approach is a serious problem for complex grammars, not to mention the slowness I saw. Speed is actually one of the biggest concerns here, also my implementation is too slow for complex grammars, which is why I implemented an ATN graph viewer in my vscode extension (http://www.soft-gems.net/files/atn.gif). That will hopefully help to better understand how to process the ATN.

I have spent some more time understanding the ATN. Previously I was not considering the ruleIndex of states to prune the branches, now I do and performance should be better for larger pieces of code. If you have some example maybe I could try and measure it.
 

A hint: don't track states you visited before and ignore them later on. This will not work with multiple occurrences of the same subrule or recursive rules.


That part works and I can figure out which types of tokens can follow. I call this part "syntactic autocompletion" and it works for all kinds of tokens but identifiers. I wrote the algorithm once and I can apply it without changes to all possible ANTLR parsers.

For identifiers I would need to understand which symbols are visible in a certain context and propose only them. This is of couse language specific.

There is a way to make even this language neutral, provided a few conditions are met. Read the description of antlr4-c3 (https://github.com/mike-lischke/antlr4-c3/blob/master/readme.md) to understand how to solve that (a summary is below).


How do I know all the events and states? I grab them from the last version of the AST that I was able to build. This is not ideal and it does not work properly all the time but it seems to me decent enough for now (if you have better ideas I would be happy to hear them).

One of the premises for useful auto completion suggestions is that the code is syntactically correct up to the caret position. Otherwise you can never be sure what the user actually intended. That means however you should be able to collect all symbols up to this point using a normal tree visitor. Symbols are usually managed in a symbol table and can be hierarchically organized, for instance you can have one for each source file + a number of static ones for e.g. functions from included libraries and runtimes. The antlr4-c3 module also contains a symbol table implementation.


Now the problem is when I need to give suggestions specific to the position of the caret. In a language I can define functions and when an ID can follow the caret inside the function I want to suggest the parameters:

fun foo(Int myParam) Int {
   // here I would suggest myParam
}
// here I would NOT suggest myParam

I am not sure what is the best way to implement this: I would probably need to build incomplete ASTs.

My approach is different and works 100% of the time for everything that can be derived from the grammar. The problem is how to make domain knowledge available to the auto completion engine, without creating a lot of extra code to handle language specific parts. The solution I found is very simple: your grammar can hold that knowledge. You would not only structure your grammar after logical aspects, but in addition you include rules for all entities (function names, parameter names, class names, you name it) you wan to have auto completion for. Then when you walk the ATN and reach the caret position, you only need to go up in the rule call chain until you find one of these rules. You can then return the rule index as candidate and the caller knows exactly what type of symbol must be shown at that point. The surrounding editor code only has to configure the c3 engine by telling it which rules (rule indices actually) are relevant and should be returned when found. I have explained that a bit more in the readme.md file mentioned above.

I have considered using your approach of introducing rules that could be used as hint for the auto-completion (e.g., ParamRef or VariableRef). I think it would help but on one hand I can see users misusing it and later opening tickets :D on the other hand there are rules for scoping anyway, that makes things not so easy to solve. I have functions for scoping resolution that I use later in interpreters, compilers, validators, etc. and I would like to reuse them for auto-completion.
 


My typical scenario is that a user type:

fun foo(Int myParam) Int { // incomplete function
// here I would expect that it suggest myParam to me

Do you have implemented a solution to similar problems? Do you have any suggestion? 

One problem that might look first as if it is very difficult to solve is scoping. Not all symbols are visible everywhere. Hence you need a way to query *visible* symbols at a given point. Also this can easily be done by a symbol table implementation. Other than what the name suggests it's not really a table but a tree (some symbols, like blocks, function bodies or modules can hold other symbols). When you visit your parse tree to collect symbols you *know* in which scope you are, hence you can add found symbols at the right position in the hierarchy and querying visible symbols from a given symbol (position) is then trivial. My antlr4-c3 implementation contains functions for that: https://github.com/mike-lischke/antlr4-c3/blob/master/src/SymbolTable.ts#L283. It returns all symbols of a given type (e.g. all class names) visible from this symbol's position, including those from dependent symbol tables (traversing so the entire hierarchy).

In order to see how to use all that look either in the module's tests (https://github.com/mike-lischke/antlr4-c3/tree/master/test), which includes code completion for a simple expression parser as well as for a C++14 grammar, or study the antlr4-graps code which uses the c3 code.

This morning I work a little bit more on my approach. Currently I do the following:
- interpret the ATN with my own code to find the possible next tokens at the point of the caret
- get the parse-tree from ANTLR (java generated classes) and translate in my AST (my Kotlin classes)
- I find the element of my AST corresponding to the position of the caret. At that point I ask to it what symbols are visible

For example, I have this language that permits to define annidated functions. This is the auto-completion I use when encountering an ID token:

when (autocompletionSurroundingInformation.rulesStack.last()) {
    "function" -> {
        // just suggesting a name for the function about to be defined
        // I could check the AST to verify it is unique
        res.add("myFun")
    }
     "param" -> {
      // same as before, just suggesting a name for a new param
        res.add("myParam")
    }
    "expression" -> {
    // here I could have value references or function calls
    // if I had use your approach of having different rules I could differentiate
    // them, however I am just using labels and I do not see labels in the ATN,
    // so they are just expressions and I suggest all variables and functions
    // Note that allVisibleValues and allVisibleFunctions are not trivial functions
    // defined on my AST classes

        currentNode?.allVisibleValues(ast!!.childParentMap())?.forEach { res.add(it.name!!) }
        currentNode?.allVisibleFunctions(ast!!.childParentMap())?.forEach { res.add(it.name!!) }
    }
}

I will look at your code to get inspired but I am seriously interested in discussing working on this together. We can discuss it here or by email.

Federico
 

Mike Lischke

unread,
May 21, 2017, 12:02:02 PM5/21/17
to antlr-di...@googlegroups.com
Hi Federico,

I wish we could join forces and you would work with me on the Visual Studio Code extension for ANTLR4 (https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4), instead writing an own editor (but you would have to switch from Kotlin to Typescript).

I would love to join forces. I am now writing a solution for desktop editors but I see the need also for web editors. Do you think your project could be adapted to work both in Visual Studio Code and in a web editor?

Hmm, the 2 node modules c3 and graps certainly. The vscode extension however obviously not, as it is made specifically for vscode.

Your project is not using the Language Server Protocol, right?

No, not at the moment. I wanted to start simple and so far everything is fast enough for direct implementation. The server protocol is more targeted at multi file processing (e.g. linting a bunch of files or refactoring over multiple files).

I could work on mine and your project.

Would be awesome.


Also, Kotlin compiles to JavaScript... just saying... :)

Ah, cool :-)

I have considered using your approach of introducing rules that could be used as hint for the auto-completion (e.g., ParamRef or VariableRef). I think it would help but on one hand I can see users misusing it and later opening tickets :D on the other hand there are rules for scoping anyway, that makes things not so easy to solve. I have functions for scoping resolution that I use later in interpreters, compilers, validators, etc. and I would like to reuse them for auto-completion.

That's perfect. If you have the rules already defined that also make up the relevant entities in code completion you essentially have everything already in place. It's not a conflict, but a perfect supplementation.


This morning I work a little bit more on my approach. Currently I do the following:
- interpret the ATN with my own code to find the possible next tokens at the point of the caret
- get the parse-tree from ANTLR (java generated classes) and translate in my AST (my Kotlin classes)
- I find the element of my AST corresponding to the position of the caret. At that point I ask to it what symbols are visible

I find it awkward to create an AST when you have a parse tree. An AST is essentially the input stream together with some artificial hierarchy (I have used that extensively myself with ANTLR3), but you get all that from the parse tree anyway (and more). In fact with a parse tree your grammar becomes more than just defining how to match your language, but it can be used to define structure/hierarchy explicitly. Really, you don't need an AST anymore in ANTLR4.



For example, I have this language that permits to define annidated functions. This is the auto-completion I use when encountering an ID token:

when (autocompletionSurroundingInformation.rulesStack.last()) {
    "function" -> {
        // just suggesting a name for the function about to be defined
        // I could check the AST to verify it is unique
        res.add("myFun")
    }
     "param" -> {
      // same as before, just suggesting a name for a new param
        res.add("myParam")
    }
    "expression" -> {
     // here I could have value references or function calls
     // if I had use your approach of having different rules I could differentiate
     // them, however I am just using labels and I do not see labels in the ATN,
     // so they are just expressions and I suggest all variables and functions
     // Note that allVisibleValues and allVisibleFunctions are not trivial functions
     // defined on my AST classes

        currentNode?.allVisibleValues(ast!!.childParentMap())?.forEach { res.add(it.name!!) }
        currentNode?.allVisibleFunctions(ast!!.childParentMap())?.forEach { res.add(it.name!!) }
    }
}

This is essentially the only language specific part the code completion core cannot do implicitly. You have to resolve the suggested items (in c3 the rule indices) to symbol types. That's usually a very trivial process, but nonetheless must be done in consumer code.


I will look at your code to get inspired but I am seriously interested in discussing working on this together. We can discuss it here or by email.

Best is probably to just look how things work in vscode. For a start you can also use an interactive node shell and play with the c3 + graps classes. Just mail me directly when you need help (e.g. how to develop/debug an extension in vscode).

Btw, I have a patch pending for ANTLR4 which allows to export interpreter data (symbols and the serialized ATN) in a separate file. This is mostly useful for setting up interpreters in other targets than Java (for grammar debugging). This info can also be used for code completion (but then you cannot run predicates, since you have no full parser). I'm still thinking on an approach to completely move out target code from ANTLR grammars and have a different way to define predicates, which would also allow to run them in a debugger or code completion.

Mike

Federico Tomassetti

unread,
May 23, 2017, 8:18:57 AM5/23/17
to antlr-discussion



This morning I work a little bit more on my approach. Currently I do the following:
- interpret the ATN with my own code to find the possible next tokens at the point of the caret
- get the parse-tree from ANTLR (java generated classes) and translate in my AST (my Kotlin classes)
- I find the element of my AST corresponding to the position of the caret. At that point I ask to it what symbols are visible

I find it awkward to create an AST when you have a parse tree. An AST is essentially the input stream together with some artificial hierarchy (I have used that extensively myself with ANTLR3), but you get all that from the parse tree anyway (and more). In fact with a parse tree your grammar becomes more than just defining how to match your language, but it can be used to define structure/hierarchy explicitly. Really, you don't need an AST anymore in ANTLR4.

Well, for me an AST works best for different reasons. One of the reasons is that I really do not like the classes generated by ANTLR. I prefer to write mines, with all classes having a certain ancestor and I do some transformations between the parse tree and the AST. If the language is simple they stay similar but if it is not they can differe quite a bit. Also, for complex languages I may want to do transformations on the AST to simplify interpreting or compiling.




For example, I have this language that permits to define annidated functions. This is the auto-completion I use when encountering an ID token:

when (autocompletionSurroundingInformation.rulesStack.last()) {
    "function" -> {
        // just suggesting a name for the function about to be defined
        // I could check the AST to verify it is unique
        res.add("myFun")
    }
     "param" -> {
      // same as before, just suggesting a name for a new param
        res.add("myParam")
    }
    "expression" -> {
     // here I could have value references or function calls
     // if I had use your approach of having different rules I could differentiate
     // them, however I am just using labels and I do not see labels in the ATN,
     // so they are just expressions and I suggest all variables and functions
     // Note that allVisibleValues and allVisibleFunctions are not trivial functions
     // defined on my AST classes

        currentNode?.allVisibleValues(ast!!.childParentMap())?.forEach { res.add(it.name!!) }
        currentNode?.allVisibleFunctions(ast!!.childParentMap())?.forEach { res.add(it.name!!) }
    }
}

This is essentially the only language specific part the code completion core cannot do implicitly. You have to resolve the suggested items (in c3 the rule indices) to symbol types. That's usually a very trivial process, but nonetheless must be done in consumer code.



I will look at your code to get inspired but I am seriously interested in discussing working on this together. We can discuss it here or by email.

Best is probably to just look how things work in vscode. For a start you can also use an interactive node shell and play with the c3 + graps classes. Just mail me directly when you need help (e.g. how to develop/debug an extension in vscode).

Btw, I have a patch pending for ANTLR4 which allows to export interpreter data (symbols and the serialized ATN) in a separate file.

Very interesting!
 
This is mostly useful for setting up interpreters in other targets than Java (for grammar debugging). This info can also be used for code completion (but then you cannot run predicates, since you have no full parser).

Well, I try to never use predicated anyway

Mike Lischke

unread,
May 23, 2017, 8:44:29 AM5/23/17
to antlr-di...@googlegroups.com

I find it awkward to create an AST when you have a parse tree. An AST is essentially the input stream together with some artificial hierarchy (I have used that extensively myself with ANTLR3), but you get all that from the parse tree anyway (and more). In fact with a parse tree your grammar becomes more than just defining how to match your language, but it can be used to define structure/hierarchy explicitly. Really, you don't need an AST anymore in ANTLR4.

Well, for me an AST works best for different reasons. One of the reasons is that I really do not like the classes generated by ANTLR. I prefer to write mines, with all classes having a certain ancestor and I do some transformations between the parse tree and the AST. If the language is simple they stay similar but if it is not they can differe quite a bit. Also, for complex languages I may want to do transformations on the AST to simplify interpreting or compiling.

OK, I see. Of course your concrete use case dictates how to structure things.


Well, I try to never use predicated anyway
 

I do for implementing versioned grammars. It allows me to enable and disable grammar parts depending on a version number. Important when your grammar has to support language variations that came in over time.


Reply all
Reply to author
Forward
0 new messages