Help Improving Kotlin Grammar Memory Usage

Geoffrey Challen

unread,

Sep 24, 2022, 1:54:33 PM9/24/22

to antlr-discussion

We're using the official Kotlin ANTLR4 grammar (https://kotlinlang.org/docs/reference/grammar.html) for a few code analysis tasks in a long running server application. However, we've noticed some problems with the way that the grammar is structured that lead to poor performance—specifically very high memory usage.

(Note that my knowledge of this area is limited, so apologies in advance if I use any incorrect terminology.)

Specifically, the Kotlin grammar includes many pass-through rules that end up being reached even when specific language features are not used. For example, here's the definition for the rule that matches the Kotlin elvis operator:

elvisExpression

: infixFunctionCall (NL* elvis NL* infixFunctionCall)*

;

However, this is also the only way to reach the infixFunctionCall rule, which then passes through to the rangeExpression rule, and so and so forth.

The result is that even parsing a short cascade if ends up generating a massive parse tree because getting from one if to the next involves traversal through dozens of rules that aren't matching but are required to reach the rule that does. Even a moderate cascade if can easily blow the stack, and we're noticing that repeatedly utilizing the Kotlin grammar is causing the ANTLR4 caches to grow quickly and eventually produce memory pressure and eventually a crash. I'm assuming that there is some reason that the grammar is designed in this way. But in practice the design is causing performance and stability problems. (This toolchain is used to do analysis of student submissions in CS1, so the use of moderately-sized cascade if's isn't all that uncommon.)

(This also makes certain analysis tasks more irritating, since the presence of a elvisExpression rule match in the parse tree does not indicate that there was actually an elvis expression used in the code. But performance is the higher-order concern at this point.)

So a few questions. First, are there any potential performance optimizations that we can turn on that would help us continue to use the existing Kotlin grammar? I noticed references to a performance optimized ANTLR fork, but it's not clear if it addresses this specific issue.

Second, assuming that some degree of grammar optimization is required, is there an extremely mechanical way of modifying the existing grammar to address these performance problems, perhaps by reducing the number of rules or chain length? I obviously don't want to do anything that would affect correctness, and we probably will need to occasionally incorporate upstream changes. If this could be automated, that would be amazing :-).

FWIW we're also using the Java grammar, which doesn't seem to have any of the same issues, but is also structurally quite different.

Ken Domino

unread,

Sep 24, 2022, 10:05:45 PM9/24/22

to antlr-discussion

This grammar defines the syntax for expressions in a classic anti-pattern for Antlr. Replace the expression rules in the parse tree chain with a rule for expression that orders the alts in the order of operator precedence (which is the reverse of the order of nodes in the chain):

expression
: primaryExpression
| expression postfixUnarySuffix+
| unaryPrefix+ expression
| expression NL* asOperator NL* type_
| expression multiplicativeOperator NL* expression
| expression additiveOperator NL* expression
| expression '..' NL* expression
| expression simpleIdentifier NL* expression
| expression NL* elvis NL* expression
| expression isOperator NL* type_
| expression inOperator NL* expression
| expression comparisonOperator NL* expression
| expression equalityOperator NL* expression
| expression NL* '&&' NL* expression
| expression NL* '||' NL* expression
;

postfixUnaryExpression
: primaryExpression
| expression postfixUnarySuffix+
;
prefixUnaryExpression
: unaryPrefix* postfixUnaryExpression
;

You'll have to include postfixUnaryExpression and prefixUnaryExpression because these are entry points from other parts of the grammar.

This is pretty much off the top of my head, so I would be careful to verify that it works exactly as intended.

--Ken

Geoffrey Challen

unread,

Sep 26, 2022, 10:12:00 PM9/26/22

to antlr-discussion

Thanks so much for your reply! I applied these changes and I definitely think that they helped.

FWIW, after a bit more digging I discovered probably the core of the problem, which was the definition for functionDeclaration. It had options both to match with and without a body, and this was where ANTLR4 was spending most of its time. After some grammar refactoring to avoid empty method bodies unless certain conditions were true (inside interface declarations, or prefixed with abstract), the nastiest performance bottleneck is gone. (Of course, it will still rear its ugly head when parsing interface declarations, but hopefully you don't find massive if-else chains in default interface methods).

ivan.ko...@gmail.com

unread,

Sep 27, 2022, 12:21:35 PM9/27/22

to antlr-discussion

I'd recommend using parsetree-free approach (such option exists in parser) to get rid of creating intermediate objects and creating AST on the fly. It reduces memory consumption a lot.

вторник, 27 сентября 2022 г. в 04:12:00 UTC+2, geoffrey...@gmail.com:

Geoffrey Challen

unread,

Sep 27, 2022, 12:47:03 PM9/27/22

to antlr-discussion

Thanks! Unfortunately we need the parse tree for later processing, but I did find an option to trim it which seems to work and seems to perhaps reduce memory usage.

Reply all

Reply to author

Forward