Article on improving the performance of ANTLR parsers

1,357 views
Skip to first unread message

Federico Tomassetti

unread,
Dec 16, 2020, 4:58:09 AM12/16/20
to antlr-discussion
Hi,
Several clients have asked us over time to help them improving the performance of their parsers, so we put together a list of things we normally look into and wrote an article on that:
https://tomassetti.me/improving-the-performance-of-an-antlr-parser/

It would be very useful for us to receive comments and suggestions to improve the article and learn more about optimizing parsers.

And perhaps people new to ANTLR could find it useful.

Cheers,
Federico

Walter Dörwald

unread,
Dec 16, 2020, 1:21:57 PM12/16/20
to antlr-discussion

It is surprising that the Python runtime is so much slower than the Java runtime.

Has there been any work done to get the Python runtime faster?

The Python code doesn't look that pythonic to me, which might be a contributing factor to the lack of speed. For example in ListTokenSource.py ListTokenSource.nextToken() has the following lines:

if len(self.tokens) > 0:
    previousStop = self.tokens[len(self.tokens) - 1].stop

A more pythonic (and faster) version of this would be:

if self.tokens:
    previousStop = self.tokens[-1].stop

The index access via -1 is three times faster:

Python 3.9.1 (default, Dec 10 2020, 10:36:35)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
🐍 — 1 » x = list("foobarbaz")
🐍 — 2 » x
🐍 — 2 « ['f', 'o', 'o', 'b', 'a', 'r', 'b', 'a', 'z']
🐍 — 3 » %timeit x[len(x)-1]
188 ns ± 3.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
🐍 — 4 » %timeit x[-1]
66.4 ns ± 0.821 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Searching for [len( reveals several spots in the code where that change might be applied. I don't know if these are performance critical spots and if these micro optimisations would change that much, but they should be safe to apply.

There might be many more of these low-hanging fruits.

Would there be interest in a patch that applies these kind of changes? Where does the source code of the Python runtime live (https://pypi.org/project/antlr4-python3-runtime/ only points to https://www.antlr.org/).

Servus,
Walter

eric.v...@wanadoo.fr

unread,
Dec 16, 2020, 7:32:55 PM12/16/20
to antlr-discussion
Hi,

1) many Python developers don't think Python is that slow because they use algorithms implemented in C, with Python being just a (very convenient) wrapper.
ANTLR unfortunately demonstrates that the exact same code runs between 20 to 30 times slower on average. This is not news, see:

2) there is no ambition to make the code pythonic. It is much more important for the maintainers to be able to compare code across targets, than to comply with arbitrary coding standard. 
Unless there is an observable improvement, it is better to stick to the reference implementation (Java).

3) there is definitely interest in any PR that improves performance. The code lives here: 
and here

rtm...@googlemail.com

unread,
Dec 18, 2020, 7:48:17 AM12/18/20
to antlr-discussion
If it's only 20 or 30 times slower that would be fine but last I looked it was far worse, orders of magnitude (https://github.com/antlr/antlr4/issues/1219). Has it improved since then?

I did start looking at it but digging through a code base that I'm unfamiliar with, implementing an algo I'm unfamiliar with, with little assistance, it would have taken too long.


jan

eric.v...@wanadoo.fr

unread,
Dec 18, 2020, 10:06:26 PM12/18/20
to antlr-discussion
@rtm
There was at the time a bug affecting performance which has been fixed I believe in 4.6, so you might want to give it another go?

Mike Lischke

unread,
Dec 19, 2020, 1:17:06 PM12/19/20
to antlr-discussion
Hi Frederico, Gabriele,


Several clients have asked us over time to help them improving the performance of their parsers, so we put together a list of things we normally look into and wrote an article on that:
https://tomassetti.me/improving-the-performance-of-an-antlr-parser/

It would be very useful for us to receive comments and suggestions to improve the article and learn more about optimizing parsers.

That's an idea I also have in mind since quite some time. ANTLR4 parser optimisation is something that frequently comes up, and be it only for your own software. As such your blog post is very welcome!

Below are some thoughts I collected while reading it:

- In general you tend to write pretty lengthy articles. That's not per se bad, but it could use some more structure. You could either split the individual sections into own parts (pages) of the overall posting or at least provide a way to easily skip over certain areas (e.g. a table of contents with linked titles, collapsible sections or similar).

- At least from my standpoint the discussion of different runtimes is something I would not want in a posting about performance optimisation. Of course you will see big differences in execution, depending on the used target language (and if you mention it at all then it should definitely state what the the fastest runtimes are: C++ and C#). But first, you rarely can change the runtime at will, so it is usually not an option for performance tuning and second, every other optimisation you can do  (mostly grammar tuning) applies to all runtimes equally.

- The C++ runtime never had memory-related performance problems that later got resolved. It's a fact the C++ runtime has to use certain memory management functionality that can be slow (relatively seen of course, even the slowest approach is orders of magnitudes faster than any scripting language like Javascript or Python).

- Talking about ASTs in the context of ANTLR4 based parsers is a bit misleading. IMO there's enough confusion about parse trees vs. ASTs, e.g. on Stackoverflow. ASTs are neither needed to optimise parser performance, nor are they even created by ANTLR4 based parsers.

- This sentence: "That is to say, we cannot tell you how using a different runtime will impact your grammar." is probably wrong. I guess you wanted to write "parser" instead of "grammar".

- The "Some Example Numbers" section is a bit strange, as if you want to prove that Python is slower than Kotlin. But that's not the subject of this article. I believe the vast majority of your readers will believe that Python is slower. It just makes the entire article longer with no benefit for the reader.

After that we come to the meat of the article, which I enjoyed to read.

- It would be interesting to see real numbers that prove that reducing ambiguity really improves performance. The All(*) algorithm, as I understood it, just picks one alt, depending on certain rules. There's no time penalty when resolving ambiguities. I'd be happy to learn something new here, if I'm wrong.

- Complexity of the grammar is certainly a point! When you mention a path the parser chooses, you should probably mention which path you actually mean: the path through the ATN (which requires additional explanation in your article, which would be more interesting for hard core optimisation than, say, switching the runtime).

- I also think the article could benefit from an abstract description of the parsing process (unlimited look ahead, possible guided by predicates), which would help in the section further down where you mention the effect of predicates.

- This sentence "If the two alternatives are an over-complicated rule or a semantic predicate, the semantic predicate can be the better option." is a bit confusing. A predicate alone cannot be an alternative. It always is just a gateway to an alternative. Additionally, if a guarded alternative appears after an unguarded one, it will not speed up the parsing process, regardless of the complexity of the unguarded alt, because the alt evaluation happens top down in a linear fashion.

- The section "Are semantic predicates good or bad?" could use some additional content, because predicates are a very important aspect when it comes to parsing speed.

- I would expect that you answer the question you asked in the title.

- In parser rules a predicate must appear on the left hand side to have any impact in the parsing process (see also https://github.com/antlr/antlr4/blob/master/doc/predicates.md#semantic-predicates). But that has only as much performance impact as time is needed to evaluate the predicate. However, alternatives containing a left-hand predicate will not be stored in the DFA cache, to be able to re-evaluate them on each parse run. That could have an additional impact if a rule with gated alternatives has many alternatives. Would be good to do measurements with and w/o predicates to find reliable numbers.

- For lexer rules it's absolut essential to put the predicate at the end, because here the effect of no caching, is really heavy. All lexer rules are handled as if they were alts in a single big outer rule. So, if only one lexer rule contains a predicate on the left hand side, no lexer rule is cached at all! When you consider that all your keywords also are lexer rules, you can easily imagine how thousands of lexer rules behave, performance-wise, if they are not cached. Side note: the DFA cache was one of the major performance improvements in ANTLR4.
  
  For the C++ runtime this became an extra burden because of the way memory is handled. I described this in my announcement blog post: https://soft-gems.net/the-antlr4-c-target-is-here/. But that is a runtime specific issue, that can easily be avoided by using predicates as recommended (as you have mentioned in your blog post).

- This sentence "The downside is that it is practically impossible to know beforehand whether moving the semantic predicate inside the lexer rule will improve performance or worsen it." can be improved by considering that the left hand side is really bad.

- I doubt that "It can also improve it because it can quickly kill the exploration of a rule that is not valid in most cases." is true, given the above information. You can use gated alts (or conditional lexer rules), but the shortcut benefit is far less than the no-caching penalty. Also you should make more clear that lexer predicates are executed for each iteration in a lexer rule and should therefore especially quick to evaluate. Example: "ID: [a-zA-Z]+ {isKeyword($text)}?;" If you have an ID with 1000 characters, the predicate "isKeyword" is executed 1000 times.


The section "Handling Expressions":

- Expression rules are especially sensitive because usually expressions can be nested with no limit. This actually has some severe implications:

- ANTLR4 based parsers enter a new stack frame for each evaluation of a rule. If you have a large set of expression rules (implementing precedence as explained in the article), which call each other recursively, you might quickly reach the maximum stack depth of your environment. The worst result of that effect is with large enough expressions your entire app can crash unpredictably! Depending on the target language you have not much control over how to deal with such a situation. Increasing the stack depth is possible in some languages (e.g. C++), but not on all platforms. Threads (except of the main thread) usually have a pretty small stack size (IIRC only 256KB or 512KB).

- Because of that effect it might sometimes not be possible to parse in a separate thread. This might cause performance problems too.

- Rewriting nested expression rules in a (direct left recursive) single rule is only possible for pretty simple expression types. As soon as you have more that a few operators things really get messy and your argument that hierarchical expression rules are ugly and/or cumbersome is something I dare to disagree with :-) Of course it also depends on the person working with the grammar, but IMO it's much better understandable how precedence is organised when you use hierarchical expression rules.

- Direct left recursive rules are no improvement from a memory or speed standpoint, because internally they are converted to non-left-recursive rules and enter a new stack frame on each invocation. Additionally, it might be more difficult to understand operator precedence with them.

- "The more complicated design has a better performance because it is not ambiguous." is a confusing sentence Where's ambiguity involved here? Direct left recursive rules are not ambiguous. Ambiguity is the situation where more than a single alt can be taken with the same input.

- The example taken from the documentation doesn't help the grammar writer to find a better way here, as it demonstrates how operator precedence is implemented in left recursive rules. If you want to explain what happens internally with left-recursive rules you could instead provide the algorithm how to manually convert direct-left-recursive rules to non-recursive rules. I have described this before on SO: https://stackoverflow.com/questions/41788100/antlr4-mutual-left-recursion-grammar/41789097#41789097. But again: this won't help much to improve performance.

The section "Ensuring you are parsing the whole input"

- Tbh. I don't understand why this section exists in this blog post. EOF at a rule's end doesn't affect performance at all.

- The hint about the lexer that could slow down because of a missing EOF is highly misleading, since EOF is itself a (built-in) token and only used in the parser grammar.

- A missing EOF doesn't parse input differently, it only ignores parts that follow what's defined by your parser rules. Up to this point the parser run is exactly the same with and w/o EOF. It might fooling you into believing your input is correct, though, even if it contains more than what is allowed.

- Adding a catch-all lexer rule can also have an adverse effect, because it will change error handling. Without that you might get an error like "invalid input [" (if you normally don't handle square brackets). With the "ANY" rule, this will change to "unexpected ANY". Which one helps better to diagnose the error is pretty clear. Note also: the catch-all rule must not be used in your parser, or it might not show an error actually.

The section "Changing the Prediction Mode"

This is mostly fine, still I have a few comments:

- You may want to make clearer that BailoutStrategy and prediction mode are two different optimisation measures:

- The bail-out strategy is quicker because it throws an exception on error that is not catched in the runtime (all other parse exceptions are catched and forwarded as error callbacks via the error listeners). That way the parser will not do any recovery attempt and directly stop the parsing process. Important here is that you have to catch the ParseCancellationException in your code or it might bubble up further and might crash your application (if there's no global catch all exception handler by default, like in C++). But with that special exception parsing is stopped as quickly as possible if there's an error.

- The prediction mode allows to do simpler parsing if your grammar allows that. You will get faster results in case of correct input. If you mostly have to deal with correct input then the two step approach (first SLL then LL prediction mode) is actually faster. SLL can report a syntax error, even if there's none (because of the simpler handling), in which case you have to confirm with LL prediction mode, if there's really an error. That means if you mostly have errors in the input then directly hopping to LL (w/o first trying SLL) is the better option. But also in this case you get away with SLL, if your grammar has no ambiguities.

- Your last sentence in this section makes no sense. Parse strategy and prediction mode have nothing to do with input sizes. Both are mostly related to error handling.

The sections "Profiling the parser" and "Profiling the lexer" are a great addition to this article. Maybe you can go into more detail in a separate blog post about all the values you get from profiling? Would be really helpful to understand their meaning, the usual values you get, the limits that are still ok etc.

The section "Caching"

- This seems to be a wishy-washy section as it doesn't provide much info to actually solve a performance problem. If you had shown an example for, say, how to save the parse state for each line (aka. Incremental parsing) it would have been an absolutely great addition (because incremental parsing is currently not supported with ANTLR4 based parser and the Github patch which would add this is still pending IIRC).

- Also there you mention the AST again, which is not helping with performance and is rather noise, since ANTLR4 doesn't generate an AST.


Some additional optimisations and good style

Below are some thoughts I collected over time, which I wanted to make into an article about what you should do to improve your grammar, what you better should not do and what is good grammar style. Maybe you want to enhance your article to become the resource I had in mind originally?

⦁ It's not so obvious when you only look at your grammar, but with the right consolidation of rule sub parts you can greatly improve speed. That idea helped me to remove a bottleneck in my grammar where many keywords where mentioned as alternatives for an identifier (the well known keyword-as-identifier problem). Look at the following rules:

indexTypeClause:
USING_SYMBOL indexType
| TYPE_SYMBOL indexType
;

indexTypeClauseOpt:
(USING_SYMBOL | TYPE_SYMBOL) indexType
;


The first shows a typical case where different keywords define alternatives with a common tail. For the first rule you get this ATN:


This is a pretty normal network. The parser has to visit two branches to find the one that matches. Now look at the picture from the second rule:



There's only a single branch to take and the check for the keyword to match is a simple find-in-a-set operation. Now imagine a rule with 100 such keywords (lexer tokens), each in an own alt, which would result in 100 branches to check. With a simple set of parentheses around just the lexer tokens you can get a pretty remarkable speed increase, in particular if that rule is often used (like `identifier`). Here's a real-world example from my MySQL grammar:

interval:
intervalTimeStamp
| (
SECOND_MICROSECOND_SYMBOL
| MINUTE_MICROSECOND_SYMBOL
| MINUTE_SECOND_SYMBOL
| HOUR_MICROSECOND_SYMBOL
| HOUR_SECOND_SYMBOL
| HOUR_MINUTE_SYMBOL
| DAY_MICROSECOND_SYMBOL
| DAY_SECOND_SYMBOL
| DAY_MINUTE_SYMBOL
| DAY_HOUR_SYMBOL
| YEAR_MONTH_SYMBOL
)
;


Note: this only works with single lexer tokens in each alt. As soon as there's more than that single token it will be moved out into an own branch in the ATN.


⦁ You can take this idea further, by collecting all common leading or trailing rule parts into a single entity to avoid lengthy alt checks. Here's a nice example from the MySQL grammar (with a twist):

bitExpr:
simpleExpr
| bitExpr op = BITWISE_XOR_OPERATOR bitExpr
| bitExpr op = (
MULT_OPERATOR
| DIV_OPERATOR
| MOD_OPERATOR
| DIV_SYMBOL
| MOD_SYMBOL
) bitExpr
| bitExpr op = (PLUS_OPERATOR | MINUS_OPERATOR) bitExpr
| bitExpr op = (PLUS_OPERATOR | MINUS_OPERATOR) INTERVAL_SYMBOL expr interval
| bitExpr op = (SHIFT_LEFT_OPERATOR | SHIFT_RIGHT_OPERATOR) bitExpr
| bitExpr op = BITWISE_AND_OPERATOR bitExpr
| bitExpr op = BITWISE_OR_OPERATOR bitExpr
;


Several alts use this approach here, but you need to be careful in left-recursive rules, because if you combine operators with different precedence levels you might change the precedence completely. That's why the alt for MULT/DIV etc. is not combined with PLUS/MINUS, even though they share a common prefix.


⦁ Avoid leading optional values. What this means is that you should not start a rule or alt with an optional value. Check this rule:

simpleIdentifier:
identifier (dotIdentifier dotIdentifier?)?
;

That results in this ATN:


You can see there's a check for both optional dotIdentifiers. If there's no dot then no further check for the second optional part is done.

You could also write that as:

simpleIdentifier:
(identifier DOT_SYMBOL)? (identifier DOT_SYMBOL)? identifier
;


However, the second form requires 2 checks instead of one with some optional parts and one more cached path in the DFA cache:


Not so much a performance hog, but can save you from ambiguity trouble, especially when this rule is used in another rule where also an identifier could be matched.


⦁ Speaking of ambiguities: there's a case that can get you into trouble which is sometimes hard to solve: rules that can match nothing:

test:
identifier*
;

You should rather make the call to it optional instead. Never combine both: a rule with potential empty match and an optional call. However, ANTLR4 will warn about that.


That's what I had to add from my experience.


Thanks again Federico and Gabriele for writing this great article,




rtm...@googlemail.com

unread,
Dec 19, 2020, 2:00:31 PM12/19/20
to antlr-discussion
Just a small point,

The section "Ensuring you are parsing the whole input"

- Tbh. I don't understand why this section exists in this blog post. EOF at a rule's end doesn't affect performance at all.

It may well be out of place in the article but forgetting about the EOF bit me pretty hard and it is important to know.

[snipped]


- A missing EOF doesn't parse input differently, it only ignores parts that follow what's defined by your parser rules. Up to this point the parser run is exactly the same with and w/o EOF. It might fooling you into believing your input is correct, though, even if it contains more than what is allowed.

In my  case the absence of EOF did cause it to parse the input differently (or seemed to) when it failed . It reported some very strange failures of the input which did not correspond to the grammar. It was very misleading. Adding the EOF brought back sanity.
From memory it would report errors earlier in the input that it 'should' have. It wasn't antlr's fault, mine only, but it was several hours wasted.

cheers

jan

Mike Cargal

unread,
Dec 19, 2020, 2:36:25 PM12/19/20
to antlr-discussion
So, this sounds like having the EOF at a rule's end impacts error reporting and recovery.  Not so much performance of parsing syntactically correct input.  It makes sense that ANTLR might do a better job of recovery it has the EOF in the grammar to sync up to when trying to do error recovery.  

BTW Mike, really nice writeup.

Mike Lischke

unread,
Dec 20, 2020, 5:33:29 AM12/20/20
to 'rtm...@googlemail.com' via antlr-discussion
Hey Jan,


The section "Ensuring you are parsing the whole input"

- Tbh. I don't understand why this section exists in this blog post. EOF at a rule's end doesn't affect performance at all.

It may well be out of place in the article but forgetting about the EOF bit me pretty hard and it is important to know.

It's off topic for a performance optimisation article, but if it changes to a general advice article (as I suggested) then this kind of information certainly has a place there!


[snipped]

- A missing EOF doesn't parse input differently, it only ignores parts that follow what's defined by your parser rules. Up to this point the parser run is exactly the same with and w/o EOF. It might fooling you into believing your input is correct, though, even if it contains more than what is allowed.

In my  case the absence of EOF did cause it to parse the input differently (or seemed to) when it failed . It reported some very strange failures of the input which did not correspond to the grammar. It was very misleading. Adding the EOF brought back sanity.
From memory it would report errors earlier in the input that it 'should' have. It wasn't antlr's fault, mine only, but it was several hours wasted.

Interesting, I'd love to see a (simple) example which reproduces the effect of changed parsing with and w/o EOF. The path through the ATN should be the same, but maybe the prediction step changes in some way?

Thanks,

Mike

Federico Tomassetti

unread,
Dec 21, 2020, 6:14:53 AM12/21/20
to Mike Lischke, antlr-discussion, Federico Tomassetti, Gabriele Tomassetti, Alessio Stalla, Maurizio Taverna
Hi Mike,
First of all, let me say that I am extremely grateful for your answer. It is amazing, very useful, and I appreciate the time you have dedicated to it: thank you!

On Sat, Dec 19, 2020 at 7:10 PM Mike Lischke <mi...@lischke-online.de> wrote:
Hi Frederico, Gabriele,


Several clients have asked us over time to help them improving the performance of their parsers, so we put together a list of things we normally look into and wrote an article on that:
https://tomassetti.me/improving-the-performance-of-an-antlr-parser/

It would be very useful for us to receive comments and suggestions to improve the article and learn more about optimizing parsers.

That's an idea I also have in mind since quite some time. ANTLR4 parser optimisation is something that frequently comes up, and be it only for your own software. As such your blog post is very welcome!

Happy to read this! We have also noticed this is a recurring theme, so we thought it could be useful to collect in one place different strategies and comments, some obvious, others less so, that we end up discussing over and over. It was also an occasion for us to reflect on this topic

And thanks to this conversation it is also a way to collect the wisdom of the community and store it in one place
 

Below are some thoughts I collected while reading it:

- In general you tend to write pretty lengthy articles. That's not per se bad, but it could use some more structure. You could either split the individual sections into own parts (pages) of the overall posting or at least provide a way to easily skip over certain areas (e.g. a table of contents with linked titles, collapsible sections or similar).

Yes, the articles are intended to be long on purpose. One reason is that it is easier to refer to them: if we have one article on optimization we can easily find it and share it when someone asks something on this topic. If we had 3-5 smaller articles on the topic it would be more difficult to find them. We would also need to ask to the person what they tried already and do some investigation to find the right article to share. Having a single long article I think is also better for SEO.

The drawback is that it can be overwhelming. Your advice is good, and we added a collapsible table of contents at the beginning of all articles. Perhaps it is not very visible because it is collapsed by default
 

- At least from my standpoint the discussion of different runtimes is something I would not want in a posting about performance optimisation. Of course you will see big differences in execution, depending on the used target language (and if you mention it at all then it should definitely state what the the fastest runtimes are: C++ and C#). But first, you rarely can change the runtime at will, so it is usually not an option for performance tuning and second, every other optimisation you can do  (mostly grammar tuning) applies to all runtimes equally.

I see your point, and I agree that very often there are constraints that prevent switching to a different runtime. However for some of our clients, this was the option that made the most sense, because they were at the beginning of their projects, so they did not have a lot of investment in code using the parse tree and they made the choice of the runtime based on the language some developer preferred, without giving it much of a thought. In those cases, changing runtime was possible and made a lot of difference. Perhaps if we make it easier to navigate through the article it will be less annoying to have these rarely used options.
 

- The C++ runtime never had memory-related performance problems that later got resolved. It's a fact the C++ runtime has to use certain memory management functionality that can be slow (relatively seen of course, even the slowest approach is orders of magnitudes faster than any scripting language like Javascript or Python).

We will correct the article, sorry about that
 

- Talking about ASTs in the context of ANTLR4 based parsers is a bit misleading. IMO there's enough confusion about parse trees vs. ASTs, e.g. on Stackoverflow. ASTs are neither needed to optimise parser performance, nor are they even created by ANTLR4 based parsers.

Yes, I agree with the confusion and we will clarify it better in the article. What we meant is that typically we take the parse tree produced by ANTLR and elaborate it to obtain an AST (for example, we build such AST using Kolasu when working on the JVM). When talking about caching we suggest caching the AST eventually built from the parse tree and not the parse tree itself. For example, in Kolasu there is support to serialize the AST to JSON or XML.
 

- This sentence: "That is to say, we cannot tell you how using a different runtime will impact your grammar." is probably wrong. I guess you wanted to write "parser" instead of "grammar".

Thank you, we will correct it
 

- The "Some Example Numbers" section is a bit strange, as if you want to prove that Python is slower than Kotlin. But that's not the subject of this article. I believe the vast majority of your readers will believe that Python is slower. It just makes the entire article longer with no benefit for the reader.

Well, you would be surprised by what people do not know :D

Often we find it very difficult to answer the question "how much faster will be my parser when switching to Kotlin?", but we understand that this is a question people are very interested in. As technicians, we would tend to say "we do not really know", as consultants we understand that even a very approximate answer is necessary for people to make a decision about exploring this possibility or not.

I think also in this case we have a different view with respect to the article: we tend to see it as a sort of encyclopedia of everything people may want to know about the optimization topic, so we expect that some sections will be irrelevant for some readers, while you look at it as an article to read from the beginning to the end and surely for many readers, especially people with a lot of expertise like you, most content would be rather obvious or redundant.
 

After that we come to the meat of the article, which I enjoyed to read.

- It would be interesting to see real numbers that prove that reducing ambiguity really improves performance. The All(*) algorithm, as I understood it, just picks one alt, depending on certain rules. There's no time penalty when resolving ambiguities. I'd be happy to learn something new here, if I'm wrong.

I agree that having more numbers in general, and in this section in particular, would be very useful. Our hope is that this article can evolve over time, for example because of your comments or because of experiments we do on specific projects, or because of things we learn, so we plan to go back and go more in depth in some cases.
 
- Complexity of the grammar is certainly a point! When you mention a path the parser chooses, you should probably mention which path you actually mean: the path through the ATN (which requires additional explanation in your article, which would be more interesting for hard core optimisation than, say, switching the runtime).

Yes, you are right
 

- I also think the article could benefit from an abstract description of the parsing process (unlimited look ahead, possible guided by predicates), which would help in the section further down where you mention the effect of predicates.

True. I am not sure if that part, more technical, could be a separate article or fit into this, but definitely we did at some point the most frequent error technicians do: take some information for granted
 

- This sentence "If the two alternatives are an over-complicated rule or a semantic predicate, the semantic predicate can be the better option." is a bit confusing. A predicate alone cannot be an alternative. It always is just a gateway to an alternative. Additionally, if a guarded alternative appears after an unguarded one, it will not speed up the parsing process, regardless of the complexity of the unguarded alt, because the alt evaluation happens top down in a linear fashion.

We will rephrase it, thank you
 

- The section "Are semantic predicates good or bad?" could use some additional content, because predicates are a very important aspect when it comes to parsing speed.

Agreed
 

- I would expect that you answer the question you asked in the title.

You are right
 

- In parser rules a predicate must appear on the left hand side to have any impact in the parsing process (see also https://github.com/antlr/antlr4/blob/master/doc/predicates.md#semantic-predicates). But that has only as much performance impact as time is needed to evaluate the predicate. However, alternatives containing a left-hand predicate will not be stored in the DFA cache, to be able to re-evaluate them on each parse run. That could have an additional impact if a rule with gated alternatives has many alternatives. Would be good to do measurements with and w/o predicates to find reliable numbers.

Very good point, we will add it
 

- For lexer rules it's absolut essential to put the predicate at the end, because here the effect of no caching, is really heavy. All lexer rules are handled as if they were alts in a single big outer rule. So, if only one lexer rule contains a predicate on the left hand side, no lexer rule is cached at all! When you consider that all your keywords also are lexer rules, you can easily imagine how thousands of lexer rules behave, performance-wise, if they are not cached. Side note: the DFA cache was one of the major performance improvements in ANTLR4.

Thank you! This is something very valuable to add to the article
 
  
  For the C++ runtime this became an extra burden because of the way memory is handled. I described this in my announcement blog post: https://soft-gems.net/the-antlr4-c-target-is-here/. But that is a runtime specific issue, that can easily be avoided by using predicates as recommended (as you have mentioned in your blog post).

We will link to it as it provides a good explanation
 

- This sentence "The downside is that it is practically impossible to know beforehand whether moving the semantic predicate inside the lexer rule will improve performance or worsen it." can be improved by considering that the left hand side is really bad.

- I doubt that "It can also improve it because it can quickly kill the exploration of a rule that is not valid in most cases." is true, given the above information. You can use gated alts (or conditional lexer rules), but the shortcut benefit is far less than the no-caching penalty. Also you should make more clear that lexer predicates are executed for each iteration in a lexer rule and should therefore especially quick to evaluate. Example: "ID: [a-zA-Z]+ {isKeyword($text)}?;" If you have an ID with 1000 characters, the predicate "isKeyword" is executed 1000 times.

Very good point, we should explain that indeed. This also reinforces the point you made to add a section describing the lexing and parsing process
 


The section "Handling Expressions":

- Expression rules are especially sensitive because usually expressions can be nested with no limit. This actually has some severe implications:

- ANTLR4 based parsers enter a new stack frame for each evaluation of a rule. If you have a large set of expression rules (implementing precedence as explained in the article), which call each other recursively, you might quickly reach the maximum stack depth of your environment. The worst result of that effect is with large enough expressions your entire app can crash unpredictably! Depending on the target language you have not much control over how to deal with such a situation. Increasing the stack depth is possible in some languages (e.g. C++), but not on all platforms. Threads (except of the main thread) usually have a pretty small stack size (IIRC only 256KB or 512KB).

- Because of that effect it might sometimes not be possible to parse in a separate thread. This might cause performance problems too.

We will add this consideration, thank you
 

- Rewriting nested expression rules in a (direct left recursive) single rule is only possible for pretty simple expression types. As soon as you have more that a few operators things really get messy and your argument that hierarchical expression rules are ugly and/or cumbersome is something I dare to disagree with :-) Of course it also depends on the person working with the grammar, but IMO it's much better understandable how precedence is organised when you use hierarchical expression rules.

I think this is a matter of preference too :D

On one hand, it is true that we can afford to have ANTLR grammars that are a bit more distant to the way we think about the concepts, because anyway we will get an AST later and we will "flatten" the hierarchy during that transformation, but I personally tend to write grammars closer to the AST, because for me it is easier to think in those terms.

I think it would be useful to add a paragraph about this point and explain the limits of what we suggested and the fact that there are good reasons to choose the opposite approach 
 

- Direct left recursive rules are no improvement from a memory or speed standpoint, because internally they are converted to non-left-recursive rules and enter a new stack frame on each invocation. Additionally, it might be more difficult to understand operator precedence with them.

We should add that, too
 

- "The more complicated design has a better performance because it is not ambiguous." is a confusing sentence Where's ambiguity involved here? Direct left recursive rules are not ambiguous. Ambiguity is the situation where more than a single alt can be taken with the same input.

We will rephrase it
 

- The example taken from the documentation doesn't help the grammar writer to find a better way here, as it demonstrates how operator precedence is implemented in left recursive rules. If you want to explain what happens internally with left-recursive rules you could instead provide the algorithm how to manually convert direct-left-recursive rules to non-recursive rules. I have described this before on SO: https://stackoverflow.com/questions/41788100/antlr4-mutual-left-recursion-grammar/41789097#41789097. But again: this won't help much to improve performance.

I think it would be useful, and it once again reinforces the case for explaining the parsing process
 

The section "Ensuring you are parsing the whole input"

- Tbh. I don't understand why this section exists in this blog post. EOF at a rule's end doesn't affect performance at all.

I will double-check that with Gabriele
 

- The hint about the lexer that could slow down because of a missing EOF is highly misleading, since EOF is itself a (built-in) token and only used in the parser grammar.

- A missing EOF doesn't parse input differently, it only ignores parts that follow what's defined by your parser rules. Up to this point the parser run is exactly the same with and w/o EOF. It might fooling you into believing your input is correct, though, even if it contains more than what is allowed.

- Adding a catch-all lexer rule can also have an adverse effect, because it will change error handling. Without that you might get an error like "invalid input [" (if you normally don't handle square brackets). With the "ANY" rule, this will change to "unexpected ANY". Which one helps better to diagnose the error is pretty clear. Note also: the catch-all rule must not be used in your parser, or it might not show an error actually.

The section "Changing the Prediction Mode"

This is mostly fine, still I have a few comments:

- You may want to make clearer that BailoutStrategy and prediction mode are two different optimisation measures:

- The bail-out strategy is quicker because it throws an exception on error that is not catched in the runtime (all other parse exceptions are catched and forwarded as error callbacks via the error listeners). That way the parser will not do any recovery attempt and directly stop the parsing process. Important here is that you have to catch the ParseCancellationException in your code or it might bubble up further and might crash your application (if there's no global catch all exception handler by default, like in C++). But with that special exception parsing is stopped as quickly as possible if there's an error.

- The prediction mode allows to do simpler parsing if your grammar allows that. You will get faster results in case of correct input. If you mostly have to deal with correct input then the two step approach (first SLL then LL prediction mode) is actually faster. SLL can report a syntax error, even if there's none (because of the simpler handling), in which case you have to confirm with LL prediction mode, if there's really an error. That means if you mostly have errors in the input then directly hopping to LL (w/o first trying SLL) is the better option. But also in this case you get away with SLL, if your grammar has no ambiguities.

Very good point, we will add your refined explanation, if that is ok for you
 

- Your last sentence in this section makes no sense. Parse strategy and prediction mode have nothing to do with input sizes. Both are mostly related to error handling.

The sections "Profiling the parser" and "Profiling the lexer" are a great addition to this article. Maybe you can go into more detail in a separate blog post about all the values you get from profiling? Would be really helpful to understand their meaning, the usual values you get, the limits that are still ok etc.

The section "Caching"

- This seems to be a wishy-washy section as it doesn't provide much info to actually solve a performance problem. If you had shown an example for, say, how to save the parse state for each line (aka. Incremental parsing) it would have been an absolutely great addition (because incremental parsing is currently not supported with ANTLR4 based parser and the Github patch which would add this is still pending IIRC).

In this case, we are back to the consulting vs developers approach: in one case with one of our clients this solution proved to be useful as it was possible to apply it immediately and with a limited effort. We are also working on the other techniques but that particular language is quite tricky (RPG: it can be both positional or free-form: a lot of fun to parse :D) and for the use case of our clients (a ton of long programs which change rarely), this approach helped immediately and basically at no cost.
 

- Also there you mention the AST again, which is not helping with performance and is rather noise, since ANTLR4 doesn't generate an AST.

Yes, this is because in that case, we cache the AST, obtained as the output of a transformation on the parse tree, but we did not explain it well, leading to confusion
 


Some additional optimisations and good style

Below are some thoughts I collected over time, which I wanted to make into an article about what you should do to improve your grammar, what you better should not do and what is good grammar style. Maybe you want to enhance your article to become the resource I had in mind originally?

We would love that! Thank you! Obviously we would make it clear you are the author of these contributions
 

⦁ It's not so obvious when you only look at your grammar, but with the right consolidation of rule sub parts you can greatly improve speed. That idea helped me to remove a bottleneck in my grammar where many keywords where mentioned as alternatives for an identifier (the well known keyword-as-identifier problem). Look at the following rules:

indexTypeClause:
USING_SYMBOL indexType
| TYPE_SYMBOL indexType
;

indexTypeClauseOpt:
(USING_SYMBOL | TYPE_SYMBOL) indexType
;


The first shows a typical case where different keywords define alternatives with a common tail. For the first rule you get this ATN:


How did you get this visualization of the ATN?
Yes! I have noticed that in the past exactly with the MULT/DIV and PLUS/MINUS cases, but I did not go so in deep in my analysis. The way you formulate it very clear and interesting
 


⦁ Avoid leading optional values. What this means is that you should not start a rule or alt with an optional value. Check this rule:

simpleIdentifier:
identifier (dotIdentifier dotIdentifier?)?
;

That results in this ATN:


You can see there's a check for both optional dotIdentifiers. If there's no dot then no further check for the second optional part is done.

You could also write that as:

simpleIdentifier:
(identifier DOT_SYMBOL)? (identifier DOT_SYMBOL)? identifier
;


However, the second form requires 2 checks instead of one with some optional parts and one more cached path in the DFA cache:


Not so much a performance hog, but can save you from ambiguity trouble, especially when this rule is used in another rule where also an identifier could be matched.


⦁ Speaking of ambiguities: there's a case that can get you into trouble which is sometimes hard to solve: rules that can match nothing:

test:
identifier*
;

You should rather make the call to it optional instead. Never combine both: a rule with potential empty match and an optional call. However, ANTLR4 will warn about that.


That's what I had to add from my experience.

This is extremely valuable, thank you a lot Mike!

We have now some work to do to improve the article based on your suggestions, but we think it will make it much better and it will become a resource we can all use for quite some time
 


Thanks again Federico and Gabriele for writing this great article,


These comments have been extremely useful for us. We would like to add an acknowledgment section to the article to thank you. We are very thankful for what you have taught us and for all the time you have invested in this. We hope that not just us, but many others, can benefit from your work.

Cheers,
Federico

--
Software Architect & Founder at Strumenta (https://strumenta.com)
Technical blog at https://tomassetti.me
GitHub https://github.com/ftomassetti
Twitter @ftomasse
linkedin.com/in/federicotomassetti

Mike Lischke

unread,
Dec 21, 2020, 7:37:50 AM12/21/20
to 'rtm...@googlemail.com' via antlr-discussion, Federico Tomassetti, Gabriele Tomassetti, Alessio Stalla, Maurizio Taverna
Hi Federico,

First of all, let me say that I am extremely grateful for your answer. It is amazing, very useful, and I appreciate the time you have dedicated to it: thank you!

I'm glad you find my input useful, but as I wrote earlier, I had this idea of a best practices documentation in my mind for a long time, but never came around actually writing it. You did, and that's the true innovation.



- At least from my standpoint the discussion of different runtimes is something I would not want in a posting about performance optimisation. Of course you will see big differences in execution, depending on the used target language (and if you mention it at all then it should definitely state what the the fastest runtimes are: C++ and C#). But first, you rarely can change the runtime at will, so it is usually not an option for performance tuning and second, every other optimisation you can do  (mostly grammar tuning) applies to all runtimes equally.

I see your point, and I agree that very often there are constraints that prevent switching to a different runtime. However for some of our clients, this was the option that made the most sense, because they were at the beginning of their projects, so they did not have a lot of investment in code using the parse tree and they made the choice of the runtime based on the language some developer preferred, without giving it much of a thought. In those cases, changing runtime was possible and made a lot of difference. Perhaps if we make it easier to navigate through the article it will be less annoying to have these rarely used options.

You obviously see much better than I ever can, what your clients want and/or need, so if you see fit then be it so :-)



- The "Some Example Numbers" section is a bit strange, as if you want to prove that Python is slower than Kotlin. But that's not the subject of this article. I believe the vast majority of your readers will believe that Python is slower. It just makes the entire article longer with no benefit for the reader.

Well, you would be surprised by what people do not know :D

Hehe, I bet!



The section "Ensuring you are parsing the whole input"

- Tbh. I don't understand why this section exists in this blog post. EOF at a rule's end doesn't affect performance at all.

I will double-check that with Gabriele


Please consider also the reply from Jan + Mike Cargal, who mentioned that a missing EOF can at least have impact on error reporting. But having EOF at the end of the top level rule is one of the most basic good-style rule (and on SO a missing EOF was the culprit in quite a few ANTLR4 questions in the past). So, this information is very important to mention in a best practice article. I just tried to illustrate the relationship between EOF + performance tuning.


Very good point, we will add your refined explanation, if that is ok for you

Absolutely, yes.



Some additional optimisations and good style

Below are some thoughts I collected over time, which I wanted to make into an article about what you should do to improve your grammar, what you better should not do and what is good grammar style. Maybe you want to enhance your article to become the resource I had in mind originally?

We would love that! Thank you! Obviously we would make it clear you are the author of these contributions

I hope more people will contribute (maybe Sam Harwell, who is the true expert here).


How did you get this visualization of the ATN?

This is one of the features of my Visual Studio Code extension for ANTLR4: https://marketplace.visualstudio.com/items?itemName=mike-lischke.vscode-antlr4, which also allows to debug grammars and print syntax diagrams (aka railroad diagrams, https://github.com/mike-lischke/vscode-antlr4/blob/master/doc/graphical-visualizations.md).

 
These comments have been extremely useful for us. We would like to add an acknowledgment section to the article to thank you. We are very thankful for what you have taught us and for all the time you have invested in this. We hope that not just us, but many others, can benefit from your work.

It's my pleasure, especially as I really appreciate your work for the community by publishing such articles!

Regards,

Federico Tomassetti

unread,
Jan 5, 2021, 5:13:44 AM1/5/21
to antlr-discussion
Thank you for your great contributions, in particular Mike.
We just published the updated version of the articles, addressing your feedback and including your contributions. It is available here:

https://tomassetti.me/improving-the-performance-of-an-antlr-parser/

Cheers,
Federico

Mike Lischke

unread,
Jan 5, 2021, 11:17:53 AM1/5/21
to 'rtm...@googlemail.com' via antlr-discussion
Hi Federico,


Thank you for your great contributions, in particular Mike.
We just published the updated version of the articles, addressing your feedback and including your contributions. It is available here:

https://tomassetti.me/improving-the-performance-of-an-antlr-parser/

Great article indeed. Maybe you should add a paragraph about the parser warm-up phase and why it is necessary to run the same input at least 2 times when measuring performance?

Background: the DFA cache is dynamic. It's filled with ATN configurations whenever the prediction process (that ALL(*) algorithm) finds a new path through the ATN. So on first parse run all the configs must be generated and cached, which is much slower (sometimes around 10 times) than all following parse runs with the same input, where the ATN simulator can just use the cached values. Downside: this cache grows unlimited and can consume quite a lot of RAM (though that's cheap these days). There's a method to clear this cache, but that's really only a temporary solution  to lower memory load and will actually slow down parsing, because that essentially starts a new warm-up phase.

On the other hand this cache could serve as a (very special) performance improvement, if you'd be able to make it persistent and load it next time the application starts. The parser would then not have to do the warm-up all the time. However, firstly, this helps mostly if you restart your application frequently (e.g. a web app or an on-demand parsing service) and, secondly, there's no code in the ANTLR4 runtime currently to achieve that. You have to extend the runtime yourself to add necessary get and set methods in the ATN simulator.

The DFA cache is shared between all instances of the same parser. For that it is the only structure in the entire runtime which is thread safe (which also adds to the warm up, because of access serialisation). That also means that a parser's performance can be impacted if another parser inserts new values into that cache in parallel. But that's a rather rare scenario.

Federico Tomassetti

unread,
Jan 6, 2021, 7:59:26 AM1/6/21
to antlr-discussion
Hi Mike,
Another very valuable addition!
We will look into adding also that one to the article.

Cheers,
Federico
Message has been deleted

ivan.ko...@gmail.com

unread,
Jan 8, 2021, 6:38:36 AM1/8/21
to antlr-discussion
 Talking about ASTs in the context of ANTLR4 based parsers is a bit misleading. IMO there's enough confusion about parse trees vs. ASTs, e.g. on Stackoverflow. ASTs are neither needed to optimise parser performance, nor are they even created by ANTLR4 based parsers.

Yes, ANTLR generates Parse Tree instead of AST but the latest also can be generated by hand without generating Parse Tree using listeners and bottom-up AST building, see PlSqlAntlrListenerConverter for example. To skip ParseTree building the option `BuildParseTree` should be set up to `false`.

Such an approach may reduce memory pressure and improve performance especially for huge files (1KK lines) because in general AST has much fewer nodes than Parse Tree.

And I think it's worth mention in the context of the performance of ANTLR-based parsers. 

ivan.ko...@gmail.com

unread,
Nov 7, 2021, 8:47:05 AM11/7/21
to antlr-discussion
> Avoid leading optional values. What this means is that you should not start a rule or alt with an optional value. Check this rule:

Hi! It seems like this part is written incorrectly, or at least suggested samples are not correct.

Consider the following samples and their ATNs:

trailing: ID (dotId dotId?)?;

trailing.png

leading: (idDot idDot?)? ID;

leading.png

Their ATNs are almost the same except for the extra ε in the second case. And these rules are equivalent unlike yours.

Your second sample

simpleIdentifier:
(identifier DOT_SYMBOL)? (identifier DOT_SYMBOL)? identifier
;

is not equivalent to the first:

simpleIdentifier:
identifier (dotIdentifier dotIdentifier?)?
;

Consider the name of the rule "Avoid leading optional values" it should be:

simpleIdentifier:
(identifierDot identifierDot?)? identifier
;

As I wrote before, ATNs for such rules are almost the same and it hardly affects performance.

I suggest changing the name of your advice to something like "Use nested optional values instead of sequential" and fixing the description with samples.

Mike Lischke

unread,
Nov 28, 2021, 6:19:04 AM11/28/21
to ANTLR discussion group
Hi Ivan,

> Avoid leading optional values. What this means is that you should not start a rule or alt with an optional value. Check this rule:

Hi! It seems like this part is written incorrectly, or at least suggested samples are not correct.

Consider the following samples and their ATNs:

trailing: ID (dotId dotId?)?;

<trailing.png>

leading: (idDot idDot?)? ID;

<leading.png>

Their ATNs are almost the same except for the extra ε in the second case. And these rules are equivalent unlike yours.

Right, with your example the differences in the ATN are marginal. However, keep in mind that my example consists of two optional parts, not an optional part nested in an optional part.


Your second sample

simpleIdentifier:
(identifier DOT_SYMBOL)? (identifier DOT_SYMBOL)? identifier
;

is not equivalent to the first:

simpleIdentifier:
identifier (dotIdentifier dotIdentifier?)?
;

Where did I say they would be equivalent? They are obviously not.


Consider the name of the rule "Avoid leading optional values" it should be:

simpleIdentifier:
(identifierDot identifierDot?)? identifier
;

As I wrote before, ATNs for such rules are almost the same and it hardly affects performance.

I think the title is still good. I not only mentioned the additional check that is necessary with my example (which I also described as something with low impact), but I additionally wrote that leading optional parts can cause hard to find ambiguities, especially when a rule with leading optional parts is used in another rule where a path exists that ends with the same elements with which the subrule begins.

I think it's safe to state that the use of leading optional values makes parsing slower, because the prediction engine has to check more paths to see if a subrule matches, than without the optional parts. How much this impacts the overall performance depends heavily on the overall structure of the grammar, of course.

Regards,


Reply all
Reply to author
Forward
0 new messages