ANTLR4 performance, controlling prediction and chosing SLL vs ALL

2,666 views
Skip to first unread message

ABW

unread,
Jul 27, 2017, 7:02:52 AM7/27/17
to antlr-discussion
I was advised that my questions are too broad for StackOverflow and I should try here.
Original questions: https://stackoverflow.com/questions/45286418/antlr4-performance-controlling-prediction-and-chosing-sll-vs-all
I'm just trying to find out if there are any secret techniques to tweak ANTLR4 behaviour or at least to get to know more about its inner workings so I can increase performance of my grammar.

A little background. I'm working on a Verilog/SystemVerilog simulator. There are some constructs that the tool is still not supporting, and the implementation has to start with being able to parse said constructs. So far it uses ANTLR2.7.7. Over the years the grammar grew and got complicated to the point that make it really hard to maintain. It was time to rewrite it from scratch. Accidentally a month before the decision was made ANTLR4 got official support for C++ target. So why not try something new?

Writing complete working grammar with ANTLR4.7 was a blast. I even quickly added simple pretty printer to learn how to use visitor instead of embedding actions in the grammar. Then I chose one real life mid size design (1.3MB, about 200k tokens after preprocessing) and started to analyze diagnostic output and data provided by ANTLR plugin for IntelliJ IDEA to figure out what to change in order to make parsing faster (unfortunately I had to clone the grammar and change it a bit for it to work with Java, because despite lack of embedded actions it turned out to be impossible to keep it target neutral). The parse on test design at first took 131s and 1.3GB of memory, after couple of iterations I took it down to 114s and 1GB (this is just parse with no listener nor tree building). For comparison the old grammar with all the embedded actions that build VPI representation of the design (a sort of equivalent of AST) takes 2.2s and 65MB of memory (the whole test with simulation time takes 36s). So there is a lot of work to do.

The Java version of new grammar works faster (24.3s in last iteration) but takes more memory (1.4GB). Since last iteration of grammar changes I can use -SLL option too and that runs in 18s (but takes 1.6GB for some reason). I haven't tried SLL mode in C++ yet.

There are places in the grammar I still have to fix, especially ambiguous parts are in the heart of expression primary, so I'm confident the grammar can be changed to work a lot faster (if fast enough is yet to be seen). There are however places where ANTLR behaves very strange, f.e. one call to frequently used rule suddenly decided to lookahead up to the end of input where all ambiguities should be resolved on closing parenthese two tokens from the start. I suspect these are resposible for the massive memory consumption. Unfortunately I wasn't able to find a short example of such behaviour to debug or post yet. These oddities prompted me to look for ways of controlling ANTLR prediction process and the questions I linked at the start.

Mike Lischke

unread,
Jul 27, 2017, 8:20:16 AM7/27/17
to antlr-di...@googlegroups.com

> Writing complete working grammar with ANTLR4.7 was a blast. I even quickly added simple pretty printer to learn how to use visitor instead of embedding actions in the grammar. Then I chose one real life mid size design (1.3MB, about 200k tokens after preprocessing) and started to analyze diagnostic output and data provided by ANTLR plugin for IntelliJ IDEA to figure out what to change in order to make parsing faster (unfortunately I had to clone the grammar and change it a bit for it to work with Java, because despite lack of embedded actions it turned out to be impossible to keep it target neutral). The parse on test design at first took 131s and 1.3GB of memory, after couple of iterations I took it down to 114s and 1GB (this is just parse with no listener nor tree building). For comparison the old grammar with all the embedded actions that build VPI representation of the design (a sort of equivalent of AST) takes 2.2s and 65MB of memory (the whole test with simulation time takes 36s). So there is a lot of work to do.

Parsing with ANTLR4 generated parsers is very sensitive to both the grammar and the input. The ALL(*) algorithm can use unlimited lookahead and caches most of that in memory to speed up following parse calls (that warmup phase is therefore usually significantly slower than following parse runs). And if something is not cached (which can happen under certain circumstances) it makes parsing slow because of frequent recreation of temporary structures. One such case where this happens is when you use predicates on the left hand side of alts. This is especially a problem with the lexer as all lexer rules are treated like being alts of a global rule. Hence even a single left hand predicate can prevent caching DFA and other data for the entire set of lexer rules (Sam please correct me if I explain that wrongly). So this is something you should avoid at all cost (easy: use a trailing predicate instead).

What I also found to be a problem is something like:

rule
: complexRule otherRule1
| complexRule otherRule2
| complexRule otherRule3
| complexRule otherRule4
| complexRule otherRule5
;

which does a lot of unnecessary processing for the alts. Put such things in a block with a single leading "complexRule". Such constructs can often be seen in expression rules. Speaking of expressions: for testing you could try very simple variants just to see if that makes a big difference with large test input.

>
> The Java version of new grammar works faster (24.3s in last iteration) but takes more memory (1.4GB). Since last iteration of grammar changes I can use -SLL option too and that runs in 18s (but takes 1.6GB for some reason). I haven't tried SLL mode in C++ yet.

All these numbers, have they been taken after warmup (the first parse run)?

>
> There are places in the grammar I still have to fix, especially ambiguous parts are in the heart of expression primary, so I'm confident the grammar can be changed to work a lot faster (if fast enough is yet to be seen).

Yes, expression rules are usually a source of trouble because of their recursive nature. Try to limit recursion.

> There are however places where ANTLR behaves very strange, f.e. one call to frequently used rule suddenly decided to lookahead up to the end of input where all ambiguities should be resolved on closing parenthese two tokens from the start.

Of course, avoiding ambiguities would be the best option, but sometimes the language is designed in way which makes this impossible. Just like with expressions: try going back to a simpler form of your grammar (with only a subset of features) and see if that can be made fast. Then add back the ambiguous stuff and try to optimize.

Is it possible to get your grammar and test input for own tests?

Mike
--
www.soft-gems.net

ABW

unread,
Jul 28, 2017, 8:13:42 AM7/28/17
to antlr-discussion
> Parsing with ANTLR4 generated parsers is very sensitive to both the grammar and the input.
The more surprising it is that I can't find the guide that would tell "do this" and "avoid that" for ANTLR4 grammars. In old ANTLR all was clear - if you needed syntactic predicate this was the point where you could have performance problems for some inputs. In new ANTLR everything falls into magic box of ParserATNSimulator::adaptivePredict. The magic is great when it works, but when it stops working or runs too slow the real problem starts. It is also much easier to misuse magic than plain solution.

> All these numbers, have they been taken after warmup (the first parse run)?
In my case there is never a second parse run. For Java version I'm just running TestRig built into ANTLR with no options or with just -SLL. For C++ the flow is the following: lexer(s) -> flyweight like file structure -> preprocessor -> filter -> actual parser specific token source -> parser. Everything up to parser token source is shared between old ANTLR2 and new ANTLR4 implementations. In normal simulator run once parser finishes its work all the generated VPI structures are run through basic checks and serialized to library, parser is destroyed and that's it. Further stages load all they need from libraries (they can in fact be run in a completely separate process). In other words one simulator execution equals one parse (or no parse at all if no sources changed since last run). There are other rare options for running multiple parses in one go but default is to glue all input files together (all macros and directives are globally visible exactly as if all files were concatenated) so even if there are multiple input files there is always just one parse.


> Is it possible to get your grammar and test input for own tests?
It's not my call when it comes to grammar, but even if it was I wouldn't dare to show it to anyone besides my coworkers before I'm close to content with it :o)
As for the test I'm currently using preprocessed UVM library with some trivial testbench - original sources for the library can be taken from here: http://www.accellera.org/downloads/standards/uvm (Class Library Code)
The raw preprocessed file I'm using looks horrible, but my pretty printer works good enough - attaching zipped file in pretty version. I guess it is useless anyway without a grammar though (and as a warning - if you tried to build a grammar directly from its LRM description https://standards.ieee.org/getieee/1800/download/1800-2012.pdf it simply won't work).

event_pool.zip

ABW

unread,
Jul 28, 2017, 8:38:21 AM7/28/17
to antlr-discussion
Small update just in case: I'm using UVM version 1.1d, not the newest 1.2

Mike Lischke

unread,
Jul 29, 2017, 5:35:13 AM7/29/17
to antlr-di...@googlegroups.com
> > Parsing with ANTLR4 generated parsers is very sensitive to both the grammar and the input.
> The more surprising it is that I can't find the guide that would tell "do this" and "avoid that" for ANTLR4 grammars.

Yes true. I thought about starting a list for that more than once. But where to keep it? Sam could say a lot about speed optimization, much more than anyone else and this knowledge should be saved.

> In old ANTLR all was clear - if you needed syntactic predicate this was the point where you could have performance problems for some inputs.

Terence wrote somewhere that ANTLR4 is not primarily optimized for speed, but for convenience. After all it's an academic project that just happen to fit quite well for many real world products. A fixed lookahead like in ANTLR3 is certainly faster than the dynamic lookahead in ANLTR4 (and less memory consuming), but fails for certain languages with ambiguities (and there are unfortunately many).

> In new ANTLR everything falls into magic box of ParserATNSimulator::adaptivePredict. The magic is great when it works, but when it stops working or runs too slow the real problem starts. It is also much easier to misuse magic than plain solution.

Misusing that is likely not a problem here, since it is an internal function. However, from a programming standpoint it's a nightmare. The prediction is highly recursive and can created huge stacks for complex input. Due to the way ATN configs and states are shared there is an extra challenge for languages without full automatic memory management (like C++) and requires there extra (time costly) code to avoid memory leaks. Btw, I believe the use of shared_ptr in the C++ runtime can still be optimized and the closure()/closure_()/closureCheckingStopState() triumvirate should be rewritten as a loop (not only in C++ however).
>
> > All these numbers, have they been taken after warmup (the first parse run)?
> In my case there is never a second parse run.

Well, even multiple instances of a parser count as multiple runs (as long as the entire application stays alive). The ATN (and the generated DFA) are static and shared among all parser instances for the same grammar. So, once you parsed certain input and the internal structure has been created for that it will be much faster on the next parse run (regardless which parser instance does this). In order to make this thread safe there are a few locks which also cost a bit time. I recommend to do some tests with a single parser on a single thread and run this twice for the same input, measuring only the second run. This should give you a good idea about the normal parsing speed.

> For Java version I'm just running TestRig built into ANTLR with no options or with just -SLL. For C++ the flow is the following: lexer(s) -> flyweight like file structure -> preprocessor -> filter -> actual parser specific token source -> parser. Everything up to parser token source is shared between old ANTLR2 and new ANTLR4 implementations. In normal simulator run once parser finishes its work all the generated VPI structures are run through basic checks and serialized to library, parser is destroyed and that's it.

I hope this doesn't involve releasing the static data (by closing the app that parsed the input). If so you are using ANTLR4 in a worst case scenario.

> Further stages load all they need from libraries (they can in fact be run in a completely separate process). In other words one simulator execution equals one parse (or no parse at all if no sources changed since last run). There are other rare options for running multiple parses in one go but default is to glue all input files together (all macros and directives are globally visible exactly as if all files were concatenated) so even if there are multiple input files there is always just one parse.

If that happens in a process that is never closed (and hence all static data stays alive) it should not be a problem.

Mike
--
www.soft-gems.net

Sam Harwell

unread,
Aug 2, 2017, 2:00:13 AM8/2/17
to antlr-di...@googlegroups.com
The complexity of adaptivePredict really cannot be overstated. The algorithm is not only wildly complex, but has an embedded concurrent cache which uses wait-free fast paths (the reference implementation is wait free when the DFA is already constructed; my fork is wait free for that plus the construction of new DFA states for the first two outgoing edges (90+% of cases)).

Most performance problems can be best resolved by refactoring the grammar with the help of the profiler and a large set of input examples. In practice, almost all grammars and inputs do not encounter performance problems when using the reference release of ANTLR. Its internals are simpler and more widely used and understood, so it's less likely you'll be impacted by bugs. If you get to a state where performance is not a problem, my recommendation would be to stick with that. The main items to resolve in the profiler are:

1. Context-sensitive decisions. These are cases where SLL and LL prediction give different answers. If you have any of these, then the two-stage parsing strategy which provides tremendous benefits in many applications will not work at all.
2. Long lookahead. Sometimes it's not
3. Certain poorly structured subrules. For example, you can write a comma separated list as ((rule COMMA)* rule), but it's much more efficient when written as (rule (COMMA rule)*).
4. Failure to left-factor within a rule. For example:

// Avoid this:
rule : something ELSE | something;
// Use this instead:
rule : something (ELSE | ) ;
// Which is equivalent to this:
rule : something ELSE? ;

In cases where performance is a problem and refactoring in this manner is not an option, you may have a good candidate for using my fork. It's less documented and much more complex than the reference implementation of ANTLR, and sometimes requires expert tuning to get it to behave well for any specific grammar. It's also only available for Java, C#, and TypeScript. An overview of most of the unique offerings from my fork is available in this document:
https://github.com/tunnelvisionlabs/antlr4/blob/master/doc/optimized-fork.md

Sam
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Harwell

unread,
Aug 2, 2017, 2:07:11 AM8/2/17
to antlr-di...@googlegroups.com
Additional tips:

* Do not assume the lexer is fast without measuring. I've seen lexers that were 95% of total parse time because a predicate placement caused the DFA to be disabled.
* Try to avoid using predicates. If you have to use a predicate in a lexer, do not place it on the left edge.
* Avoid using non-greedy constructs. Instead, for a lexer use the strategies seen here: https://github.com/tunnelvisionlabs/antlr4-grammar-postgresql/blob/master/src/com/tunnelvisionlabs/postgresql/PostgreSqlLexer.g4. For a parser, just avoid them because they are horrifically slow.

Sam
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

ABW

unread,
Aug 2, 2017, 8:52:17 PM8/2/17
to antlr-discussion
Sorry for the delay. I wanted to test something before replying.


>>> Parsing with ANTLR4 generated parsers is very sensitive to both the grammar and the input.
>> The more surprising it is that I can't find the guide that would tell "do this" and "avoid that" for ANTLR4 grammars.
> Yes true. I thought about starting a list for that more than once. But where to keep it? Sam could say a lot about speed optimization, much more than anyone else and this knowledge should be saved.
I think the best would be either a supplement for the book on ANTLR4 or something like the allstar techreport. Or simply an article/wiki linked from the main ANTLR page.


>> In old ANTLR all was clear - if you needed syntactic predicate this was the point where you could have performance problems for some inputs.
> Terence wrote somewhere that ANTLR4 is not primarily optimized for speed, but for convenience. After all it's an academic project that just happen to fit quite well for many real world products.
Are you implying that I should drop ANTLR4 and stick to the ANTLR2 (or perhaps try ANTLR3)? :o)
Well, there is no point in convenience when I can't actually use it. Raw convenience is good for script languages like Python and applications that need to be developed fast, do whatever they need to do and never be touched again. For any lasting application reasonable performance is necessary if only so your regression tests don't take all day to run.


>> In new ANTLR everything falls into magic box of ParserATNSimulator::adaptivePredict. The magic is great when it works, but when it stops working or runs too slow the real problem starts. It is also much easier to misuse magic than plain solution.
> Misusing that is likely not a problem here, since it is an internal function.
By misusing I mean writing grammar in a way that unnecessarily stresses the prediction magic.


> Btw, I believe the use of shared_ptr in the C++ runtime can still be optimized and the closure()/closure_()/closureCheckingStopState() triumvirate should be rewritten as a loop (not only in C++ however).
Yes please! First thing I've noticed (when I was running first grammar tests still in debug) was how long it takes to unwind closure() recursion. I'm used to run multi-megabyte designs in debug but with ANTLR4 it takes several seconds to parse through couple of lines of input, well over 90% of time spent in closure().
When I started complaining about the performance, while I was trying to optimize the grammar my coworker looked into the C++ runtime and working on my original grammar managed to reduce parse time somewhat and cut memory consumption in half just by purely technical changes (removing virtuality from classes that don't need it, replacing memory costly hashmaps with regular ones, vectors with arrays where size is known in advance etc. - I haven't actually looked into the details yet or tried it) and as far as I know he haven't even used block allocators yet (all these micro same-size objects just beg for such solution). He also found some C++ target specific bugs (I mean code that looks similar to its Java counterpart but does not actually perform the same action).


>>> All these numbers, have they been taken after warmup (the first parse run)?
>> In my case there is never a second parse run.
> Well, even multiple instances of a parser count as multiple runs (as long as the entire application stays alive).
I actually can't think of any application I've ever worked on that could realistically benefit from caching parser data between parses. Even in case of web applications that might occassionally do some parsing it is still better not to spend precious memory to speed up peripheral work - parsing is a utility, you parse, get the results and forget about it. In order to utilize the cache in full one would have to have an application that parses constantly and it is its main purpose.

> ... This should give you a good idea about the normal parsing speed.
Like I said, the normal parsing speed for me is on the first and only parse. Imagine f.e. application like grep using ANTLR4 - there is no grep daemon that would reside in the system keeping cached parser data between invocations of grep, each run is a separate process and a single first parser run. It is the same with simulator. The "reuse of parser" is achieved by not parsing at all the second run and using the results of the first parse. Just like when you build a multi-project solution in Visual Studio you don't need to compile everything every time, just the things you've changed using .obj/.lib/.dll results of previous parser runs for things that didn't change.


> I hope this doesn't involve releasing the static data (by closing the app that parsed the input). If so you are using ANTLR4 in a worst case scenario.
I wasn't releasing the cache. Thanks for reminding me. It would be sitting idly while I need that memory for main work. Sadly it turns out the C++ target is even slower than I thought because I have to account for cache destruction time (it takes about one extra second in my test example).

Actually I'm still quite optimistic when it comes to performance. Every time I figure out how to change certain ambiguous rule so it is expressed better the time associated with its execution pretty much disappears. The only problem is that I'd rather know in advance which rules to focus on instead of relying on luck that I'll run into an example that shows that particular rule is coded poorly. There is still one thing that I think (hope) is a bug and not a feature in ANTLR4 (in general, not just C++ target). I've managed to cut the example file and gut the grammar to the point that pretty much any change in the grammar causes either syntax errors or the problem disappears. In my full example there are two places that cause unexpected long lookahead. In the attached example I've used the shorter one, because associated rule where it happens is quite simple. The other place behaves in just as unexplained way causing lookahead that eats more than half of the input, spawns 2/3 of DFA states and "wins" in "total k"/"max SLL k"/"DFA k" and similar IntelliJ plugin categories by a solid 20 times over its next competing decision/rule. I'm pretty sure that this problem is the reason for majority of both memory consumption and the parse time. There was a chance that the problem is only showed by the plugin and does not occur in reality (meaning performance is not dragged down by a single problem) but I've run it under debugger and exactly during execution of prediction on this decision DFA count is significantly increased. The thing I find most strange is that removal of any of the pieces marked with !!! in attached grammar (none involved in the problematic decision and most not even used) makes the problem disappear.

PS. google is messing with this post, I hope it does not end up in multiple copies (actually every time I try and end up with "there was an error posting to the group" there is new "message have been deleted"... great, trying without attachment now)

ABW

unread,
Aug 2, 2017, 8:56:50 PM8/2/17
to antlr-discussion
Maybe smaller attachment without screenshot from IntelliJ (showing results of source_text testing sorted by "max SLL k" column) and build script will get through...
lookahead_bug.zip

ABW

unread,
Aug 2, 2017, 9:36:47 PM8/2/17
to antlr-discussion
> * Do not assume the lexer is fast without measuring. I've seen lexers that were 95% of total parse time because a predicate placement caused the DFA to be disabled.
For Java version I'm using simplified lexer (same as in my bug attachment) that works only because the input is preprocessed already, for C++ the lexer is hand written and the time it takes is negligible (in milliseconds).

> * Try to avoid using predicates.
I'm using them as checks in some places (placed after the decision was already made) - most of them are checking if certain token which can take multiple values has the expected value (f.e. there is a non-keyword 'randomize' - you can use it everywhere as regular identifier, however when applied to class method it means builtin randomize functionality. There is special syntax associated with it, that is, it can be called much like any other method (and then there is no need to treat it in special way), but when that special syntax is used it can only work with randomize method (and therefore it is a parser duty to make sure the method identifier equals 'randomize'). I've noticed I'm not the only one dealing with such things, I'd very much like the functionality proposed here: https://github.com/antlr/antlr4/issues/1965

ABW

unread,
Aug 2, 2017, 10:02:07 PM8/2/17
to antlr-discussion
 > 1. Context-sensitive decisions. These are cases where SLL and LL prediction give different answers. If you have any of these, then the two-stage parsing strategy which provides tremendous benefits in many applications will not work at all.
I'm already convinced that I have to aim at making the grammar work purely in SLL. The two-stage strategy in its original form (parse everything from the start on exception) is out of the question, it would be different if it was possible to execute such strategy on selected subrules only. Too bad I have no itea how to spot context sensitive rules. So far I've eliminated one and I'm still not sure what was that made it context sensitive where other rules look similar and are not context sensitive.

> 2. Long lookahead.
As I imagine long lookahead is not in itself a bad thing in case it is unavoidable. F.e. I have couple of places where there is an alternative between expression and data_type - not only they have ambiguous matching inputs (f.e. reference can point to an object as well as named type) but they have even wider set of ambiguous prefixes (big subset of data_type can be a prefix for cast or assignment pattern expression). In old grammar I have syntactic predicate on whole expression in these places, so ANTLR4  will not lookahead more than the old one, but at least it will follow the predicted path without the need to repeat the process in "non-guessing" run. Of course there is also this lookahead problem that I consider a bug mentiontioned in previous posts.


> 3. Certain poorly structured subrules. For example, you can write a comma separated list as ((rule COMMA)* rule), but it's much more efficient when written as (rule (COMMA rule)*).
Well, for comma separated list it might be obvious, but what to do in such case:
reference: ( scope DOT )* ID;
where scope rule contains ID as one of its alternatives?


> 4. Failure to left-factor within a rule.
When left-factoring you are giving away the ability to label alternatives, so it is not entirely neutral for the grammar.

Mike Lischke

unread,
Aug 3, 2017, 3:48:55 AM8/3/17
to antlr-di...@googlegroups.com
>> In old ANTLR all was clear - if you needed syntactic predicate this was the point where you could have performance problems for some inputs.
> Terence wrote somewhere that ANTLR4 is not primarily optimized for speed, but for convenience. After all it's an academic project that just happen to fit quite well for many real world products.
Are you implying that I should drop ANTLR4 and stick to the ANTLR2 (or perhaps try ANTLR3)? :o)

I'm suggesting nothing here. Just wanted to point out that speed was not the first concern for ANTLR4's design.


When I started complaining about the performance, while I was trying to optimize the grammar my coworker looked into the C++ runtime and working on my original grammar managed to reduce parse time somewhat and cut memory consumption in half just by purely technical changes (removing virtuality from classes that don't need it,

There is still ongoing work to improve this and similar cases (e.g. https://github.com/antlr/antlr4/pull/1930https://github.com/antlr/antlr4/pull/1919). Some optimizations already have been included in the ANTLR4 repo.

replacing memory costly hashmaps with regular ones

This is something you should be very careful with. The way how objects are found in such structures (map, set) is often based on the object's properties, not the objects themselves. You can easily break the code by messing with that. Getting these right was one of the things that costed me a lot of time.

, vectors with arrays where size is known in advance etc. - I haven't actually looked into the details yet or tried it) and as far as I know he haven't even used block allocators yet (all these micro same-size objects just beg for such solution). He also found some C++ target specific bugs (I mean code that looks similar to its Java counterpart but does not actually perform the same action).

Let him make a pull request on Github, so others can review the code and benefit from his work.

> Well, even multiple instances of a parser count as multiple runs (as long as the entire application stays alive).
I actually can't think of any application I've ever worked on that could realistically benefit from caching parser data between parses. Even in case of web applications that might occassionally do some parsing it is still better not to spend precious memory to speed up peripheral work - parsing is a utility, you parse, get the results and forget about it. In order to utilize the cache in full one would have to have an application that parses constantly and it is its main purpose.

I'm afraid parsing is more than that. Some work needs one time preparations that produces data which must be store somewhere, hence you need a cache. If you fight with your tool because you think it should behave differently than it does, you are constantly in trouble. You can then only switch to a different tool or play by its rules.


> ... This should give you a good idea about the normal parsing speed.
Like I said, the normal parsing speed for me is on the first and only parse. Imagine f.e. application like grep using ANTLR4 - there is no grep daemon that would reside in the system keeping cached parser data between invocations of grep, each run is a separate process and a single first parser run. It is the same with simulator. The "reuse of parser" is achieved by not parsing at all the second run and using the results of the first parse. Just like when you build a multi-project solution in Visual Studio you don't need to compile everything every time, just the things you've changed using .obj/.lib/.dll results of previous parser runs for things that didn't change.

That's actually a good example *pro* a cache. VS can optimize speed here because it keeps things around that have been processed before. The same happens in an ANTLR4 based parser. If a path through the ATN hasn't been went before the parser needs to store it. Next time it is taken, the stored data can be used for that.


> I hope this doesn't involve releasing the static data (by closing the app that parsed the input). If so you are using ANTLR4 in a worst case scenario.
I wasn't releasing the cache. Thanks for reminding me. It would be sitting idly while I need that memory for main work. Sadly it turns out the C++ target is even slower than I thought because I have to account for cache destruction time (it takes about one extra second in my test example).

Just to be clear here: I'm not talking about a cache you have to maintain. What I mean is the internal data produced by a parser and shared among all parser (for a given grammar). This happens internally and is nothing you have to take care of.


Actually I'm still quite optimistic when it comes to performance. Every time I figure out how to change certain ambiguous rule so it is expressed better the time associated with its execution pretty much disappears.

As said before: the parser is very sensitive to grammar and input structure. Speed varies greatly depending on that.


ABW

unread,
Aug 7, 2017, 12:25:10 PM8/7/17
to antlr-discussion
My optimism faded.

I reached the point when I can't do anything anymore. Only two decisions show ambiguities in my test case. One is on the 'else' in the only if-if-else case in the test. I don't think it is a legitimate ambiguity though as both paths involve the exact same rule and decision, only on different call depths, and since ANTLR will always take the deepest path anyway it should not consider the other path at all (that assertion would probably need formal proof). The other ambiguities are in expression rule, I'm not really sure why as I see none. I can eliminate these by expressing operator priorities traditional way (with separate rules for each level) but that solution is actually slower (as expected).

Multiple rules that exhibit the uncontrolled lookahead presented in my earlier example occupy 6-7 top positions in pretty much all important profiler categories, from time spent, through number of DFA states, lookahead depth etc. My latest results are around 8s + 1GB for Java SLL mode and 26s + 700MB for C++ SLL. When I use changes in C++ runtime from my coworker the time is pretty much unchanged, memory drops to 452MB. This is still at least ten times too slow and also 6 times too memory consuming (just a reminder - this is raw parsing while compared to old ANTLR 2.2s + 65MB doing all the actions that build VPI data). I could live with some slowdown and memory consumption but the differences are just too big.

For the time being I'm going to try ANTLR3. Sadly I'm sure there are plenty of features of ANTLR4 that I'll miss :o(


> Let him make a pull request on Github, so others can review the code and benefit from his work.
That was the original plan, although he used our coding convention and now that I'm turning away from ANTLR4 he might not be so eager to correct it. I'm also not sure if he kept thread safety as we don't need it for simulator.
I have some remarks of my own:
 - avoid for (auto elemCopy : container) - use for (auto &elemRef : container) unless you really want to create a copy of each element inside loop or the elements are of primitive type or pure pointer (in that case it is still better to use for (auto *elemPtr : container) for clarity)
 - I've seen this a lot in Java for some reason, however in C++ using dynamic_cast more often than not indicates a problem with class interface
 - linear search - unless the routine is barely used or the container is always nearly empty linear search is a bad idea; f.e. context token getters are too convenient to use in a wrong way, if they and the underlying linear search routine didn't exist, user would have to either use explicit labels for quick access or roll through children manually - a good reason to ask "do I really want to do that?"; linear search is also present in internal routines like ATNState::addTransition
 - ATNState::removeTransition - in its current form in C++ it leaks transition and does not work like its Java counterpart; considering how it is used it should really be popTransition, otherwise (if it was frequently used with any index) it would suggest use of different type of transition container
 - PredictionContext::combineCommonParents - in its current form in C++ it is a costly no-op (the uniqueParents uses default std::less instead of defined equality comparator so all parents are always unique unless they are shared pointers on the same object already); also while keeping general code scheme similar to Java helps maintaining multiple targets, there is no point in doing so down to the last line; I'd rather code it like this (one uniqueness search per element and one pass on parents):
   bool PredictionContext::combineCommonParents(std::vector<Ref<PredictionContext>> &parents) {
     PredictionContextCache uniqueParents;
     for (auto &parent : parents) {
       auto insertInfo = uniqueParents.emplace(parent);
       if (!insertInfo.second) //duplicate
         parent = *(insertInfo.first); //replace with first occurence of equal one
     }
     return uniqueParents.size() < parents.size();
   }


> Some work needs one time preparations that produces data which must be store somewhere, hence you need a cache. If you fight with your tool because you think it should behave differently than it does, you are constantly in trouble.
I wouldn't mind the cache if not for its humongous size. I find the idea that you can cache everything and not manage it very dangerous. Especially in case of applications that could benefit from said cache the most - some kind of parser server that you just send your input to and get the results. Such application would have to stop all its parsers once in a while and remove all cache, unless ANTLR was managing it internally. F.e. I'd expect ANTLR to maintain a list of prediction paths sorted by their use count (frequently used = better) and length (longer that is used is better than short that is used, but when not used it is worse than short that is not used) and prune when cache exceeds certain size. Also I'd expect that paths that are longer than set number of decisions should have "kill me" action at their end. After all the longer the path the less chance it will be walked again after passing initial input that caused it to spawn therefore it makes no sense to ever put that in cache. I also understand that the above might be hard to impossible especially if different paths share parts of their chain. That however should put whole idea into question because ever growing cache is hardly acceptable for any application.


> Just to be clear here: I'm not talking about a cache you have to maintain. What I mean is the internal data produced by a parser and shared among all parser (for a given grammar). This happens internally and is nothing you have to take care of.
Any memory that is taken after I'm no longer benefitting from its content needs to be released. In that sense I have to maintain it :o)

ABW

unread,
Aug 8, 2017, 8:12:51 AM8/8/17
to antlr-discussion
Something just clicked in my mind, maybe it was the barrier that I had to overcome to start understanding a bit of what is going on inside ANTLR. But I better ask to make sure it clicked right :o)

ANTLR is not actually caching data - it caches code. It starts with no parser and with each piece of input one giant DFA is grown that represents a parser tailored for accepting the input seen so far. When new piece of input happens to fit in existing DFA - good, it means there is a parser ready to do the work, if not, new states and transitions are added to the single graph and only then the input can actually be parsed. Am I getting it correctly now? So far I thought the parser is built statically and only in places that are too problematic for static analysis it uses help from multiple separate small DFAs built dynamically for specific decisions.

Mike Lischke

unread,
Aug 9, 2017, 3:37:27 AM8/9/17
to antlr-di...@googlegroups.com
Multiple rules that exhibit the uncontrolled lookahead presented in my earlier example occupy 6-7 top positions in pretty much all important profiler categories, from time spent, through number of DFA states, lookahead depth etc. My latest results are around 8s + 1GB for Java SLL mode and 26s + 700MB for C++ SLL. When I use changes in C++ runtime from my coworker the time is pretty much unchanged, memory drops to 452MB. This is still at least ten times too slow and also 6 times too memory consuming (just a reminder - this is raw parsing while compared to old ANTLR 2.2s + 65MB doing all the actions that build VPI data). I could live with some slowdown and memory consumption but the differences are just too big.

Surprising that parsing with the C++ runtime is even slower than with Java, but I assume you are again only considering the first run (which in fact *can* be slower in C++). The differences between warmup run and following runs can be huge (7s vs 16ms) as I showed in this message: https://groups.google.com/forum/#!searchin/antlr-discussion/1$2B2*3-4$2F5|sort:relevance/antlr-discussion/edFmDYpby8M/3UHE179KCQAJ.


For the time being I'm going to try ANTLR3. Sadly I'm sure there are plenty of features of ANTLR4 that I'll miss :o(

Well, after all you should of course use the tool that fits best to solve your problem, no doubt.


I have some remarks of my own:
 - avoid for (auto elemCopy : container) - use for (auto &elemRef : container) unless you really want to create a copy of each element inside loop or the elements are of primitive type or pure pointer (in that case it is still better to use for (auto *elemPtr : container) for clarity)

If you still see a range based for loop without reference operator then it is probably an oversight.

 - I've seen this a lot in Java for some reason, however in C++ using dynamic_cast more often than not indicates a problem with class interface

There are also a number of unnecessary classes/interfaces and virtual functions (I optimized a few away already in the C++ target). But we target authors usually try to stay close to the Java target to avoid confusion (documentation still applies, examples work the same way etc). 

 - linear search - unless the routine is barely used or the container is always nearly empty linear search is a bad idea; f.e. context token getters are too convenient to use in a wrong way, if they and the underlying linear search routine didn't exist, user would have to either use explicit labels for quick access or roll through children manually - a good reason to ask "do I really want to do that?"; linear search is also present in internal routines like ATNState::addTransition

A typical case for simplicity vs speed. However, it still remains to be determined how much such iterations affect the runtime. My impression is rather that locks (also and in particular those used in shared_ptr do slow down things a lot).

 - ATNState::removeTransition - in its current form in C++ it leaks transition and does not work like its Java counterpart; considering how it is used it should really be popTransition, otherwise (if it was frequently used with any index) it would suggest use of different type of transition container

Oh, that's in fact a bug. The removed transition should be returned. Thanks for spotting that. This method is only use for deserialization of the ATN (a one time step at start up) and only when generating bypass transitions, so it's probably good as it is. The transitions list is frequently queried and that's what should be fast.

 - PredictionContext::combineCommonParents - in its current form in C++ it is a costly no-op (the uniqueParents uses default std::less instead of defined equality comparator so all parents are always unique unless they are shared pointers on the same object already); also while keeping general code scheme similar to Java helps maintaining multiple targets, there is no point in doing so down to the last line; I'd rather code it like this (one uniqueness search per element and one pass on parents):
   bool PredictionContext::combineCommonParents(std::vector<Ref<PredictionContext>> &parents) {
     PredictionContextCache uniqueParents;
     for (auto &parent : parents) {
       auto insertInfo = uniqueParents.emplace(parent);
       if (!insertInfo.second) //duplicate
         parent = *(insertInfo.first); //replace with first occurence of equal one
     }
     return uniqueParents.size() < parents.size();
   }



Thanks, that's one of the still unoptimized parts (there are more). The runtime is complex and the primary goal was to get it working without leaking memory.


I wouldn't mind the cache if not for its humongous size. I find the idea that you can cache everything and not manage it very dangerous. Especially in case of applications that could benefit from said cache the most - some kind of parser server that you just send your input to and get the results. Such application would have to stop all its parsers once in a while and remove all cache, unless ANTLR was managing it internally. F.e. I'd expect ANTLR to maintain a list of prediction paths sorted by their use count (frequently used = better) and length (longer that is used is better than short that is used, but when not used it is worse than short that is not used) and prune when cache exceeds certain size.

That's more of an algorithm question, not so much the runtime implementation.

Also I'd expect that paths that are longer than set number of decisions should have "kill me" action at their end. After all the longer the path the less chance it will be walked again after passing initial input that caused it to spawn therefore it makes no sense to ever put that in cache.

That's something similar I suggested before. Currently there's no guarantee of the length of the lookahead, except when EOF is encountered. This is also why it is so difficult to implement parsing from unbuffered data (e.g. over an http connection). If we could stop lookahead at a certain depth and solve ambiguities on a first-come-first-serve basis (or via predicates as we did before ANTLR4), that could help in certain extreme situations.


Mike Lischke

unread,
Aug 9, 2017, 4:00:11 AM8/9/17
to antlr-di...@googlegroups.com
Something just clicked in my mind, maybe it was the barrier that I had to overcome to start understanding a bit of what is going on inside ANTLR. But I better ask to make sure it clicked right :o)

ANTLR is not actually caching data - it caches code.

It's caching data, more precisely it caches ATN configurations + DFA states which allow for a faster walk through the state machine.

It starts with no parser and with each piece of input one giant DFA is grown that represents a parser tailored for accepting the input seen so far.

Yes, something along this line. However, it starts with the static ATN, which is written as blob into the generated parsers and lexers and unserialized at application startup. Then for each combination of input symbols a set of ATN configurations is created, if not yet done before, and stored in a DFA state. This creation is what makes up the warmup phase (in fact you can have multiple such phases, whenever input comes in that hasn't been seen yet this creation process is triggered).

When new piece of input happens to fit in existing DFA - good, it means there is a parser ready to do the work,

Not a parser, but a path. There's no more than the parser you start with. It's the path data for ATN execution that's missing at the beginning and generated as you go. And in extreme cases (like what I and Sam mentioned, when you have predicates on the left hand side in a lexer rule) the result of this generation is thrown away after execution and recreated on the next run (when parsing enters the same state machine path), which can slow down things a lot. I described this effect also in my blog post here: http://www.soft-gems.net/index.php/tools/49-the-antlr4-c-target-is-here (see the Development paragraph).

if not, new states and transitions are added to the single graph and only then the input can actually be parsed. Am I getting it correctly now?

Yes, mostly. You can see this pretty well in the work loop in `executeATN` (https://github.com/antlr/antlr4/blob/master/runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp#L161). It first checks if there is an existing DFA state. If there's none it's created first and then used for the actual parsing step. The heavy lifting of finding the target state is then done here: https://github.com/antlr/antlr4/blob/master/runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp#L270. That's where I see quite some potential for optimization as well. One idea I have for quite some time is to somehow manage DFA states and ATN configs centrally in some way, so that we can work with raw pointers in the DFA (to avoid copying around shared_ptr), but the Java target relies on automatic garbage collection here, which is hard to simulate in C++ (states and configs are simply overwritten when no longer needed and freed by GC then).


Reply all
Reply to author
Forward
0 new messages