About Antlr Performance

John Brown

unread,

Jun 22, 2021, 2:25:28 AM6/22/21

to antlr-discussion

Background:

Recently, I try to translate my old grammar rules, which are written in flex and bison, to antlr. However I met performance issue when I finish translation. the *.g4 file is about 1700s tokens and 1600s statements, 9400s branches of statements.

When I compiled the g4 file, the antlr generated ATN graph that was nedges:68285, states:60948. The edges number is such enormous that antlr can not serialize the edges data (it only supports 0-65535), so I had to modify the serialization code to fit my project.

Situation:

It works, but there are some performance issues I have to deal with. The parser performance is so low.

I generated java code and used grun to test. It took 8s or even 10s to parse a short sql sentence(e.g. select * from tablename where i=1 or j=2). It only take milliseconds when I use flex and bison to parse. Meanwhile, I try to generate javascript to parse, the solution time is unbearable(10m or even longer), sometimes Chrome got OOM and crashed.

The below is the flame graph when grun was parsing.

There are a lot of recursion when parsing. especially in the function: ParserATNSimulator.closureCheckingStopState and ParserATNSimulator.closure. It seems to take a bunch of time. Are there some methods for me to solve those performance issues?

Thx

Federico Tomassetti

unread,

Jun 22, 2021, 3:11:05 AM6/22/21

to antlr-di...@googlegroups.com

Hi John,
While I do not have an answer for your specific problem, we collected some ideas we had and some great suggestions by Mike Lischke on the topic of performance optimization for ANTLR parser and we wrote them here: https://tomassetti.me/improving-the-performance-of-an-antlr-parser/

Maybe some ideas could help you. In any case, I would be very grateful for any feedback you would have and curious to hear other suggestions you would receive.

Cheers,
Federico

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/38830e35-9656-4f89-b635-f8d6ba690a96n%40googlegroups.com.

--

Website at https://tomassetti.me

GitHub https://github.com/ftomassetti

John Brown

unread,

Jun 22, 2021, 4:57:23 AM6/22/21

to antlr-discussion

hi Federico,

Thank you very much for your reply. I saw this website before I posted this thread. And I changed some sub rules format and used grun with -SLL option for caching as the website mentioned. But unfortunately, There was not significant improvement. I wonder that is there any possible way to speed the process up to 1s or at the least 2 seconds. Because ,anyway, flex and bison , they can parse the sentence in milliseconds in the same rules.

yours

John

Mike Lischke

unread,

Jun 22, 2021, 5:07:42 AM6/22/21

to ANTLR discussion group

Hi John,

Background:
Recently, I try to translate my old grammar rules, which are written in flex and bison, to antlr. However I met performance issue when I finish translation. the *.g4 file is about 1700s tokens and 1600s statements, 9400s branches of statements.

When I compiled the g4 file, the antlr generated ATN graph that was nedges:68285, states:60948. The edges number is such enormous that antlr can not serialize the edges data (it only supports 0-65535), so I had to modify the serialization code to fit my project.

That sounds crazy, indeed. I am working with SQL grammars (MySQL, SQLite) a lot and never saw such numbers. Can you provide a link to the grammar, so I can take a look? For which RDBMS is that?

In my MySQL grammar (C++ target) I reach parsing times in the order of milliseconds for ordinary queries.

As you can see your query parses in 263µs, after warmup (25ms in the first run) So, it's not ANTLR4 alone (if at all), which makes your parsing slow.

Mike
--
www.soft-gems.net

John Brown

unread,

Jun 22, 2021, 6:01:09 AM6/22/21

to antlr-discussion

Hi Mike,

Thanks for replying. For some security reason, I can not share the grammar rule of my company's product with you (sorry for that). Our RDBMS rule is more complicated than MYSQL, it is kind of like a superset for MYSQL's grammar.

How many rules and sub-rules does MYSQL have? If my rule does need such a large number of rules. Is it normal for antlr to take seconds to parse?

Yours

John

Mike Lischke

unread,

Jun 22, 2021, 6:24:33 AM6/22/21

to antlr-di...@googlegroups.com

Thanks for replying. For some security reason, I can not share the grammar rule of my company's product with you (sorry for that). Our RDBMS rule is more complicated than MYSQL, it is kind of like a superset for MYSQL's grammar.

How many rules and sub-rules does MYSQL have? If my rule does need such a large number of rules. Is it normal for antlr to take seconds to parse?

Without seeing the grammar it's really hard to say anything useful for your problem. The sheer size certainly has some impact, but I think it's more a matter of how rules are organised (ambiguities, duplicate parts etc.). My MySQL grammar has 707 parser rules and 846 lexer rules (including keywords). Comparing that to the values you gave in your first post that significantly less. So I assume you are trying too much in your grammar (for example doing semantic work, like determining number of parameters in function calls). But as I said, without grammar we cannot do anything.

Mike
--
www.soft-gems.net

John Brown

unread,

Jun 22, 2021, 6:55:01 AM6/22/21

to antlr-discussion

I agree with you. It is impossible to analyze an accurate result without the grammar file.

Because bison generates statements file called *.output in BNF form with '-v' option, meanwhile, Antlr uses BNF to define the rule too. So I leveraged that output file and translate it to the Antlr grammar rule. Maybe I should try to shrink some rules and check whether there is any problem you just mentioned, like ambiguities or duplicate parts. Thank you again for your attention.

Ken Domino

unread,

Jun 23, 2021, 1:58:51 PM6/23/21

to antlr-discussion

If you are generating for the JavaScript target, you should know there is--in my opinion--an extremely serious bug in the JavaScript target: https://github.com/antlr/antlr4/issues/3135. In this bug, the ATN graphs per each rule are mutated as the parse is performed! Modifying the ATN graph of a rule is equivalent to mutating the grammar as you parse the grammar. I personally have observed the slowdown several times, and have verified it due to this bug. I would not trust the target whatsoever, and I disabled the testing of the JavaScript target CI build for https://github.com/antlr/grammars-v4. I was going to work on a PR fix for this but I hadn't had time to complete it. --Ken

eric vergnaud

unread,

Jun 23, 2021, 2:24:03 PM6/23/21

to antlr-discussion

Ken,

please refrain from making definitive comments such as "I would not trust the target whatsoever", when thousands of users are perfectly fine with it (that includes myself)

Not saying the JS runtime is bug-free but so far, only you have observed this bug and although it may slow down parsing in some edge cases, I am not aware of any incorrect parsing.

Plus are your comment not off-topic (antlr vs lex/bison) ?

You've done great work identifying bugs which we were able to fix together.

Is it too much to ask you to rather focus on proposing a PR :-) which we can review together ?

Ken Domino

unread,

Jun 23, 2021, 4:31:03 PM6/23/21

to antlr-discussion

Sorry, the JS target overall is execellent work, and I'm a great fan of Antlr. When I have time to finish other work, I have a few PRs for Antlr to complete. I guess what I'm trying to say is that if porting a grammar from Bison to Antlr, why not just first try a comparable target that Bison supports (C++ or Java)? --Ken

rtm...@googlemail.com

unread,

Jun 27, 2021, 2:34:57 PM6/27/21

to antlr-discussion

Hi, and sorry to hear your troubles.

I am doing an SQL-like language in C# and I'm getting nothing so awful as this. My small test suite runs in well under a second (probably close to instantaneous but for the overhead of visual studio loading DLLs & other stuff) and each test is considerably more complex than that your simple select. There are 16 tests.

I can't suggest anything but to ask, have you tried it in another compiled language? Can you try C# if possible? Just to see if the sloth goes away.

Without seeing the grammar I'm afraid it's going to be difficult to help.

jan

Message has been deleted

Wanadoo

unread,

Jun 30, 2021, 4:32:13 AM6/30/21

to antlr-di...@googlegroups.com

John, the JS runtime is usually twice faster than the Python one so it seems you are definitely hitting some issue which affects performance.

But 3s in Java tells me you might be better off re-writing your grammar from scratch, or by enhancing an existing SQL grammar.

Envoyé de mon iPhone

Le 30 juin 2021 à 01:48, John Brown <zbyty...@gmail.com> a écrit :

Hi Jan,
Thanks for your advice. My environment is a UNIX-like system, so I did not try C# runtime. But I did some other runtime tests for my rules.

First, as I said above, I translate my grammar rules from a statement output file generated by Bison. That file does have some duplicated rules and adds a lot of "."(wildcard) in the original rules as Mike said. So I removed them, cut duplicated branches then generated a normal g4 file for Antlr. The edge and node of ATN is reduced to about 24k.

Now it's faster than before. However, it's still not as fast as I want. My test maybe not such strict. I used the same sentence, run in different runtimes on the same computer.

Cpp: about 2s, Java: about 3s, Python3: about 29s, and JS: about 15mins on Chrome(not OOM anymore (¬‿¬) ).

John

--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/U2uSnV6zqyo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/c5712c3c-f1bb-42d2-8e13-b269cd0c63f8n%40googlegroups.com.

rtm...@googlemail.com

unread,

Jul 2, 2021, 8:28:34 AM7/2/21

to antlr-discussion

@John Brown.
I'll try to help a little more in a few days when I can sit in a chair (backache at the mo!)
Have you looked at, or based your grammar off, any of the SQL grammars in the collection of them at <https://github.com/antlr/grammars-v4/tree/master/sql>?
Perhaps conparing yours with them might show something up.

To others: maybe smething pathalogical's happening and perhaps it's hidden a bit, suppose A should be a sequence of c's :

    A : B+ ;
    B : c+ ;

this is sort of

    A : c++ ;

Does anyone know if this sends Antlr a bit dizzy, or anything like it?

cheers

jan

On Tuesday, June 22, 2021 at 7:25:28 AM UTC+1 John Brown wrote:

rtm...@googlemail.com

unread,

Jul 2, 2021, 8:36:17 AM7/2/21

to antlr-discussion

Last thing, this looks like a business problem not a technical one, because you can't share the code. Perhaps you could talk to your boss and get someone antlr-competent here (as in, not me) to help out under and NDA, with some payment to them. It might be the cheapest and quickest solution.

cheers

jan

Reply all

Reply to author

Forward