ANTLR4 parser performance, ambiguities

536 views
Skip to first unread message

Moritz Becker

unread,
Dec 3, 2014, 2:15:49 PM12/3/14
to antlr-di...@googlegroups.com
Hi,
we are experiencing performance problems due to ambiguities in our grammar (see parser grammar and lexer grammar attached).
The main problem is (we think) that we have different expression rules for different data types (like string_expression, arithmetic_expression etc.) and in all these expression rules we use the state_field_path_expression rule at some point on the far left side of a production. The scalar_expression rule bundles all these expression rules and therefore we have ambiguities once the parser enters the scalar_expression rule with a state_field_path_expression on the input.
We don't know what would be the best way to sort out these ambiguities while avoiding duplicate code in the grammar.
Can you give some advice please?
JPQL_lexer.g4
JPQLSelectExpression.g4

Terence Parr

unread,
Dec 3, 2014, 2:19:33 PM12/3/14
to antlr-di...@googlegroups.com
hi. Are you having problems identifying the ambiguities or knowing how to resolve them? Identifying hotspots in ambiguities in the grammar is easy with the intellij plug-in for antlr 4.

Ter
> --
> You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <JPQL_lexer.g4><JPQLSelectExpression.g4>

David Whitten

unread,
Dec 3, 2014, 2:43:24 PM12/3/14
to antlr-di...@googlegroups.com
Well, you already said you factored out the inner calls.
I expect that you are going to have to lift the non-terminal
state_field_path_expression
to a higher rule so there is only one use of it.

Instead of having state_field_path_expression referenced from
scalar_expression, string_expression, arithmetic_expression etc, just
prefix them with the state_field_path_expression and remove that
reference from within them.

Or you can try to convince Ter to do some deep factoring on grammar
trees to recognize that the
rule is referenced on each path.

Moritz Becker

unread,
Dec 3, 2014, 3:18:20 PM12/3/14
to
We are already working with the intellij plug-in :) it's pretty good
I previously posted a wrong version of the grammar, find the right one attached.

So you mean e.g.

string_expression_or_state_field_path
 
: state_field_path_expression
 
| string_expression
 
;

string_expression
 
: general_path_element
 
| String_literal
 
| Input_parameter
 
| CONCAT '('string_expression',' string_expression (',' string_expression)*')'
 
| SUBSTRING '('string_expression',' arithmetic_expression (',' arithmetic_expression)?')'
 
| TRIM '('((LEADING | TRAILING | BOTH)? (trim_character)? FROM)? string_expression')'
 
| LOWER '('string_expression')'
 
| UPPER '('string_expression')'
 
| aggregate_expression
 
| case_expression
 
| function_invocation
 
;


and then we would use string_expression in scalar_expression and string_expression_or_state_field_path everywhere else? We already tried this and it works well in the case of string_expression but we got stuck with this approach for arithmetic_expression e.g. because of the left recursion

arithmetic_expression
 
: arithmetic_term
 
| arithmetic_expression op=( '+' | '-' ) arithmetic_term
 
;

We somehow need the state_field_path_expression in the recursion but when we use arithmetic_expression_or_state_field_path in this case then we also have the ambiguity.
What do you mean by deep factoring? Use semantic attributes to deactivate the rules?


Am Mittwoch, 3. Dezember 2014 20:19:33 UTC+1 schrieb the_antlr_guy:
hi.  Are you having problems identifying the ambiguities or knowing how to resolve them? Identifying hotspots in ambiguities in the grammar is easy with the intellij plug-in for antlr 4.

Ter
On Dec 3, 2014, at 11:15 AM, Moritz Becker <moritz....@gmail.com> wrote:

> Hi,
> we are experiencing performance problems due to ambiguities in our grammar (see parser grammar and lexer grammar attached).
> The main problem is (we think) that we have different expression rules for different data types (like string_expression, arithmetic_expression etc.) and in all these expression rules we use the state_field_path_expression rule at some point on the far left side of a production. The scalar_expression rule bundles all these expression rules and therefore we have ambiguities once the parser enters the scalar_expression rule with a state_field_path_expression on the input.
> We don't know what would be the best way to sort out these ambiguities while avoiding duplicate code in the grammar.
> Can you give some advice please?
>
> --
> You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <JPQL_lexer.g4><JPQLSelectExpression.g4>

JPQL_lexer.g4
JPQLSelectExpression.g4

Sam Harwell

unread,
Dec 3, 2014, 11:33:23 PM12/3/14
to antlr-di...@googlegroups.com
This type of refactoring is supported (automated) by my optimized fork, and does not change the shape of the parse trees.

For example, I use it here to address a performance issue in my Go parser without requiring me to use manually left-factored rules that deviate from the language specification.
https://github.com/tunnelvisionlabs/goworks/blob/master/goworks.editor/src/org/tvl/goworks/editor/go/parser/GoParser.g4#L437

The output of the parsing process is identical with and without the @leftfactor{expression} line, but the performance of the automatically left-factored version is dramatically improved. Careful selection of the placement and choice of rule for the @leftfactor{} action is very important, as it does introduce a small amount of overhead for the parse tree (re)construction process. It's important to make sure it's only used where the lookahead benefits clearly outweigh the overhead of changing the way trees are constructed.

Thanks,
Sam

Moritz Becker

unread,
Dec 5, 2014, 6:54:43 AM12/5/14
to antlr-di...@googlegroups.com
I tried your approach with the current master version of the optimized fork and changed the grammar like this:

scalar_expression @leftfactor{state_field_path_expression}
                   
: arithmetic_expression
                   
| string_expression
                   
| enum_expression
                   
| datetime_expression
                   
| boolean_expression
                   
| coalesce_expression
                   
| nullif_expression
                   
| type_discriminator
                   
| identifier
                   
| Input_parameter
                   
| case_expression
                   
;

But I do not experience any performance improvement...
I have attached a test project, I would be glad if you could take a look and tell me what I am doing wrong. I have built your ANTLR fork and verified that it is located in my local repository.
leftfactor-test.zip

Sam Harwell

unread,
Dec 5, 2014, 7:46:27 AM12/5/14
to antlr-di...@googlegroups.com

Hi Moritz,

 

I was able to run your code, but I was not able to reproduce a clear performance problem. With or without the @leftfactor action, and without any other special “tuning”, I’m observing an average time of 50ms to parse the sample expression using the “reference” build of ANTLR 4.3, and 30ms using the “optimized” build of ANTLR 4.4 (these are the latest builds of each in Maven Central). Can you provide some additional sample inputs that will highlight the areas that are causing problems for your application?

 

Note that to use the “optimized” build, you set the groupId in your pom.xml to read “com.tunnelvisionlabs” instead of “org.antlr” (for both the antlr4-runtime and antlr4-maven-plugin artifacts), and then you rebuild your project. If you set up the sample on GitHub somewhere I can send a pull request showing the changes I would make to simplify the project structure a bit. We could also use that project as a testing ground for the performance issues.

 

Thanks,

Sam

Moritz Becker

unread,
Dec 9, 2014, 2:33:05 AM12/9/14
to antlr-di...@googlegroups.com
Hi Sam,

I have set up a repository: leftfactortest
I tried to re-run it with the proper dependency but for me, the problem is still present. The test run takes 600-700ms on my machine (Intel Xeon @ 3,4 GHz) with both the normal and the optimized ANTLR version on JDK jdk1.7.0_45.

Interestingly, the Intellij Antlr Plugin takes just 200-300ms to parse the same expression. However, the profiler tells me that 98% of the total parse time is prediction time.

Sam Harwell

unread,
Dec 9, 2014, 10:28:54 AM12/9/14
to antlr-di...@googlegroups.com

Hi Moritz,

 

I sent you a pull request with some grammar/project simplifications.

 

Regarding the test: You are running a single test with a small input fragment. Your test timing currently includes the complete overhead of initializing ANTLR 4, deserializing the lexer and parser ATNs, and constructing the initially empty DFA cache. This only reflects the behavior of ANTLR 4 if you are writing an application whose sole purpose is to parse a single small expression like this exactly one time. After the first time, the DFA cache starts to be used instead of constructing a new one, bringing with it a dramatic performance improvement.

Terence Parr

unread,
Dec 9, 2014, 11:14:11 AM12/9/14
to antlr-di...@googlegroups.com
Wow. the interpreter claims to be faster? Hmm…
Ter
Reply all
Reply to author
Forward
0 new messages