DFA warm-up and speed

143 views
Skip to first unread message

Mária Jurčovičová (Meri)

unread,
Jun 8, 2015, 5:00:35 AM6/8/15
to antlr-di...@googlegroups.com
Hi all,

I ported antlr3 project into antlr4 and the result turned up too slow due to parser warm-up. Compiling the first file in unit tests can take over 2 seconds. Compiling second file still takes longer then in antlr3, but the performance loss is acceptable.

Use case: My project will run either from command line or as part of build process and is unlikely to compile more files at the same time. I am fine with antlr4 version being slightly slower, but two seconds penalty for every run matters.

What I did so far: I tried to persist `_sharedContextCache` into file and load it again before first start. It did not affected the speed. It is building of `_decisionToDFA` DFA array that takes all that time. 

Persisting DFA is more work, as it has more complicated object structure.

My questions are: 
* Is there something else I can do to speed up parser warm-up?
* If not, is DFA class structure stable? Does it make sense to spend time trying to persist/load that too?
* Would there be interest in patch if I made DFA persistend/load and warm up would turned up faster?

In case it is relevant, my grammars can be found here:

With Regards,
Maria Jurcovicova

Terence Parr

unread,
Jun 8, 2015, 11:44:03 AM6/8/15
to antlr-di...@googlegroups.com
Java on-the-fly compilation cost is likely the issue
Ter
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gerald Rosenberg

unread,
Jun 15, 2015, 3:16:52 AM6/15/15
to antlr-di...@googlegroups.com
Before trying the extraordinary, you may wish to optimize your grammar first.  Much of your performance problem may be due to a too literal translation of the v3 grammar to v4. 

Would probably be helpful to start by looking at the 'standard' v4 LESS grammar that is in the github repository.  (https://github.com/antlr/grammars-v4/tree/master/less)

Couple of comments on your existing grammar: you have at least one parser rule -- the term rule -- that starts with a predicate.  That can have a substantial performance impact on lookahead. 

The two parser whitespace rules conflict - don't think the second will ever be reached.

   ws
: (WS | NEW_LINE)*;
   mandatory_ws
: (WS | NEW_LINE)+;

For LESS, whitespace should probably not be significant in the parser.  If not significant, and dropped in the lexer, looks like you can reduce the number of tokens going to the parser by 10 to 25%

Finally, the individual letter fragments could be reduced to a simple range - given that random whitespace is not allowed in keywords.  Also, it looks like you are missing a 'u' in the unicode forms.  V4 handles unicode better/different from how V3 does.

Jim Idle

unread,
Jun 15, 2015, 8:41:46 PM6/15/15
to antlr-di...@googlegroups.com
On Mon, Jun 15, 2015 at 3:16 PM, Gerald Rosenberg <gbrose...@gmail.com> wrote:
Before trying the extraordinary, you may wish to optimize your grammar first.  Much of your performance problem may be due to a too literal translation of the v3 grammar to v4. 

Would probably be helpful to start by looking at the 'standard' v4 LESS grammar that is in the github repository.  (https://github.com/antlr/grammars-v4/tree/master/less)

Couple of comments on your existing grammar: you have at least one parser rule -- the term rule -- that starts with a predicate.  That can have a substantial performance impact on lookahead. 

The two parser whitespace rules conflict - don't think the second will ever be reached.

Theses are parser rules, not lexer rules, so they will be called, or not called.
 

   ws
: (WS | NEW_LINE)*;
   mandatory_ws
: (WS | NEW_LINE)+;

For LESS, whitespace should probably not be significant in the parser.  If not significant, and dropped in the lexer, looks like you can reduce the number of tokens going to the parser by 10 to 25%

However, I also cannot see anything in this grammar that requires white space to hit the parser. But I do see some rules that probably do suffer from having to predict spaces everywhere, such as:

declaration:
property ( | ws PLUS ( | ws UNDERSCORE ) ) ws COLON (ws expression_full)?
;
 
The presence of a manadatory_ws rules suggests that all the other ws rules are there just to support that. However, this can be enforced semantically in all probability and in v4 might not actually be a requirement (though lexer modes may take care of this if it is).

There are also a a lot of rules that can probably be merged such as rules that do or do not require comma and so on. 

I think that the performance is down to the grammar and not v4 and that you won't get too far trying to persist the cache. Why don't you profile your generated code and see where the time is being spent. While you could create a perfectly formed grammar, you will probably find that you just need to optimize a few constructs/rule.

Definitely start by getting rid of the ws rule.

Jim


Finally, the individual letter fragments could be reduced to a simple range - given that random whitespace is not allowed in keywords.  Also, it looks like you are missing a 'u' in the unicode forms.  V4 handles unicode better/different from how V3 does.


On Monday, June 8, 2015 at 2:00:35 AM UTC-7, Mária Jurčovičová (Meri) wrote:
Hi all,

I ported antlr3 project into antlr4 and the result turned up too slow due to parser warm-up. Compiling the first file in unit tests can take over 2 seconds. Compiling second file still takes longer then in antlr3, but the performance loss is acceptable.

Use case: My project will run either from command line or as part of build process and is unlikely to compile more files at the same time. I am fine with antlr4 version being slightly slower, but two seconds penalty for every run matters.

What I did so far: I tried to persist `_sharedContextCache` into file and load it again before first start. It did not affected the speed. It is building of `_decisionToDFA` DFA array that takes all that time. 

Persisting DFA is more work, as it has more complicated object structure.

My questions are: 
* Is there something else I can do to speed up parser warm-up?
* If not, is DFA class structure stable? Does it make sense to spend time trying to persist/load that too?
* Would there be interest in patch if I made DFA persistend/load and warm up would turned up faster?

In case it is relevant, my grammars can be found here:

With Regards,
Maria Jurcovicova

Gerald Rosenberg

unread,
Jun 18, 2015, 8:06:40 PM6/18/15
to antlr-di...@googlegroups.com

On Monday, June 15, 2015 at 5:41:46 PM UTC-7, Jim Idle wrote:
On Mon, Jun 15, 2015 at 3:16 PM, Gerald Rosenberg <gbrose...@gmail.com> wrote:
The two parser whitespace rules conflict - don't think the second will ever be reached.

Theses are parser rules, not lexer rules, so they will be called, or not called.
 

   ws
: (WS | NEW_LINE)*;
   mandatory_ws
: (WS | NEW_LINE)+;


 I cannot believe I did that.  I even identified them as parser rules!  <sigh/>  Must have been a Senior Moment(tm).

Mária Jurčovičová

unread,
Jun 19, 2015, 8:27:33 AM6/19/15
to antlr-di...@googlegroups.com
Thank all for answers, it seem that the next step for me to do is to profile and tweak the grammar. The rest of this mail tries to answer points about whitespaces in less.js and official antl4 grammar.

1.) The whitespace in less is relevant in some situations. For example, following two statements are different:

* `2 -1` - list with two elements `2` and `-1`
* `2 - 1` - an expression that should compile into `1`

Following selectors are different too:
* div :not(:enabled) - any child of div that is not enabled
* div:not(:enabled) - any div that is not enabled

My first v3 grammar was sending whitespaces into hidden channel, but then I had to use too many semantic predicates in similar situations. 

I do not like the idea of dealing with spaces in java behind the parser, because having them in grammar (whether as predicates or in-rules) is easier to read. Anecdotally, every time I moved stuff from grammar to java behind it, the compiler became harder to read/maintain. Moving code from java to parser usually paid off in long term.

2.) Looking at the official grammar, it supports only subset of less.js - for example it does not have detached rulesets nor support for all at-rules (less.js supports also arbitrary at-rules).

I run it on W3C selectors test cases and parser failed to parse in 114 cases out of 175. When run on less expressions 12 out of 18 tests failed. On the plus side, it is really fast and does not have first file penalty.

3.) I did not profiled the generated parser yet, it is what I am going to do next. 

With regards,
Maria Jurcovicova

--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/8u0Hd0PqwlQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages