CheckStyle Project: slow performance with adaptivePredict

R Veach

unread,

Apr 27, 2016, 10:43:56 AM4/27/16

to antlr-discussion

Hi,

I am a developer helping out with the project CheckStyle. https://github.com/checkstyle/checkstyle/

We use antlr for parsing Java and JavaDoc grammars for use in our utility to programatically analyze user code for styling.

We have noticed a slowdown in our JavaDoc grammar versus our Java grammar. After examining the hotspots, we have determined that the specific method causing the issue is adaptivePredict in antlr.

After reading this stackoverflow: http://stackoverflow.com/questions/14726229/antlr4-bad-performance-when-calling-getinterpreter-adaptivepredict

you have a decision in your grammar which requires a very large amount of lookahead and/or is ambiguous or context-sensitive

I have a general idea of the problem, but I am unsure how to resolve this for our project.

Our parser and lexer grammars are located at: https://github.com/checkstyle/checkstyle/tree/master/src/main/resources/com/puppycrawl/tools/checkstyle/grammars/javadoc

The slowdown occurs in our text grammar in the parser: https://github.com/checkstyle/checkstyle/blob/master/src/main/resources/com/puppycrawl/tools/checkstyle/grammars/javadoc/JavadocParser.g4#L920

We were hoping your team may have some guidance for us on how to fix the grammar to avoid this performance issue.

In the meantime, we did come up with an alternative solution that modified the generated Java code to skip over the adaptivePredict. While this is more of a hack, we have seen great speed improvements with this fix and no negative side affects in our unit tests or real world usage.

The fixes diff: https://github.com/checkstyle/checkstyle/pull/3055/commits/bb3e963cf3a285672604140e119ff7ff3d99fd4c

The difference in the generated java code: http://rnveach.github.io/checkstyle/1064/diff-report.html

Until we fix the issue with the grammar, if it is possible, is there anything we should be concerned about in applying a fix of this nature?

The setState, errHandler.sync, and the adaptivePredict are all now dead code with the fix that won't be run. Are there any side affects or harm in skipping those 3 lines?

Any help or guidance is greatly appreciated.

Thank you for your time.

Jim Idle

unread,

Apr 27, 2016, 10:16:41 PM4/27/16

to antlr-di...@googlegroups.com

Did you try using SLL mode first?

parser.getInterpreter().setSLL(true);
try {
     parser.*yourrule()*;
     ...
} 
catch (RuntimeException ....

However, I think that it might be as simple as a little left factoring. as with markup languages and HTML etc a lot of your constructs start with the same sequence of tokens. For instance OPEN, so you can use:

tags: OPEN (td | ....

And take the OPEN out of each of the tags.

That simple factoring is probably not what your issue is, but look for all the places you can do that. Out of interest I will copy your parser and lexer and look at the generated code. It might be obvious where your issue is and I’ll report back.

Jim

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jim Idle

unread,

Apr 27, 2016, 11:01:58 PM4/27/16

to antlr-di...@googlegroups.com

So your text rule:

text : (CHAR | WS)+ ;

Itself is OK, but you then insert this all over the place, which is not a dumb idea, but it is creating a lot of ambiguities that I think are leading to your issues. For instance here:

javadoc: (

htmlElement

| ({!isNextJavadocTag()}? LEADING_ASTERISK)

| htmlComment

| CDATA

| NEWLINE

| text

| javadocInlineTag

)*

(LEADING_ASTERISK? WS* javadocTag)*

EOF;

Your ()* loop can see a CHAR or WS as part of the text rule, but it can also see the WS in your WS* rule, so I think this ends up making it bloody complicated (technical jargon :).

In this case, and perhaps all others, I think you can just not use the text rule and put the WS and CHAR tokens in that loop, then make your LEADING_ASTERISK easier on the soul.

javadoc: (

htmlElement

| ({!isNextJavadocTag()}? LEADING_ASTERISK)

| htmlComment

| CDATA

| NEWLINE

| WS

| CHAR

| javadocInlineTag

)*

((LEADING_ASTERISK WS*)? javadocTag)*

EOF;

And indeed the generated code looks better.

I don't have time to go through all your uses of the text rule. But if they are all in a loop, then you can possibly just change it to:

text : WS | CHAR ;

And let the loop higher up worry about it. Indeed if I make this mod and then examine the generated code, it looks (intuitively) a lot better.

But I suspect just ridding yourself of the text rule and optimising the loops that were calling it as per above will solve your issues.

Another example is this:

attribute: HTML_TAG_NAME (NEWLINE | LEADING_ASTERISK | WS)*

EQUALS (NEWLINE | LEADING_ASTERISK | WS)*

(ATTR_VALUE | text | HTML_TAG_NAME);

The ()* after EQUALS can consume WS but so can the sequence that follows as it has the text rule.

So the issue is not the text rule per se, but the fact that you have it all over the place. If you don't want to go through and factor out all the text calls, then you could just remove them and call a function to consume CHAR and WS before you start each loop.

And one final thought. Do you even need the WS and CHAR to show up in the parser? A quick glance implies not. Just skip() the WS at least and remove it from your parser. I think it will all work as planned. THat's more or less what your hack is doing anyway.

I deleted all the WS references and left text consuming CHAR+ (could even just have CHAR I think). This also seemed to look a lot better. If you don't need the CHAR, then just remove the text rule altogether and add skip to CHAR and WS. That was the best looking code I got out of it.

One last observation is that your tags seem to repeat exactly the same alts ls...javadocInLineTag. Why not factor those out in to one rule?

Hope that helps,

Jim

R Veach

unread,

Apr 28, 2016, 9:53:31 AM4/28/16

to antlr-discussion

If you don't need the CHAR, then just remove the text rule altogether and add skip to CHAR and WS.

We actually do need text and the rest. We don't just style HTML code, but we also examine the text the user has entered for the JavaDoc. We have certain checks to make sure they aren't using specific phrases, or that they end sentences with a period, etc. We do need WS in to find breaks between words, otherwise we would have to examine the column positions alot.
My grammar hack is still consuming the characters, not skipping them.

you can possibly just change it to:

text : WS | CHAR ;

This transforms our text trees from:

|       |--TEXT -> My  [1:0]|       |   |--CHAR -> M [1:0]
|       |   |--CHAR -> y [1:1]
|       |   `--WS ->   [1:2]

into:

|       |--TEXT -> M [1:0]|       |   `--CHAR -> M [1:0]
|       |--TEXT -> y [1:1]
|       |   `--CHAR -> y [1:1]
|       |--TEXT ->   [1:2]
|       |   `--WS ->   [1:2]

using our junit 'com.puppycrawl.tools.checkstyle.MainTest.testPrintTreeJavadocOption'.

This kind of change makes the tree a bit harder to work with when we want to analyze sentences, as text is now each character instead of one or more words.

We will look into this too, but I think we are hoping to not have to merge words ourselves.

tags seem to repeat exactly the same alts ls...javadocInLineTag. Why not factor those out in to one rule?

I didn't write the grammar, and am pretty new to ANTLR itself. So I can't say why this was written this way. :)

I don't have time to go through all your uses of the text rule.

optimising the loops that were calling it as per above will solve your issues.

I completely understand and we thank you for your insight. I never would have guessed to look into the ()*s.
We will look over our grammar and try to rework it as you have suggested but of course this will take some time. The grammar is a year old and it is quite large.

In the meantime, do you see any issues with releasing the modified version of the grammar with the java code embedded to alleviate the performance problems for now?

The fixes diff: https://github.com/checkstyle/checkstyle/pull/3055/commits/bb3e963cf3a285672604140e119ff7ff3d99fd4c

The difference in the generated java code: http://rnveach.github.io/checkstyle/1064/diff-report.html

Like I said, we haven't seen any issues with it (invalid trees), but since we don't know the antlr core, maybe it is doing something bad internally that our tests aren't looking for.

Jim Idle

unread,

Apr 28, 2016, 10:17:20 PM4/28/16

to antlr-di...@googlegroups.com

I see no real problem with the hack to be honest, as basically you are making it do what you 'know' to be true' though I might write it as a call to an embedded function that used input.consume() to read off the tokens until LA(1) tells you that it is not WS or CHAR.

Overall, you can of course improve the grammar now that you guys have a bit of experience with it.

You could get clever and detect CHAR and WS on another channel, inferring the text by looking at the indexes of the tokens that bound it. But I think that just reworking some of what you have is very doable and tricks like that have a habit of being more complicated than they are worth.

Jim

Roman Ivanov

unread,

Apr 29, 2016, 9:28:23 AM4/29/16

to antlr-discussion

Hi Jim,

thanks a lot for your answers.

> I might write it as a call to an embedded function that used input.consume() to read off the tokens until LA(1) tells you that it is not WS or CHAR.

can you give us link to example to let us clearly understand your idea.

> You could get clever and detect CHAR and WS on another channel, inferring the text by looking at the indexes of the tokens that bound it.

please give us example and and link to doc to let us understand what you mean and try to change according to best practices.