ANTLR4-generated Java 7 grammar very slow

1,988 views
Skip to first unread message

Thomas Feng

unread,
Sep 6, 2013, 4:03:24 AM9/6/13
to antlr-di...@googlegroups.com
ANTLR Folks,

I'm new to this forum. Hopefully I can find some help here to get me out of a performance issue.

I've built a Java 7 parser using the Java.g4 grammar (https://github.com/antlr/grammars-v4/blob/master/java/Java.g4). I find that the grammar is extremely slow given a large Java file.

This is what I experienced:

1. I downloaded antlr 4.1 and compiled Java.g4 into a parser, with no specific commandline option.

2. I wrote the following Test.java to invoke the parser:

import java.io.IOException;

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;

class Test {
public static void main(String[] args) throws IOException {
ANTLRFileStream stream = new ANTLRFileStream("CompletionTests_1_5.java.txt");
JavaLexer lexer = new JavaLexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
JavaParser parser = new JavaParser(tokenStream);
parser.compilationUnit();
}
}

3. For experiment, I downloaded a big java file from eclipse JDT and saved it as CompletionTests_1_5.java.txt (https://github.com/eclipse/eclipse.jdt.core/blob/master/org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/CompletionTests_1_5.java).

4. I ran the compiled Test.class. It took ridiculously long to parse that java file with about 14,400 lines (in minutes, much longer than Java compiler itself did).

5. I tried to optimize Java.g4 by rewriting things like "expression : expression '.' Identifier | expression '.' 'this' | ..." into "expression : expression '.' (Identifier | 'this' | ...)". But it didn't help. The parsing time was still very long. It also took about 1G of RAM while parsing that file.

Can someone suggest what is causing the performance issue in Java.g4? I'm not familiar with ANTLR4 and do not know what else impacts its performance.

Thanks!

Terence Parr

unread,
Sep 6, 2013, 12:07:21 PM9/6/13
to antlr-di...@googlegroups.com
Hi. look at the book for two-stage parsing where you try with SLL first then full LL only if SLL fails. It's section 13.7 maximizing parser speed. :)

note we can parse *all* of jdk 1.7 library cold in about 7s on my box.

Ter
> --
> You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Thomas Feng

unread,
Sep 6, 2013, 3:17:48 PM9/6/13
to antlr-di...@googlegroups.com
Thanks Ter!

I have verified adding the following speeds up the java parser by a lot and speed is no longer an issue. (And I also read the section referred to in the book.)

  parser.getInterpreter().setPredictionMode(PredictionMode.SLL);

However, according to the book, if SLL fails, it could be false negative, so we still need to retry parsing with LL. When we do that, it will require a lot of memory and time.

I'm trying to run the parser in a web server, and obviously this will expose the server to attacks (with large files that cannot be parsed with SLL). Is there anyway I can get around the problem (i.e., get accurate parsing result but at the same time not being attacked with large files)?

Thanks a lot for your answers!

Terence Parr

unread,
Sep 16, 2013, 5:32:49 PM9/16/13
to antlr-di...@googlegroups.com
On Fri, Sep 6, 2013 at 12:17 PM, Thomas Feng <tf...@berkeley.edu> wrote:
Thanks Ter!

I have verified adding the following speeds up the java parser by a lot and speed is no longer an issue. (And I also read the section referred to in the book.)

  parser.getInterpreter().setPredictionMode(PredictionMode.SLL);

However, according to the book, if SLL fails, it could be false negative, so we still need to retry parsing with LL. When we do that, it will require a lot of memory and time.

maybe, maybe not.  and will only do so occasionally.
 

I'm trying to run the parser in a web server, and obviously this will expose the server to attacks (with large files that cannot be parsed with SLL). Is there anyway I can get around the problem (i.e., get accurate parsing result but at the same time not being attacked with large files)?

 I think the answer is to optimize the grammar so that it does not need LL very much or at all.  Sam and I will eventually have a nice tool that will explain where the hotspots are, but you can use the DiagnosticErrorListener to learn about where it is failing over to full LL

Ter 



--
Dictation in use. Please excuse homophones, malapropisms, and nonsense. 
Reply all
Reply to author
Forward
0 new messages