I'm new to this forum. Hopefully I can find some help here to get me out of a performance issue.
I've built a Java 7 parser using the Java.g4 grammar (https://github.com/antlr/grammars-v4/blob/master/java/Java.g4). I find that the grammar is extremely slow given a large Java file.
This is what I experienced:
1. I downloaded antlr 4.1 and compiled Java.g4 into a parser, with no specific commandline option.
2. I wrote the following Test.java to invoke the parser:
import java.io.IOException;
import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
class Test {
public static void main(String[] args) throws IOException {
ANTLRFileStream stream = new ANTLRFileStream("CompletionTests_1_5.java.txt");
JavaLexer lexer = new JavaLexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
JavaParser parser = new JavaParser(tokenStream);
parser.compilationUnit();
}
}
3. For experiment, I downloaded a big java file from eclipse JDT and saved it as CompletionTests_1_5.java.txt (https://github.com/eclipse/eclipse.jdt.core/blob/master/org.eclipse.jdt.core.tests.model/src/org/eclipse/jdt/core/tests/model/CompletionTests_1_5.java).
4. I ran the compiled Test.class. It took ridiculously long to parse that java file with about 14,400 lines (in minutes, much longer than Java compiler itself did).
5. I tried to optimize Java.g4 by rewriting things like "expression : expression '.' Identifier | expression '.' 'this' | ..." into "expression : expression '.' (Identifier | 'this' | ...)". But it didn't help. The parsing time was still very long. It also took about 1G of RAM while parsing that file.
Can someone suggest what is causing the performance issue in Java.g4? I'm not familiar with ANTLR4 and do not know what else impacts its performance.
Thanks!
Thanks Ter!
I have verified adding the following speeds up the java parser by a lot and speed is no longer an issue. (And I also read the section referred to in the book.)
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
However, according to the book, if SLL fails, it could be false negative, so we still need to retry parsing with LL. When we do that, it will require a lot of memory and time.
I'm trying to run the parser in a web server, and obviously this will expose the server to attacks (with large files that cannot be parsed with SLL). Is there anyway I can get around the problem (i.e., get accurate parsing result but at the same time not being attacked with large files)?