serialization of parsed structure in ANTLR4

Leo Antoli

unread,

Jan 19, 2013, 12:34:12 PM1/19/13

to antlr-di...@googlegroups.com

Hi,

I'm starting to use ANTLR4. I was wondering if it is possible (and sensible) to serialize ParserRuleContext (and its object graph) for entry rule so I don't need to parse the source files again and again.

I'm doing some tools to analyse some thousands of Java files using Java grammar: https://raw.github.com/antlr/grammars-v4/master/java/Java.g4

I need to go though the classes several times to get different information. Parsing is "very" expensive, in my computer it's about 10 seconds per 100 files.

I have code like this:

ANTLRInputStream input = new ANTLRInputStream(stream);

JavaLexer lexer = new JavaLexer(input);

CommonTokenStream tokens = new CommonTokenStream(lexer);

JavaParser parser = new JavaParser(tokens);

CompilationUnitContext tree = parser.compilationUnit();

I'd like to be able to serialize "tree" to a file or DB so I can get Java structure in other executions/tools without parsing again.

I'm not sure whether this would improve speed dramatically or not, for instance maybe the serialized data is so big that it's expensive to deserialize and create the object graph so it's not worth.

Thanks a lot.

Regards,

Leo

Terence Parr

unread,

Jan 19, 2013, 1:28:54 PM1/19/13

to antlr-di...@googlegroups.com

Hi. set parsing mode to SLL or use the 2 stage thingie mentioned in book
advanced section. MUCH faster.

No built in serialization but you could implement easily enough.

Ter
--
Dictation in use. Please excuse homophones, malapropism, and nonsense.

Leo Antoli wrote:
> Hi,
> I'm starting to use ANTLR4. I was wondering if it is possible (and

> sensible) to serialize ParserRuleContext (and its object graph)for entry

> rule so I don't need to parse the source files again and again.
>
> I'm doing some tools to analyse some thousands of Java files using Java
> grammar: https://raw.github.com/antlr/grammars-v4/master/java/Java.g4
> I need to go though the classes several times to get different
> information. Parsing is "very" expensive, in my computer it's about 10
> seconds per 100 files.
>
> I have code like this:
> ANTLRInputStream input = new ANTLRInputStream(stream);
> JavaLexer lexer = new JavaLexer(input);
> CommonTokenStream tokens = new CommonTokenStream(lexer);
> JavaParser parser = new JavaParser(tokens);
> CompilationUnitContext tree = parser.compilationUnit();
>
> I'd like to be able to serialize "tree" to a file or DB so I can get
> Java structure in other executions/tools without parsing again.
>
> I'm not sure whether this would improve speed dramatically or not, for
> instance maybe the serialized data is so big that it's expensive to
> deserialize and create the object graph so it's not worth.
>
> Thanks a lot.
>
> Regards,
> Leo
>
>
>

> --
>
>

Leo Antoli

unread,

Jan 19, 2013, 2:10:35 PM1/19/13

to antlr-di...@googlegroups.com

Thanks.

I parsed 1000 java files with SLL, LL & LL_EXACT_AMBIG_DETECTION and seems to last very similar, all of them are very close to 30 seconds.

BTW, in the book it says: parser.getInterpreter().setSLL(true);

but setSLL doesn't exist in ParserATNSimulator in antlr 4.0-rc-1

I think it should be something like:

parser.getInterpreter().setPredictionMode(PredictionMode.SLL);

Is there any estimated date for ANTLR 4 first release ?

Regards,

Leo

--

Terence Parr

unread,

Jan 19, 2013, 2:21:51 PM1/19/13

to antlr-di...@googlegroups.com

hoping for monday/tuesday release :)
T

> https://raw.github.com/antlr/__grammars-v4/master/java/Java.__g4

> <https://raw.github.com/antlr/grammars-v4/master/java/Java.g4>
> I need to go though the classes several times to get different
> information. Parsing is "very" expensive, in my computer it's
> about 10
> seconds per 100 files.
>
> I have code like this:
> ANTLRInputStream input = new ANTLRInputStream(stream);
> JavaLexer lexer = new JavaLexer(input);
> CommonTokenStream tokens = new CommonTokenStream(lexer);
> JavaParser parser = new JavaParser(tokens);
> CompilationUnitContext tree = parser.compilationUnit();
>
> I'd like to be able to serialize "tree" to a file or DB so I can get
> Java structure in other executions/tools without parsing again.
>
> I'm not sure whether this would improve speed dramatically or
> not, for
> instance maybe the serialized data is so big that it's expensive to
> deserialize and create the object graph so it's not worth.
>
> Thanks a lot.
>
> Regards,
> Leo
>
>
>
> --
>
>
>
> --
>
>
>

> --
>
>

Terence Parr

unread,

Jan 19, 2013, 2:23:12 PM1/19/13

to antlr-di...@googlegroups.com

btw,That timing is very strange. I can parse all of the jdk source
library in like 5s without warm-up in SLL mode. It's 19s in full LL mode.

> https://raw.github.com/antlr/__grammars-v4/master/java/Java.__g4

> <https://raw.github.com/antlr/grammars-v4/master/java/Java.g4>
> I need to go though the classes several times to get different
> information. Parsing is "very" expensive, in my computer it's
> about 10
> seconds per 100 files.
>
> I have code like this:
> ANTLRInputStream input = new ANTLRInputStream(stream);
> JavaLexer lexer = new JavaLexer(input);
> CommonTokenStream tokens = new CommonTokenStream(lexer);
> JavaParser parser = new JavaParser(tokens);
> CompilationUnitContext tree = parser.compilationUnit();
>
> I'd like to be able to serialize "tree" to a file or DB so I can get
> Java structure in other executions/tools without parsing again.
>
> I'm not sure whether this would improve speed dramatically or
> not, for
> instance maybe the serialized data is so big that it's expensive to
> deserialize and create the object graph so it's not worth.
>
> Thanks a lot.
>
> Regards,
> Leo
>
>
>
> --
>
>
>
> --
>
>
>

> --
>
>

Sam Harwell

unread,

Jan 19, 2013, 2:40:29 PM1/19/13

to antlr-di...@googlegroups.com

I should add that this data set is over 7000 files, over 70MB of Java source code.

LL and LL_EXACT_AMBIG_DETECTION have the possibility to perform similarly. If SLL is performing the same as those two, you're doing something wrong (or you're ***extremely*** bottlenecked on your storage device, enough to hide a 4:1 processor performance discrepancy).

--
Sam Harwell
Owner, Lead Developer
http://tunnelvisionlabs.com

--

Terence Parr

unread,

Jan 19, 2013, 2:45:46 PM1/19/13

to antlr-di...@googlegroups.com

or not enough mem?
T

--
Dictation in use. Please excuse homophones, malapropism, and nonsense.

Sam Harwell wrote:
> I should add that this data set is over 7000 files, over 70MB of Java source code.
>
> LL and LL_EXACT_AMBIG_DETECTION have the possibility to perform similarly. If SLL is performing the same as those two, you're doing something wrong (or you're ***extremely*** bottlenecked on your storage device, enough to hide a 4:1 processor performance discrepancy).
>
> --
> Sam Harwell
> Owner, Lead Developer
> http://tunnelvisionlabs.com
>
> -----Original Message-----
> From: antlr-di...@googlegroups.com [mailto:antlr-di...@googlegroups.com] On Behalf Of Terence Parr
> Sent: Saturday, January 19, 2013 1:23 PM
> To: antlr-di...@googlegroups.com
> Subject: Re: [antlr-discussion] serialization of parsed structure in ANTLR4
>
> btw,That timing is very strange. I can parse all of the jdk source library in like 5s without warm-up in SLL mode. It's 19s in full LL mode.
> Ter
> --
> Dictation in use. Please excuse homophones, malapropism, and nonsense.
>
>
>
> Leo Antoli wrote:
>> Thanks.
>>

>> I parsed 1000 java files with SLL, LL& LL_EXACT_AMBIG_DETECTION and

Sam Harwell

unread,

Jan 19, 2013, 2:48:11 PM1/19/13

to antlr-di...@googlegroups.com

Memory constraints would affect both SLL and LL, so both would be slower would still be a large offset.

My guess is he didn't enable SLL prediction mode properly, but would need to see the code to be sure.

--

Leo Antoli

unread,

Jan 19, 2013, 4:38:23 PM1/19/13

to antlr-discussion

Thanks you guys for the quick responses.

I'm using a MBP with SSD so storage device performance shouldn't be a problem. It has 8 GB (2 GB for JVM) so memory performance shouldn't be a problem either.

In order to check I'm doing it right, could you provide me with some Java source files (or their URLs) in which 1/4 ratio happen?

Could it be a performance problem in the grammar ? https://raw.github.com/antlr/grammars-v4/master/java/Java.g4

Could it be that lexer time is much longer that parser time ?

The code I'm measuring is:

ANTLRInputStream input = new ANTLRInputStream(sourceFile);

JavaLexer lexer = new JavaLexer(input);

CommonTokenStream tokens = new CommonTokenStream(lexer);

JavaParser parser = new JavaParser(tokens);

parser.getInterpreter().setPredictionMode(PredictionMode.SLL);

ParseTree tree = parser.compilationUnit();

Thanks.

Regards,

Leo

--

Terence Parr

unread,

Jan 19, 2013, 5:25:38 PM1/19/13

to antlr-di...@googlegroups.com

weird! You are doing the right thing:

parser.getInterpreter().setPredictionMode(PredictionMode.SLL);

take a look at my test rig for JavaLR.g4 grammar which should be the
same as that one in the examples dir at github:

https://github.com/antlr/antlr4/blob/master/tool/playground/TestJavaLR.java

Actually we can just use grun:

~/antlr/code/antlr4/tool/playground $ time grun JavaLR compilationUnit
JavaParser.java

real 0m16.607s
user 0m22.172s
sys 0m0.603s
~/antlr/code/antlr4/tool/playground $ time grun JavaLR compilationUnit
-SLL JavaParser.java

real 0m1.715s
user 0m4.349s
sys 0m0.168s

Note that I am trying to parse the JavaParser not JavaLRParser file
because it has some really long expressions which faces off the full
ALL(*) algorithm.

> --
>
>

Sam Harwell

unread,

Jan 19, 2013, 5:55:03 PM1/19/13

to antlr-di...@googlegroups.com

There is a unit test “TestPerformance” which is highly configurable and designed for use with the Java parser. You can test with the LR (new style) or standard (old style) Java grammars, and have many options to control the behavior. I recommend the following initial configuration to gauge the performance on your input files:

· Set the JDK_SOURCE_ROOT environment variable or Java property (passed to the VM with the -D flag) to the root folder containing the source files you’d like to parse.

· Set the TOP_PACKAGE field in TestPerformance to the top-level package you’d like to parse. If this is the empty string, then all packages under the JDK_SOURCE_ROOT (recursively) will be parsed for the test.

· Set TestPerformance.COMPUTE_CHECKSUM to false.

The test loads all source files into memory so variations in drive speed do not affect the test results. You also have easy access to:

· Multiple parsing strategies, including SLL/LL and the hybrid two-stage parsing technique

· Single- and multi-threaded parsing evaluation

· Statistics concerning DFA size

· The ability to compensate results for JIT overhead during startup and normalize for GC overhead

· The ability to independently evaluate overhead of constructing a parse tree on the overall process

· The ability to independently evaluate overhead of running a (blank) listener over the resulting parse tree

· Control over the DFA caches (preserve throughout the test, flush after each pass, or flush after every file)

https://github.com/antlr/antlr4/blob/master/tool/test/org/antlr/v4/test/TestPerformance.java

The grammars used for this test are “Java-LR.g4” and “Java.g4” in this folder:

https://github.com/antlr/antlr4/tree/master/tool/test/org/antlr/v4/test

--

Sam Harwell

Owner, Lead Developer

http://tunnelvisionlabs.com

--

Leo Antoli

unread,

Jan 20, 2013, 6:28:24 PM1/20/13

to antlr-discussion

Ter / Sam,

sorry, I was using LLC in the wrong place, not for all the files.

LLC is being about 10x faster than LC, for 1,000 Java source files it's about 3 secs compared to about 30 secs.

Is it ok to use this Java grammar: https://raw.github.com/antlr/grammars-v4/master/java/Java.g4 ?

or it's better grammars for tests “Java-LR.g4” and “Java.g4” in this folder: https://github.com/antlr/antlr4/tree/master/tool/test/org/antlr/v4/test

or other one?

I was also wondering for an estimated date for ANTLRWorkds for antlr4 (for instance ATLRWorks 1.5 can be used for Java.g4 in antlr/grammars github provided that hidden channel and right associations are removed from the grammar).

Thanks a lot for your help. And thanks for ANTLR, I hadn't used it before and I'm loving it.

Regards,

Leo

--

Reply all

Reply to author

Forward