Accumulating ATNConfig objects during a large batch parsing run

193 views
Skip to first unread message

timhilco

unread,
Jun 27, 2013, 2:53:21 PM6/27/13
to antlr-di...@googlegroups.com
I'm trying to lex and parse a large number of source files (`30,000 COBOL source files). I've build a java program that will look at a directory and parse each of the source files. After a processing a large number of files, I'm getting a HeapDump OutOfMemory error.

I've created some heap dumps and run a profile against the code and it seems like it is accumulating a large number of 'ATNConfig' objects that don't seem to get garabage collected after I parse a single file. I've tried lots of things to remove the references to these objects (nulling the Lexer and Parser between file, etc), but nothing seems to work.

I'm not an expert, so I'm not sure I know what I'm talking about, but things seem to be pointing to the class ParserATNSimulator and method addDFAEdge. The 'configs' field in DFAState seems to be holding a ATNConfigSet which contains the ATNConfig objects.

I would like to get rid of these objects between each file parsing, so I don't run out of memory.

Looking for solutions to avoid the Out of Memory condition.

Tim Hilgenberg

Terence Parr

unread,
Jun 27, 2013, 3:03:05 PM6/27/13
to antlr-di...@googlegroups.com
for lots of files i give it -Xmx2G. what have you tried?
T
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Sam Harwell

unread,
Jun 27, 2013, 3:07:51 PM6/27/13
to antlr-di...@googlegroups.com

The DFA required to parse your actual input is exceeding the memory available to the JVM. Here are some potential solutions to this situation.

 

1.       Refactoring your grammar to reduce lookahead for decisions with long lookahead will reduce the number of DFA states created, which will in turn reduce the memory requirements to hold the constructed DFA. Both of these characteristics will improve your parser’s performance.

2.       You can manually clear the DFA between files to prevent accumulation of states in the DFA. This will obliterate parser performance – it could easily cause your parser to take 100x as long to parse the same input. At a minimum, I recommend that you clear the DFA only after many files have been parsed, e.g. clear the DFA after every 1000 files.

 

We are working on some additional solutions which are not yet available in the public distribution.

 

1.       In an experimental branch, I changed the way ATN configurations are stored in the ATNConfigSet. This change caused a 22-fold reduction in the number of ATNConfig instances stored in the DFA. In the reference code, we were averaging 43.81 ATNConfig instances per ATNConfigSet. In the experimental branch, the number of ATNConfigSet instances remained the same, but each had an average of only 1.98 ATNConfig instances.

2.       In another experimental branch, additional optimizations are performed during the construction of the DFA, which is shown to reduce the number of DFA states created by 4x or more, with a matching reduction in both the number of ATNConfigSet instances and ATNConfig instances.

3.       In another experimental branch, an automated grammar transformation pass during parser generation provided the ability to reduce lookahead for certain costly decisions without changing the grammar or resulting parse tree. Careful application of this feature to a particular grammar resulted in a 1000:1 reduction in the number of DFA states stored in the DFA for this grammar. The performance of this feature exceeded the performance of item #1 in the previous list, without causing the grammar or parse tree to change.

 

Thank you,

--

Sam Harwell

Owner, Lead Developer

http://tunnelvisionlabs.com

--

Sam Harwell

unread,
Jun 27, 2013, 3:21:49 PM6/27/13
to antlr-di...@googlegroups.com

I forgot to mention how to manually clear the DFA.

 

In ANTLR 4.0:

 

Arrays.fill(parser.getInterpreter().decisionToDFA, null);

 

In ANTLR 4.1 (coming very soon):

 

ATN atn = parser.getATN();

for (int i = 0; i < parser.getInterpreter().decisionToDFA.length; i++) {

                parser.getInterpreter().decisionToDFA[i] = new DFA(atn.getDecisionState(i), i);

}

 

--

Sam Harwell

Owner, Lead Developer

http://tunnelvisionlabs.com

 

Terence Parr

unread,
Jun 27, 2013, 3:29:59 PM6/27/13
to antlr-di...@googlegroups.com
you can do this wiping "every once in a while" and speed will pop back
up within 20 files as it adapts to new input.

Ter

> Sam Harwell <mailto:s...@tunnelvisionlabs.com>
> June 27, 2013 12:21 PM
>
> I forgot to mention /how/ to manually clear the DFA.
>
> In ANTLR 4.0:
>
> Arrays.fill(parser.getInterpreter().decisionToDFA, null);
>
> In ANTLR 4.1 (coming very soon):
>
> ATN atn = parser.getATN();
>
> for (int i = 0; i < parser.getInterpreter().decisionToDFA.length; i++) {
>
> parser.getInterpreter().decisionToDFA[i] = new
> DFA(atn.getDecisionState(i), i);
>
> }
>
> --
>
> Sam Harwell
>
> Owner, Lead Developer
>
> http://tunnelvisionlabs.com
>
> *From:*antlr-di...@googlegroups.com
> [mailto:antlr-di...@googlegroups.com] *On Behalf Of *Sam Harwell
> *Sent:* Thursday, June 27, 2013 2:08 PM
> *To:* antlr-di...@googlegroups.com
> *Subject:* RE: [antlr-discussion] Accumulating ATNConfig objects
> during a large batch parsing run
>
> The DFA required to parse your actual input is exceeding the memory
> available to the JVM. Here are some potential solutions to this situation.
>
> 1.Refactoring your grammar to reduce lookahead for decisions with long
> lookahead will reduce the number of DFA states created, which will in
> turn reduce the memory requirements to hold the constructed DFA. Both
> of these characteristics will improve your parser�s performance.
>
> 2.You can manually clear the DFA between files to prevent accumulation
> of states in the DFA. This will /obliterate/ parser performance � it
> could easily cause your parser to take 100x as long to parse the same
> input. At a minimum, I recommend that you clear the DFA only after
> many files have been parsed, e.g. clear the DFA after every 1000 files.
>
> We are working on some additional solutions which are not yet
> available in the public distribution.
>
> 1.In an experimental branch, I changed the way ATN configurations are
> stored in the ATNConfigSet. This change caused a 22-fold reduction in
> the number of ATNConfig instances stored in the DFA. In the reference
> code, we were averaging 43.81 ATNConfig instances per ATNConfigSet. In
> the experimental branch, the number of ATNConfigSet instances remained
> the same, but each had an average of only 1.98 ATNConfig instances.
>
> 2.In another experimental branch, additional optimizations are
> performed during the construction of the DFA, which is shown to reduce
> the number of DFA states created by 4x or more, with a matching
> reduction in both the number of ATNConfigSet instances and ATNConfig
> instances.
>
> 3.In another experimental branch, an automated grammar transformation
> pass during parser generation provided the ability to reduce lookahead
> for certain costly decisions without changing the grammar or resulting
> parse tree. Careful application of this feature to a particular
> grammar resulted in a 1000:1 reduction in the number of DFA states
> stored in the DFA for this grammar. The performance of this feature
> exceeded the performance of item #1 in the previous list, without
> causing the grammar or parse tree to change.
>
> Thank you,
>
> --
>
> Sam Harwell
>
> Owner, Lead Developer
>
> http://tunnelvisionlabs.com
>
> *From:*antlr-di...@googlegroups.com
> [mailto:antlr-di...@googlegroups.com] *On Behalf Of *timhilco
> *Sent:* Thursday, June 27, 2013 1:53 PM
> *To:* antlr-di...@googlegroups.com
> *Subject:* [antlr-discussion] Accumulating ATNConfig objects during a
> large batch parsing run
>
> I'm trying to lex and parse a large number of source files (`30,000
> COBOL source files). I've build a java program that will look at a
> directory and parse each of the source files. After a processing a
> large number of files, I'm getting a HeapDump OutOfMemory error.
>
> I've created some heap dumps and run a profile against the code and it
> seems like it is accumulating a large number of 'ATNConfig' objects
> that don't seem to get garabage collected after I parse a single file.
> I've tried lots of things to remove the references to these objects
> (nulling the Lexer and Parser between file, etc), but nothing seems to
> work.
>
> I'm not an expert, so I'm not sure I know what I'm talking about, but
> things seem to be pointing to the class ParserATNSimulator and method
> addDFAEdge. The 'configs' field in DFAState seems to be holding a
> ATNConfigSet which contains the ATNConfig objects.
>
> I would like to get rid of these objects between each file parsing, so
> I don't run out of memory.
>
> Looking for solutions to avoid the Out of Memory condition.
>
> Tim Hilgenberg
>
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com
> <mailto:antlr-discussi...@googlegroups.com>.
> For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> Sam Harwell <mailto:s...@tunnelvisionlabs.com>
> June 27, 2013 12:07 PM
>
> The DFA required to parse your actual input is exceeding the memory
> available to the JVM. Here are some potential solutions to this situation.
>
> 1.Refactoring your grammar to reduce lookahead for decisions with long
> lookahead will reduce the number of DFA states created, which will in
> turn reduce the memory requirements to hold the constructed DFA. Both
> of these characteristics will improve your parser�s performance.
>
> 2.You can manually clear the DFA between files to prevent accumulation
> of states in the DFA. This will /obliterate/ parser performance � it
> could easily cause your parser to take 100x as long to parse the same
> input. At a minimum, I recommend that you clear the DFA only after
> many files have been parsed, e.g. clear the DFA after every 1000 files.
>
> We are working on some additional solutions which are not yet
> available in the public distribution.
>
> 1.In an experimental branch, I changed the way ATN configurations are
> stored in the ATNConfigSet. This change caused a 22-fold reduction in
> the number of ATNConfig instances stored in the DFA. In the reference
> code, we were averaging 43.81 ATNConfig instances per ATNConfigSet. In
> the experimental branch, the number of ATNConfigSet instances remained
> the same, but each had an average of only 1.98 ATNConfig instances.
>
> 2.In another experimental branch, additional optimizations are
> performed during the construction of the DFA, which is shown to reduce
> the number of DFA states created by 4x or more, with a matching
> reduction in both the number of ATNConfigSet instances and ATNConfig
> instances.
>
> 3.In another experimental branch, an automated grammar transformation
> pass during parser generation provided the ability to reduce lookahead
> for certain costly decisions without changing the grammar or resulting
> parse tree. Careful application of this feature to a particular
> grammar resulted in a 1000:1 reduction in the number of DFA states
> stored in the DFA for this grammar. The performance of this feature
> exceeded the performance of item #1 in the previous list, without
> causing the grammar or parse tree to change.
>
> Thank you,
>
> --
>
> Sam Harwell
>
> Owner, Lead Developer
>
> http://tunnelvisionlabs.com
>
> *From:*antlr-di...@googlegroups.com
> [mailto:antlr-di...@googlegroups.com] *On Behalf Of *timhilco
> *Sent:* Thursday, June 27, 2013 1:53 PM
> *To:* antlr-di...@googlegroups.com
> *Subject:* [antlr-discussion] Accumulating ATNConfig objects during a
> large batch parsing run
>
> I'm trying to lex and parse a large number of source files (`30,000
> COBOL source files). I've build a java program that will look at a
> directory and parse each of the source files. After a processing a
> large number of files, I'm getting a HeapDump OutOfMemory error.
>
> I've created some heap dumps and run a profile against the code and it
> seems like it is accumulating a large number of 'ATNConfig' objects
> that don't seem to get garabage collected after I parse a single file.
> I've tried lots of things to remove the references to these objects
> (nulling the Lexer and Parser between file, etc), but nothing seems to
> work.
>
> I'm not an expert, so I'm not sure I know what I'm talking about, but
> things seem to be pointing to the class ParserATNSimulator and method
> addDFAEdge. The 'configs' field in DFAState seems to be holding a
> ATNConfigSet which contains the ATNConfig objects.
>
> I would like to get rid of these objects between each file parsing, so
> I don't run out of memory.
>
> Looking for solutions to avoid the Out of Memory condition.
>
> Tim Hilgenberg
>
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com
> <mailto:antlr-discussi...@googlegroups.com>.
> For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google
> Groups "antlr-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> timhilco <mailto:timh...@gmail.com>
> June 27, 2013 11:53 AM

Loring Craymer

unread,
Jun 27, 2013, 7:07:05 PM6/27/13
to antlr-di...@googlegroups.com
And to add to what Ter and Sam are saying:  you don't want to get rid of those objects; they would just be recreated, and the recreation would slow down the parse.  These objects relate to the COBOL grammar and the recognition engine, not to the files being parsed.

--Loring

timhilco

unread,
Jun 28, 2013, 11:30:30 AM6/28/13
to antlr-di...@googlegroups.com
Your suggestions fixed the problem:

1. I increased the memory to 2G
2. Manually cleared the DFA after processing a large number of files

Through the profiler, I can see the number of ATNConfig object growing and then dropping once I clear the DFA.

Thanks for the help. COBOL isn't a very easy language to parse. We've added O-O extensions to the language which makes it even worse to deal with.
Reply all
Reply to author
Forward
0 new messages