I have a rather large input to Neotoma (around 850 lines) and it produces:
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 1271244 bytes of memory (of type "heap").
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
The problem occurs when using cygwin and I have no issues except quite long run times when running on Linux, so I am not blocked from working.
Then I run into another into with a few more lines that actually parses. This leads me to think that complexity of the input is more likely the root cause of this problem - especially since the failing input is one of the most complicated sources at all.
Since the input is a proprietary artefact I cannot give out information on the grammar nor the input, but I would hear if you ideas that could speed up the performance of produced parsers since I think there could be something to gain by structuring the clauses of the rules in an optimal way or use specific constructs over others.
Cheers,
Torben
--
http://www.linkedin.com/in/torbenhoffmann
Sean
On Tuesday, January 5, 2010, Torben Hoffmann
Anyway, to the point -- tonight I discovered a major memory hog in
neotoma, related to memoization. For a while I have thought it was
building the AST and having lots of copies of it that caused the memory
leak, but except for very large inputs that is likely not the case. The
real problem is that with every memoization of a result, I'm also
memoizing the input. Because the input is a list, you're getting at
most NumOfRules*InputLength copies in various sizes, all memoized in the
ETS table. Talk about waste!
The simple solution when dealing with large input from an outside source
is to use a binary, which will be saved into a separate heap in the VM
and shared when possible. Subsections of the binary are simple
pointer-like structures internally, resulting in many fewer copies.
This should reduce memory usage dramatically in many cases. A nice
side-effect of switching to a binary as the internal input-stream
representation is that I can tackle another thing that's been on my list
- unicode support - with very little extra effort.
It'll take a few hours of work to get this fix in, but look for it very
soon (or figure it out yourself and send me a patch). Only about 5
internal parsing functions will be affected, but your transformation
code will need to expect binaries instead of lists for string and
charclass matches.
Cheers,
Sean
>> and do not give back. Best to use lookahead (!,&) when possible