Re: Neotoma running out of memory

Sean Cribbs

unread,

Jan 4, 2010, 9:44:48 PM1/4/10

to Torben Hoffmann, neoto...@googlegroups.com

In general, you'll want to put the longest match first in any ordered choice expression, especially when that sub-expression might share matches. The canonical example is an "if" expression vs. an "if ... else" expression, where you'd put the "if ... else" first. Another possibility is to put a more restrictive expression first, that is, one that is less likely to match.

Also, be careful to avoid left-recursion, and be judicious or conservative with repeating operators (*, +) as they are purely greedy and do not give back. Best to use lookahead (!, &) when possible to avoid them becoming too greedy.

However, this doesn't necessarily solve your memory problem. My first guess would be to ferret out any left-recursion, and then examine any right-recursive rules for potential leaks. Also bear in mind that packrat parsers use an amount of memory proportional to the input size - larger inputs will consume more memory. One thing I'm thinking of doing is changing the memoization map from a dict to a proplist. This would be slower but likely use less memory.

Sean

On 1/4/10 4:38 PM, Torben Hoffmann wrote:

I have a rather large input to Neotoma (around 850 lines) and it produces:

Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 1271244 bytes of memory (of type "heap").

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

The problem occurs when using cygwin and I have no issues except quite long run times when running on Linux, so I am not blocked from working.

Then I run into another into with a few more lines that actually parses. This leads me to think that complexity of the input is more likely the root cause of this problem - especially since the failing input is one of the most complicated sources at all.

Since the input is a proprietary artefact I cannot give out information on the grammar nor the input, but I would hear if you ideas that could speed up the performance of produced parsers since I think there could be something to gain by structuring the clauses of the rules in an optimal way or use specific constructs over others.

Cheers,
Torben
--
http://www.linkedin.com/in/torbenhoffmann

Torben Hoffmann

unread,

Jan 4, 2010, 4:44:31 PM1/4/10

to neoto...@googlegroups.com

Re-send after joining the group.

---------- Forwarded message ----------
From: Torben Hoffmann <torben...@googlemail.com>
Date: Mon, Jan 4, 2010 at 22:38
Subject: Neotoma running out of memory
To: Sean Cribbs <seanc...@gmail.com>, neoto...@googlegroups.com

I have a rather large input to Neotoma (around 850 lines) and it produces:

Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 1271244 bytes of memory (of type "heap").

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

The problem occurs when using cygwin and I have no issues except quite long run times when running on Linux, so I am not blocked from working.

Then I run into another into with a few more lines that actually parses. This leads me to think that complexity of the input is more likely the root cause of this problem - especially since the failing input is one of the most complicated sources at all.

Since the input is a proprietary artefact I cannot give out information on the grammar nor the input, but I would hear if you ideas that could speed up the performance of produced parsers since I think there could be something to gain by structuring the clauses of the rules in an optimal way or use specific constructs over others.

Cheers,
Torben
--
http://www.linkedin.com/in/torbenhoffmann

--
http://www.linkedin.com/in/torbenhoffmann

Torben Hoffmann

unread,

Jan 5, 2010, 3:57:51 AM1/5/10

to neoto...@googlegroups.com

Hi Sean,

Thanks for the advices - I will go through my grammar and see if I can improve it.

I do not have any left recursion as far as I can see, but I have been rather aggressive with repetition so I need to look at that.

Could you explain what you mean by

examine any right-recursive rules for potential leaks

especially what a leak is?

Cheers,
Torben

--
http://www.linkedin.com/in/torbenhoffmann

Sean Cribbs

unread,

Jan 5, 2010, 7:51:37 AM1/5/10

to neoto...@googlegroups.com

I meant that a rule should eventually reduce down to a terminal. Think
of it like a Lisp interpreter - each application of a form should get
you closer to a result. If one rule directly or indirectly
self-recurses, there should be an alternative that reduces more
simply.

Sean

On Tuesday, January 5, 2010, Torben Hoffmann

Sean Cribbs

unread,

Feb 2, 2010, 12:05:57 AM2/2/10

to neoto...@googlegroups.com

I've been thinking about this problem a lot lately, partly because I
want neotoma not to create crazy parsers that crash your Erlang VM, and
partly because creating efficient data structures in an immutable FPL is
challenging. I've been mulling over a lot of options, some that I'll
use (like proplists instead of dicts), some that I'll table (promises
and other lazy data-structures).

Anyway, to the point -- tonight I discovered a major memory hog in
neotoma, related to memoization. For a while I have thought it was
building the AST and having lots of copies of it that caused the memory
leak, but except for very large inputs that is likely not the case. The
real problem is that with every memoization of a result, I'm also
memoizing the input. Because the input is a list, you're getting at
most NumOfRules*InputLength copies in various sizes, all memoized in the
ETS table. Talk about waste!

The simple solution when dealing with large input from an outside source
is to use a binary, which will be saved into a separate heap in the VM
and shared when possible. Subsections of the binary are simple
pointer-like structures internally, resulting in many fewer copies.
This should reduce memory usage dramatically in many cases. A nice
side-effect of switching to a binary as the internal input-stream
representation is that I can tackle another thing that's been on my list
- unicode support - with very little extra effort.

It'll take a few hours of work to get this fix in, but look for it very
soon (or figure it out yourself and send me a patch). Only about 5
internal parsing functions will be affected, but your transformation
code will need to expect binaries instead of lists for string and
charclass matches.

Cheers,

Sean

>> and do not give back. Best to use lookahead (!,&) when possible

Reply all

Reply to author

Forward