Performance for NLP Application

13 views
Skip to first unread message

Adam

unread,
Dec 31, 2011, 9:14:58 PM12/31/11
to lepl
Hi,

I've been experimenting with LEPL as an alternative to NLTK's CFG
parsing library for a natural language processing application I'm
working with and I'm pretty pleased with the results. I know NLP isn't
exactly the ideal use case for LEPL, but with a few tricks I'm getting
better performance (after compilation) from LEPL than from NLTK. I'm
parsing a (very) limited subset of English and for my use case I'm
getting better performance than I am from NLTK, and the I like the
syntax and visibility much better (NLTK gives you zero of the latter).

There are two problems I wanted to get an expert opinion on, though
neither are deal breakers. First for some context: I have a grammar
which I assume is pretty large by the average LEPL user's standards:
about 6000 leaves and 200 non-leaf productions. My initial naive
conversion from my NLTK grammar to LEPL wouldn't even complete in the
~ 15 mins I gave it to parse a simple sentence. It was much faster
when I cut the leaves down to a small fraction of the original 6000,
so I determined that the number of leaf nodes was the problem (I ruled
out ambiguity as well).

The "trick" I alluded to above that worked well was to take all non-
ambiguous tokens (which was most of them) and group them together into
a single regexp. For example:

VerbToken -> <Token(?:wash|dry|run|jump)>
AdjectiveToken -> <Token(?:nice|happy|green)>

I left ambiguous tokens as their own tokens, eg.

CleanToken -> <Token(clean)>

Then one level up, I have the productions:

Verb -> VerbToken | CleanToken
Adjective -> AdjectiveToken | CleanToken

Like I said, this worked well, however I am open to other suggestions.
My questions are:

(1) More importantly, my parser takes up about 900mb of memory. It
claims this memory when the parser is first compiled (when it is first
asked to parse something), and hangs on to it for the lifetime of the
running instance. Subsequence parses don't seem to increase memory
usage significantly. Any idea why it's using so much memory?

(2) The compilation step is still pretty slow, it takes about a
minute. This is fine for production use for my application since I
expect each compilation to be amortized by thousands of parses over
the lifecycle of the app, but it is annoying for development,
especially when I am working on something that has nothing to do with
the parser and need to restart my python instance. Any suggestions for
this? I'm open to hacks for this one, since it's just a development
convenience. I briefly tried pickling the parser to disk so I could
recycle it when it was unchanged but, unsurprisingly, didn't get too
far with this.

I'm really impressed with LEPL so far even though I'm pushing its
limits. Solutions to either of these questions would make it even
better.

Thanks!
Adam

andrew cooke

unread,
Jan 1, 2012, 7:17:30 AM1/1/12
to le...@googlegroups.com
On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote:
> Hi,
>
> I've been experimenting with LEPL as an alternative to NLTK's CFG
> parsing library for a natural language processing application I'm
> working with and I'm pretty pleased with the results. I know NLP isn't
> exactly the ideal use case for LEPL, but with a few tricks I'm getting
> better performance (after compilation) from LEPL than from NLTK. I'm
> parsing a (very) limited subset of English and for my use case I'm
> getting better performance than I am from NLTK, and the I like the
> syntax and visibility much better (NLTK gives you zero of the latter).

hi,

thanks. can you explain what you mean by visibility? (i haven't used nltk)

[more below]

> There are two problems I wanted to get an expert opinion on, though
> neither are deal breakers. First for some context: I have a grammar
> which I assume is pretty large by the average LEPL user's standards:
> about 6000 leaves and 200 non-leaf productions. My initial naive
> conversion from my NLTK grammar to LEPL wouldn't even complete in the
> ~ 15 mins I gave it to parse a simple sentence. It was much faster
> when I cut the leaves down to a small fraction of the original 6000,
> so I determined that the number of leaf nodes was the problem (I ruled
> out ambiguity as well).

the tokenizer is a dfa written in python. i would have expected the two
approaches you describe to be pretty much equivalent, since the system
collects all the tokens and makes a regexp from them (so what you are doing is
reproducing by hand what the system does, roughly - the main difference
being what labels are associated with what matches).

in other words, it sounds like something very strange is happening - probably
some kind of bug (in the sense of not being impleented correctly).

this part of lepl (the regexp support) is being completely rewritten at the
moment. this is a very long process (it's complex and i keep getting
sidetracked by other projects), but it would be good to have what you are
doing as a test for the new code. would it be possible to provide me wth a
slow and fast version of the code and some input data? preferably with
confirmation that you have read and agree with the certificate of origin
described at http://acooke.org/lepl/licence.html#contributions (so that i can
include the code in lepl itself as part of the test suite)?

hmmm. actually, thinking some more, the dfa would need to preserve separate
state for the different matches in the slow case. so i guess it is
understandable. it would be a really interesting test for the new code...

[more below]

> The "trick" I alluded to above that worked well was to take all non-
> ambiguous tokens (which was most of them) and group them together into
> a single regexp. For example:
>
> VerbToken -> <Token(?:wash|dry|run|jump)>
> AdjectiveToken -> <Token(?:nice|happy|green)>
>
> I left ambiguous tokens as their own tokens, eg.
>
> CleanToken -> <Token(clean)>
>
> Then one level up, I have the productions:
>
> Verb -> VerbToken | CleanToken
> Adjective -> AdjectiveToken | CleanToken
>
> Like I said, this worked well, however I am open to other suggestions.
> My questions are:
>
> (1) More importantly, my parser takes up about 900mb of memory. It
> claims this memory when the parser is first compiled (when it is first
> asked to parse something), and hangs on to it for the lifetime of the
> running instance. Subsequence parses don't seem to increase memory
> usage significantly. Any idea why it's using so much memory?

i haven't looked at this kind of memory use (i have worried more about the use
of memory when processing data). i imagine it could be improved, but it would
need profiling and that kind fo work will need to wait for the new changes
(which are taking forever to write - i hope to have something out later this
year).

> (2) The compilation step is still pretty slow, it takes about a
> minute. This is fine for production use for my application since I
> expect each compilation to be amortized by thousands of parses over
> the lifecycle of the app, but it is annoying for development,
> especially when I am working on something that has nothing to do with
> the parser and need to restart my python instance. Any suggestions for
> this? I'm open to hacks for this one, since it's just a development
> convenience. I briefly tried pickling the parser to disk so I could
> recycle it when it was unchanged but, unsurprisingly, didn't get too
> far with this.

well, you can disable various things in the config. see all the
no_... options at http://acooke.org/lepl/advanced.html#configuration

but the bigger problem there is that you are using tokens. they require a
fair amount of processing for compilation (you can't disable that part or
things won't work).

so if you don't need tokens when playing around you can use .clear() whoch
should reduce compilation time considerably.

for a first attempt, try ,atcher.config.clear().lexer() which should give the
minimal time needed with tokens.

> I'm really impressed with LEPL so far even though I'm pushing its
> limits. Solutions to either of these questions would make it even
> better.

glad you like it and apologies to you (and everyone on the list) that
development these days is so slow.

andrew

>
> Thanks!
> Adam
>
> --
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
>

Adam

unread,
Jan 1, 2012, 3:50:38 PM1/1/12
to lepl


On Jan 1, 4:17 am, andrew cooke <and...@acooke.org> wrote:
> On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote:
> > Hi,
>
> > I've been experimenting with LEPL as an alternative to NLTK's CFG
> > parsing library for a natural language processing application I'm
> > working with and I'm pretty pleased with the results. I know NLP isn't
> > exactly the ideal use case for LEPL, but with a few tricks I'm getting
> > better performance (after compilation) from LEPL than from NLTK. I'm
> > parsing a (very) limited subset of English and for my use case I'm
> > getting better performance than I am from NLTK, and the I like the
> > syntax and visibility much better (NLTK gives you zero of the latter).
>
> hi,
>
> thanks.  can you explain what you mean by visibility?  (i haven't used nltk)

2 things in particular: LEPL helped me find a couple hidden instances
of left recursion. Granted it was still tricky to find using LEPL, but
at least LEPL told me that that was the problem; NLTK just froze my
computer. The other visibility window I've used in LEPL is the ability
to find partial and multiple matches. Perhaps not intended for
debugging, but that's how it's been useful for me. With NLTK you give
the parser a string that represents your grammar, the string you want
to match, and you get either a single tree or an exception (or
ridiculous swapping).


> > There are two problems I wanted to get an expert opinion on, though
> > neither are deal breakers. First for some context: I have a grammar
> > which I assume is pretty large by the average LEPL user's standards:
> > about 6000 leaves and 200 non-leaf productions. My initial naive
> > conversion from my NLTK grammar to LEPL wouldn't even complete in the
> > ~ 15 mins I gave it to parse a simple sentence. It was much faster
> > when I cut the leaves down to a small fraction of the original 6000,
> > so I determined that the number of leaf nodes was the problem (I ruled
> > out ambiguity as well).
>
> the tokenizer is a dfa written in python.  i would have expected the two
> approaches you describe to be pretty much equivalent, since the system
> collects all the tokens and makes a regexp from them (so what you are doing is
> reproducing by hand what the system does, roughly - the main difference
> being what labels are associated with what matches).
>
> in other words, it sounds like something very strange is happening - probably
> some kind of bug (in the sense of not being impleented correctly).
>
> this part of lepl (the regexp support) is being completely rewritten at the
> moment.  this is a very long process (it's complex and i keep getting
> sidetracked by other projects), but it would be good to have what you are
> doing as a test for the new code.  would it be possible to provide me wth a
> slow and fast version of the code and some input data?  preferably with
> confirmation that you have read and agree with the certificate of origin
> described athttp://acooke.org/lepl/licence.html#contributions(so that i can
> include the code in lepl itself as part of the test suite)?

It would be pretty difficult unfortunately. The parser a small cog in
a larger system that unfortunately isn't very well isolated. I think
any grammar with thousands of leaf nodes would have this problem
though, I could try to come up with a simpler example.
I look forward to that. 900mb seems pretty excessive.


> > (2) The compilation step is still pretty slow, it takes about a
> > minute. This is fine for production use for my application since I
> > expect each compilation to be amortized by thousands of parses over
> > the lifecycle of the app, but it is annoying for development,
> > especially when I am working on something that has nothing to do with
> > the parser and need to restart my python instance. Any suggestions for
> > this? I'm open to hacks for this one, since it's just a development
> > convenience. I briefly tried pickling the parser to disk so I could
> > recycle it when it was unchanged but, unsurprisingly, didn't get too
> > far with this.
>
> well, you can disable various things in the config.  see all the
> no_... options athttp://acooke.org/lepl/advanced.html#configuration
> but the bigger problem there is that you are using tokens.  they require a
> fair amount of processing for compilation (you can't disable that part or
> things won't work).
>
> so if you don't need tokens when playing around you can use .clear() whoch
> should reduce compilation time considerably.
>
> for a first attempt, try ,atcher.config.clear().lexer() which should give the
> minimal time needed with tokens.

Oh, the reason I'm using the tokenizer is because of a different
problem that you may or may not be aware of. When using the Regexp()
matcher to try to mirror the above trick without using the tokenizer I
got an error saying something about the regular expression being too
long. It was a while ago, I can reproduce it if you want the exact
exception.

Thanks again!

Adam

andrew cooke

unread,
Jan 1, 2012, 5:16:09 PM1/1/12
to le...@googlegroups.com

OK, no problem - I can reconstruct examples from your description.

The problem with a tool long regexp sounds like it's comig from Python's
re package (the Regexp() matcher). If you use the NfaRegexp() or DfaRegexp()
you will use the same code used in the tokenizer and shouldn't hit that
problem.

Andrew

Adam Berlinsky-Schine

unread,
Jan 11, 2012, 10:20:53 PM1/11/12
to le...@googlegroups.com
I got around to trying this, and I do get the size exception with
NfaRegexp/DfaRegexp. Here it is:

File "../crawler/test/parser_test.py", line 293, in test_parsable_sentences
gp.parse("apples")
File "/Users/atom/workspace/crawler/lib/grammar_parser.py", line 464, in parse
result_list = self.parser.parse(new_sentence)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/config.py",
line 858, in parse
return self.get_parse()(input_, **kargs)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py",
line 257, in single
return next(raw(arg, **kargs))[0]
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py",
line 146, in trampoline
value = next(value.generator)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/support.py",
line 416, in _match
for results in self._cached_matcher(self, stream_in):
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/combine.py",
line 368, in match
for result in matcher._untagged_match(stream_in):
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/memo.py",
line 124, in _untagged_match
result = next(descriptor[2].generator)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/support.py",
line 440, in _match
results = self._cached_matcher(self, stream_in)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/support.py",
line 301, in _cached_matcher
self.__cached_matcher = self.factory(*args, **kargs)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/core.py",
line 194, in Regexp
pattern = compile_(pattern)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py",
line 190, in compile
return _compile(pattern, flags)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py",
line 243, in _compile
p = sre_compile.compile(pattern, flags)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_compile.py",
line 523, in compile
groupindex, indexgroup
OverflowError: regular expression code size limit exceeded

----------------------------------------------------------------------
Ran 1 test in 60.485s

FAILED (errors=1)

andrew cooke

unread,
Jan 12, 2012, 6:03:43 AM1/12/12
to le...@googlegroups.com

Can you provide the code? That error is still coming from a Regexp (see the
stack trace). Thanks, Andrew

Adam Berlinsky-Schine

unread,
Jan 12, 2012, 11:59:32 AM1/12/12
to le...@googlegroups.com
My bad, I tried again and it does get rid of the regex size exceeded
error, I must have missed a case yesterday. But switching to
DfaRegexp/NfaRegexp (or getting rid of tokens) makes my parser very
slow. It's alright though, my bigger complaint was the memory usage,
I'd rather let you focus on getting the next version out that has
memory profiling support :)

Adam

andrew cooke

unread,
Jan 12, 2012, 12:24:20 PM1/12/12
to le...@googlegroups.com

yeah, that doesn't surprise me (the slow part). the hope is that the next
release will have faster regxps, at least under pypy. we'll see...
Reply all
Reply to author
Forward
0 new messages