hi,
thanks. can you explain what you mean by visibility? (i haven't used nltk)
[more below]
> There are two problems I wanted to get an expert opinion on, though
> neither are deal breakers. First for some context: I have a grammar
> which I assume is pretty large by the average LEPL user's standards:
> about 6000 leaves and 200 non-leaf productions. My initial naive
> conversion from my NLTK grammar to LEPL wouldn't even complete in the
> ~ 15 mins I gave it to parse a simple sentence. It was much faster
> when I cut the leaves down to a small fraction of the original 6000,
> so I determined that the number of leaf nodes was the problem (I ruled
> out ambiguity as well).
the tokenizer is a dfa written in python. i would have expected the two
approaches you describe to be pretty much equivalent, since the system
collects all the tokens and makes a regexp from them (so what you are doing is
reproducing by hand what the system does, roughly - the main difference
being what labels are associated with what matches).
in other words, it sounds like something very strange is happening - probably
some kind of bug (in the sense of not being impleented correctly).
this part of lepl (the regexp support) is being completely rewritten at the
moment. this is a very long process (it's complex and i keep getting
sidetracked by other projects), but it would be good to have what you are
doing as a test for the new code. would it be possible to provide me wth a
slow and fast version of the code and some input data? preferably with
confirmation that you have read and agree with the certificate of origin
described at http://acooke.org/lepl/licence.html#contributions (so that i can
include the code in lepl itself as part of the test suite)?
hmmm. actually, thinking some more, the dfa would need to preserve separate
state for the different matches in the slow case. so i guess it is
understandable. it would be a really interesting test for the new code...
[more below]
> The "trick" I alluded to above that worked well was to take all non-
> ambiguous tokens (which was most of them) and group them together into
> a single regexp. For example:
>
> VerbToken -> <Token(?:wash|dry|run|jump)>
> AdjectiveToken -> <Token(?:nice|happy|green)>
>
> I left ambiguous tokens as their own tokens, eg.
>
> CleanToken -> <Token(clean)>
>
> Then one level up, I have the productions:
>
> Verb -> VerbToken | CleanToken
> Adjective -> AdjectiveToken | CleanToken
>
> Like I said, this worked well, however I am open to other suggestions.
> My questions are:
>
> (1) More importantly, my parser takes up about 900mb of memory. It
> claims this memory when the parser is first compiled (when it is first
> asked to parse something), and hangs on to it for the lifetime of the
> running instance. Subsequence parses don't seem to increase memory
> usage significantly. Any idea why it's using so much memory?
i haven't looked at this kind of memory use (i have worried more about the use
of memory when processing data). i imagine it could be improved, but it would
need profiling and that kind fo work will need to wait for the new changes
(which are taking forever to write - i hope to have something out later this
year).
> (2) The compilation step is still pretty slow, it takes about a
> minute. This is fine for production use for my application since I
> expect each compilation to be amortized by thousands of parses over
> the lifecycle of the app, but it is annoying for development,
> especially when I am working on something that has nothing to do with
> the parser and need to restart my python instance. Any suggestions for
> this? I'm open to hacks for this one, since it's just a development
> convenience. I briefly tried pickling the parser to disk so I could
> recycle it when it was unchanged but, unsurprisingly, didn't get too
> far with this.
well, you can disable various things in the config. see all the
no_... options at http://acooke.org/lepl/advanced.html#configuration
but the bigger problem there is that you are using tokens. they require a
fair amount of processing for compilation (you can't disable that part or
things won't work).
so if you don't need tokens when playing around you can use .clear() whoch
should reduce compilation time considerably.
for a first attempt, try ,atcher.config.clear().lexer() which should give the
minimal time needed with tokens.
> I'm really impressed with LEPL so far even though I'm pushing its
> limits. Solutions to either of these questions would make it even
> better.
glad you like it and apologies to you (and everyone on the list) that
development these days is so slow.
andrew
>
> Thanks!
> Adam
>
> --
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
>
The problem with a tool long regexp sounds like it's comig from Python's
re package (the Regexp() matcher). If you use the NfaRegexp() or DfaRegexp()
you will use the same code used in the tokenizer and shouldn't hit that
problem.
Andrew
File "../crawler/test/parser_test.py", line 293, in test_parsable_sentences
gp.parse("apples")
File "/Users/atom/workspace/crawler/lib/grammar_parser.py", line 464, in parse
result_list = self.parser.parse(new_sentence)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/config.py",
line 858, in parse
return self.get_parse()(input_, **kargs)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py",
line 257, in single
return next(raw(arg, **kargs))[0]
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py",
line 146, in trampoline
value = next(value.generator)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/support.py",
line 416, in _match
for results in self._cached_matcher(self, stream_in):
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/combine.py",
line 368, in match
for result in matcher._untagged_match(stream_in):
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/memo.py",
line 124, in _untagged_match
result = next(descriptor[2].generator)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/support.py",
line 440, in _match
results = self._cached_matcher(self, stream_in)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/support.py",
line 301, in _cached_matcher
self.__cached_matcher = self.factory(*args, **kargs)
File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/core.py",
line 194, in Regexp
pattern = compile_(pattern)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py",
line 190, in compile
return _compile(pattern, flags)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py",
line 243, in _compile
p = sre_compile.compile(pattern, flags)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_compile.py",
line 523, in compile
groupindex, indexgroup
OverflowError: regular expression code size limit exceeded
----------------------------------------------------------------------
Ran 1 test in 60.485s
FAILED (errors=1)
Adam