I've been experimenting with LEPL as an alternative to NLTK's CFG
parsing library for a natural language processing application I'm
working with and I'm pretty pleased with the results. I know NLP isn't
exactly the ideal use case for LEPL, but with a few tricks I'm getting
better performance (after compilation) from LEPL than from NLTK. I'm
parsing a (very) limited subset of English and for my use case I'm
getting better performance than I am from NLTK, and the I like the
syntax and visibility much better (NLTK gives you zero of the latter).
There are two problems I wanted to get an expert opinion on, though
neither are deal breakers. First for some context: I have a grammar
which I assume is pretty large by the average LEPL user's standards:
about 6000 leaves and 200 non-leaf productions. My initial naive
conversion from my NLTK grammar to LEPL wouldn't even complete in the
~ 15 mins I gave it to parse a simple sentence. It was much faster
when I cut the leaves down to a small fraction of the original 6000,
so I determined that the number of leaf nodes was the problem (I ruled
out ambiguity as well).
The "trick" I alluded to above that worked well was to take all non-
ambiguous tokens (which was most of them) and group them together into
a single regexp. For example:
Like I said, this worked well, however I am open to other suggestions.
My questions are:
(1) More importantly, my parser takes up about 900mb of memory. It
claims this memory when the parser is first compiled (when it is first
asked to parse something), and hangs on to it for the lifetime of the
running instance. Subsequence parses don't seem to increase memory
usage significantly. Any idea why it's using so much memory?
(2) The compilation step is still pretty slow, it takes about a
minute. This is fine for production use for my application since I
expect each compilation to be amortized by thousands of parses over
the lifecycle of the app, but it is annoying for development,
especially when I am working on something that has nothing to do with
the parser and need to restart my python instance. Any suggestions for
this? I'm open to hacks for this one, since it's just a development
convenience. I briefly tried pickling the parser to disk so I could
recycle it when it was unchanged but, unsurprisingly, didn't get too
far with this.
I'm really impressed with LEPL so far even though I'm pushing its
limits. Solutions to either of these questions would make it even
better.
On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote: > Hi,
> I've been experimenting with LEPL as an alternative to NLTK's CFG > parsing library for a natural language processing application I'm > working with and I'm pretty pleased with the results. I know NLP isn't > exactly the ideal use case for LEPL, but with a few tricks I'm getting > better performance (after compilation) from LEPL than from NLTK. I'm > parsing a (very) limited subset of English and for my use case I'm > getting better performance than I am from NLTK, and the I like the > syntax and visibility much better (NLTK gives you zero of the latter).
hi,
thanks. can you explain what you mean by visibility? (i haven't used nltk)
[more below]
> There are two problems I wanted to get an expert opinion on, though > neither are deal breakers. First for some context: I have a grammar > which I assume is pretty large by the average LEPL user's standards: > about 6000 leaves and 200 non-leaf productions. My initial naive > conversion from my NLTK grammar to LEPL wouldn't even complete in the > ~ 15 mins I gave it to parse a simple sentence. It was much faster > when I cut the leaves down to a small fraction of the original 6000, > so I determined that the number of leaf nodes was the problem (I ruled > out ambiguity as well).
the tokenizer is a dfa written in python. i would have expected the two approaches you describe to be pretty much equivalent, since the system collects all the tokens and makes a regexp from them (so what you are doing is reproducing by hand what the system does, roughly - the main difference being what labels are associated with what matches).
in other words, it sounds like something very strange is happening - probably some kind of bug (in the sense of not being impleented correctly).
this part of lepl (the regexp support) is being completely rewritten at the moment. this is a very long process (it's complex and i keep getting sidetracked by other projects), but it would be good to have what you are doing as a test for the new code. would it be possible to provide me wth a slow and fast version of the code and some input data? preferably with confirmation that you have read and agree with the certificate of origin described at http://acooke.org/lepl/licence.html#contributions (so that i can include the code in lepl itself as part of the test suite)?
hmmm. actually, thinking some more, the dfa would need to preserve separate state for the different matches in the slow case. so i guess it is understandable. it would be a really interesting test for the new code...
> The "trick" I alluded to above that worked well was to take all non- > ambiguous tokens (which was most of them) and group them together into > a single regexp. For example:
> Like I said, this worked well, however I am open to other suggestions. > My questions are:
> (1) More importantly, my parser takes up about 900mb of memory. It > claims this memory when the parser is first compiled (when it is first > asked to parse something), and hangs on to it for the lifetime of the > running instance. Subsequence parses don't seem to increase memory > usage significantly. Any idea why it's using so much memory?
i haven't looked at this kind of memory use (i have worried more about the use of memory when processing data). i imagine it could be improved, but it would need profiling and that kind fo work will need to wait for the new changes (which are taking forever to write - i hope to have something out later this year).
> (2) The compilation step is still pretty slow, it takes about a > minute. This is fine for production use for my application since I > expect each compilation to be amortized by thousands of parses over > the lifecycle of the app, but it is annoying for development, > especially when I am working on something that has nothing to do with > the parser and need to restart my python instance. Any suggestions for > this? I'm open to hacks for this one, since it's just a development > convenience. I briefly tried pickling the parser to disk so I could > recycle it when it was unchanged but, unsurprisingly, didn't get too > far with this.
but the bigger problem there is that you are using tokens. they require a fair amount of processing for compilation (you can't disable that part or things won't work).
so if you don't need tokens when playing around you can use .clear() whoch should reduce compilation time considerably.
for a first attempt, try ,atcher.config.clear().lexer() which should give the minimal time needed with tokens.
> I'm really impressed with LEPL so far even though I'm pushing its > limits. Solutions to either of these questions would make it even > better.
glad you like it and apologies to you (and everyone on the list) that development these days is so slow.
> -- > You received this message because you are subscribed to the Google Groups "lepl" group. > To post to this group, send email to lepl@googlegroups.com. > To unsubscribe from this group, send email to lepl+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
On Jan 1, 4:17 am, andrew cooke <and...@acooke.org> wrote:
> On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote:
> > Hi,
> > I've been experimenting with LEPL as an alternative to NLTK's CFG
> > parsing library for a natural language processing application I'm
> > working with and I'm pretty pleased with the results. I know NLP isn't
> > exactly the ideal use case for LEPL, but with a few tricks I'm getting
> > better performance (after compilation) from LEPL than from NLTK. I'm
> > parsing a (very) limited subset of English and for my use case I'm
> > getting better performance than I am from NLTK, and the I like the
> > syntax and visibility much better (NLTK gives you zero of the latter).
> hi,
> thanks. can you explain what you mean by visibility? (i haven't used nltk)
2 things in particular: LEPL helped me find a couple hidden instances
of left recursion. Granted it was still tricky to find using LEPL, but
at least LEPL told me that that was the problem; NLTK just froze my
computer. The other visibility window I've used in LEPL is the ability
to find partial and multiple matches. Perhaps not intended for
debugging, but that's how it's been useful for me. With NLTK you give
the parser a string that represents your grammar, the string you want
to match, and you get either a single tree or an exception (or
ridiculous swapping).
> > There are two problems I wanted to get an expert opinion on, though
> > neither are deal breakers. First for some context: I have a grammar
> > which I assume is pretty large by the average LEPL user's standards:
> > about 6000 leaves and 200 non-leaf productions. My initial naive
> > conversion from my NLTK grammar to LEPL wouldn't even complete in the
> > ~ 15 mins I gave it to parse a simple sentence. It was much faster
> > when I cut the leaves down to a small fraction of the original 6000,
> > so I determined that the number of leaf nodes was the problem (I ruled
> > out ambiguity as well).
> the tokenizer is a dfa written in python. i would have expected the two
> approaches you describe to be pretty much equivalent, since the system
> collects all the tokens and makes a regexp from them (so what you are doing is
> reproducing by hand what the system does, roughly - the main difference
> being what labels are associated with what matches).
> in other words, it sounds like something very strange is happening - probably
> some kind of bug (in the sense of not being impleented correctly).
> this part of lepl (the regexp support) is being completely rewritten at the
> moment. this is a very long process (it's complex and i keep getting
> sidetracked by other projects), but it would be good to have what you are
> doing as a test for the new code. would it be possible to provide me wth a
> slow and fast version of the code and some input data? preferably with
> confirmation that you have read and agree with the certificate of origin
> described athttp://acooke.org/lepl/licence.html#contributions(so that i can
> include the code in lepl itself as part of the test suite)?
It would be pretty difficult unfortunately. The parser a small cog in
a larger system that unfortunately isn't very well isolated. I think
any grammar with thousands of leaf nodes would have this problem
though, I could try to come up with a simpler example.
> hmmm. actually, thinking some more, the dfa would need to preserve separate
> state for the different matches in the slow case. so i guess it is
> understandable. it would be a really interesting test for the new code...
> [more below]
> > The "trick" I alluded to above that worked well was to take all non-
> > ambiguous tokens (which was most of them) and group them together into
> > a single regexp. For example:
> > Like I said, this worked well, however I am open to other suggestions.
> > My questions are:
> > (1) More importantly, my parser takes up about 900mb of memory. It
> > claims this memory when the parser is first compiled (when it is first
> > asked to parse something), and hangs on to it for the lifetime of the
> > running instance. Subsequence parses don't seem to increase memory
> > usage significantly. Any idea why it's using so much memory?
> i haven't looked at this kind of memory use (i have worried more about the use
> of memory when processing data). i imagine it could be improved, but it would
> need profiling and that kind fo work will need to wait for the new changes
> (which are taking forever to write - i hope to have something out later this
> year).
I look forward to that. 900mb seems pretty excessive.
> > (2) The compilation step is still pretty slow, it takes about a
> > minute. This is fine for production use for my application since I
> > expect each compilation to be amortized by thousands of parses over
> > the lifecycle of the app, but it is annoying for development,
> > especially when I am working on something that has nothing to do with
> > the parser and need to restart my python instance. Any suggestions for
> > this? I'm open to hacks for this one, since it's just a development
> > convenience. I briefly tried pickling the parser to disk so I could
> > recycle it when it was unchanged but, unsurprisingly, didn't get too
> > far with this.
> well, you can disable various things in the config. see all the
> no_... options athttp://acooke.org/lepl/advanced.html#configuration > but the bigger problem there is that you are using tokens. they require a
> fair amount of processing for compilation (you can't disable that part or
> things won't work).
> so if you don't need tokens when playing around you can use .clear() whoch
> should reduce compilation time considerably.
> for a first attempt, try ,atcher.config.clear().lexer() which should give the
> minimal time needed with tokens.
Oh, the reason I'm using the tokenizer is because of a different
problem that you may or may not be aware of. When using the Regexp()
matcher to try to mirror the above trick without using the tokenizer I
got an error saying something about the regular expression being too
long. It was a while ago, I can reproduce it if you want the exact
exception.
OK, no problem - I can reconstruct examples from your description.
The problem with a tool long regexp sounds like it's comig from Python's re package (the Regexp() matcher). If you use the NfaRegexp() or DfaRegexp() you will use the same code used in the tokenizer and shouldn't hit that problem.
On Sun, Jan 01, 2012 at 12:50:38PM -0800, Adam wrote:
> On Jan 1, 4:17 am, andrew cooke <and...@acooke.org> wrote: > > On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote: > > > Hi,
> > > I've been experimenting with LEPL as an alternative to NLTK's CFG > > > parsing library for a natural language processing application I'm > > > working with and I'm pretty pleased with the results. I know NLP isn't > > > exactly the ideal use case for LEPL, but with a few tricks I'm getting > > > better performance (after compilation) from LEPL than from NLTK. I'm > > > parsing a (very) limited subset of English and for my use case I'm > > > getting better performance than I am from NLTK, and the I like the > > > syntax and visibility much better (NLTK gives you zero of the latter).
> > hi,
> > thanks. can you explain what you mean by visibility? (i haven't used nltk)
> 2 things in particular: LEPL helped me find a couple hidden instances > of left recursion. Granted it was still tricky to find using LEPL, but > at least LEPL told me that that was the problem; NLTK just froze my > computer. The other visibility window I've used in LEPL is the ability > to find partial and multiple matches. Perhaps not intended for > debugging, but that's how it's been useful for me. With NLTK you give > the parser a string that represents your grammar, the string you want > to match, and you get either a single tree or an exception (or > ridiculous swapping).
> > > There are two problems I wanted to get an expert opinion on, though > > > neither are deal breakers. First for some context: I have a grammar > > > which I assume is pretty large by the average LEPL user's standards: > > > about 6000 leaves and 200 non-leaf productions. My initial naive > > > conversion from my NLTK grammar to LEPL wouldn't even complete in the > > > ~ 15 mins I gave it to parse a simple sentence. It was much faster > > > when I cut the leaves down to a small fraction of the original 6000, > > > so I determined that the number of leaf nodes was the problem (I ruled > > > out ambiguity as well).
> > the tokenizer is a dfa written in python. i would have expected the two > > approaches you describe to be pretty much equivalent, since the system > > collects all the tokens and makes a regexp from them (so what you are doing is > > reproducing by hand what the system does, roughly - the main difference > > being what labels are associated with what matches).
> > in other words, it sounds like something very strange is happening - probably > > some kind of bug (in the sense of not being impleented correctly).
> > this part of lepl (the regexp support) is being completely rewritten at the > > moment. this is a very long process (it's complex and i keep getting > > sidetracked by other projects), but it would be good to have what you are > > doing as a test for the new code. would it be possible to provide me wth a > > slow and fast version of the code and some input data? preferably with > > confirmation that you have read and agree with the certificate of origin > > described athttp://acooke.org/lepl/licence.html#contributions(so that i can > > include the code in lepl itself as part of the test suite)?
> It would be pretty difficult unfortunately. The parser a small cog in > a larger system that unfortunately isn't very well isolated. I think > any grammar with thousands of leaf nodes would have this problem > though, I could try to come up with a simpler example.
> > hmmm. actually, thinking some more, the dfa would need to preserve separate > > state for the different matches in the slow case. so i guess it is > > understandable. it would be a really interesting test for the new code...
> > [more below]
> > > The "trick" I alluded to above that worked well was to take all non- > > > ambiguous tokens (which was most of them) and group them together into > > > a single regexp. For example:
> > > Like I said, this worked well, however I am open to other suggestions. > > > My questions are:
> > > (1) More importantly, my parser takes up about 900mb of memory. It > > > claims this memory when the parser is first compiled (when it is first > > > asked to parse something), and hangs on to it for the lifetime of the > > > running instance. Subsequence parses don't seem to increase memory > > > usage significantly. Any idea why it's using so much memory?
> > i haven't looked at this kind of memory use (i have worried more about the use > > of memory when processing data). i imagine it could be improved, but it would > > need profiling and that kind fo work will need to wait for the new changes > > (which are taking forever to write - i hope to have something out later this > > year).
> I look forward to that. 900mb seems pretty excessive.
> > > (2) The compilation step is still pretty slow, it takes about a > > > minute. This is fine for production use for my application since I > > > expect each compilation to be amortized by thousands of parses over > > > the lifecycle of the app, but it is annoying for development, > > > especially when I am working on something that has nothing to do with > > > the parser and need to restart my python instance. Any suggestions for > > > this? I'm open to hacks for this one, since it's just a development > > > convenience. I briefly tried pickling the parser to disk so I could > > > recycle it when it was unchanged but, unsurprisingly, didn't get too > > > far with this.
> > well, you can disable various things in the config. see all the > > no_... options athttp://acooke.org/lepl/advanced.html#configuration > > but the bigger problem there is that you are using tokens. they require a > > fair amount of processing for compilation (you can't disable that part or > > things won't work).
> > so if you don't need tokens when playing around you can use .clear() whoch > > should reduce compilation time considerably.
> > for a first attempt, try ,atcher.config.clear().lexer() which should give the > > minimal time needed with tokens.
> Oh, the reason I'm using the tokenizer is because of a different > problem that you may or may not be aware of. When using the Regexp() > matcher to try to mirror the above trick without using the tokenizer I > got an error saying something about the regular expression being too > long. It was a while ago, I can reproduce it if you want the exact > exception.
> Thanks again!
> Adam
> -- > You received this message because you are subscribed to the Google Groups "lepl" group. > To post to this group, send email to lepl@googlegroups.com. > To unsubscribe from this group, send email to lepl+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
I got around to trying this, and I do get the size exception with NfaRegexp/DfaRegexp. Here it is:
File "../crawler/test/parser_test.py", line 293, in test_parsable_sentences gp.parse("apples") File "/Users/atom/workspace/crawler/lib/grammar_parser.py", line 464, in parse result_list = self.parser.parse(new_sentence) File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/config.py ", line 858, in parse return self.get_parse()(input_, **kargs) File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", line 257, in single return next(raw(arg, **kargs))[0] File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", line 146, in trampoline value = next(value.generator) File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", line 416, in _match for results in self._cached_matcher(self, stream_in): File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/combi ne.py", line 368, in match for result in matcher._untagged_match(stream_in): File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/memo. py", line 124, in _untagged_match result = next(descriptor[2].generator) File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", line 440, in _match results = self._cached_matcher(self, stream_in) File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", line 301, in _cached_matcher self.__cached_matcher = self.factory(*args, **kargs) File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/core. py", line 194, in Regexp pattern = compile_(pattern) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", line 190, in compile return _compile(pattern, flags) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", line 243, in _compile p = sre_compile.compile(pattern, flags) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre _compile.py", line 523, in compile groupindex, indexgroup OverflowError: regular expression code size limit exceeded
---------------------------------------------------------------------- Ran 1 test in 60.485s
On Sun, Jan 1, 2012 at 2:16 PM, andrew cooke <and...@acooke.org> wrote:
> OK, no problem - I can reconstruct examples from your description.
> The problem with a tool long regexp sounds like it's comig from Python's > re package (the Regexp() matcher). If you use the NfaRegexp() or DfaRegexp() > you will use the same code used in the tokenizer and shouldn't hit that > problem.
> Andrew
> On Sun, Jan 01, 2012 at 12:50:38PM -0800, Adam wrote:
>> On Jan 1, 4:17 am, andrew cooke <and...@acooke.org> wrote: >> > On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote: >> > > Hi,
>> > > I've been experimenting with LEPL as an alternative to NLTK's CFG >> > > parsing library for a natural language processing application I'm >> > > working with and I'm pretty pleased with the results. I know NLP isn't >> > > exactly the ideal use case for LEPL, but with a few tricks I'm getting >> > > better performance (after compilation) from LEPL than from NLTK. I'm >> > > parsing a (very) limited subset of English and for my use case I'm >> > > getting better performance than I am from NLTK, and the I like the >> > > syntax and visibility much better (NLTK gives you zero of the latter).
>> > hi,
>> > thanks. can you explain what you mean by visibility? (i haven't used nltk)
>> 2 things in particular: LEPL helped me find a couple hidden instances >> of left recursion. Granted it was still tricky to find using LEPL, but >> at least LEPL told me that that was the problem; NLTK just froze my >> computer. The other visibility window I've used in LEPL is the ability >> to find partial and multiple matches. Perhaps not intended for >> debugging, but that's how it's been useful for me. With NLTK you give >> the parser a string that represents your grammar, the string you want >> to match, and you get either a single tree or an exception (or >> ridiculous swapping).
>> > > There are two problems I wanted to get an expert opinion on, though >> > > neither are deal breakers. First for some context: I have a grammar >> > > which I assume is pretty large by the average LEPL user's standards: >> > > about 6000 leaves and 200 non-leaf productions. My initial naive >> > > conversion from my NLTK grammar to LEPL wouldn't even complete in the >> > > ~ 15 mins I gave it to parse a simple sentence. It was much faster >> > > when I cut the leaves down to a small fraction of the original 6000, >> > > so I determined that the number of leaf nodes was the problem (I ruled >> > > out ambiguity as well).
>> > the tokenizer is a dfa written in python. i would have expected the two >> > approaches you describe to be pretty much equivalent, since the system >> > collects all the tokens and makes a regexp from them (so what you are doing is >> > reproducing by hand what the system does, roughly - the main difference >> > being what labels are associated with what matches).
>> > in other words, it sounds like something very strange is happening - probably >> > some kind of bug (in the sense of not being impleented correctly).
>> > this part of lepl (the regexp support) is being completely rewritten at the >> > moment. this is a very long process (it's complex and i keep getting >> > sidetracked by other projects), but it would be good to have what you are >> > doing as a test for the new code. would it be possible to provide me wth a >> > slow and fast version of the code and some input data? preferably with >> > confirmation that you have read and agree with the certificate of origin >> > described athttp://acooke.org/lepl/licence.html#contributions(so that i can >> > include the code in lepl itself as part of the test suite)?
>> It would be pretty difficult unfortunately. The parser a small cog in >> a larger system that unfortunately isn't very well isolated. I think >> any grammar with thousands of leaf nodes would have this problem >> though, I could try to come up with a simpler example.
>> > hmmm. actually, thinking some more, the dfa would need to preserve separate >> > state for the different matches in the slow case. so i guess it is >> > understandable. it would be a really interesting test for the new code...
>> > [more below]
>> > > The "trick" I alluded to above that worked well was to take all non- >> > > ambiguous tokens (which was most of them) and group them together into >> > > a single regexp. For example:
>> > > Like I said, this worked well, however I am open to other suggestions. >> > > My questions are:
>> > > (1) More importantly, my parser takes up about 900mb of memory. It >> > > claims this memory when the parser is first compiled (when it is first >> > > asked to parse something), and hangs on to it for the lifetime of the >> > > running instance. Subsequence parses don't seem to increase memory >> > > usage significantly. Any idea why it's using so much memory?
>> > i haven't looked at this kind of memory use (i have worried more about the use >> > of memory when processing data). i imagine it could be improved, but it would >> > need profiling and that kind fo work will need to wait for the new changes >> > (which are taking forever to write - i hope to have something out later this >> > year).
>> I look forward to that. 900mb seems pretty excessive.
>> > > (2) The compilation step is still pretty slow, it takes about a >> > > minute. This is fine for production use for my application since I >> > > expect each compilation to be amortized by thousands of parses over >> > > the lifecycle of the app, but it is annoying for development, >> > > especially when I am working on something that has nothing to do with >> > > the parser and need to restart my python instance. Any suggestions for >> > > this? I'm open to hacks for this one, since it's just a development >> > > convenience. I briefly tried pickling the parser to disk so I could >> > > recycle it when it was unchanged but, unsurprisingly, didn't get too >> > > far with this.
>> > well, you can disable various things in the config. see all the >> > no_... options athttp://acooke.org/lepl/advanced.html#configuration >> > but the bigger problem there is that you are using tokens. they require a >> > fair amount of processing for compilation (you can't disable that part or >> > things won't work).
>> > so if you don't need tokens when playing around you can use .clear() whoch >> > should reduce compilation time considerably.
>> > for a first attempt, try ,atcher.config.clear().lexer() which should give the >> > minimal time needed with tokens.
>> Oh, the reason I'm using the tokenizer is because of a different >> problem that you may or may not be aware of. When using the Regexp() >> matcher to try to mirror the above trick without using the tokenizer I >> got an error saying something about the regular expression being too >> long. It was a while ago, I can reproduce it if you want the exact >> exception.
>> Thanks again!
>> Adam
>> -- >> You received this message because you are subscribed to the Google Groups "lepl" group. >> To post to this group, send email to lepl@googlegroups.com. >> To unsubscribe from this group, send email to lepl+unsubscribe@googlegroups.com. >> For
On Wed, Jan 11, 2012 at 07:20:53PM -0800, Adam Berlinsky-Schine wrote: > I got around to trying this, and I do get the size exception with > NfaRegexp/DfaRegexp. Here it is:
> File "../crawler/test/parser_test.py", line 293, in test_parsable_sentences > gp.parse("apples") > File "/Users/atom/workspace/crawler/lib/grammar_parser.py", line 464, in parse > result_list = self.parser.parse(new_sentence) > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/config.py ", > line 858, in parse > return self.get_parse()(input_, **kargs) > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", > line 257, in single > return next(raw(arg, **kargs))[0] > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", > line 146, in trampoline > value = next(value.generator) > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", > line 416, in _match > for results in self._cached_matcher(self, stream_in): > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/combi ne.py", > line 368, in match > for result in matcher._untagged_match(stream_in): > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/memo. py", > line 124, in _untagged_match > result = next(descriptor[2].generator) > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", > line 440, in _match > results = self._cached_matcher(self, stream_in) > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", > line 301, in _cached_matcher > self.__cached_matcher = self.factory(*args, **kargs) > File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/core. py", > line 194, in Regexp > pattern = compile_(pattern) > File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", > line 190, in compile > return _compile(pattern, flags) > File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", > line 243, in _compile > p = sre_compile.compile(pattern, flags) > File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre _compile.py", > line 523, in compile > groupindex, indexgroup > OverflowError: regular expression code size limit exceeded
> ---------------------------------------------------------------------- > Ran 1 test in 60.485s
> FAILED (errors=1)
> On Sun, Jan 1, 2012 at 2:16 PM, andrew cooke <and...@acooke.org> wrote:
> > OK, no problem - I can reconstruct examples from your description.
> > The problem with a tool long regexp sounds like it's comig from Python's > > re package (the Regexp() matcher). If you use the NfaRegexp() or DfaRegexp() > > you will use the same code used in the tokenizer and shouldn't hit that > > problem.
> > Andrew
> > On Sun, Jan 01, 2012 at 12:50:38PM -0800, Adam wrote:
> >> On Jan 1, 4:17 am, andrew cooke <and...@acooke.org> wrote: > >> > On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote: > >> > > Hi,
> >> > > I've been experimenting with LEPL as an alternative to NLTK's CFG > >> > > parsing library for a natural language processing application I'm > >> > > working with and I'm pretty pleased with the results. I know NLP isn't > >> > > exactly the ideal use case for LEPL, but with a few tricks I'm getting > >> > > better performance (after compilation) from LEPL than from NLTK. I'm > >> > > parsing a (very) limited subset of English and for my use case I'm > >> > > getting better performance than I am from NLTK, and the I like the > >> > > syntax and visibility much better (NLTK gives you zero of the latter).
> >> > hi,
> >> > thanks. can you explain what you mean by visibility? (i haven't used nltk)
> >> 2 things in particular: LEPL helped me find a couple hidden instances > >> of left recursion. Granted it was still tricky to find using LEPL, but > >> at least LEPL told me that that was the problem; NLTK just froze my > >> computer. The other visibility window I've used in LEPL is the ability > >> to find partial and multiple matches. Perhaps not intended for > >> debugging, but that's how it's been useful for me. With NLTK you give > >> the parser a string that represents your grammar, the string you want > >> to match, and you get either a single tree or an exception (or > >> ridiculous swapping).
> >> > > There are two problems I wanted to get an expert opinion on, though > >> > > neither are deal breakers. First for some context: I have a grammar > >> > > which I assume is pretty large by the average LEPL user's standards: > >> > > about 6000 leaves and 200 non-leaf productions. My initial naive > >> > > conversion from my NLTK grammar to LEPL wouldn't even complete in the > >> > > ~ 15 mins I gave it to parse a simple sentence. It was much faster > >> > > when I cut the leaves down to a small fraction of the original 6000, > >> > > so I determined that the number of leaf nodes was the problem (I ruled > >> > > out ambiguity as well).
> >> > the tokenizer is a dfa written in python. i would have expected the two > >> > approaches you describe to be pretty much equivalent, since the system > >> > collects all the tokens and makes a regexp from them (so what you are doing is > >> > reproducing by hand what the system does, roughly - the main difference > >> > being what labels are associated with what matches).
> >> > in other words, it sounds like something very strange is happening - probably > >> > some kind of bug (in the sense of not being impleented correctly).
> >> > this part of lepl (the regexp support) is being completely rewritten at the > >> > moment. this is a very long process (it's complex and i keep getting > >> > sidetracked by other projects), but it would be good to have what you are > >> > doing as a test for the new code. would it be possible to provide me wth a > >> > slow and fast version of the code and some input data? preferably with > >> > confirmation that you have read and agree with the certificate of origin > >> > described athttp://acooke.org/lepl/licence.html#contributions(so that i can > >> > include the code in lepl itself as part of the test suite)?
> >> It would be pretty difficult unfortunately. The parser a small cog in > >> a larger system that unfortunately isn't very well isolated. I think > >> any grammar with thousands of leaf nodes would have this problem > >> though, I could try to come up with a simpler example.
> >> > hmmm. actually, thinking some more, the dfa would need to preserve separate > >> > state for the different matches in the slow case. so i guess it is > >> > understandable. it would be a really interesting test for the new code...
> >> > [more below]
> >> > > The "trick" I alluded to above that worked well was to take all non- > >> > > ambiguous tokens (which was most of them) and group them together into > >> > > a single regexp. For example:
> >> > > Like I said, this worked well, however I am open to other suggestions. > >> > > My questions are:
> >> > > (1) More importantly, my parser takes up about 900mb of memory. It > >> > > claims this memory when the parser is first compiled (when it is first > >> > > asked to parse something), and hangs on to it for the lifetime of the > >> > > running instance. Subsequence parses don't seem to increase memory > >> > > usage significantly. Any idea why it's using so much memory?
> >> > i haven't looked at this kind of memory use (i have worried more about the use > >> > of memory when processing data). i imagine it could be improved, but it would > >> > need profiling and that kind fo work will need to wait for the new changes > >> > (which are taking forever to write - i hope to have something out later this > >> > year).
> >> I look forward to that. 900mb seems pretty excessive.
> >> > > (2) The compilation step is still pretty slow, it takes about a > >> > > minute. This is fine for production use for my application since I > >> > > expect each compilation to be amortized by thousands of parses over > >> > > the lifecycle of the app, but it is annoying for development, > >> > > especially when I am working on something that has nothing to do with > >> > > the parser and need to restart my python instance. Any suggestions for > >> > > this? I'm open to hacks for this one, since it's just a development > >> > > convenience. I briefly tried pickling the parser to disk so I could > >> > > recycle it when it was unchanged but, unsurprisingly, didn't get too > >> > > far with this.
> >> > well, you can disable various things in the config. see all the > >> > no_... options athttp://acooke.org/lepl/advanced.html#configuration > >> > but the bigger problem there is that you are using tokens. they require a > >> > fair amount of processing for compilation (you can't disable that part or > >> > things won't work).
> >> > so if you don't need tokens when playing around you can use .clear() whoch > >> > should reduce compilation time considerably.
> >> > for a first attempt, try ,atcher.config.clear().lexer() which should give the > >> > minimal time needed with tokens.
> >> Oh, the reason I'm using the tokenizer is because of a different
My bad, I tried again and it does get rid of the regex size exceeded error, I must have missed a case yesterday. But switching to DfaRegexp/NfaRegexp (or getting rid of tokens) makes my parser very slow. It's alright though, my bigger complaint was the memory usage, I'd rather let you focus on getting the next version out that has memory profiling support :)
On Thu, Jan 12, 2012 at 3:03 AM, andrew cooke <and...@acooke.org> wrote:
> Can you provide the code? That error is still coming from a Regexp (see the > stack trace). Thanks, Andrew
> On Wed, Jan 11, 2012 at 07:20:53PM -0800, Adam Berlinsky-Schine wrote: >> I got around to trying this, and I do get the size exception with >> NfaRegexp/DfaRegexp. Here it is:
>> File "../crawler/test/parser_test.py", line 293, in test_parsable_sentences >> gp.parse("apples") >> File "/Users/atom/workspace/crawler/lib/grammar_parser.py", line 464, in parse >> result_list = self.parser.parse(new_sentence) >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/config.py ", >> line 858, in parse >> return self.get_parse()(input_, **kargs) >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", >> line 257, in single >> return next(raw(arg, **kargs))[0] >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", >> line 146, in trampoline >> value = next(value.generator) >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", >> line 416, in _match >> for results in self._cached_matcher(self, stream_in): >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/combi ne.py", >> line 368, in match >> for result in matcher._untagged_match(stream_in): >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/memo. py", >> line 124, in _untagged_match >> result = next(descriptor[2].generator) >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", >> line 440, in _match >> results = self._cached_matcher(self, stream_in) >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", >> line 301, in _cached_matcher >> self.__cached_matcher = self.factory(*args, **kargs) >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/core. py", >> line 194, in Regexp >> pattern = compile_(pattern) >> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", >> line 190, in compile >> return _compile(pattern, flags) >> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", >> line 243, in _compile >> p = sre_compile.compile(pattern, flags) >> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre _compile.py", >> line 523, in compile >> groupindex, indexgroup >> OverflowError: regular expression code size limit exceeded
>> ---------------------------------------------------------------------- >> Ran 1 test in 60.485s
>> FAILED (errors=1)
>> On Sun, Jan 1, 2012 at 2:16 PM, andrew cooke <and...@acooke.org> wrote:
>> > OK, no problem - I can reconstruct examples from your description.
>> > The problem with a tool long regexp sounds like it's comig from Python's >> > re package (the Regexp() matcher). If you use the NfaRegexp() or DfaRegexp() >> > you will use the same code used in the tokenizer and shouldn't hit that >> > problem.
>> > Andrew
>> > On Sun, Jan 01, 2012 at 12:50:38PM -0800, Adam wrote:
>> >> On Jan 1, 4:17 am, andrew cooke <and...@acooke.org> wrote: >> >> > On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote: >> >> > > Hi,
>> >> > > I've been experimenting with LEPL as an alternative to NLTK's CFG >> >> > > parsing library for a natural language processing application I'm >> >> > > working with and I'm pretty pleased with the results. I know NLP isn't >> >> > > exactly the ideal use case for LEPL, but with a few tricks I'm getting >> >> > > better performance (after compilation) from LEPL than from NLTK. I'm >> >> > > parsing a (very) limited subset of English and for my use case I'm >> >> > > getting better performance than I am from NLTK, and the I like the >> >> > > syntax and visibility much better (NLTK gives you zero of the latter).
>> >> > hi,
>> >> > thanks. can you explain what you mean by visibility? (i haven't used nltk)
>> >> 2 things in particular: LEPL helped me find a couple hidden instances >> >> of left recursion. Granted it was still tricky to find using LEPL, but >> >> at least LEPL told me that that was the problem; NLTK just froze my >> >> computer. The other visibility window I've used in LEPL is the ability >> >> to find partial and multiple matches. Perhaps not intended for >> >> debugging, but that's how it's been useful for me. With NLTK you give >> >> the parser a string that represents your grammar, the string you want >> >> to match, and you get either a single tree or an exception (or >> >> ridiculous swapping).
>> >> > > There are two problems I wanted to get an expert opinion on, though >> >> > > neither are deal breakers. First for some context: I have a grammar >> >> > > which I assume is pretty large by the average LEPL user's standards: >> >> > > about 6000 leaves and 200 non-leaf productions. My initial naive >> >> > > conversion from my NLTK grammar to LEPL wouldn't even complete in the >> >> > > ~ 15 mins I gave it to parse a simple sentence. It was much faster >> >> > > when I cut the leaves down to a small fraction of the original 6000, >> >> > > so I determined that the number of leaf nodes was the problem (I ruled >> >> > > out ambiguity as well).
>> >> > the tokenizer is a dfa written in python. i would have expected the two >> >> > approaches you describe to be pretty much equivalent, since the system >> >> > collects all the tokens and makes a regexp from them (so what you are doing is >> >> > reproducing by hand what the system does, roughly - the main difference >> >> > being what labels are associated with what matches).
>> >> > in other words, it sounds like something very strange is happening - probably >> >> > some kind of bug (in the sense of not being impleented correctly).
>> >> > this part of lepl (the regexp support) is being completely rewritten at the >> >> > moment. this is a very long process (it's complex and i keep getting >> >> > sidetracked by other projects), but it would be good to have what you are >> >> > doing as a test for the new code. would it be possible to provide me wth a >> >> > slow and fast version of the code and some input data? preferably with >> >> > confirmation that you have read and agree with the certificate of origin >> >> > described athttp://acooke.org/lepl/licence.html#contributions(so that i can >> >> > include the code in lepl itself as part of the test suite)?
>> >> It would be pretty difficult unfortunately. The parser a small cog in >> >> a larger system that unfortunately isn't very well isolated. I think >> >> any grammar with thousands of leaf nodes would have this problem >> >> though, I could try to come up with a simpler example.
>> >> > hmmm. actually, thinking some more, the dfa would need to preserve separate >> >> > state for the different matches in the slow case. so i guess it is >> >> > understandable. it would be a really interesting test for the new code...
>> >> > [more below]
>> >> > > The "trick" I alluded to above that worked well was to take all non- >> >> > > ambiguous tokens (which was most of them) and group them together into >> >> > > a single regexp. For example:
>> >> > > Like I said, this worked well, however I am open to other suggestions. >> >> > > My questions are:
>> >> > > (1) More importantly, my parser takes up about 900mb of memory. It >> >> > > claims this memory when the parser is first compiled (when it is first >> >> > > asked to parse something), and hangs on to it for the lifetime of the >> >> > > running instance. Subsequence parses don't seem to increase memory >> >> > > usage significantly. Any idea why it's using so much memory?
>> >> > i haven't looked at this kind of memory use (i have worried more about the use >> >> > of memory when processing data). i imagine it could be improved, but it would >> >> > need profiling and that kind fo work will need to wait for the new changes >> >> > (which are taking forever to write - i hope to have something out later this >> >> > year).
>> >> I look forward to that. 900mb seems pretty excessive.
>> >> > > (2) The compilation step is still pretty slow, it takes about a >> >> > > minute. This is fine for production use for my application since I >> >> > > expect each compilation to be amortized by thousands of parses over >> >> > > the lifecycle of the app, but it is annoying for development, >> >> > > especially when I am working on something that has nothing to do with >> >> > > the parser and need to restart my python instance. Any suggestions for >> >> > > this? I'm open to hacks for this one, since it's just a development >> >> > > convenience. I briefly tried pickling the parser to disk so I could >> >> > > recycle it when it was unchanged but, unsurprisingly, didn't get too >> >> > > far with this.
On Thu, Jan 12, 2012 at 08:59:32AM -0800, Adam Berlinsky-Schine wrote: > My bad, I tried again and it does get rid of the regex size exceeded > error, I must have missed a case yesterday. But switching to > DfaRegexp/NfaRegexp (or getting rid of tokens) makes my parser very > slow. It's alright though, my bigger complaint was the memory usage, > I'd rather let you focus on getting the next version out that has > memory profiling support :)
> Adam
> On Thu, Jan 12, 2012 at 3:03 AM, andrew cooke <and...@acooke.org> wrote:
> > Can you provide the code? That error is still coming from a Regexp (see the > > stack trace). Thanks, Andrew
> > On Wed, Jan 11, 2012 at 07:20:53PM -0800, Adam Berlinsky-Schine wrote: > >> I got around to trying this, and I do get the size exception with > >> NfaRegexp/DfaRegexp. Here it is:
> >> File "../crawler/test/parser_test.py", line 293, in test_parsable_sentences > >> gp.parse("apples") > >> File "/Users/atom/workspace/crawler/lib/grammar_parser.py", line 464, in parse > >> result_list = self.parser.parse(new_sentence) > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/config.py ", > >> line 858, in parse > >> return self.get_parse()(input_, **kargs) > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", > >> line 257, in single > >> return next(raw(arg, **kargs))[0] > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/core/parser.py ", > >> line 146, in trampoline > >> value = next(value.generator) > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", > >> line 416, in _match > >> for results in self._cached_matcher(self, stream_in): > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/combi ne.py", > >> line 368, in match > >> for result in matcher._untagged_match(stream_in): > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/memo. py", > >> line 124, in _untagged_match > >> result = next(descriptor[2].generator) > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", > >> line 440, in _match > >> results = self._cached_matcher(self, stream_in) > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/suppo rt.py", > >> line 301, in _cached_matcher > >> self.__cached_matcher = self.factory(*args, **kargs) > >> File "/Library/Python/2.7/site-packages/LEPL-5.0.0-py2.7.egg/lepl/matchers/core. py", > >> line 194, in Regexp > >> pattern = compile_(pattern) > >> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", > >> line 190, in compile > >> return _compile(pattern, flags) > >> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re. py", > >> line 243, in _compile > >> p = sre_compile.compile(pattern, flags) > >> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre _compile.py", > >> line 523, in compile > >> groupindex, indexgroup > >> OverflowError: regular expression code size limit exceeded
> >> ---------------------------------------------------------------------- > >> Ran 1 test in 60.485s
> >> FAILED (errors=1)
> >> On Sun, Jan 1, 2012 at 2:16 PM, andrew cooke <and...@acooke.org> wrote:
> >> > OK, no problem - I can reconstruct examples from your description.
> >> > The problem with a tool long regexp sounds like it's comig from Python's > >> > re package (the Regexp() matcher). If you use the NfaRegexp() or DfaRegexp() > >> > you will use the same code used in the tokenizer and shouldn't hit that > >> > problem.
> >> > Andrew
> >> > On Sun, Jan 01, 2012 at 12:50:38PM -0800, Adam wrote:
> >> >> On Jan 1, 4:17 am, andrew cooke <and...@acooke.org> wrote: > >> >> > On Sat, Dec 31, 2011 at 06:14:58PM -0800, Adam wrote: > >> >> > > Hi,
> >> >> > > I've been experimenting with LEPL as an alternative to NLTK's CFG > >> >> > > parsing library for a natural language processing application I'm > >> >> > > working with and I'm pretty pleased with the results. I know NLP isn't > >> >> > > exactly the ideal use case for LEPL, but with a few tricks I'm getting > >> >> > > better performance (after compilation) from LEPL than from NLTK. I'm > >> >> > > parsing a (very) limited subset of English and for my use case I'm > >> >> > > getting better performance than I am from NLTK, and the I like the > >> >> > > syntax and visibility much better (NLTK gives you zero of the latter).
> >> >> > hi,
> >> >> > thanks. can you explain what you mean by visibility? (i haven't used nltk)
> >> >> 2 things in particular: LEPL helped me find a couple hidden instances > >> >> of left recursion. Granted it was still tricky to find using LEPL, but > >> >> at least LEPL told me that that was the problem; NLTK just froze my > >> >> computer. The other visibility window I've used in LEPL is the ability > >> >> to find partial and multiple matches. Perhaps not intended for > >> >> debugging, but that's how it's been useful for me. With NLTK you give > >> >> the parser a string that represents your grammar, the string you want > >> >> to match, and you get either a single tree or an exception (or > >> >> ridiculous swapping).
> >> >> > > There are two problems I wanted to get an expert opinion on, though > >> >> > > neither are deal breakers. First for some context: I have a grammar > >> >> > > which I assume is pretty large by the average LEPL user's standards: > >> >> > > about 6000 leaves and 200 non-leaf productions. My initial naive > >> >> > > conversion from my NLTK grammar to LEPL wouldn't even complete in the > >> >> > > ~ 15 mins I gave it to parse a simple sentence. It was much faster > >> >> > > when I cut the leaves down to a small fraction of the original 6000, > >> >> > > so I determined that the number of leaf nodes was the problem (I ruled > >> >> > > out ambiguity as well).
> >> >> > the tokenizer is a dfa written in python. i would have expected the two > >> >> > approaches you describe to be pretty much equivalent, since the system > >> >> > collects all the tokens and makes a regexp from them (so what you are doing is > >> >> > reproducing by hand what the system does, roughly - the main difference > >> >> > being what labels are associated with what matches).
> >> >> > in other words, it sounds like something very strange is happening - probably > >> >> > some kind of bug (in the sense of not being impleented correctly).
> >> >> > this part of lepl (the regexp support) is being completely rewritten at the > >> >> > moment. this is a very long process (it's complex and i keep getting > >> >> > sidetracked by other projects), but it would be good to have what you are > >> >> > doing as a test for the new code. would it be possible to provide me wth a > >> >> > slow and fast version of the code and some input data? preferably with > >> >> > confirmation that you have read and agree with the certificate of origin > >> >> > described athttp://acooke.org/lepl/licence.html#contributions(so that i can > >> >> > include the code in lepl itself as part of the test suite)?
> >> >> It would be pretty difficult unfortunately. The parser a small cog in > >> >> a larger system that unfortunately isn't very well isolated. I think > >> >> any grammar with thousands of leaf nodes would have this problem > >> >> though, I could try to come up with a simpler example.
> >> >> > hmmm. actually, thinking some more, the dfa would need to preserve separate > >> >> > state for the different matches in the slow case. so i guess it is > >> >> > understandable. it would be a really interesting test for the new code...
> >> >> > [more below]
> >> >> > > The "trick" I alluded to above that worked well was to take all non- > >> >> > > ambiguous tokens (which was most of them) and group them together into > >> >> > > a single regexp. For example:
> >> >> > > Like I said, this worked well, however I am open to other suggestions. > >> >> > > My questions are:
> >> >> > > (1) More importantly, my parser takes up about 900mb of memory. It > >> >> > > claims this memory when the parser is first compiled (when it is first > >> >> > > asked to parse something), and hangs on to it for the lifetime of the > >> >> > > running instance. Subsequence parses don't seem to increase memory > >> >> > > usage significantly. Any idea why it's using so much memory?
> >> >> > i haven't looked at this kind of memory use (i have worried more about the use > >> >> > of memory when processing data). i imagine it could be improved, but it would > >> >> > need profiling and that kind fo work will need to wait for the new changes > >> >> > (which are taking forever to write - i hope to have something out later this > >> >> > year).
> >> >> I look forward to that. 900mb seems pretty excessive.
> >> >> > > (2) The compilation step is still pretty slow, it takes about a > >> >> > > minute. This is fine for production