28 dec 2010 kl. 18.26 skrev Max:
> I'm writing an FCFG grammar for a subset of English as part of a small
> DRT-based semantic analysis project. NLTK has allowed me to quickly
> get started and build a fairly large grammar. However, now that the
> grammar is pushing ~450 ambiguous, non-lexical rules, the NLTK feature
> chart parser seems to be struggling - short sentences can take
> anywhere from 2 to 30 seconds to parse.
I'm responsible for most of the parsing algorithms in NLTK. I have optimized the context-free implementations as much as I can, and they are quite fast now. But the feature grammar parser is 100x-1000x slower, and the bottle-neck is unification.
I have tried to optimize the unification implementation, and by stripping away almost everything I can get it twice as fast, but it's still about 100x too slow. If you want it faster, you have to do some tweaking yourself.
> Lexical rules do not seem to affect parsing performance at all:
> switching from 1K to 500K rules containing no nonterminals on the
> right hand side hardly made any difference (although loading such a
> grammar takes a whole minute).
As I'm suggesting below, try to use the lefthand sides as context-free nonterminals (i.e. strings), not feature structures. Also, try to save the resulting grammar as a Pickled object, it might speed up loading.
> The grammar in question uses features quite sparingly, only having a
> single level and only using them for enforcing number/verb form
> matching and such rather than building up semantic representation
> (that is done in a separate procedural step afterwards).
>
> Now I'm looking for ways to improve the parsing performance, either by
> optimizing the grammar, improving the NLTK parser or switching to
> another one. Any pointers towards any of these goals would be quite
> helpful.
I think you have four options:
1. Re-implement the unification algorithm in C (or Cython). I guess you need to be a good C hacker for this, as well as have good knowledge about Robinson's unification algorithm.
2. Inline unification in the chartparser. For this you need to know a lot about the recent research on unification-based parsing. And be a real Python hacker.
3. Use an existing parser written in another language, and write a wrapper. I don't know about free C parsers (yacc/bison cannot handle features and are not good for ambiguous grammars), but you could have a look at LKB (written in Lisp) or ALE (written in Prolog). Note that both these systems have a different grammar format than in NLTK, so you have to transform your own grammar.
If LKB and ALE are too big for your purposes, it's easy to implement an efficient unification parser in Prolog (e.g., SWI prolog) in 10-20 lines of code. Then you can use pyprolog or pyswip to interact between Python and SWI.
4. Translate your grammar to context-free form, by expanding all features. This might lead to an explosion of the number of rules, but since you write that you don't use features that much, I guess the problem won't be that big.
The basic algorithm is quite straight-forward: first you find out which values each feature can have, and then you simply loop through each production in the grammar and instantiate each unification variable with all its possible values, yielding a new production for every instantiation. The nonterminals in the new productions should be strings and not FeatureStructures (e.g., you can use the __repr__ of the instantiated nonterminal as the new context-free nonterminal).
If the resulting grammar explodes, there are some techniques you can try. See, e.g., M. Rayner, J. Dowding and B.A. Hockey (2001). "A Baseline Method for Compiling Typed Unification Grammars into Context Free Language Models". Available from here:
http://www.issco.unige.ch/en/research/projects/regulus/publications.shtml
So, whatever option you choose will involve some NLTK hacking. If you do some of these things, please sign up to nltk...@googlegroups.com and share your experience and your code.
> I've posted a similar question on StackOverflow (http://
> stackoverflow.com/questions/4543008/efficient-context-free-grammar-
> parser-preferably-python-friendly), but so far that hasn't produced
> many useful results.
I'm not surprised: there are mostly hackers who answer questions on stackoverflow, and most hackers know nothing about NLP problems. (Which can be seen from the answers you got)
best,
/Peter Ljunglöf
PS. Option nr 4 has been on my TODO list for a while now, but I haven't got the time to actually implement it yet... it would be great if you happen to do something about it:)