Hi Sujit,
16 apr 2013 kl. 20:03 skrev sujitpal <
suji...@comcast.net>:
> I was hoping for some pointers to this. I can't believe PCFG parsing is so arcane?
No, PCFG parsing is not arcane in the NLP community. If you want there are lots of reading you can do yourself, such as chapter 14 in the standard textbook by Jurafsky & Martin.
> Is there some information I haven't provided that will help?
Not directly, but to be able to answer your question, I need to try your code myself. And that takes some time. See later in this mail.
> Would appreciate being told if so. I am currently stuck on this, would really appreciate some help and pointers regarding my two questions:
> 1) how can I debug my grammar quickly and (relatively) painlessly?
It's difficult. Trained grammars are often large and impossible to read. You just have to trust them...
But one standard thing is to divide you data into a training set and a test set. Never ever test on the training data!
The PCFG implementation in NLTK is not state-of-the-art, neither the induction nor the parser.
> 2) any pointers to C or Java parsers that work with large grammars (preferably in the format generated by NLTK, but not necessarily, I can write the Java code to convert it into the structure the parser wants).
If you google "PCFG parser", the second hit is the Stanford Parser which is one of the best ones. You can also try the MaltParser, which is another one.
> More important, is this approach valid, ie, given your experience, would you use this approach to find phrases by type (rather than shallow chunking or something else, for example)?
It depends on so many things - it might be good, might not. But 1700 questions is not a big treebank, perhaps it's better to use a pre-trained model (which both Stanford and MaltParser have). If you create a test set (disjoint from the training set), you can test it out yourself.
> I am trying to build a grammar by taking a tree-tagged set of 1723 questions, then using the grammar as input to a tree parser on unseen sentences. The objective is to use parts of the resulting tree as input to a search engine for a question answering system. To test the code, I built the grammar off the first 10 sentences and was able to generate parse trees out of the (first 10) training sentences using both ChartParser and ViterbiParser. However, if I build the grammar out of the entire training set and try to build parse trees out of the first 10 as before, the parser.parse() method returns None.
I tried your code, both with 10 and 1000 training sentences. The grammars become very different, since you remove rare words. So the first sentence for the small grammar is like this:
<UNK> the <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> ?
but for the large grammar it's this:
In the late <UNK> British <UNK> were used to <UNK> which <UNK> ?
So the grammars become quite incomparable.
You shouldn't use the ChartParser on a PCFG, since it just skips all probabilities and builds all possible trees. For the small grammar, the first sentence has 1040346 parse trees. I didn't even try the large grammar - it takes too much memory.
The ViterbiParser is not a very efficient one, but it works on both the large and the small grammar. And it never returns None for me, it always gives a parse tree. Just as it is supposed to. So I can't reproduce your problem.
> My code is below (also on <a href="
http://pastebin.com/eHZPExmY">pastebin</a>). I am using Python 2.7.3 and NLTK 2.0.4. I am also marking words with count < 3 as <UNK> so the parser is more robust wrt unseen words in the test set. My questions are:
>
> 1) why are the parsers behaving differently with the larger grammar? The <a href="
http://pastebin.com/1uWJzDHg">smaller grammar</a> has 45 productions and the <a href="
http://pastebin.com/P8KR7wnk">full grammar</a> has 907. I suspect a bad rule being generated but I don't have the necessary expertise to detect it. I can post both grammars inline if it helps, or alternatively, if you notice something and/or pointers on things to look for would be appreciated.
As I said, I didn't notice anything strange in my quick test. The returned trees are different, but then they should be.
> 2) From other threads, I understand that NLTK is not the recommended approach to parse large grammars. Are there any recommended alternatives that can reuse the PCFG grammar generated by NLTK? Code snips on how to call (if bindings exist from within NLTK) would help a lot.
There is a NLTK binding to the MaltParser. Try:
>>> help(nltk.parse.malt)
good luck!
/Peter