Problems with PCFG parsing

413 views
Skip to first unread message

sujitpal

unread,
Apr 13, 2013, 10:04:12 AM4/13/13
to nltk-...@googlegroups.com
Hello,

I am trying to build a grammar by taking a tree-tagged set of 1723 questions, then using the grammar as input to a tree parser on unseen sentences. The objective is to use parts of the resulting tree as input to a search engine for a question answering system. To test the code, I built the grammar off the first 10 sentences and was able to generate parse trees out of the (first 10) training sentences using both ChartParser and ViterbiParser. However, if I build the grammar out of the entire training set and try to build parse trees out of the first 10 as before, the parser.parse() method returns None.

My code is below (also on <a href="http://pastebin.com/eHZPExmY">pastebin</a>). I am using Python 2.7.3 and NLTK 2.0.4. I am also marking words with count < 3 as <UNK> so the parser is more robust wrt unseen words in the test set. My questions are:

1) why are the parsers behaving differently with the larger grammar? The <a href="http://pastebin.com/1uWJzDHg">smaller grammar</a> has 45 productions and the <a href="http://pastebin.com/P8KR7wnk">full grammar</a> has 907. I suspect a bad rule being generated but I don't have the necessary expertise to detect it. I can post both grammars inline if it helps, or alternatively, if you notice something and/or pointers on things to look for would be appreciated.

2) From other threads, I understand that NLTK is not the recommended approach to parse large grammars. Are there any recommended alternatives that can reuse the PCFG grammar generated by NLTK? Code snips on how to call (if bindings exist from within NLTK) would help a lot.

####
from __future__ import division
import nltk
import nltk.parse.pchart

def wrap(sentence):
  return "(ROOT " + sentence + ")"

def unkize(words, rareWords):
  return map(lambda word: "<UNK>" if word in rareWords else word, words)

# accumulate rare words
wordsFD = nltk.FreqDist()
ftrain = open("parse_train.dat", 'rb')
for sentence in ftrain.readlines()[0:10]:
  sentence = wrap(sentence.strip())
  tree = nltk.Tree(sentence)
  tree.collapse_unary(collapsePOS=True)
  tree.chomsky_normal_form()
  [wordsFD.inc(word) for word in tree.leaves()]
ftrain.close()
rareWords = set(filter(lambda word: wordsFD[word] < 3, wordsFD))

# build grammar
productions = []
ftrain = open("parse_train.dat", 'rb')
for sentence in ftrain.readlines()[0:10]:
  sentence = wrap(sentence.strip())
  tree = nltk.Tree(sentence)
  tree.collapse_unary(collapsePOS=True)
  tree.chomsky_normal_form()
  print "train sentence=", tree.leaves()
  print "orig tree=", tree
  for production in tree.productions():
    if production.is_lexical() and len(production.rhs()) == 1:
      termWord = production.rhs()[0]
      if termWord in rareWords:
        modProduction = nltk.Production(production.lhs(), ["<UNK>"])
        productions.append(modProduction)
        continue
    productions.append(production)
ftrain.close()
root = nltk.Nonterminal("ROOT")
grammar = nltk.induce_pcfg(root, productions)
print grammar

parsers = [
  nltk.ChartParser(grammar),
  nltk.ViterbiParser(grammar)
]
ftrain = open("parse_train.dat", 'rb')
for sentence in ftrain.readlines()[0:10]:
  sentence = wrap(sentence.strip())
  tree = nltk.Tree(sentence)
  words = unkize(tree.leaves(), rareWords)
  print "test sentence: ", words
  for parser in parsers:
    print "parsed tree [ " + parser.__class__.__name__, \
      "]:", parser.parse(words)
ftrain.close()
####

In case you would like to run the code to see the problem, the parse_train.dat file (training set of tree-tagged questions) <a href="http://pastebin.com/5iDmKGuT">is available here</a>.

Thanks in advance for any help you can provide.

-sujit

sujitpal

unread,
Apr 16, 2013, 2:03:38 PM4/16/13
to nltk-...@googlegroups.com
Hello NLTK-ers,

I was hoping for some pointers to this. I can't believe PCFG parsing is so arcane?

Is there some information I haven't provided that will help?

Would appreciate being told if so. I am currently stuck on this, would really appreciate some help and pointers regarding my two questions:
1) how can I debug my grammar quickly and (relatively) painlessly?
2) any pointers to C or Java parsers that work with large grammars (preferably in the format generated by NLTK, but not necessarily, I can write the Java code to convert it into the structure the parser wants).

More important, is this approach valid, ie, given your experience, would you use this approach to find phrases by type (rather than shallow chunking or something else, for example)?

Once again, thanks in advance for any advice/answers yo can provide.

-sujit

peter ljunglöf

unread,
Apr 17, 2013, 3:03:12 AM4/17/13
to nltk-...@googlegroups.com
Hi Sujit,

16 apr 2013 kl. 20:03 skrev sujitpal <suji...@comcast.net>:

> I was hoping for some pointers to this. I can't believe PCFG parsing is so arcane?

No, PCFG parsing is not arcane in the NLP community. If you want there are lots of reading you can do yourself, such as chapter 14 in the standard textbook by Jurafsky & Martin.

> Is there some information I haven't provided that will help?

Not directly, but to be able to answer your question, I need to try your code myself. And that takes some time. See later in this mail.

> Would appreciate being told if so. I am currently stuck on this, would really appreciate some help and pointers regarding my two questions:
> 1) how can I debug my grammar quickly and (relatively) painlessly?

It's difficult. Trained grammars are often large and impossible to read. You just have to trust them...

But one standard thing is to divide you data into a training set and a test set. Never ever test on the training data!

The PCFG implementation in NLTK is not state-of-the-art, neither the induction nor the parser.

> 2) any pointers to C or Java parsers that work with large grammars (preferably in the format generated by NLTK, but not necessarily, I can write the Java code to convert it into the structure the parser wants).

If you google "PCFG parser", the second hit is the Stanford Parser which is one of the best ones. You can also try the MaltParser, which is another one.

> More important, is this approach valid, ie, given your experience, would you use this approach to find phrases by type (rather than shallow chunking or something else, for example)?

It depends on so many things - it might be good, might not. But 1700 questions is not a big treebank, perhaps it's better to use a pre-trained model (which both Stanford and MaltParser have). If you create a test set (disjoint from the training set), you can test it out yourself.

> I am trying to build a grammar by taking a tree-tagged set of 1723 questions, then using the grammar as input to a tree parser on unseen sentences. The objective is to use parts of the resulting tree as input to a search engine for a question answering system. To test the code, I built the grammar off the first 10 sentences and was able to generate parse trees out of the (first 10) training sentences using both ChartParser and ViterbiParser. However, if I build the grammar out of the entire training set and try to build parse trees out of the first 10 as before, the parser.parse() method returns None.

I tried your code, both with 10 and 1000 training sentences. The grammars become very different, since you remove rare words. So the first sentence for the small grammar is like this:

<UNK> the <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> ?

but for the large grammar it's this:

In the late <UNK> British <UNK> were used to <UNK> which <UNK> ?

So the grammars become quite incomparable.

You shouldn't use the ChartParser on a PCFG, since it just skips all probabilities and builds all possible trees. For the small grammar, the first sentence has 1040346 parse trees. I didn't even try the large grammar - it takes too much memory.

The ViterbiParser is not a very efficient one, but it works on both the large and the small grammar. And it never returns None for me, it always gives a parse tree. Just as it is supposed to. So I can't reproduce your problem.

> My code is below (also on <a href="http://pastebin.com/eHZPExmY">pastebin</a>). I am using Python 2.7.3 and NLTK 2.0.4. I am also marking words with count < 3 as <UNK> so the parser is more robust wrt unseen words in the test set. My questions are:
>
> 1) why are the parsers behaving differently with the larger grammar? The <a href="http://pastebin.com/1uWJzDHg">smaller grammar</a> has 45 productions and the <a href="http://pastebin.com/P8KR7wnk">full grammar</a> has 907. I suspect a bad rule being generated but I don't have the necessary expertise to detect it. I can post both grammars inline if it helps, or alternatively, if you notice something and/or pointers on things to look for would be appreciated.

As I said, I didn't notice anything strange in my quick test. The returned trees are different, but then they should be.

> 2) From other threads, I understand that NLTK is not the recommended approach to parse large grammars. Are there any recommended alternatives that can reuse the PCFG grammar generated by NLTK? Code snips on how to call (if bindings exist from within NLTK) would help a lot.

There is a NLTK binding to the MaltParser. Try:

>>> help(nltk.parse.malt)

good luck!
/Peter

sujitpal

unread,
Apr 17, 2013, 1:33:20 PM4/17/13
to nltk-...@googlegroups.com
Hi Peter,

Thank you for the very detailed response and all the pointers. I do have a separate test set, but I was trying to figure out if the code was wrong, and I figured that a parser trained on a grammar derived from the training set should /definitely/ be able to parse a sentence in the training set (otherwise I would suspect my code).

Yes, the <UNK> stuff does actually confuses things since I am using a fixed frequency cutoff, so smaller training sets will end up having more words converted to <UNK>. As a baseline I tried without the <UNK> functionality as well. I tried with 10 and full training set, I will try to emulate your strategy by gradually increasing the number of questions to find out where it breaks.

Also thanks for the pointer to Stanford parser. I saw both Malt and Stanford being discussed on the mailing list, so wasn't sure which one is more popular/(considered) better, I will probably try to use the Stanford parser. Some people at my company have used it so I may get some help there as well.

I will post a followup with more details on this thread once I am closer to a solution. Thanks again for your help.

Regards,
Sujit


On Saturday, April 13, 2013 7:04:12 AM UTC-7, sujitpal wrote:
Reply all
Reply to author
Forward
0 new messages