Pruning Tree using nltk.tree

1,364 views
Skip to first unread message

Vivian

unread,
Mar 26, 2013, 9:54:27 PM3/26/13
to nltk-...@googlegroups.com
Hi All,
    I recently ran into nltk.tree and found it is very useful for parsing the CFG tree.
    But I come across a problem about pruning the tree.
   For example,  I want to prune the following tree to find the lowest common ancestor for  "John married Marry". I start with pruning all the branches to the left of "John", but every time when I removed a node, the index for each nodes changes, so I lost the index.

    I would really appreciate if anyone could post some codes snippet or hints.

Thanks!
Vivian


    

Vivian

unread,
Mar 26, 2013, 9:58:37 PM3/26/13
to nltk-...@googlegroups.com
I found the email list did not show the pics of the parse tree...
So I just add to my previous post

The original sentence is "They said John married Marry last month", and I want to get a pruned tree only contains "John married Marry".
The parse tree for the full sentence is:
(ROOT
  (S
    (NP (PRP They))
    (VP (VBD said)
      (SBAR
        (S
          (NP (NNP John))
          (VP (VBD married)
            (NP (NNP Marry) (JJ last) (NNS Month))))))
    (. .)))

And I want a pruned tree like this:
       (S
          (NP (NNP John))
          (VP (VBD married)
            (NP (NNP Marry))))

I would really appreciate if someone could help me with this!

Vivian

Alexis Dimitriadis

unread,
Mar 27, 2013, 6:25:11 AM3/27/13
to nltk-...@googlegroups.com
Hi Vivian,

Instead of removing branches, you can use `tree.treeposition_spanning_leaves(index1, index2)` which does exactly what you're looking for. It seems to behave like list indices, so index2 should be one larger than the index of the second word. This will return a tree "position", which is a tuple. You can use it as if it was a simple index to get the subtree you want:
pos = tree.treeposition_spanning_leaves(index_john, index_mary+1)
print tree[pos]

Alexis

peter ljunglöf

unread,
Mar 27, 2013, 3:26:22 AM3/27/13
to nltk-...@googlegroups.com
Hi Vivian,

you're lucky -- there's a method for that! Almost...

>>> t = Tree('(ROOT (S (NP (PRP They)) (VP (VBD said) (SBAR (S (NP (NNP John)) (VP (VBD married) (NP (NNP Marry) (JJ last) (NNS month))))))))')
>>> t.leaves()
['They', 'said', 'John', 'married', 'Marry', 'last', 'month']
>>> t.leaves()[2:5]
['John', 'married', 'Marry']
>>> pos = t.treeposition_spanning_leaves(2,5)
>>> pos
(0, 1, 1, 0)
>>> t[pos]
Tree('S', [Tree('NP', [Tree('NNP', ['John'])]), Tree('VP', [Tree('VBD', ['married']), Tree('NP', [Tree('NNP', ['Marry']), Tree('JJ', ['last']), Tree('NNS', ['month'])])])])
>>> t[pos].leaves()
['John', 'married', 'Marry', 'last', 'month']

So, what .treeposition_spanning_leaves() gives you is the smallest subtree that covers your substring. But it doesn't prune any branches from that subtree, and since "last month" is in that subtree, it's still there.

To prune away "last month", you can take one word at the time, like this:

>>> t.leaves()[4]
'Marry'
>>> t.leaves()[5]
'last'
>>> t.leaf_treeposition(4)
(0, 1, 1, 0, 1, 1, 0, 0)
>>> t.leaf_treeposition(5)
(0, 1, 1, 0, 1, 1, 1, 0)

Find the node position where these two differ, i.e., (0, 1, 1, 0, 1, 1, 1), and remove that node:

>>> print t
(ROOT
(S
(NP (PRP They))
(VP
(VBD said)
(SBAR
(S
(NP (NNP John))
(VP (VBD married) (NP (NNP Marry) (JJ last) (NNS Month))))))))
>>> del t[(0, 1, 1, 0, 1, 1, 1)]
>>> print t
(ROOT
(S
(NP (PRP They))
(VP
(VBD said)
(SBAR
(S
(NP (NNP John))
(VP (VBD married) (NP (NNP Marry) (NNS Month))))))))

Repeat until the tree is pruned. When doing this on the left side, the index of your substring decreases every time you remove a node. You can solve this by counting from the right instead. I.e.:

>>> t.leaves()[-5]
'said'
>>> t.leaves()[-4]
'John'
>>> t.leaf_treeposition(len(t.leaves())-5)
(0, 1, 0, 0)
>>> t[0, 1, 0, 0]
'said'
>>> t.leaf_treeposition(len(t.leaves())-4)
(0, 1, 1, 0, 0, 0, 0)
>>> t[0, 1, 1, 0, 0, 0, 0]
'John'

Best is probably to start with .treeposition_spanning_leaves() to get the minimal subtree and then start pruning that one. Also, note that NLTK trees are mutable, which means that when you remove nodes, the original tree is also affected. If you don't want that you should use t.copy(deep=True) before you start pruning.

Good luck!
/Peter

n.shir...@somaiya.edu

unread,
Mar 21, 2018, 6:15:23 PM3/21/18
to nltk-users
Hello Vivian, 
I am working on nltk trees for my research work, I would like to know the process of pruning the tree that is mentioned (Deleting nodes which are tagged). I am trying to implement a parse and trim method for sentence compression and thats why I need it. 
Thank you!
Regards

           
Reply all
Reply to author
Forward
0 new messages