Viterbi parser providing no trees, but Chart parser does. What am I doing wrong?

Robert Slowley

unread,

Mar 9, 2010, 6:08:54 PM3/9/10

to nltk-users

I've just started using NLTK, and it's not quite doing what I'd
expect. I have a PCFG and sentence, which when I parse with
nbest_parse() with a ChartParser I get two valid trees (which is
expected). However when I parse using a ViterbiParser (calling
parse()) I get no tree returned :-/ I'd expect one tree to get
returned (the most likely tree).

I assume I'm making some obvious newbie mistake. I'd really appreciate
any assistance on why things are not working as expected.

Here follows my Python source, and the output it produces:

import nltk
grammar_text = open('GENIA.pgrammar', 'r').read()

pcfg_grammar = nltk.parse_pcfg(grammar_text)
print "Grammar:"
print pcfg_grammar
print

sent = "Glucocorticoid resistance in the squirrel monkey is associated
with overexpression of the immunophilin FKBP51 .".split()
chart_parser = nltk.ChartParser(pcfg_grammar)
trees = chart_parser.nbest_parse(sent)

print "Parse trees for Chart Parser:"
for tree in trees:
print tree

print

viterbi_parser = nltk.ViterbiParser(pcfg_grammar)
trees = viterbi_parser.parse(sent)
print "Parse tree for Viterbi Parser:"
for tree in trees:
print tree

-----

Running “nltk-pcfg.py”…
Python 2.6.3

Grammar:
Grammar with 23 productions (start state = S)
S -> NP VP PERIOD [1.0]
NP -> [0.142857142857]
NP -> NN [0.142857142857]
NP -> NN NN [0.142857142857]
NP -> DT NN NN [0.285714285714]
NP -> NP PP [0.285714285714]
PP -> IN NP [1.0]
VP -> VBN NP PP [0.5]
VP -> VBZ VP [0.5]
NN -> 'FKBP51' [0.142857142857]
NN -> 'immunophilin' [0.142857142857]
NN -> 'overexpression' [0.142857142857]
NN -> 'squirrel' [0.142857142857]
NN -> 'resistance' [0.142857142857]
NN -> 'monkey' [0.142857142857]
NN -> 'Glucocorticoid' [0.142857142857]
VBN -> 'associated' [1.0]
VBZ -> 'is' [1.0]
PERIOD -> '.' [1.0]
DT -> 'the' [1.0]
IN -> 'of' [0.333333333333]
IN -> 'in' [0.333333333333]
IN -> 'with' [0.333333333333]

Parse trees for Chart Parser:
(S
(NP
(NP (NN Glucocorticoid) (NN resistance))
(PP (IN in) (NP (DT the) (NN squirrel) (NN monkey))))
(VP
(VBZ is)
(VP
(VBN associated)
(NP (NP ) (PP (IN with) (NP (NN overexpression))))
(PP (IN of) (NP (DT the) (NN immunophilin) (NN FKBP51)))))
(PERIOD .))
(S
(NP
(NP (NN Glucocorticoid) (NN resistance))
(PP (IN in) (NP (DT the) (NN squirrel) (NN monkey))))
(VP
(VBZ is)
(VP
(VBN associated)
(NP )
(PP
(IN with)
(NP
(NP (NN overexpression))
(PP (IN of) (NP (DT the) (NN immunophilin) (NN FKBP51)))))))
(PERIOD .))

Parse tree for Viterbi Parser:
-----

Thank you,
-Rob

Robert Slowley

unread,

Mar 9, 2010, 6:12:45 PM3/9/10

to nltk-users

> viterbi_parser = nltk.ViterbiParser(pcfg_grammar)
> trees = viterbi_parser.parse(sent)
> print "Parse tree for Viterbi Parser:"
> for tree in trees:
> print tree

... of course as soon as I post this it becomes obvious. parse here
returns a single tree, so I don't want the for loop, I want:

viterbi_parser = nltk.ViterbiParser(pcfg_grammar)
tree = viterbi_parser.parse(sent)

print "Parse tree for Viterbi Parser:"

print tree

Which works perfectly.

Thank you !
-Rob

Robert Slowley

unread,

Mar 9, 2010, 6:14:36 PM3/9/10

to nltk-users

> Which works perfectly.
Gah. No it doesn't, it just outputs [].

:-/

So please ignore my last message, and if anyone can help with the
problem outlined in the first message I'd appreciate it.

-Rob

Robert Slowley

unread,

Mar 9, 2010, 6:16:08 PM3/9/10

to nltk-users

If I set trace to 2 I get this:

Inserting tokens into the most likely constituents table...
Insert: |=..............| Glucocorticoid
Insert: |.=.............| resistance
Insert: |..=............| in
Insert: |...=...........| the
Insert: |....=..........| squirrel
Insert: |.....=.........| monkey
Insert: |......=........| is
Insert: |.......=.......| associated
Insert: |........=......| with
Insert: |.........=.....| overexpression
Insert: |..........=....| of
Insert: |...........=...| the
Insert: |............=..| immunophilin
Insert: |.............=.| FKBP51
Insert: |..............=| .
Finding the most likely constituents spanning 1 text elements...
Insert: |=..............| NN -> 'Glucocorticoid' [0.142857142857]
Insert: |=..............| NP -> NN [0.142857142857]
Insert: |.=.............| NN -> 'resistance' [0.142857142857]
Insert: |.=.............| NP -> NN [0.142857142857]
Insert: |..=............| IN -> 'in' [0.333333333333]
Insert: |...=...........| DT -> 'the' [1.0]
Insert: |....=..........| NN -> 'squirrel' [0.142857142857]
Insert: |....=..........| NP -> NN [0.142857142857]
Insert: |.....=.........| NN -> 'monkey' [0.142857142857]
Insert: |.....=.........| NP -> NN [0.142857142857]
Insert: |......=........| VBZ -> 'is' [1.0]
Insert: |.......=.......| VBN -> 'associated' [1.0]
Insert: |........=......| IN -> 'with' [0.333333333333]
Insert: |.........=.....| NN -> 'overexpression' [0.142857142857]
Insert: |.........=.....| NP -> NN [0.142857142857]
Insert: |..........=....| IN -> 'of' [0.333333333333]
Insert: |...........=...| DT -> 'the' [1.0]
Insert: |............=..| NN -> 'immunophilin' [0.142857142857]
Insert: |............=..| NP -> NN [0.142857142857]
Insert: |.............=.| NN -> 'FKBP51' [0.142857142857]
Insert: |.............=.| NP -> NN [0.142857142857]
Insert: |..............=| PERIOD -> '.' [1.0]
Finding the most likely constituents spanning 2 text elements...
Insert: |==.............| NP -> NN NN [0.142857142857]
Insert: |....==.........| NP -> NN NN [0.142857142857]
Insert: |........==.....| PP -> IN NP [1.0]
Insert: |............==.| NP -> NN NN [0.142857142857]
Finding the most likely constituents spanning 3 text elements...
Insert: |...===.........| NP -> DT NN NN [0.285714285714]
Insert: |...........===.| NP -> DT NN NN [0.285714285714]
Finding the most likely constituents spanning 4 text elements...
Insert: |..====.........| PP -> IN NP [1.0]
Insert: |..........====.| PP -> IN NP [1.0]
Finding the most likely constituents spanning 5 text elements...
Insert: |.=====.........| NP -> NP PP [0.285714285714]
Insert: |.........=====.| NP -> NP PP [0.285714285714]
Finding the most likely constituents spanning 6 text elements...
Insert: |======.........| NP -> NP PP [0.285714285714]
Insert: |........======.| PP -> IN NP [1.0]
Finding the most likely constituents spanning 7 text elements...
Finding the most likely constituents spanning 8 text elements...
Finding the most likely constituents spanning 9 text elements...
Finding the most likely constituents spanning 10 text elements...
Finding the most likely constituents spanning 11 text elements...
Finding the most likely constituents spanning 12 text elements...
Finding the most likely constituents spanning 13 text elements...
Finding the most likely constituents spanning 14 text elements...
Finding the most likely constituents spanning 15 text elements...

Steven Bird

unread,

Mar 9, 2010, 6:49:44 PM3/9/10

to nltk-...@googlegroups.com

Robert, please post your GENIA.pgrammar file here as well. -Steven Bird

Robert Slowley

unread,

Mar 9, 2010, 6:58:11 PM3/9/10

to nltk-users

> Robert, please post your GENIA.pgrammar file here as well. -Steven Bird

Here it is:

S -> NP VP PERIOD [1.0]

NP -> [0.142857142857143]
NP -> NN [0.142857142857143]
NP -> NN NN [0.142857142857143]
NP -> DT NN NN [0.285714285714286]
NP -> NP PP [0.285714285714286]

PP -> IN NP [1.0]

VP -> VBN NP PP [0.5]
VP -> VBZ VP [0.5]

NN -> 'FKBP51' [0.142857142857143]
NN -> 'immunophilin' [0.142857142857143]
NN -> 'overexpression' [0.142857142857143]
NN -> 'squirrel' [0.142857142857143]
NN -> 'resistance' [0.142857142857143]
NN -> 'monkey' [0.142857142857143]
NN -> 'Glucocorticoid' [0.142857142857143]

VBN -> 'associated' [1.0]
VBZ -> 'is' [1.0]
PERIOD -> '.' [1.0]
DT -> 'the' [1.0]

IN -> 'of' [0.333333333333333]
IN -> 'in' [0.333333333333333]
IN -> 'with' [0.333333333333333]

I'm wondering if it has to do with the rule with an empty on the RHS?
NP -> [0.142857142857143]

Is that the correct way to represent such a thing in NLTK?

Robert Slowley

unread,

Mar 9, 2010, 7:05:33 PM3/9/10

to nltk-users

The sentence this should be a grammar for (from the GENIA treebank) is
this one:

<sentence id="S1">
<cons cat="S">
<cons cat="NP" id="i1" role="SBJ">
<cons cat="NP">
<tok cat="NN">Glucocorticoid</tok>
<tok cat="NN">resistance</tok>
</cons>
<cons cat="PP">
<tok cat="IN">in</tok>
<cons cat="NP">
<tok cat="DT">the</tok>
<tok cat="NN">squirrel</tok>
<tok cat="NN">monkey</tok>
</cons>
</cons>
</cons>
<cons cat="VP">
<tok cat="VBZ">is</tok>
<cons cat="VP">
<tok cat="VBN">associated</tok>
<cons cat="NP" ref="i1" null="NONE"/>
<cons cat="PP">
<tok cat="IN">with</tok>
<cons cat="NP">
<cons cat="NP">
<tok cat="NN">overexpression</tok>
</cons>
<cons cat="PP">
<tok cat="IN">of</tok>
<cons cat="NP">
<tok cat="DT">the</tok>
<tok cat="NN">immunophilin</tok>
<tok cat="NN">FKBP51</tok>
</cons>
</cons>
</cons>
</cons>
</cons>
</cons>
<tok cat="PERIOD">.</tok>
</cons>
</sentence>

When I say "empty on the RHS" I think the correct terminology is
epsilon, not empty... That's meant to represent the following line
from the Treebank's XML:
<cons cat="NP" ref="i1" null="NONE"/>

Steven Bird

unread,

Mar 9, 2010, 7:25:38 PM3/9/10

to nltk-...@googlegroups.com

On 10 March 2010 10:58, Robert Slowley <robert...@gmail.com> wrote:
> I'm wondering if it has to do with the rule with an empty on the RHS?
> NP -> [0.142857142857143]

No, that's fine. You can see that empty productions work fine:

----
from nltk import parse_pcfg, ViterbiParser

pcfg_grammar = parse_pcfg("""
S -> [0.5]
S -> 'foo' [0.5]
""")

parser = ViterbiParser(pcfg_grammar)
print parser.nbest_parse(['foo'])
----

Prints:
[ProbabilisticTree('S', ['foo']) (p=0.5)]

So that's not the problem.

Note that a Viterbi parser only stores the most likely analysis for
each given span. Does that explain why there's no result?

-Steven Bird

Robert Slowley

unread,

Mar 9, 2010, 7:28:28 PM3/9/10

to nltk-users

Which parsers should I be considering then if not the Viterbi parser?

The example grammar here is quite small, the actual grammar is very
large (31,000 rules for the whole treebank), so I'm assuming I need a
probabilistic parser to parse any sentence in reasonable time.

Steven Bird

unread,

Mar 9, 2010, 7:33:51 PM3/9/10

to nltk-...@googlegroups.com

NLTK's probabilistic parsers are not (yet) serious implementations.
We're talking about fixing this in the coming months. For now you
should use one of the high performance statistical parsers, such as
the Stanford Parser. -Steven Bird

> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>

Victor Miclovich

unread,

Mar 9, 2010, 7:35:59 PM3/9/10

to nltk-...@googlegroups.com

and hopefully a plugin that will interface with the Stanford parser should be out then... ;)

Robert Slowley

unread,

Mar 9, 2010, 7:39:57 PM3/9/10

to nltk-users

What do you mean by 'not serious implementation' ?

What limitations are there?

Which parsers are the statistical ones?

Robert Slowley

unread,

Mar 14, 2010, 9:30:30 AM3/14/10

to nltk-users

Hi Steven -

What do you mean by 'not serious implementation' ?

What limitations are there?

Which parsers are the statistical ones?

I'm using this for some university coursework and I have been told to
use the NLTK probabilistic parsers - are you saying that these are
very buggy and don't work for valid grammars?

-Rob

On Mar 10, 12:33 am, Steven Bird <stevenbi...@gmail.com> wrote:
> NLTK's probabilistic parsers are not (yet) serious implementations.
> We're talking about fixing this in the coming months. For now you
> should use one of the high performance statistical parsers, such as
> the Stanford Parser. -Steven Bird
>

Reply all

Reply to author

Forward