I assume I'm making some obvious newbie mistake. I'd really appreciate
any assistance on why things are not working as expected.
Here follows my Python source, and the output it produces:
import nltk
grammar_text = open('GENIA.pgrammar', 'r').read()
pcfg_grammar = nltk.parse_pcfg(grammar_text)
print "Grammar:"
print pcfg_grammar
print
sent = "Glucocorticoid resistance in the squirrel monkey is associated
with overexpression of the immunophilin FKBP51 .".split()
chart_parser = nltk.ChartParser(pcfg_grammar)
trees = chart_parser.nbest_parse(sent)
print "Parse trees for Chart Parser:"
for tree in trees:
print tree
viterbi_parser = nltk.ViterbiParser(pcfg_grammar)
trees = viterbi_parser.parse(sent)
print "Parse tree for Viterbi Parser:"
for tree in trees:
print tree
-----
Running “nltk-pcfg.py”…
Python 2.6.3
Grammar:
Grammar with 23 productions (start state = S)
S -> NP VP PERIOD [1.0]
NP -> [0.142857142857]
NP -> NN [0.142857142857]
NP -> NN NN [0.142857142857]
NP -> DT NN NN [0.285714285714]
NP -> NP PP [0.285714285714]
PP -> IN NP [1.0]
VP -> VBN NP PP [0.5]
VP -> VBZ VP [0.5]
NN -> 'FKBP51' [0.142857142857]
NN -> 'immunophilin' [0.142857142857]
NN -> 'overexpression' [0.142857142857]
NN -> 'squirrel' [0.142857142857]
NN -> 'resistance' [0.142857142857]
NN -> 'monkey' [0.142857142857]
NN -> 'Glucocorticoid' [0.142857142857]
VBN -> 'associated' [1.0]
VBZ -> 'is' [1.0]
PERIOD -> '.' [1.0]
DT -> 'the' [1.0]
IN -> 'of' [0.333333333333]
IN -> 'in' [0.333333333333]
IN -> 'with' [0.333333333333]
Parse trees for Chart Parser:
(S
(NP
(NP (NN Glucocorticoid) (NN resistance))
(PP (IN in) (NP (DT the) (NN squirrel) (NN monkey))))
(VP
(VBZ is)
(VP
(VBN associated)
(NP (NP ) (PP (IN with) (NP (NN overexpression))))
(PP (IN of) (NP (DT the) (NN immunophilin) (NN FKBP51)))))
(PERIOD .))
(S
(NP
(NP (NN Glucocorticoid) (NN resistance))
(PP (IN in) (NP (DT the) (NN squirrel) (NN monkey))))
(VP
(VBZ is)
(VP
(VBN associated)
(NP )
(PP
(IN with)
(NP
(NP (NN overexpression))
(PP (IN of) (NP (DT the) (NN immunophilin) (NN FKBP51)))))))
(PERIOD .))
Parse tree for Viterbi Parser:
-----
Thank you,
-Rob
viterbi_parser = nltk.ViterbiParser(pcfg_grammar)
tree = viterbi_parser.parse(sent)
print "Parse tree for Viterbi Parser:"
print tree
Which works perfectly.
Thank you !
-Rob
:-/
So please ignore my last message, and if anyone can help with the
problem outlined in the first message I'd appreciate it.
-Rob
Inserting tokens into the most likely constituents table...
Insert: |=..............| Glucocorticoid
Insert: |.=.............| resistance
Insert: |..=............| in
Insert: |...=...........| the
Insert: |....=..........| squirrel
Insert: |.....=.........| monkey
Insert: |......=........| is
Insert: |.......=.......| associated
Insert: |........=......| with
Insert: |.........=.....| overexpression
Insert: |..........=....| of
Insert: |...........=...| the
Insert: |............=..| immunophilin
Insert: |.............=.| FKBP51
Insert: |..............=| .
Finding the most likely constituents spanning 1 text elements...
Insert: |=..............| NN -> 'Glucocorticoid' [0.142857142857]
Insert: |=..............| NP -> NN [0.142857142857]
Insert: |.=.............| NN -> 'resistance' [0.142857142857]
Insert: |.=.............| NP -> NN [0.142857142857]
Insert: |..=............| IN -> 'in' [0.333333333333]
Insert: |...=...........| DT -> 'the' [1.0]
Insert: |....=..........| NN -> 'squirrel' [0.142857142857]
Insert: |....=..........| NP -> NN [0.142857142857]
Insert: |.....=.........| NN -> 'monkey' [0.142857142857]
Insert: |.....=.........| NP -> NN [0.142857142857]
Insert: |......=........| VBZ -> 'is' [1.0]
Insert: |.......=.......| VBN -> 'associated' [1.0]
Insert: |........=......| IN -> 'with' [0.333333333333]
Insert: |.........=.....| NN -> 'overexpression' [0.142857142857]
Insert: |.........=.....| NP -> NN [0.142857142857]
Insert: |..........=....| IN -> 'of' [0.333333333333]
Insert: |...........=...| DT -> 'the' [1.0]
Insert: |............=..| NN -> 'immunophilin' [0.142857142857]
Insert: |............=..| NP -> NN [0.142857142857]
Insert: |.............=.| NN -> 'FKBP51' [0.142857142857]
Insert: |.............=.| NP -> NN [0.142857142857]
Insert: |..............=| PERIOD -> '.' [1.0]
Finding the most likely constituents spanning 2 text elements...
Insert: |==.............| NP -> NN NN [0.142857142857]
Insert: |....==.........| NP -> NN NN [0.142857142857]
Insert: |........==.....| PP -> IN NP [1.0]
Insert: |............==.| NP -> NN NN [0.142857142857]
Finding the most likely constituents spanning 3 text elements...
Insert: |...===.........| NP -> DT NN NN [0.285714285714]
Insert: |...........===.| NP -> DT NN NN [0.285714285714]
Finding the most likely constituents spanning 4 text elements...
Insert: |..====.........| PP -> IN NP [1.0]
Insert: |..........====.| PP -> IN NP [1.0]
Finding the most likely constituents spanning 5 text elements...
Insert: |.=====.........| NP -> NP PP [0.285714285714]
Insert: |.........=====.| NP -> NP PP [0.285714285714]
Finding the most likely constituents spanning 6 text elements...
Insert: |======.........| NP -> NP PP [0.285714285714]
Insert: |........======.| PP -> IN NP [1.0]
Finding the most likely constituents spanning 7 text elements...
Finding the most likely constituents spanning 8 text elements...
Finding the most likely constituents spanning 9 text elements...
Finding the most likely constituents spanning 10 text elements...
Finding the most likely constituents spanning 11 text elements...
Finding the most likely constituents spanning 12 text elements...
Finding the most likely constituents spanning 13 text elements...
Finding the most likely constituents spanning 14 text elements...
Finding the most likely constituents spanning 15 text elements...
S -> NP VP PERIOD [1.0]
NP -> [0.142857142857143]
NP -> NN [0.142857142857143]
NP -> NN NN [0.142857142857143]
NP -> DT NN NN [0.285714285714286]
NP -> NP PP [0.285714285714286]
PP -> IN NP [1.0]
VP -> VBN NP PP [0.5]
VP -> VBZ VP [0.5]
NN -> 'FKBP51' [0.142857142857143]
NN -> 'immunophilin' [0.142857142857143]
NN -> 'overexpression' [0.142857142857143]
NN -> 'squirrel' [0.142857142857143]
NN -> 'resistance' [0.142857142857143]
NN -> 'monkey' [0.142857142857143]
NN -> 'Glucocorticoid' [0.142857142857143]
VBN -> 'associated' [1.0]
VBZ -> 'is' [1.0]
PERIOD -> '.' [1.0]
DT -> 'the' [1.0]
IN -> 'of' [0.333333333333333]
IN -> 'in' [0.333333333333333]
IN -> 'with' [0.333333333333333]
I'm wondering if it has to do with the rule with an empty on the RHS?
NP -> [0.142857142857143]
Is that the correct way to represent such a thing in NLTK?
<sentence id="S1">
<cons cat="S">
<cons cat="NP" id="i1" role="SBJ">
<cons cat="NP">
<tok cat="NN">Glucocorticoid</tok>
<tok cat="NN">resistance</tok>
</cons>
<cons cat="PP">
<tok cat="IN">in</tok>
<cons cat="NP">
<tok cat="DT">the</tok>
<tok cat="NN">squirrel</tok>
<tok cat="NN">monkey</tok>
</cons>
</cons>
</cons>
<cons cat="VP">
<tok cat="VBZ">is</tok>
<cons cat="VP">
<tok cat="VBN">associated</tok>
<cons cat="NP" ref="i1" null="NONE"/>
<cons cat="PP">
<tok cat="IN">with</tok>
<cons cat="NP">
<cons cat="NP">
<tok cat="NN">overexpression</tok>
</cons>
<cons cat="PP">
<tok cat="IN">of</tok>
<cons cat="NP">
<tok cat="DT">the</tok>
<tok cat="NN">immunophilin</tok>
<tok cat="NN">FKBP51</tok>
</cons>
</cons>
</cons>
</cons>
</cons>
</cons>
<tok cat="PERIOD">.</tok>
</cons>
</sentence>
When I say "empty on the RHS" I think the correct terminology is
epsilon, not empty... That's meant to represent the following line
from the Treebank's XML:
<cons cat="NP" ref="i1" null="NONE"/>
No, that's fine. You can see that empty productions work fine:
----
from nltk import parse_pcfg, ViterbiParser
pcfg_grammar = parse_pcfg("""
S -> [0.5]
S -> 'foo' [0.5]
""")
parser = ViterbiParser(pcfg_grammar)
print parser.nbest_parse(['foo'])
----
Prints:
[ProbabilisticTree('S', ['foo']) (p=0.5)]
So that's not the problem.
Note that a Viterbi parser only stores the most likely analysis for
each given span. Does that explain why there's no result?
-Steven Bird
The example grammar here is quite small, the actual grammar is very
large (31,000 rules for the whole treebank), so I'm assuming I need a
probabilistic parser to parse any sentence in reasonable time.
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>
What limitations are there?
Which parsers are the statistical ones?
What do you mean by 'not serious implementation' ?
What limitations are there?
Which parsers are the statistical ones?
I'm using this for some university coursework and I have been told to
use the NLTK probabilistic parsers - are you saying that these are
very buggy and don't work for valid grammars?
-Rob
On Mar 10, 12:33 am, Steven Bird <stevenbi...@gmail.com> wrote:
> NLTK's probabilistic parsers are not (yet) serious implementations.
> We're talking about fixing this in the coming months. For now you
> should use one of the high performance statistical parsers, such as
> the Stanford Parser. -Steven Bird
>