PCFG default probability

Russell Weber

unread,

Nov 14, 2011, 12:51:46 PM11/14/11

to nltk-...@googlegroups.com

Is there a way to define a default probability in PCFG's for rare or unforseen words? I am trying to induce a grammar from the provided corpus but find it lacking words.
example

grammar = induce_pcfg(S, productions)
print grammar

parser = nltk.parse.ViterbiParser(grammar)
s = "I am a sentence that should be tokenized and parsed for all that I am worth!"
tokens = word_tokenize(s)
for t in parser.nbest_parse(tokens):

"input words: %r." % missing)
ValueError: Grammar does not cover some of the input words: "'tokenized', 'parsed'".

Obviosly "tokenized" and "parsed" are rare words but instead of coming back with an error, I would like it to just use a default low valued probability as a kind of place holder for rare words. Viterbi should still come up with a good parse if not too many rare words are put in sequence given that all back pointers through the network will still depend upon the constant low valued default. Is there a way of doing this?

Steven Bird

unread,

Nov 14, 2011, 9:26:26 PM11/14/11

to nltk-...@googlegroups.com

A lightweight solution is to define a vocabulary in advance (e.g. the
most common N words) and to map all other words to a special token
UNK. Add productions of the form C -> UNK for each pre-terminal
category C. Then let the induction step take care of working out how
likely unknown words are, for each category. -Steven Bird

On 15 November 2011 04:51, Russell Weber

> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to
> nltk-users+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/nltk-users?hl=en.
>

teancum

unread,

Nov 15, 2011, 2:17:58 PM11/15/11

to nltk-users

I am not quite sure that I understand this solution. Shouldn't there
be an easier solution to tell viterbi or make a note into our PCFG to
treal all other symbols with a defaulted low probability? Maybe some
type of regular expression?

John K Pate

unread,

Nov 15, 2011, 3:13:42 PM11/15/11

to nltk-...@googlegroups.com

On Tue, 2011-11-15 at 11:17 -0800, teancum wrote:
> I am not quite sure that I understand this solution. Shouldn't there
> be an easier solution to tell viterbi or make a note into our PCFG to
> treal all other symbols with a defaulted low probability? Maybe some
> type of regular expression?

Even if there is a way, it's not necessarily a good idea. How do you
determine what that low probability should be? Steven Bird's solution is
simple. Take the least common words in your training set (say, those
which appear only once or twice), and replace them with "UNK." The PCFG
can then learn something about the behavior of rare words from your
training set.

John

==
http://homepages.inf.ed.ac.uk/s0930006/

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

teancum

unread,

Nov 15, 2011, 8:43:24 PM11/15/11

to nltk-users

"define a vocabulary in advance (e.g. the
most common N words) and to map all other words to a special token
UNK. Add productions of the form C -> UNK for each pre-terminal
category C."

I guess what I am saying is that I have no Idea on how to do this
with NLTK which is why I am asking....

lets start with the penntreebank example for inducing a grammer

#code
items = treebank.fileids()# treebank.files()

for item in items: #treebank.items:
for tree in treebank.parsed_sents(item):
# perform optional tree transformations, e.g.:
tree.collapse_unary(collapsePOS = False) # Remove branches
A-B-C into A-B+C
tree.chomsky_normal_form(horzMarkov = 2) # Remove A-
>(B,C,D) into A->B,C+D->D

productions += tree.productions()

S = nltk.Nonterminal('S')

grammar = induce_pcfg(S, productions)
print grammar

#!code
How do I get a vocabulary list and then "map all other words to a
special token
UNK"? Then how do I get all of the preterminals when the set of
terminals and nonterminals is implicitly specified by the productions
on weighted grammars to add them to the productions list.

Reply all

Reply to author

Forward