NLTK perplexity score smoothing

368 views
Skip to first unread message

Shengrong Liu

unread,
Aug 2, 2013, 4:04:25 PM8/2/13
to nltk-...@googlegroups.com
Hi all,
my original code is 

tt1=NgramModel(2, my_bigrams, estimator = None)

How to smooth model? 
Should we just change code to 

estimator1 = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)

tt1=NgramModel(2, my_bigrams, estimator = estimator1)

However, the new problem occurs. If I did not remember wrong, the perplexity score for trigram model should be smaller than perplexity socre for bigram model . 
The outcome of my code is that two perplexity scores is 299.25 and 299.22, so I know something is wrong. 

Can anyone help to figure out why it happens? 

Thanks 

Tim McNamara

unread,
Aug 3, 2013, 4:26:48 PM8/3/13
to nltk-...@googlegroups.com
Quick checks:

It looks like you are still feeding in the my_bigrams variable, which presumably has the bigrams rather than trigrams?

Are you sure you want to be feeding NgramModel a list of ngrams? I believe it would be better to give it the word list and allow it to build the ngrams itself.


Generally:

How to smooth model is going to heavily depend on the input data, e.g. feature selection, tokenisation, stemming and many other choices. It is very difficult to provide a specific answer to this question.

What are your options when a mailing list can't help? You can probably figure out an answer to the question yourself though by getting a deeper understanding of what is happening.

You can take a look at the source code that generates the perplexity score relatively easily[0]. You'll see that it's just an expansion on the text's entropy score[1]. The docstring for NgramModel.entropy is trying to build "the average log probability of each word in the text." So changing the probability distribution should change the entropy value. However, if you are already providing ngrams, perhaps there is not that much work for it to do?

[0] http://nltk.org/_modules/nltk/model/ngram.html#NgramModel.perplexity

Tim McNamara
@timClicks


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages