Perplexity Estimates in LDA Model

820 views
Skip to first unread message

Ryan Mills

unread,
Aug 6, 2014, 11:42:27 PM8/6/14
to gen...@googlegroups.com
I'm doing topics modelling on a corpus using Gensim LDA implementation.When I compare perplexity on different number of topics, I observe that with increasing number of topics from 5 to 60 although likelihood on both training and test set increases, the amount of increase is very small:

num_topics
Likelihood_Train Likelihood_Test
5 -229377021.9 -58513103.75
10 -224322296.5 -57476512.33
20 -219480128.3 -56550233.29
30 -217518306.7 -56260050.09
40 -215907408.8 -56081111.45
50
-214815993.7 -55982093.45
60
-213963838.1 -55885838.52

I'm not sure how should I interpret this? I'm setting iterations and gamma_threshold values to more extreme values than their default ones (500 and 0.00001 respectively) so that I would be sure that the model converges properly.

Ryan

Radim Řehůřek

unread,
Aug 7, 2014, 6:40:31 AM8/7/14
to gen...@googlegroups.com
Hello Ryan,


On Thursday, August 7, 2014 6:42:27 AM UTC+3, Ryan Mills wrote:
I'm doing topics modelling on a corpus using Gensim LDA implementation.When I compare perplexity on different number of topics, I observe that with increasing number of topics from 5 to 60 although likelihood on both training and test set increases, the amount of increase is very small:

you don't mention how you compute those numbers. I assume you're using the same evaluation corpus, the same likelihood formula?

Gensim's LdaModel class implements online variational LDA from https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf

The particular equation gensim uses for the (lower bound of) perplexity comes from formula 16) in that article, and it's called `model.bound(eval_corpus)` for the bound itself, or `model.log_perplexity(eval_corpus)` for a nicer per-word perplexity estimate (calls `bound()` internally).
These parameters guide internal VB convergence; for good overall results, either increase the amount of your training data, or increase the `passes` parameter (to train over the same dataset multiple times).

HTH,
Radim


 

Ryan

Reply all
Reply to author
Forward
0 new messages