Perplexity Estimates in LDA Model

822 views

Skip to first unread message

Ryan Mills

unread,

Aug 6, 2014, 11:42:27 PM8/6/14

to gen...@googlegroups.com

I'm doing topics modelling on a corpus using Gensim LDA implementation.When I compare perplexity on different number of topics, I observe that with increasing number of topics from 5 to 60 although likelihood on both training and test set increases, the amount of increase is very small:

num_topics	Likelihood_Train	Likelihood_Test
5	-229377021.9	-58513103.75
10	-224322296.5	-57476512.33
20	-219480128.3	-56550233.29
30	-217518306.7	-56260050.09
40	-215907408.8	-56081111.45
50	-214815993.7	-55982093.45
60	-213963838.1	-55885838.52

I'm not sure how should I interpret this? I'm setting iterations and gamma_threshold values to more extreme values than their default ones (500 and 0.00001 respectively) so that I would be sure that the model converges properly.

Ryan

Radim Řehůřek

unread,

Aug 7, 2014, 6:40:31 AM8/7/14

to gen...@googlegroups.com

Hello Ryan,

On Thursday, August 7, 2014 6:42:27 AM UTC+3, Ryan Mills wrote:

I'm doing topics modelling on a corpus using Gensim LDA implementation.When I compare perplexity on different number of topics, I observe that with increasing number of topics from 5 to 60 although likelihood on both training and test set increases, the amount of increase is very small:

you don't mention how you compute those numbers. I assume you're using the same evaluation corpus, the same likelihood formula?

Gensim's LdaModel class implements online variational LDA from https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf

The particular equation gensim uses for the (lower bound of) perplexity comes from formula 16) in that article, and it's called `model.bound(eval_corpus)` for the bound itself, or `model.log_perplexity(eval_corpus)` for a nicer per-word perplexity estimate (calls `bound()` internally).

These parameters guide internal VB convergence; for good overall results, either increase the amount of your training data, or increase the `passes` parameter (to train over the same dataset multiple times).

HTH,

Radim

Ryan

Reply all

Reply to author

Forward

0 new messages