Comparing perplexity across num_topics?

317 views
Skip to first unread message

Kuang

unread,
Oct 15, 2014, 10:25:06 AM10/15/14
to gen...@googlegroups.com
Hello,

After reading others' discussions, I found the answer to an inquirer's
question about finding a reasonable number of topics using the returns of
bound():

"The values coming out of `bound()` depend on the number of topics (as well as
number of words), so they're not comparable across different num_topics (or
different test corpora)."

Based on my interpretation, it means that the outputs of bound() are
comparable only when 1. num_topics is fixed, and 2. using the same corpus.

My question is (and I believe many others' have the same question): How to
generate perplexity values from the same corpus that are comparable across
different num_topics (and generate figures similar to figure 9 in Blei et al.
2003)? I'd like to identity an appropriate num. of topics for my topic model.
Are there other ways of doing that in addition to calculating perplexities?
Thank you very much!

Radim Řehůřek

unread,
Oct 16, 2014, 3:07:44 AM10/16/14
to gen...@googlegroups.com, rayhua...@gmail.com
Hello Kuang,

`bound()` has been refactored since then, so you should be able to compare its scores across different number of topics.

Comparing on "same corpus" still stands (though if your test corpora is large & representative enough, the difference in scores shouldn't be large anyway, so that's no problem either).

HTH,
Radim

Kuang

unread,
Oct 16, 2014, 10:30:05 AM10/16/14
to gen...@googlegroups.com
Radim Řehůřek <me@...> writes:

>
>
> Hello Kuang,
> `bound()` has been refactored since then, so you should be able to compare
its scores across different number of topics.
>
> Comparing on "same corpus" still stands (though if your test corpora is
large & representative enough, the difference in scores shouldn't be large
anyway, so that's no problem either).
>
> HTH,
> RadimOn Wednesday, October 15, 2014 4:25:06 PM UTC+2, Kuang wrote:Hello,
> After reading others' discussions, I found the answer to an inquirer's
> question about finding a reasonable number of topics using the returns of
> bound():
> "The values coming out of `bound()` depend on the number of topics (as
well as
> number of words), so they're not comparable across different num_topics
(or
> different test corpora)."
> Based on my interpretation, it means that the outputs of bound() are
> comparable only when 1. num_topics is fixed, and 2. using the same corpus.
> My question is (and I believe many others' have the same question): How to
> generate perplexity values from the same corpus that are comparable across
> different num_topics (and generate figures similar to figure 9 in Blei et
al.
> 2003)? I'd like to identity an appropriate num. of topics for my topic
model.
> Are there other ways of doing that in addition to calculating
perplexities?
> Thank you very much!  
>

Hello Radim,

Thank you for your kind reply. I am asking this question because I, like
some other posters, also experienced an increment of (per word) perplexity
values as num_topics increases, which is counter-intuitive.

I tried Hoffman's online LDA code to analyze wiki data and found that the
perplexity values also increase as the number of topics increases (the
return values of bound() from his code). This made me think that probably
the perplexity values are not comparable across num_topics. Any comments?
Thanks a lot.

(btw, I ended up using Christopher Grainger's approach to calculate the
values of symmetric KL divergence)

sincerely,
Kuang



Alta de Waal

unread,
Oct 21, 2014, 3:02:17 AM10/21/14
to gen...@googlegroups.com, rayhua...@gmail.com
Dear Kuang and Radim,

Radim, you say the code has been refactored since then - I ran my code a few weeks ago and still got the counter intuitive results of perplexity increasing with number of topics. Peplexity is also very sensitive to the number of passes.

Here is my code:

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
train_corpus = [corpus[i] for i in train_index]
test_corpus = [corpus[j] for j in test_index]

lda = models.ldamodel.LdaModel(train_corpus, num_topics=nTopics, id2word = dictionary, 
                               update_every=1, chunksize=20, eval_every=10)

train_words = sum(cnt for document in train_corpus for _, cnt in document)
train_perwordbound = lda.bound(train_corpus)/train_words
train_perplexity = np.exp2(-train_perwordbound)
print 'LDA Train perplexity  :' + str(train_perplexity)

test_words = sum(cnt for document in test_corpus for _, cnt in document)
test_perwordbound = lda.bound(test_corpus)/test_words
test_perplexity = np.exp2(-test_perwordbound)

Regards,
Alta
Reply all
Reply to author
Forward
0 new messages