Perplexity on test set - increases with number of topics

438 views
Skip to first unread message

Alta de Waal

unread,
Sep 29, 2014, 1:54:27 PM9/29/14
to gen...@googlegroups.com
Hi,

I am testing the Gensim LDA model on the Reuters-21578 dataset. I split the dataset into a 90/10 training and test set. My gensim code is as follows:

''

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

train_corpus = [corpus[i] for i in train_index]

test_corpus = [corpus[j] for j in test_index]


lda = models.ldamodel.LdaModel(train_corpus, num_topics=nTopics, id2word = dictionary,

update_every=1, chunksize=20, eval_every=10)


train_words = sum(cnt for document in train_corpus for _, cnt in document)

train_perwordbound = lda.bound(train_corpus)/train_words

train_perplexity = np.exp2(-train_perwordbound)

print 'LDA Train perplexity :' + str(train_perplexity)


test_words = sum(cnt for document in test_corpus for _, cnt in document)

test_perwordbound = lda.bound(test_corpus)/test_words

test_perplexity = np.exp2(-test_perwordbound)

''

So I am using the default gensim settings for the lda model. My problem is that my perplexity increases with number of topics and I would expect a decrease in perplexity as number of topics increases. Does it have something to do with the fact that I am calculating perplexity using the lower variational bound?


Regards,

Alta de Waal 

Oleg Nagornyy

unread,
Feb 25, 2015, 8:50:07 AM2/25/15
to gen...@googlegroups.com
Hello Alta!
I faced the same issue (as I understand it's a common problem), so i'd like to ask you, did you solve it? Maybe it would be better to change the tool?
Reply all
Reply to author
Forward
0 new messages