Perplexity on test set - increases with number of topics

438 views

Skip to first unread message

Alta de Waal

unread,

Sep 29, 2014, 1:54:27 PM9/29/14

to gen...@googlegroups.com

Hi,

I am testing the Gensim LDA model on the Reuters-21578 dataset. I split the dataset into a 90/10 training and test set. My gensim code is as follows:

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

train_corpus = [corpus[i] for i in train_index]

test_corpus = [corpus[j] for j in test_index]

lda = models.ldamodel.LdaModel(train_corpus, num_topics=nTopics, id2word = dictionary,

update_every=1, chunksize=20, eval_every=10)

train_words = sum(cnt for document in train_corpus for _, cnt in document)

train_perwordbound = lda.bound(train_corpus)/train_words

train_perplexity = np.exp2(-train_perwordbound)

print 'LDA Train perplexity :' + str(train_perplexity)

test_words = sum(cnt for document in test_corpus for _, cnt in document)

test_perwordbound = lda.bound(test_corpus)/test_words

test_perplexity = np.exp2(-test_perwordbound)

So I am using the default gensim settings for the lda model. My problem is that my perplexity increases with number of topics and I would expect a decrease in perplexity as number of topics increases. Does it have something to do with the fact that I am calculating perplexity using the lower variational bound?

Regards,

Alta de Waal

Oleg Nagornyy

unread,

Feb 25, 2015, 8:50:07 AM2/25/15

to gen...@googlegroups.com

Hello Alta!
I faced the same issue (as I understand it's a common problem), so i'd like to ask you, did you solve it? Maybe it would be better to change the tool?

Reply all

Reply to author

Forward

0 new messages