dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
train_corpus = [corpus[i] for i in train_index]
test_corpus = [corpus[j] for j in test_index]
lda = models.ldamodel.LdaModel(train_corpus, num_topics=nTopics, id2word = dictionary,
update_every=1, chunksize=20, eval_every=10)
train_words = sum(cnt for document in train_corpus for _, cnt in document)
train_perwordbound = lda.bound(train_corpus)/train_words
train_perplexity = np.exp2(-train_perwordbound)
print 'LDA Train perplexity :' + str(train_perplexity)
test_words = sum(cnt for document in test_corpus for _, cnt in document)
test_perwordbound = lda.bound(test_corpus)/test_words
test_perplexity = np.exp2(-test_perwordbound)
''
So I am using the default gensim settings for the lda model. My problem is that my perplexity increases with number of topics and I would expect a decrease in perplexity as number of topics increases. Does it have something to do with the fact that I am calculating perplexity using the lower variational bound?
Regards,
Alta de Waal