KL-distance after LDA

666 views
Skip to first unread message

Sergey Slavsky

unread,
Aug 1, 2016, 10:54:33 AM8/1/16
to gensim
I'm tring to calculate KL-distance between texts in my corpus. After I ran LDA topic modeling   

model = models.LdaModel(corpus, num_topics=30)
array = model[corpus]

And got an array where for each text there should be 30 numbers corresponding for different topics. 
But somehow most of this numbers are zeros. So I can't use the formula for KL-distance because there would be zeros in the denominator
I tried  so vary alpha parameter of the model, but all zeros disappear only at alpha greater than 200, and then there is very little difference between topics. 
How should I solve this problem? 

Thanks

Cameron Fen

unread,
Aug 1, 2016, 2:44:15 PM8/1/16
to gensim
Do you need to use KL divergence?  It's not symmetric which is sort of annoying and not a true metric.  You can use Hellinger distance.  See this SO.  

Bhargav Srinivasa

unread,
Aug 1, 2016, 2:52:15 PM8/1/16
to gensim
Hello, I've written a notebook to use distance metrics with Gensim. You should be able to find whatever you're looking for here, including getting past 0s in KL - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/distance_metrics.ipynb

Both Hellinger and KL have been implemented within Gensim now. 

Sergey Slavsky

unread,
Aug 2, 2016, 6:22:04 AM8/2/16
to gensim
Thanks for your notebook. But there is only solution to KL-distance between topics, not texts. But I think I have found the solution. While creating a model you need to set low minimum probability, by default it's 0.01. 
 lda = models.LdaModel(corpus, num_topics=n_topics, minimum_probability=0.00001)

Bhargav Srinivasa

unread,
Aug 2, 2016, 10:33:39 AM8/2/16
to gensim
Cool - if you could, would be awesome if you can add a line to the notebook mentioning the same!
Reply all
Reply to author
Forward
0 new messages