How to compare the topical similarity between two documents from their LDA topic distributions?

817 views
Skip to first unread message

Victor Wang

unread,
Mar 26, 2019, 11:03:46 AM3/26/19
to Gensim


I have trained a LDA model on a corpus. Now that I have the topic distribution for each document, how can I compare how similar two documents are in topics? 

Should I calculate the Euclidean or Cosine distance between the two vectors of topic prababilities? 

And using this measure, can I say that, for example, DOC 1 is more similar to DOC2 than to DOC3, or DOC1 and DOC 2 are more similar to each other than DOC 3 and DOC 4 topically? Thank you!

rupen sharma

unread,
Mar 27, 2019, 1:01:21 AM3/27/19
to gen...@googlegroups.com
Hi,

I did what you are trying to do using Gensim LSA and used Cosine similarity. It works as expected as u mentioned in your mail.

Now, I'm experimenting and struggling with Doc2Vec.




--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Victor Wang

unread,
Mar 29, 2019, 3:21:58 PM3/29/19
to gen...@googlegroups.com
I have some new findings. 

Per the gensim tutorial, the Kullback–Leibler divergence is more appropriate than cosine similarity 
" Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, different similarity measures may be more appropriate.

There are some built-in functions to calculate the Kullback–Leibler divergence. 


 gensim.matutils.kullback_leibler(vec1, vec2, num_features=None)

Calculate Kullback-Leibler distance between two probability distributions using scipy.stats.entropy.

Parameters:
  • vec1 ({scipy.sparsenumpy.ndarraylist of (intfloat)}) – Distribution vector.
  • vec2 ({scipy.sparsenumpy.ndarraylist of (intfloat)}) – Distribution vector.
  • num_features (intoptional) – Number of features in the vectors.
Returns:

Kullback-Leibler distance between vec1 and vec2. Value in range [0, +∞) where values closer to 0 mean less distance (higher similarity).

Return type:

float


zks...@ualr.edu

unread,
Mar 29, 2019, 3:30:16 PM3/29/19
to Gensim
Note that KL divergence is an asymmetric measure. Jensen-Shannon divergence is a symmetric form that might be useful here.
Hi,

To unsubscribe from this group and stop receiving emails from it, send an email to gen...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gen...@googlegroups.com.

Victor Wang

unread,
Mar 29, 2019, 3:33:15 PM3/29/19
to gen...@googlegroups.com
Good point!

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
Message has been deleted

DarkFyre

unread,
Mar 31, 2019, 10:46:42 PM3/31/19
to Gensim
Sorry this might be a little off topic but i am trying to something that is somewhat similar to this and i have a question
Did you train your model on normal word2vec without embedding the similarity in the training dataset ?
Reply all
Reply to author
Forward
0 new messages