How to compare the topical similarity between two documents from their LDA topic distributions?

Victor Wang

unread,

Mar 26, 2019, 3:03:46 PM3/26/19

to Gensim

I have trained a LDA model on a corpus. Now that I have the topic distribution for each document, how can I compare how similar two documents are in topics?

Should I calculate the Euclidean or Cosine distance between the two vectors of topic prababilities?

And using this measure, can I say that, for example, DOC 1 is more similar to DOC2 than to DOC3, or DOC1 and DOC 2 are more similar to each other than DOC 3 and DOC 4 topically? Thank you!

rupen sharma

unread,

Mar 27, 2019, 5:01:21 AM3/27/19

to gen...@googlegroups.com

Hi,

I did what you are trying to do using Gensim LSA and used Cosine similarity. It works as expected as u mentioned in your mail.

Now, I'm experimenting and struggling with Doc2Vec.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Victor Wang

unread,

Mar 29, 2019, 7:21:58 PM3/29/19

to gen...@googlegroups.com

I have some new findings.

Per the gensim tutorial, the Kullback–Leibler divergence is more appropriate than cosine similarity

https://radimrehurek.com/gensim/tut3.html

" Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, different similarity measures may be more appropriate."

There are some built-in functions to calculate the Kullback–Leibler divergence.

https://radimrehurek.com/gensim/matutils.html

gensim.matutils.kullback_leibler(vec1, vec2, num_features=None)

Calculate Kullback-Leibler distance between two probability distributions using scipy.stats.entropy.

Parameters:	vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector. vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector. num_features (int, optional) – Number of features in the vectors.
Returns:	Kullback-Leibler distance between vec1 and vec2. Value in range [0, +∞) where values closer to 0 mean less distance (higher similarity).
Return type:	float

Parameters:

vec1 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
vec2 ({scipy.sparse, numpy.ndarray, list of (int, float)}) – Distribution vector.
num_features (int, optional) – Number of features in the vectors.

Returns:

Kullback-Leibler distance between vec1 and vec2. Value in range [0, +∞) where values closer to 0 mean less distance (higher similarity).

Return type:

float

zks...@ualr.edu

unread,

Mar 29, 2019, 7:30:16 PM3/29/19

to Gensim

Note that KL divergence is an asymmetric measure. Jensen-Shannon divergence is a symmetric form that might be useful here.

Hi,

To unsubscribe from this group and stop receiving emails from it, send an email to gen...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gen...@googlegroups.com.

Victor Wang

unread,

Mar 29, 2019, 7:33:15 PM3/29/19

to gen...@googlegroups.com

Good point!

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

Message has been deleted

DarkFyre

unread,

Apr 1, 2019, 2:46:42 AM4/1/19

to Gensim

Sorry this might be a little off topic but i am trying to something that is somewhat similar to this and i have a question

Did you train your model on normal word2vec without embedding the similarity in the training dataset ?

Reply all

Reply to author

Forward