Comparing Similarity of LDA Topics

4,181 views
Skip to first unread message

Harry Baker

unread,
Feb 14, 2016, 4:48:16 PM2/14/16
to gensim
Hi,

I'm trying to develop a way to compare the similarity of different LDA topics, and am wondering what the best method of implementing it would be. My research involves topic modeling historical documents, so we're really interested in seeing how similar themes change over time, as well as easily identifying topics that match themes we've previously identified.

I was doing some research into it, and it seemed like KL divergence would be a good place to start, since we're trying to find relative difference of two different distributions. Does gensim have any packages or functions to compute KL divergence?

Harry

Radim Řehůřek

unread,
Feb 16, 2016, 11:25:12 PM2/16/16
to gensim
Hello Harry,

it's a one-liner in Python, so there's no explicit package for it. Search this forum for "Hellinger distance", it's been discussed before.

Best,
Radim

Harry Baker

unread,
Feb 25, 2016, 12:21:56 AM2/25/16
to gensim
Most of the discussion seems to be around using KL divergence to compare documents to documents, or queries to documents. Would it be possible to use it to compare topics to topics? I've been looking through the similarities class, and it seems like they are designed to compare queries to entire documents.

Radim Řehůřek

unread,
Feb 25, 2016, 12:54:49 AM2/25/16
to gensim
On Thursday, February 25, 2016 at 2:21:56 PM UTC+9, Harry Baker wrote:
Most of the discussion seems to be around using KL divergence to compare documents to documents, or queries to documents. Would it be possible to use it to compare topics to topics?

These methods work by comparing probability distributions -- whether these distributions "semantically" represent documents or topics is irrelevant. So yes, topics-to-topics is possible.

 
I've been looking through the similarities class, and it seems like they are designed to compare queries to entire documents.


The Similarity classes in gensim do not implement KL divergence/ Hellinger distance at all. They only work with cosine similarity. That's mostly because it's a simple one-liner; unless you need performance, there's no reason to complicate the computation.

Have a look at the code in this Stackoverflow answer for example:


Best,
Radim

Harry Baker

unread,
Feb 25, 2016, 1:56:56 PM2/25/16
to gensim
Ok, thank you, that's helpful. I'm having some trouble converting LDA topics into dense arrays.

Here's my code:

dense1 = gensim.matutils.sparse2full(lda.show_topic(x), 50)
dense2 = gensim.matutils.sparse2full(lda.show_topic(y), 50)

sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())

topicMatrix.append(sim)


Which is throwing the error:

line 217, in sparse2full
result[list(doc)] = list(itervalues(doc))

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`)
and integer or boolean arrays are valid indices

Since a topic is a sparse array, shouldn't it be able to be directly converted into a dense array, or do I need to modify it first?

Radim Řehůřek

unread,
Feb 25, 2016, 7:28:02 PM2/25/16
to gensim
I don't think `show_topic(topic_id)` does what you think; try `sorted(lda.get_topic_terms(topic_id, topn=lda.num_terms))` instead:


By the way, the IPython interactive shell is a great way to explore Python pipelines and see what's going on, step by step.

Hope that helps!
Radim

Harry Baker

unread,
Mar 9, 2016, 8:41:30 PM3/9/16
to gensim
That worked perfectly, thank you! And thanks for all of your help through this.

I do have a few more questions about how words are represented in dense vectors though. Is there any significance about the order of the elements of a sparse vector when it's converted to a dense vector? I want to be able to sort the vector by most significant words, and I'm worried that if I compare two topics that have been sorted by significance (rather than by word ID), I won't be comparing anything.

I'd also like to be able to only compare the top N many words of a topic, which is another reason I'm worried about this. I'd like to only convert these top N words into a dense matrix, but I'm afraid that by comparing this to another dense matrix I'll get garbage results. 

Harry

Radim Řehůřek

unread,
Mar 9, 2016, 9:32:19 PM3/9/16
to gensim
That's right, when you compare dense vectors, you must compare them in the same order of features/dimensions. If you re-sort their elements by magnitude, you'll get a garbage comparison, comparing apples to pears.

How you realize this comparison depends on your comparator (~similarity metric). Probably the easiest way is to not convert sparse to dense, but rather compare the sparse vectors of `(feature_id, feature_weight)` directly?

Best,
Radim

Harry Baker

unread,
Apr 4, 2016, 7:56:12 PM4/4/16
to gensim
Hi, sorry I'm replying late, I've been on break.

Would comparing two sparse vectors return the same value as comparing the same respective dense vector if it was converted? When I calculate the hessinger distance on dense vectors it returns values that are within 0 and 1, which I expected, but when I compare equivalent sparse vectors I get values that range from 1 to 50.

Radim Řehůřek

unread,
Apr 5, 2016, 5:02:40 AM4/5/16
to gensim
Sure! The internal vector representation (sparse/dense) is just a technicality; math works the same on either.

If two methods for "comparing" give you (noticeably) different results when run or sparse vs. dense, that points toward a bug in implementation of either the dense or sparse "comparing" (or both).

Radim

Harry Baker

unread,
Apr 10, 2016, 8:07:40 PM4/10/16
to gensim
Ok, thank you for clearing that up. I think I have a pretty solid implementation within a single model. The next step of our project involves comparing topics from different models, and I'm wondering if there had been any work on that. I haven't been able to find anything researching it, and I was wondering if it would even be possible.

It seems like it would be possible to create a combined ID list for each topic's words, and then recreate each topics vector from the master list. 

Gordon Mohr

unread,
Apr 10, 2016, 11:40:09 PM4/10/16
to gensim
I recently noticed an interesting paper that trains dense vectors for LDA topics into the same space as word-vectors. They then describe those LDA topics by the closest words, and suggest these better highlight what makes the topics unique. See:

"Topic2Vec: Learning Distributed Representations of Topics"

Conceivably, also, the distances between topic-vectors would then be a measure of their similarity. And if the topic-vectors were created in the same space, or projected into each others' space, comparison of topics from different models might also work.

There's no direct support for this in gensim, but it might be possible to get a similar effect by either: 

* In a Word2Vec session, generating alternate versions of text where words are (sometimes) replaced with tokens representing the topics. (This would very closely approximate their technique.)

* In a Doc2Vec session, supply doc-tags for texts that represent the top LDA topics of that text. (This isn't quite their technique, but might similarly push the topic-vectors to similar positions vis-a-vis nearby words.)

- Gordon

Harry Baker

unread,
Apr 11, 2016, 1:11:41 PM4/11/16
to gensim
That article was very interesting, thank you for showing it to me. In the long run topic2vec would be perfect for our project. I have to admit that I'm struggling to understand a lot of the underlying math behind how topic2vec would work, or how I could mimic topic2vec using LDA and word2vec. 

If I understand you correctly, I would need to train a word2vec model at the same time that I trained an LDA model, using the same inputs for each? Or would I train a word2vec model using the topics from the LDA model as documents? I'm not completely sure what you mean by "generating alternate versions of text". How would replacing words with a token representing a topic work, since words are vectors and topics are probabilistic distributions. 


Harry

Gordon Mohr

unread,
Apr 11, 2016, 6:50:40 PM4/11/16
to gensim
If I understand the topic2vec paper properly, they're doing LDA first, completely independent of word-vectors. At the end of that process, they have LDA topics, and scores indicating words most associated with certain topics. 

They then do a word2vec training, on the same texts, but every place where word2vec would normally attempt to predict a word, they *also* try to predict the LDA topic most associated with the word. 

So let's say a training sentence was "The cat leapt to the branch". And let's say that 'cat' was found most-associated with LDA 'topic_3', and 'leapt' most-associated with 'topic_27', and 'branch' most-associated with 'topic_11'. Their word2vec training is then somewhat like, instead of just training on the original raw sentence "The cat leapt to the branch", training on all the alternate/expanded variants:

    The cat leapt to the branch
    The topic_3 leapt to the branch
    The cat topic_27 to the branch
    The cat leapt to the topic_11

(This may not be exactly right, but it's the same gist, so might get similar results via a mere preprocessing step on the corpus, without modifying the word2vec code. And the Yahoo queryCategorizr paper also seems to be doing something very similar.)

At the end of the process, all the `topic_#` pseudowords wind up with word-vectors, and their relative distances to other topics and other words may be useful in the same way word-vectors are. 

The Doc2Vec-like way to approximate the same effect could be to supply the top LDA topics as extra tags on TaggedDocument examples. So instead of vanilla Doc2Vec examples like:

    TaggedDocument(tags=['doc_7'], words=['the', 'cat', 'leapt', 'to', 'the', 'branch'])

...you'd have instead...

    TaggedDocument(tags=['doc_7', 'topic_3', 'topic_27', 'topic_11'], words=['the', 'cat', 'leapt', 'to', 'the', 'branch'])

Hopefully, the various 'topic_#' tags should again arrange in some useful constellation. If using a Doc2Vec mode that co-trains words in the same vector-space (`dm=1` or `dm=0, dbow_words=1`), they might also arrange in a way that renders them comparable to words. And per the claims of the topic2vec paper, those words closest to topic-vectors might be more helpful in understanding the subtle differences between LDA topics. 

Not sure any of these approaches would offer great results – just that they're plausible, given what's suggested by the topic2vec and queryCategorizr results. 

- Gordon

Francisco Pereira

unread,
Apr 11, 2016, 9:47:27 PM4/11/16
to gen...@googlegroups.com
Thank you very much for the very clear explanation and example, Gordon! This actually helps with a completely different research purpose (in brain imaging, of all places :).

Francisco

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Harry Baker

unread,
Apr 12, 2016, 12:32:26 PM4/12/16
to gensim
Ok, awesome, that makes a lot of sense. So when you train  "The topic_3 leapt to the branch", are you literally using the string 'topic_3' as the word to replace it? Or are you inputting some manifestation of the topic itself?

Gordon Mohr

unread,
Apr 12, 2016, 7:09:00 PM4/12/16
to gensim
You'd be aiming to provide Word2Vec with a transformed corpus – and Word2Vec only takes (as training data) lists-of-strings. So yes, you'd literally replace the original word-string with some new pseudoword-string, like 'topic_3'. 

You may want to use a convention that makes them stick out from real words even more – '#topic_3#', whatever. But at heart this tactic is just tricking unchanged Word2Vec code into modeling your other artifacts as if they were words, by mixing them into the original texts as actual string words. 

- Gordon

Nicholas Ampazis

unread,
Apr 15, 2016, 3:57:33 AM4/15/16
to gensim
You might also want to check out lda2vec https://github.com/cemoody/lda2vec

Harry Baker

unread,
Apr 18, 2016, 8:34:40 PM4/18/16
to gensim
Thank you for the link Nicholas, I'll definitely take a look into that.

I do have some questions about how I can grab the topic most associated with a given word. I've tried using ldaTopic[word], but I'm getting errors that my input isn't in a proper BoW format. I don't understand why this is happening, because it throws an error even when I give it the id of the word in model's the dictionary. What's the proper way to convert a word to a BoW format that the model will recognize? Can i grab the id value from the model itself, or do I need to convert it using the dictionary that the lda model was created from?

Harry
Message has been deleted

Timothy James

unread,
Mar 27, 2018, 3:05:44 PM3/27/18
to gensim
Harry, Did you figure out a solution to this? I'd be interested in seeing how you did it. Thanks!
Reply all
Reply to author
Forward
0 new messages