Best method to compare topic similarily across multiple models/corpora

2,680 views
Skip to first unread message

Raphaël Millière

unread,
Sep 3, 2017, 11:31:19 AM9/3/17
to gensim
Hello everyone,

I've been doing LDA models of narrative reports in natural language for a research project. I have several smallish corpora (from 1400 to 200 docs each – I know, that's tiny!) that I'd like to compare, but I don't know how to do that beyond looking at each LDA model (for instance with pyLDAvis). My academic background is not in CS, and I'm still a bit new to NLP.


What are some good ways to compare topics across corpora/topic models? For instance, is it possible to estimate how much two LDA models overlap? Or are there other ways to assess the topic similarity of several corpora?


Thanks in advance for your help!

jhop

unread,
Sep 5, 2017, 2:23:43 PM9/5/17
to gensim
There are at least a few different similarity/distance measures that can be applied to two different vectors. In this case, your vectors are topic models, which are composed of <word, weight> pairs.

See here for a list of different similarity measures: https://en.wikipedia.org/wiki/Category:Similarity_and_distance_measures

A common one to use for topic models is cosine similarity, but you could choose another based on your needs. For LDA topic models, cosine similarity will yield a score in the range [0,1], with 0 meaning dissimilar and 1 meaning identical. There is some hand-waving going on here, inasmuch as cosine similarity doesn't take magnitudes of lines into account. In other words, there is no difference in the angle made by the lines described by endpoints ((0,0), (1,0)) and ((0,0), (2,0)) on an x,y Cartesian plane, both have angle 0 degrees, and cosine of 1, so their similarity will be 1, although the first line has length 1 and the second line has length 2. Another issue: Let's say you have one topic model vector: <(boat, 0.50), (water, 0.50), (sunken, 0)> and another topic model vector <(boat, 0.33), (water, 0.33), and (sunken, 0.33)>. Note the significant difference for the word "sunken" - the first topic model vector essentially doesn't include "sunken" at all, while the second one does. It is up to you whether or not you want to count that as a difference, or how much you want to count it as a difference. All similarity measurements are statistical measures that should not be divorced from your reasoning about questions and answers. You may want to experiment with different similarity measures on small, toy topic model vectors that you make up, or choose more than one similarity measure. This is the intersection of art and statistics.

If you have only one corpus and 100 different topic models for that corpus, and want to compare each topic model to each other topic model within that single corpus, that would result in 100 * (100 + 1) / 2 unique topic model comparisons. You might assume that LDA topic model comparisons within the same corpus should yield very little overlap, but you might be surprised to find the contrary.

If you have two corpora and 100 different topic models for each corpus, and want to compare each topic model in the first corpus to each other topic model within the second corpus, that would result in 100 * 100 unique topic model comparisons.

After you have a full set of comparisons, you could threshold and then rank-order the similarities. For example, choose a similarity threshold of 0.85, meaning that 0.85 is "similar enough" to warrant calling two topic models similar in colloquial terms, and then rank order similarities like "Topic 1 from Corpus A is most similar to Topic 27 from Corpus B", etc. Something more complex: determine presence or absence of topic model clusters.

Raphaël Millière

unread,
Sep 5, 2017, 4:17:42 PM9/5/17
to gensim
Thanks a lot, that's very helpful!

Would you be aware of any tutorial/example of this? I'm not very advanced with python, so I'm not sure how to iterate cosine similarity measurements (or another similarity measure) for all topics of model 1 with all topics of model 2.

In addition, I wonder if it might not be simpler to do one of the two following things:
  • Doing one big topic model of all corpora together, using the new author-topic modeling implementation. As noted in the blog post: “The ‘authors’ can represent any kind of label on documents, in particular when there are several labels per document and several documents per label; for example, tags on internet posts could be used." So instead of comparing several topic models, I could compare the topic distribution for documents across labels (labeling each sub-corpus).
  • Doing a dynamic LDA model, with my sub-corpora as label (instead of temporal labels). Taking inspiration from this: “one advantage of this method is that it is easy to aggregate by a factor other than time… to see the overall distribution of topics within different subsets of the corpus.” See in particular this graph.
Do you think either of these solutions would work?

jhop

unread,
Sep 6, 2017, 3:32:16 PM9/6/17
to gensim
See this post for an implementation of pairwise cosine similarity in python: https://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists

I'm not very familiar with author-topic modeling. This seems to be the seminal paper, published in 2004, so it's a bit old by now: http://www.datalab.uci.edu/author-topic/398.pdf

The section 4.2 Evaluating Predictive Power in that paper was interesting to me.

If I understand correctly (I'm not sure that I do), author-topic modeling in the manner you described will blend all corpora together as an early step, effectively creating a single corpus, and then at least partly identify topic models that would help generate *all* corpora you included *together* at the same time (topic modeling part of author-topic modeling), while also "training" another set of author/corpus models based on author/corpus labels. Something seems fishy/indirect to me about this approach given your initial goal. In particular, I don't like the idea of blending the corpora together given your initial goal. But it might be an informative experiment nonetheless.

In my opinion, dynamic LDA in the manner you described, if I understand correctly, would also blend all corpora together as an early step. So my opinion about that is the same as for author-topic modeling. Dynamic LDA was also designed to account for the time dimension. I'm naturally suspicious of using an approach originally designed to satisfy a purpose other than the one I need (use only in an emergency or for fun), but it's not strict rule.

If either of the other two options you're considering are mostly because you don't know exactly how to do pairwise cosine similarity, my advice is to learn how to do pairwise cosine similarity. You will have an easier time explaining your results - "easier" is relative, not necessarily "easy". There is already enough conceptually complex "magic" within LDA to question the interpretation and value of comparing LDA topic models between two different corpora without adding additional magic of author-topic modeling or dynamic LDA. Simply tuning standard LDA is not straightforward, as you have already experienced (small corpora, number of topics, etc.).

R. M.

unread,
Sep 6, 2017, 7:21:49 PM9/6/17
to gensim
Many thanks for this detailed reply once again!

The reason why I'm considering doing one topic model of the total corpus (including all subcorpora that I want to compare) is that I'm worried that several independent topic models of different subcorpora might be difficult to compare reliably. For instance, in this thread, Gordon Mohr says about doc2vec models that 

Models trained on different datasets aren't directly or easily comparable, because there's quite a bit of randomness in initialization and training. There's no 'true' vector for a word or document – just some useful value that's hammered out via the iterative training, in tug-of-war against all the other training examples. So only vectors trained inside the same session/model are directly comparable. The best approach would likely be to train a model on the union of all your texts – making all the vectors pairwise comparable. Then, also, comparisons of average of subsets of vectors may be meaningful.

Is LDA any different in this regard? You seem to consider that training different models of different subcorpora would be more efficient to compare them than training one big model and finding out the top topics of each subcorpora for instance. Could you explain to me why you think that, and why Gordon's worry in the quote does not apply here?

Let me give you a more concrete idea of my research project. I have scraped a database of narrative reports of people who have taken psychoactive drugs. I have 20 subcorpora, each corresponding to all the reports with a specific drug. My end goal is to compare the effects of each drug with LDA, given that LDA models of this corpus/subcopora rather reliably include topics related to various kinds of effects (visual, auditory, cognitive, etc.). 

Another thing to consider, in my case, is that my overall corpus is quite small. The biggest substance-specific subcorpus is a little over 1000 docs. The smallest are around 100 docs. Thus, unsurprisingly, training an LDA model over the total corpus (every drug) gives better topics, because the input data is larger. I get OK topics by training a different model for each substance-specific subcorpus, but topics are not as clear-cut and there is more noise.

Given all this, would you still recommend going with different LDA models and calculating cosine similarity? One issue is which topics of each model to compare, if I have different models for different drugs, in order to get a real grasp on how the effects of the drugs differ. 

Thanks in advance, this thread has already been very helpful for me.

jhop

unread,
Sep 7, 2017, 5:11:55 PM9/7/17
to gensim
1. It's important to note there is a significant difference between the way doc2vec works, and the way that LDA works, and for that matter the way that LSI works. Statements made in the context of one approach may not be valid in another.

2. Imagine you have a jar filled with red and blue marbles. There are 1000 marbles mixed up in the jar total, with 900 red, and 100 blue. If you sample marbles one at a time, you have roughly a 90% chance of picking a red marble and 10% chance of picking a blue marble. Red marbles will dominate. This is a gross simplification of what LDA does with a corpus: it samples and associates words at the document level in order to produce a set of topic models, and then progressively adjusts the topic models and topic model weights across the corpus (both words within topic models and topic models themselves possess weights), in order to reverse-engineer a procedure that could reproducing similar "documents" as found in the corpus from scratch, given the topic models and their weights. Again, this is a gross simplification, just to give a flavor of what's happening. Now, if you have a single/combined corpus with 1000 documents, where 900 documents address economics, and 100 documents address physics, and apply standard LDA, economics will dominate the set of topic models and topic model weights across the corpus. Physics would be mostly washed out. A corpus re-generated from the LDA topic models would look similar to the original corpus, except the words within each document would be totally jumbled. Does this mean that physics topic models would be absent? No. They just won't be nearly as "good" as the economics topic models because of the relative difference in the amount of data, and this problem would be exacerbated by the fact that LDA can't tell the difference between the two different subjects (economics, physics) a priori, since documents from the two topic areas were put in the same corpus. And some re-generated files would be likely to contain content from both economics and physics.

3. You cannot escape the sampling problem above without normalization of some sort between your topic domains (in your case, narratives about the effects of different drugs, separated by drug), and the obvious approach for providing that normalization is to have one corpus for each drug, with corresponding LDA topic models.

4. In order to be effective, LDA requires far more data than you seem to have on hand. There is no hard rule, but I personally would not use LDA on anything less than 10k documents at a bare minimum, where the documents have significant length. For example, tweets would be absolutely horrible. Typical news articles would be so-so.

5. I understand the temptation to fit your problem and data to the tools and methods on hand (LDA and its kin), but it's far easier to defend a decision to fit your tools and methods to the problem and data. You seem to have "little data". LDA requires "big data", or at least bigger than what you have. If it were my project, based on what you've written, I'd take an entirely different approach and feed the documents through a decision tree classifier like C4.5, which would allow you to reason about why each document is predicted to belong to a particular drug, and not another drug (identify discriminator words or groups of discriminator words between effects of drugs), and accommodate your relatively small dataset. Such a "topic model" comparison need not be quantitative - it could be binary or qualitative. C4.5 does not output "topic models" per se.

Gordon Mohr

unread,
Sep 7, 2017, 7:43:23 PM9/7/17
to gensim
I'd echo everything jhop said. 

For a Doc2Vec approach, in order for different doc-vectors to be comparable, they'd have to be part of the same interleaved training – so you'd definitely want to shuffle all your docs together into once combined corpus, rather than treating them as 20 different corpora/models. The combined corpus (<20000 docs) still sounds kind of small, so if there's any way to get additional diverse documents of similar domain value, that could help a lot. 

If you get a doc-vector per document, if all goes well you *might* then find that the average of all vectors for one drug has some interesting distance/arrangement relationships with other drug-averages, or that documents that are closer to other drug's clusters, or whose nearest-neighbors are mostly other-drugs, give hunts about essential similarities or contrasts. (I'd even imagine many of the reports include comparative references to other drugs' effects, blurring somewhat the 1:1 relationship of doc-to-drug.)

To the extent some categories might be underrepresented – fewer, smaller docs and thus far less influence on the final model's representativeness – you *might* consider trying to overweight those docs, for example by repeating them someplace else in the corpus. (Not sure that'd help, but it'd be worth experimentation once you have some repeatable, quantitative way to evaluate to what extent your process is delivering expected modeling power.)

Separately: The "Word Mover's Distance" (WMD) calculation models texts by the vectors of their individual words – with word-vectors perhaps imported from elsewhere – and then can do an impressive job of detecting very similar expressions using different words (given that it's still all just bootstrapped from co-occurrences). The best demo of I've seen compared restaurant reviews; your drug narratives might be similar. WMD for pairs of documents is expensive to calculate, and moreso for longer docs – but maybe with a small corpus that's manageable. 

Pairwise WMD comparisons and "most-similar" lists/clusters between docs (or sentences, or even sub-sentence phrases) might help indicate common themes within each category, or between categories. Highly speculatively, similarities of sentences/fragments to some generic hand-crafted archetype fragments like "seeing"/"feeling"/"hearing"/etc *might* also help categorize descriptions along those axes. 

- Gordon

R. M.

unread,
Sep 8, 2017, 10:25:52 AM9/8/17
to gensim
Thanks a lot to you both, that's enlightening.

I've uploaded a visualization of an LDA model of my whole corpus (> 9000 docs, non drug-specific) here if you want to get an idea of what it looks like. As you can see, some topics seem coherent (interior environment of the trip, visual effects, profound effects on consciousness, bad trips, etc.). However, certain drugs dominate the corpus, such as salvia and mushrooms, thus skewing the model. And for my purposes (comparing the effects of drugs) I now understand that LDA probably won't do.

I'm very intrigued by WMD. I suspect the restaurant rating example you mention is this. It's very impressive indeed, and it could potentially work very well to cluster documents by drug effect, then see how this correlates with the corpus (drug) to which documents belong. 

I've seen that WMD is now implemented in textacy using Spacy, maybe that would be more efficient than the Gensim implementation?

(On a side note, I found this post by Sujit Pal — building on that one by Spacy creator Matthew Honnibal — intriguing. Do you think something like that could work with my 9000 docs corpus?)

Sorry for asking so many questions. I'm new to NLP and CS is not my background, but I'm very eager to learn how to use these tools in my field (cognitive science & philosophy of mind) to compare narrative reports without relying too much on human qualitative analysis. I've already learned a lot from this thread.

Ivan Menshikh

unread,
Sep 8, 2017, 11:45:17 AM9/8/17
to gensim
Hello,

Hi, 

You can look at LDA model difference exampleTopic networkTopic cluster viz and TensorBoard viz for inspiration, all of this already implemented in gensim.

Gordon Mohr

unread,
Sep 8, 2017, 5:57:28 PM9/8/17
to gensim
FYI, the visualization link you gave doesn't render in the browser. 

I don't know anything about Spacy's WMD implementation; it's probably good given that project's focus on efficient code, but you'd have to try it to know how it compares against gensim.

Same goes for the other 'embed/encode/attend/predict' suggestions; you'd have to try it against your corpus and goals. Deeper networks often need more data, which might be concern, but the post seems to report its demo results on the not-very-big 20newgroups set (which is 18,000 docs, not the 180,000 mentioned there). 

- Gordon

R. M.

unread,
Sep 8, 2017, 6:41:07 PM9/8/17
to gensim
Thanks! This link should work for the LDA model of the whole corpus. Any feedback on that would also be useful.

I'll try WMD with Spacy and Gensim, as well as the embed/encode/attend/predict if I can get it to work.

R. M.

unread,
Sep 8, 2017, 6:43:18 PM9/8/17
to gensim
Thanks a lot Ivan! That's terrific! I'll have a look at all of your links.

Hannes Kettner

unread,
Sep 18, 2017, 4:39:26 PM9/18/17
to gensim
Hi Gordon,

First off thanks a lot for your detailed replies! Having also been involved in the here described project (as a first experience with NLP) it is incredibly valuable to have this kind of feedback.
 
It would be possible to obtain another 15k reports where either drugs have been used which are not among the 20 most frequently used, or more than one drug has been used at a time. These reports will hence be similar to our drug-specific corpora, but for now we are not interested in analysing them. However, considering your remark: 

The combined corpus (<20000 docs) still sounds kind of small, so if there's any way to get additional diverse documents of similar domain value, that could help a lot. 
 
 they might come in handy as part of the training corpus..

Now as I understood it, the doc2vec approach you suggested would look like this: - training a model on all reports (ca 25k, of which 10k would be of interest), -retrieving doc-vecs for all docs of interest, - averaging doc-vecs for each class of drug, -comparing these averages to other drug-averages.
This had me wondering whether it might be possible as a next step to perform some kind of feature analysis on the most diverging vector dimensions (I wasn't able to find anything on this). The goal would be to not only attain a similarity measure, but also explore in which ways the corpora effectively differ. Hoping for at least some of the doc2vec dimensions to be anywhere as interpretable as LDA topics is probably unrealistic?

Clustering words into concepts as a prior step to classifying documents dependant on the weight of the resulting concepts, as was done in this publication looks like an interesting way to add interpretability to the representational strength of doc2vec, but then our corpora seem too similar semantically, too small and too different in size for such an approach.. Are there any other ways to add interpretability to doc-vector representations?
Thanks in advance!

Best
- Hannes

Gordon Mohr

unread,
Sep 18, 2017, 9:32:58 PM9/18/17
to gensim
You're welcome!

Adding more documents is among things that *may* help, as long as the documents help 'shade-in' the same variations of word-meaning, and even if the documents aren't as cleanly categorized as the main-set-of-interest. But you'd have to test both ways to be sure. 

After you've created doc-vecs for each document, averaging all the documents of the same substance-category might be one interesting technique, to find a sort of 'centroid' for that group. But other approaches might also be of interest. You might go document-by-document: calculate the preponderance of N-nearest-neighbors' categories, for each document, and see which categories tend to 'border' each other that way. If your training method co-trains word-vectors, and the word-vectors become strong enough, the word-vector to doc-vector similarities between foundational concepts (like 'vision'-'see' etc) and categories *might* become meaningful.

The individual orthogonal coordinate dimensions of a Doc2Vec vector are *not* typically as cleanly interpretable as LDA topics. (That is, there's nothing forcing any one (1, 0, 0, …, 0)-direction-from-origin to map to a human concept. But other relative-angles, or neighborhoods, in the full space might vaguely map to human descriptions, like sentiments or senses or themes. See for example the "document analogies" example in the "Document Embedding With Paragraph Vectors" <https://arxiv.org/abs/1507.07998> paper, which finds a neighborhood of Japanese pop music stars by starting with a US star, then 'subtracting' an 'American' direction and 'adding' a 'Japanese' direction. Whether your data can induce similar relations is something you'd have to test in your own experiments. 

I haven't looked closely at the paper you link, but from a skim it looks like it shifts the modeling to again be more like LSI or LDA, where documents are mixtures of word-defined clusters. Those models may be better for some purposes – it all depends on the corpus and desired scale/aspects of analysis.

- Gordon

Hannes Kettner

unread,
Sep 19, 2017, 3:35:49 PM9/19/17
to gensim
Thank you once more for your input, the publication you linked has exactly what I had been looking for! Time and again it blows my mind how semantic and even pragmatic human concepts concepts can be represented and operated with in a vector space...Even mixing word vectors and paragraph vectors - all the possibilities this opens up! Can't wait to try this out!
Reply all
Reply to author
Forward
Message has been deleted
0 new messages