LDA versus LSA for computing document similarities

3,513 views
Skip to first unread message

Alejandro

unread,
Nov 2, 2011, 3:28:00 PM11/2/11
to gen...@googlegroups.com
I am trying to use LDA to compute the similarity between documents, and I am observing results that are less satisfactory than the ones I get when I use LSA. I am basically following the steps described in the "Experiments on the English Wikipedia" tutorial (the actual code I'm using is more complicated, but what follows captures the essence of it):

- Run wikicorpus.py using the english wikipedia dump. I get the following files:

enwiki_tfidf.mm
enwiki_tfidf.mm.index
enwiki_bow.mm
enwiki_bow.mm.index   
enwiki_wordids.txt

- Load id2word dictionary and the corpus in TFIDF and BOW format

id2word = corpora.Dictionary.load_from_text('enwiki_wordids.txt')
mm_tfidf = corpora.MmCorpus('enwiki_tfidf.mm')
mm_bow = corpora.MmCorpus('enwiki_bow.mm')

- Build the TFIDF, LSA and LDA-online models
model_tfidf = models.TfidfModel(mm_bow, id2word=id2word, normalize=True)
model_lsi = models.lsimodel.LsiModel(corpus=mm_tfidf, id2word=id2word, num_topics=400)
model_lda = models.ldamodel.LdaModel(corpus=mm_tfidf, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)

- Create the index with the document I want to compare to

text_filter = corpora.wikicorpus.filter_wiki(open('army.wiki').read())
text = corpora.wikicorpus.tokenize(text_filter)
corpus_ind = [tfidf_model[id2word.doc2bow(text)]] # text is a tokenized version of the document I want to compare to
index_lsi = similarities.MatrixSimilarity(model_lsi[corpus_ind])
index_lda = similarities.MatrixSimilarity(model_lda[corpus])

- Compute the similarity between the document in the index and the other documents

model = model_lsi # choose between model_lsi or model_lda
index = index_lsi # choose between index_lsi or index_lda
for doc in docs: # docs contains a set of documents
    doc = doc.translate(string.maketrans("",""), string.punctuation)
    vec_bow = id2word.doc2bow(doc.lower().split())
    vec_model = model[tfidf_model[vec_bow]] # convert the query to model space
    sims = index[vec_model]
    print sims[0]

I am using the 'army' article in the index, and I compare with an extract from the same article, an extract from the article 'Chile', and an extract from the article 'gun'. These are the results for both models:

LSA
Extract from army article -> 0.842584
Extract from Chile article -> 0.120402
Extract from gun article -> 0.253231

LDA
Extract from army article -> 0.115992
Extract from Chile article -> 0.046896
Extract from gun article -> 0.156798

I like the results I get using the LSA model: The article is very similar with respect to an extract of itself, and is more similar to the 'gun' article' than to the 'Chile' article. However, the similarities obtained using the LDA model are not that nice.

So my questions are:

- Am I doing something wrong here?
- Is LDA trickier to use to compute similarities between documents? I realize that I am using 100 topics in LDA, versus 400 topics in LSA, may that explain the difference? I also read in a another post that tuning the LDA hyperparameters is non-trivial.

Alejandro

Timmy Wilson

unread,
Nov 2, 2011, 7:35:34 PM11/2/11
to gen...@googlegroups.com
> model_lda = models.ldamodel.LdaModel(corpus=mm_tfidf, id2word=id2word,
> num_topics=100, update_every=1, chunksize=10000, passes=1)

i'll take a stab at this

perhaps model_lda did not converge ?

is there a way to test/quantify convergence?

Alejandro

unread,
Nov 2, 2011, 10:17:37 PM11/2/11
to gen...@googlegroups.com
On Wednesday, November 2, 2011 5:35:34 PM UTC-6, Timmy Wilson wrote:
> model_lda = models.ldamodel.LdaModel(corpus=mm_tfidf, id2word=id2word,
> num_topics=100, update_every=1, chunksize=10000, passes=1)

i'll take a stab at this

perhaps model_lda did not converge ?

is there a way to test/quantify convergence?


By the end, the log says:

2011-11-02 03:54:06,754 : INFO : PROGRESS: iteration 0, at document #3268634/3268634
2011-11-02 03:54:35,442 : INFO : 8634/8634 documents converged within 50 iterations
2011-11-02 03:54:35,534 : INFO : merging changes from 8634 documents into a model of 3268634 documents

The word "converged" is there, but I am not sure if that means that the whole thing converged.

Alejandro.

Timmy Wilson

unread,
Nov 3, 2011, 7:07:09 AM11/3/11
to gen...@googlegroups.com
how do we know how well the model (topics + topic assignments)
approximate the real documents?

there's convergence -- but i don't know if that's the same thing -- is
it possible to have convergence w/out having a good 'original document
approximation'

matrix factorization methods try to minimize a cost function that
quantifies topic + topic assignment approximation

is there there anything similar for lda

Radim

unread,
Nov 3, 2011, 12:44:25 PM11/3/11
to gensim
Hi Alejandro,

thx for reporting on your experiments! Much appreciated.


> - Build the TFIDF, LSA and LDA-online models
> model_tfidf = models.TfidfModel(mm_bow, id2word=id2word, normalize=True)
> model_lsi = models.lsimodel.LsiModel(corpus=mm_tfidf, id2word=id2word,
> num_topics=400)
> model_lda = models.ldamodel.LdaModel(corpus=mm_tfidf, id2word=id2word,
> num_topics=100, update_every=1, chunksize=10000, passes=1)

LDA model works over word counts (integers) -- but here, you run it
over tf-idf (real-valued). The underlying probability model doesn't
make sense this way... although it works numerically and may even give
good (better?) results. Just saying :-)


> LSA
> Extract from army article -> 0.842584
> Extract from Chile article -> 0.120402
> Extract from gun article -> 0.253231
>
> LDA
> Extract from army article -> 0.115992
> Extract from Chile article -> 0.046896
> Extract from gun article -> 0.156798

The similarity numbers are not necessarily comparable across methods.
By default, gensim uses cosine similarity (~angle between the topic
vectors), so that the range is guaranteed to be <-1, 1>, but that
doesn't mean score 0.1 with one method means the same thing as 0.1
with another. A score of 0.5 could mean "extremely similar" under one
method (with its internal parameters) and "no similarity to speak of"
for another. The absolute scores are only comparable within the same
model.

For LDA, you can also try using a different similarity measure, like
the Hellinger distance = sum([sqrt(v1) - sqrt(v2)]^2). See also here:
http://groups.google.com/group/gensim/msg/28d3a0d1d947b90a

Best,
Radim

Radim

unread,
Nov 3, 2011, 1:23:39 PM11/3/11
to gensim
Hi,

you typically measure convergence by observing likelihood of some
(training or, better, held-out) data.

If you use the default algo in gensim, which is online LDA, on a
reasonably large dataset (at least dozens of updates, better hundreds
or thousands), you're fine. The online algo converges pretty quickly
and training until complete convergence doesn't make much sense for
online data anyway. In gensim, you can see the approximate variational
error with `LdaModel.bound()`.


> there's convergence -- but i don't know if that's the same thing -- is
> it possible to have convergence w/out having a good 'original document
> approximation'


Timmy has also touched on a more general question of how to evaluate
the quality of the model (as opposed to just convergence). Here there
is no good answer, because unsupervised methods aka clustering don't
have -- by definition -- any gold-standard labels to compare
against... or else they would be supervised. That of course doesn't
stop people from trying and publishing papers on the topic :-) But
beware that the clustering quality is intimately connected to the task
you're attempting to solve with the clustering, and generic
"clustering quality" scores have to be taken with a heap of salt
outside of their concrete evaluation setup.

Best,
Radim


Radim

unread,
Nov 3, 2011, 2:33:42 PM11/3/11
to gensim
Oh and congrats Timmy on your post #666 on this mailing list! MUAHAHA
(devilish laughter :)

Alejandro

unread,
Nov 3, 2011, 3:40:30 PM11/3/11
to gen...@googlegroups.com
On Thursday, November 3, 2011 10:44:25 AM UTC-6, Radim wrote:
> - Build the TFIDF, LSA and LDA-online models
> model_tfidf = models.TfidfModel(mm_bow, id2word=id2word, normalize=True)
> model_lsi = models.lsimodel.LsiModel(corpus=mm_tfidf, id2word=id2word,
> num_topics=400)
> model_lda = models.ldamodel.LdaModel(corpus=mm_tfidf, id2word=id2word,
> num_topics=100, update_every=1, chunksize=10000, passes=1)

LDA model works over word counts (integers) -- but here, you run it
over tf-idf (real-valued). The underlying probability model doesn't
make sense this way... although it works numerically and may even give
good (better?) results. Just saying :-)


OK. I am running the code again this time using the mm_bow corpus. I'll report my results after it finish.

According to this, is it correct what you have a in http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation? It seems you are using the TFIDF model to build the LDA model:

>>>
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
>>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)


> LSA
> Extract from army article -> 0.842584
> Extract from Chile article -> 0.120402
> Extract from gun article -> 0.253231
>
> LDA
> Extract from army article -> 0.115992
> Extract from Chile article -> 0.046896
> Extract from gun article -> 0.156798

The similarity numbers are not necessarily comparable across methods.
By default, gensim uses cosine similarity (~angle between the topic
vectors), so that the range is guaranteed to be <-1, 1>, but that
doesn't mean score 0.1 with one method means the same thing as 0.1
with another. A score of 0.5 could mean "extremely similar" under one
method (with its internal parameters) and "no similarity to speak of"
for another. The absolute scores are only comparable within the same
model.


However, if both models are "correct" (whatever that means), I thing that if I sort from most similar to less similar, I should get the same order for both models, which is not the case with these results (as you said before, I'm building the LDA model in the wrong way, so I understand that that may be the problem). Am I right?

Alejandro.

Radim

unread,
Nov 4, 2011, 2:01:43 PM11/4/11
to gensim
Hi Alejandro,


> > LDA model works over word counts (integers) -- but here, you run it
> > over tf-idf (real-valued). The underlying probability model doesn't
> > make sense this way... although it works numerically and may even give
> > good (better?) results. Just saying :-)
>
> OK. I am running the code again this time using the mm_bow corpus. I'll
> report my results after it finish.
>
> According to this, is it correct what you have a inhttp://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation?It
> seems you are using the TFIDF model to build the LDA model:

Oh yes, I am using the same word weighting as you. My remark was meant
only as a theoretical disclaimer, in case you're not familiar with the
math behind LDA :)


> However, if both models are "correct" (whatever that means), I thing that
> if I sort from most similar to less similar, I should get the same order
> for both models, which is not the case with these results (as you said
> before, I'm building the LDA model in the wrong way, so I understand that
> that may be the problem). Am I right?

No, the two models (LSA vs. LDA) don't necessarily produce the same
ranking of similar documents. The same pair of documents may have very
different similarity scores under LSA vs. LDA.

Can you post your new LDA topics, once the training is done? I'm
interested in seeing the difference.

Best,
Radim

Timmy Wilson

unread,
Nov 4, 2011, 10:18:33 PM11/4/11
to gen...@googlegroups.com
> Oh and congrats Timmy on your post #666 on this mailing list! MUAHAHA
> (devilish laughter :)

thank you Radim -- and thank you gensim community for the prior 665
messages -- i couldn't have done it w/out you!

i've been going to therapy -- trying to curb my 'evilishness' --
http://www.youtube.com/watch?v=jMIDpJ8H7H0

> In gensim, you can see the approximate variational error with `LdaModel.bound()`.

Radim -- do mind helping me understand what this method does -- i'm
trying to go through the code, but i'm too new -- a little guidance
would go a long way

Radim

unread,
Nov 5, 2011, 12:34:51 PM11/5/11
to gensim
On Nov 5, 3:18 am, Timmy Wilson <tim...@smarttypes.org> wrote:
> > Oh and congrats Timmy on your post #666 on this mailing list! MUAHAHA
> > (devilish laughter :)
>
> thank you Radim -- and thank you gensim community for the prior 665
> messages -- i couldn't have done it w/out you!
>
> i've been going to therapy -- trying to curb my 'evilishness' --http://www.youtube.com/watch?v=jMIDpJ8H7H0


Haha. In this context, the truly Evil thing is ignoring all question
marks (among other things), in these naive bag-of-words models :)


> > In gensim, you can see the approximate variational error with `LdaModel.bound()`.
>
> Radim -- do mind helping me understand what this method does -- i'm
> trying to go through the code, but i'm too new -- a little guidance
> would go a long way


The online LDA algo uses variational inference, which means it really
optimizes the wrong objective function (yuck). But it has its
advantages, one of which is that we can pick a cozy family of
distributions on which to optimize, making inference and assessing
convergence easier. What the `bound` method does is quantify the error
thus introduced, by means of a so called "evidence lower bound".
Basically increasing this bound means our posterior came closer to the
true posterior. Once this quantity doesn't change, we might as well
stop training (our variational parameters lead to a distribution that
is already as close as it can get to the true posterior, for our
choice of the variational family).

The particular method in gensim is based on Matt Hoffman's onlinelda
implementation. See their article "Hoffman, Blei, Bach: Online
Learning for Latent Dirichlet Allocation" for the concrete derivation.

HTH,
Radim


Alejandro

unread,
Nov 7, 2011, 8:12:23 PM11/7/11
to gen...@googlegroups.com
On Friday, November 4, 2011 12:01:43 PM UTC-6, Radim wrote:
Can you post your new LDA topics, once the training is done? I'm
interested in seeing the difference.


These are the results. I trained the LDA model using the BOW corpus instead of the TFIDF corpus. I am computing the similarity using cosine similarity and Hellinger similarity. I am also reporting my previous LSA results.

####### LSA #######

Extract from army article -> 0.842584
Extract from Chile article -> 0.120402
Extract from gun article -> 0.253082

####### LDA with cosine similarity #######
Extract from army article -> -0.012667
Extract from Chile article -> 0.047606
Extract from gun article -> 0.040735

####### LDA with hellinger similarity #######
Extract from army article -> 0.551207
Extract from Chile article -> 0.868623
Extract from gun article -> 0.707079

IMO, none of the LDA results looks good. I wonder if I need to increase the number of topics to make it better.

The code I used is available here: https://github.com/aweinstein/scrapcode/tree/master/gensim.

Alejandro.

Radim

unread,
Nov 8, 2011, 1:59:33 PM11/8/11
to gensim
On Nov 8, 2:12 am, Alejandro <alejandro.weinst...@gmail.com> wrote:
> These are the results. I trained the LDA model using the BOW corpus instead
> of the TFIDF corpus. I am computing the similarity using cosine similarity
> and Hellinger similarity. I am also reporting my previous LSA results.
>
> ####### LSA #######
> Extract from army article -> 0.842584
> Extract from Chile article -> 0.120402
> Extract from gun article -> 0.253082
>
> ####### LDA with cosine similarity #######
> Extract from army article -> -0.012667
> Extract from Chile article -> 0.047606
> Extract from gun article -> 0.040735
>
> ####### LDA with hellinger similarity #######
> Extract from army article -> 0.551207
> Extract from Chile article -> 0.868623
> Extract from gun article -> 0.707079

Ah interesting, thanks for posting this.

Note that the Hellinger distance is a distance, not similarity. With
similarity, the greater the number the better -- with distance, it's
the opposite, and 0.0 distance is the best. So after converting the
distance to similarity with sim=1-dist:

extract LSA LDA
army 0.842584 0.448793
Chile 0.120402 0.131377
gun 0.253082 0.292921

which isn't so bad.

What are you using the topic modelling for?

Best,
Radim

Alejandro Weinstein

unread,
Nov 14, 2011, 9:19:15 AM11/14/11
to gen...@googlegroups.com
On Tue, Nov 8, 2011 at 11:59 AM, Radim <radimr...@seznam.cz> wrote:
> Note that the Hellinger distance is a distance, not similarity. With
> similarity, the greater the number the better -- with distance, it's
> the opposite, and 0.0 distance is the best. So after converting the
> distance to similarity with sim=1-dist:
>
> extract LSA        LDA
> army    0.842584   0.448793
> Chile   0.120402   0.131377
> gun     0.253082   0.292921
>
> which isn't so bad.

You're right! It looks better. I still likes the LSA results more, but
I guess that's subjective.

> What are you using the topic modelling for?

So far I am doing some exploratory analysis. I want to see the
relationship between links between articles and their vector space
representation.

Alejandro.

Radim

unread,
Nov 14, 2011, 4:54:12 PM11/14/11
to gensim

> > What are you using the topic modelling for?
>
> So far I am doing some exploratory analysis. I want to see the
> relationship between links between articles and their vector space
> representation.

Nice; please share your findings when you're done, that is an
interesting topic (is the goal to fight online spam?)

Radim

adalex

unread,
Aug 25, 2016, 11:18:00 AM8/25/16
to gensim
Hi Radim,

I found this very old post in which you say that the LDA model works with word counts which are integers.


"LDA model works over word counts (integers) -- but here, you run it
over tf-idf (real-valued). The underlying probability model doesn't
make sense this way... although it works numerically and may even give
good (better?) results. Just saying :-)

"

I am interested to know if this statement is still valid, because most of the time I am using to train topics after the corpus is transformed using some tfidf and the resulted numbers are real valued numbers.

I am trying to understand also how the topic distribution is assigned for a new document once the model is trained. For a new document I just have a list of tuples in the format (wd_idx,wd_tfidf). How the topics are assigned to this new document?

Andrea

hypnoticpoisons

unread,
Sep 1, 2016, 2:14:42 PM9/1/16
to gensim
Do you have a good explaination or is there a good tutorial about the math behind LDA. I understood LSA
Reply all
Reply to author
Forward
0 new messages