LSI vs LDA

PaulR

unread,

Jan 24, 2012, 2:46:51 AM1/24/12

to gensim

Both LSI and LDA give a mechanism for determining topics from a
corpus, and expressing documents in terms of those topics. Both have
an online implementation (which I'm interested in at the moment).

I wondered if people had any feel for how they differ in terms of the
results?

Also - are there significant performance differences between the
gensim implementations of the two?

Radim

unread,

Jan 24, 2012, 3:06:49 AM1/24/12

to gensim

On Jan 24, 8:46 am, PaulR <p...@rudin.co.uk> wrote:
> Both LSI and LDA give a mechanism for determining topics from a
> corpus, and expressing documents in terms of those topics. Both have
> an online implementation (which I'm interested in at the moment).
>
> I wondered if people had any feel for how they differ in terms of the
> results?

Visually: LDA topics typically "look better", hands down. More
coherent and easier to interpret.

Actual quality of doc-doc similarity: notoriously difficult to judge
objectively, but there doesn't seem to be a fundamental difference.
People on this mailing list have mentioned they actually preferred the
LSI results, but that's just anecdotal.

> Also - are there significant performance differences between the
> gensim implementations of the two?

Not "significant", but LSI is faster by a constant factor. LSI also
has the nice property that the topics are nested -- once you compute
400 topics, you get a model for 200 topics for free, with no extra
computation. This is not true for LDA, where you'll have to re-train
when changing the dimensionality.

HTH,
Radim

PaulR

unread,

Jan 24, 2012, 3:42:33 AM1/24/12

to gensim

I do have a large quantity of (something like) doc-doc similarity
judgements for a large document collection so I might try to do some
measurements w.r.t. these judgements. It's actually slightly strange
data - it's basically a large collection of sets of documents from the
corpus - each set telling me that its members are somehow similar. But
they're not disjoint and it could be that A and B occur in one set,
but A might occur in other sets without B.

I guess the other thing is to make a similarity model that weights
topics according to agreement with these judgements and then try to
learn weights for the topics that minimize the error. Of course in
that case I could just start with the tfidf data, rather than
bothering with a learned topic model. But perhaps it's a more
tractable problem with fewer dimensions, and there are things in the
topics that are interesting.

Radim

unread,

Jan 24, 2012, 4:10:13 AM1/24/12

to gensim

Paul,

On Jan 24, 9:42 am, PaulR <p...@rudin.co.uk> wrote:
> I do have a large quantity of (something like) doc-doc similarity
> judgements for a large document collection so I might try to do some
> measurements w.r.t. these judgements. It's actually slightly strange
> data - it's basically a large collection of sets of documents from the
> corpus - each set telling me that its members are somehow similar. But
> they're not disjoint and it could be that A and B occur in one set,
> but A might occur in other sets without B.

I'm sure evaluation of topic quality would be highly appreciated by
many (including me). Just please be sure to describe the experiment
setup completely, because details matter.

> I guess the other thing is to make a similarity model that weights
> topics according to agreement with these judgements and then try to
> learn weights for the topics that minimize the error. Of course in
> that case I could just start with the tfidf data, rather than
> bothering with a learned topic model. But perhaps it's a more
> tractable problem with fewer dimensions, and there are things in the
> topics that are interesting.

Interesting idea! I guess it depends how well does this training data
correlate with your real goal, and how much time you have for original
research :)
I would be definitely interested in hearing the outcome.

Radim

PaulR

unread,

Jan 24, 2012, 4:37:59 AM1/24/12

to gensim

I guess this has some utility in terms of practical applications. If
you're just doing document document similarity then it doesn't matter
- the whole topic infrastructure can be a black box that end users are
completely unaware of. But if you're either trying to explain
similarity results; or doing explicit querying by topic; or (for
example) doing some kind of faceting on the results then it's nice to
have an interpretation of the topics.

On Jan 24, 8:06 am, Radim <radimrehu...@seznam.cz> wrote:

Karel Antonio Verdecia Ortiz

unread,

Jan 24, 2012, 8:47:34 AM1/24/12

to gen...@googlegroups.com

El 24/01/12 03:06, Radim escribi�:

> LSI also has the nice property that the topics are nested -- once you
> compute 400 topics, you get a model for 200 topics for free, with no
> extra computation.

How can I get the nested topics?

Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
http://www.antiterroristas.cu
http://justiciaparaloscinco.wordpress.com

Radim

unread,

Jan 25, 2012, 4:35:31 AM1/25/12

to gensim

On Jan 24, 2:47 pm, Karel Antonio Verdecia Ortiz <kverde...@uci.cu>
wrote:

> El 24/01/12 03:06, Radim escribi :
>
> > LSI also has the nice property that the topics are nested -- once you
> > compute 400 topics, you get a model for 200 topics for free, with no
> > extra computation.
>
> How can I get the nested topics?

by setting `trained_lsi_model.num_topics=200`, where 200 <= num_topics
you used when training the model.

Best,
Radim

PaulR

unread,

Jan 25, 2012, 5:05:28 AM1/25/12

to gensim

I could be wrong but I suspect that there's some confusion over what
"nested" means here.

Karel - there's no hierarchy of topics - it's just that the topics you
get from a 200 topic model are the same as the most significant 200
topics from a 400 topic model trained on the same data.

Radim

unread,

Jan 25, 2012, 5:22:36 AM1/25/12

to gensim

Ah, thanks for your warning, Paul. Indeed I meant "nested" in the
sense Paul describes, not in terms of any topics hierarchy (LSI topics
are all orthogonal by construction).

Best,
Radim

Karel Antonio Verdecia Ortiz

unread,

Jan 25, 2012, 8:20:45 AM1/25/12

to gen...@googlegroups.com

El 25/01/12 05:22, Radim escribi�:

> thanks for your warning, Paul. Indeed I meant "nested" in the
> sense Paul describes, not in terms of any topics hierarchy (LSI topics
> are all orthogonal by construction).

Thanks for your responses.

Reply all

Reply to author

Forward