applying LSA/LDA on top of BOW or TFIDF ?

Dieter Plaetinck

unread,

Jan 13, 2012, 7:57:58 AM1/13/12

to gensim

Hi,
I'm currently trying out LSA (and maybe later LDA but first I want to
see how LSA goes).
However, I find it unclear whether LSA/LDA should be run on BOW corpi
or on TFIDF.

* http://radimrehurek.com/gensim/ demonstrates LSA on top of BOW,
while it's LSA on top of TFIDF @
http://radimrehurek.com/gensim/tut2.html#transformation-interface
Also, that page says "Latent Semantic Indexing, LSI (or sometimes LSA)
transforms documents from either bag-of-words or (preferrably)
TfIdf-weighted space into a latent space of a lower dimensionality."

in the discussion "LDA versus LSA for computing document similarities"
more specifically this post: http://comments.gmane.org/gmane.comp.ai.gensim/659
Radim confirms again LDA should be run on BOW, despite the
corresponding official example running LDA on top of TFIDF:
http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation?It,
what's the reason?

I guess other than being somewhat confused by the docs, I guess my
main question is why it's preferred to do lsi on top of tfidf instead
of on bow ?

Dieter

Senthil

unread,

Jan 13, 2012, 1:01:45 PM1/13/12

to gen...@googlegroups.com

While we wait for radims answer, here is my experience:

I have tried LSA using both the approaches, (bow or tfidf). My experiments were using a corpus of about 600K documents. I found the accuracy of tfidf was surprisingly high (in terms of relevant results returned) compared against the bow approach. I guess tfidf is just a better/less-noisey representation of a documents than just the word counts.

On a related note, I am wondering how one takes a tfidf incrementally? (if at all it is possible), I have asked about this in this thread: http://groups.google.com/group/gensim/browse_thread/thread/1ef91e336c0080c6

niefpaarschoenen

unread,

Jan 13, 2012, 7:29:28 PM1/13/12

to gensim

The idea of tf-idf is to remove the effect of function words from the
analysis. Function words typically show up a lot in all documents,
thus have a high document frequency and a low tf-idf. If your goal is
to find semantic relationships between content words, tf-idf is
definitely the way to go!

Tf-idf incrementally is not too hard. First you loop over all terms,
and store the document frequencies for every term. Then you loop over
all documents and divide the term frequencies by the corresponding
document frequencies. At least that's what I understood from the
TfidfModel code. No need to store the complete matrix in memory.

On Jan 13, 7:01 pm, Senthil <gsn.coldf...@gmail.com> wrote:
> While we wait for radims answer, here is my experience:
>
> I have tried LSA using both the approaches, (bow or tfidf). My experiments
> were using a corpus of about 600K documents. I found the accuracy of tfidf
> was surprisingly high (in terms of relevant results returned) compared
> against the bow approach. I guess tfidf is just a better/less-noisey
> representation of a documents than just the word counts.
>
> On a related note, I am wondering how one takes a tfidf incrementally? (if

> at all it is possible), I have asked about this in this thread:http://groups.google.com/group/gensim/browse_thread/thread/1ef91e336c...

>
> On Fri, Jan 13, 2012 at 4:57 AM, Dieter Plaetinck <die...@plaetinck.be>wrote:
>
> > Hi,
> > I'm currently trying out LSA (and maybe later LDA but first I want to
> > see how LSA goes).
> > However, I find it unclear whether LSA/LDA should be run on BOW corpi
> > or on TFIDF.
>

> > *http://radimrehurek.com/gensim/demonstrates LSA on top of BOW,

Radim

unread,

Jan 15, 2012, 2:40:13 PM1/15/12

to gensim

Like the guys said, TF-IDF is primitive heuristic that aims to improve
information retrieval, by promoting content words. There are many
other similar ad-hoc (or less ad-hoc) transformations, just take your
pick.

Re. LDA -- in theory, it only works over plain bag-of-words
(integers). The theory doesn't make sense over floats. But the floats/
integers distinction makes no difference to the LDA implementation in
gensim, so I tried it over tfidf too, and personally found the tfidf
results better :) But I didn't do any rigorous evaluation of this, so
ymmv; best if you see how it behaves on your own data.

Best,
Radim

> > > *http://radimrehurek.com/gensim/demonstratesLSA on top of BOW,

swhi...@choicestream.com

unread,

Jun 10, 2015, 11:49:31 AM6/10/15

to gen...@googlegroups.com

Stumbling onto this post. I also have a question about applying an LDA on top of a TFIDF. While I understand it doesn't make sense in "theory" I wanted to try it out and get some results. But when I go to project a new unseen document onto the topics, I still need to apply the TFIDF transformation correct?

swhi...@choicestream.com

unread,

Jun 10, 2015, 11:51:45 AM6/10/15

to gen...@googlegroups.com

Quick follow up to that. In my particular application there are several words that are used EXTREMELY often (which is why I wanted to use the TFIDF to penalize them). If a TFIDF does not make sense or is adding on a layer of complexity. What are some other techniques that one could use to help penalize them in the LDA. Right now they dominate my topic results

Yesu Feng

unread,

Mar 31, 2016, 1:14:58 PM3/31/16

to gensim, die...@plaetinck.be

Hi,

I have recently tried gensim LDA/LSI model, and I have a corpus of 1500 documents, which after removing stopwords etc, have a dictionary of ~6000 tokens. LSI worked out great, based on an tf-idf transformation, however, LDA based on the tf-idf transformed corpus has no success, giving non-sense topics. Then I tried again with plain bow, it works. However, there are words that are more widely shared across documents and also have higher within document frequency, it would be ideal to to weigh their frequency/counts using tf-idf before doing LDA. Otherwise, it seems to me that the topic weighting given by LDA has more weights on those topics represented by those more frequent words. I am wondering if anyone has any remedy for that, also I saw most people had tf-idf improved the LDA results, I am not sure why mine failed, any suggestion where to check.

I used the models.LdaModel, for the plain bow corpus, I used default alpha, eta, tried num of topics to 20,30,40,50, iterations = 200, passes = 20, generally worked. For tf-idf corpus, I searched a bit on these parameters as well as alpha and eta, but no success.

Yesu

Reply all

Reply to author

Forward