On the use of gensim for retrieval purposes: A case study

752 views
Skip to first unread message

Shivani

unread,
Aug 6, 2012, 7:56:18 PM8/6/12
to gen...@googlegroups.com
Hello Radim and Gensim fans,

I am emailing because I am hitting a wall while trying to use gensim's LSA model for retreival experiments. I am getting terrible results with gensim's LSA model compared to R's lsa model. I went on a little exploration as to why this is.

Here are the variable parameters of my test

Input: a 486 by 11191 document-term-matrix
Option: tf-idf weighting of the documents before computing lsi model
LSA algorithm tried: Scipy's SVD, gensim's LSA, R's LSA
Output examined: S values, rankings of relevant files for a particular query

First I looked at the s values from each of the models to notice any stark differences.

Suprisingly, gensim's SVD is exact replica of the Scipy's SVD. I manually computed the cosine between the mapped query and the mapped documents for both the cases and got the exact same result. (figure attached)

But if you look at the ranks of relevant documents, seems like gensim or even Scipy's SVD is performing very poorly compared to R's methods.

                                           doc1   doc2     doc3
Gemsim LSA with tf-idf         60.0    19.0       60.0
Gemsim LSA without  tf-idf   80.0    136.0      380.0
Python's SVD                      139.0   30.0      139.0
R's LSA with tf-df                 146      2             8
R's LSA without tf-idf            342     1             13

 I am not able to come up with a good way to explain this.. Any insights would be very helpful.

Thanks a bunch

Shivani Rao
comparision_s_values.jpg

Radim Řehůřek

unread,
Aug 7, 2012, 5:06:01 PM8/7/12
to gensim
Hello Shivani,

> First I looked at the s values from each of the models to notice any stark
> differences.

If you got different factorizations, that means the preprocessing
(=the input matrix) was different. So rather than comparing SVD,
you're comparing different preprocessing pipelines.

> Suprisingly, gensim's SVD is exact replica of the Scipy's SVD. I manually
> computed the cosine between the mapped query and the mapped documents for
> both the cases and got the exact same result. (figure attached)
>
> But if you look at the ranks of relevant documents, seems like gensim or
> even Scipy's SVD is performing very poorly compared to R's methods.
>
>                                            doc1   doc2     doc3
> Gemsim LSA with tf-idf         60.0    19.0       60.0
> Gemsim LSA without  tf-idf   80.0    136.0      380.0
> Python's SVD                      139.0   30.0      139.0
> R's LSA with tf-df                 146      2             8
> R's LSA without tf-idf            342     1             13

What do these numbers mean?

If R's result really are superior in your scenario, it may be
worthwhile breaking down its processing pipeline (see where the
numbers start to diverge and why) and adding the good bits to gensim.

Best,
Radim


>
>  I am not able to come up with a good way to explain this.. Any insights
> would be very helpful.
>
> Thanks a bunch
>
> Shivani Rao
>
>  comparision_s_values.jpg
> 114KViewDownload

Senthil

unread,
Aug 7, 2012, 5:09:42 PM8/7/12
to gen...@googlegroups.com, gensim
Yep, I am curious about the pre processing pipeline as well, would be great if you can post details on it.

Shivani

unread,
Aug 8, 2012, 4:06:41 PM8/8/12
to gen...@googlegroups.com
Hello Radim and bluish green,

I looked up and you are right radim, the feature space is a little different. Not much though, there are in all 367 terms that are different. Nevertheless, getting such poor retrieval results requires more investigation. So I tried 3 variations of the basic gensim model for retrieval. Although they should in theory give me the same rankings, they are all giving me different rankings.

This is how I use gensim for retrieval. after creating an LSA model, the query is mapped to this space and similarity computed. The similarity serve as scores that are then used to find rankings. The rankings of the relevant documents gives an idea of how good the model and/or the similarity function is. The elements of the table are the ranks of the "relevant" documents for the query. The lower the element value the better the algorithm.

I tried playing around with the way I compute the scores of the documents vis-a-vis the query. Turns out there are several ways.  TextCorpus.nm is the Matrix market format for the corpus and QueryFile is the location of the query, K is the number of topics

a) The most basic option.

myC = gensim.corpora.MmCorpus(resultsdir + 'TextCorpus.nm') #read the corpus
Queries = open(QueryFile).readlines()      #read the query , a single document
Qtext = [[word for word in q.lower().split()] for q in Queries]
myQueries = [myCorpus.dictionary.doc2bow(text) for text in Qtext]
gensim.models.lsimodel.P2_EXTRA_ITERS = 4
lsi = gensim.models.LsiModel(corpus=myC, id2word=myCorpus.dictionary, num_topics=K, distributed=False, power_iters=4)  # compute lsi
index = similarities.Similarity(resultsdir + "indexsim", lsi[myC], K)  # create index
scores = index[lsi[myQueries]]  # compute scores

b) Variation1: Compute the similarity without using the index

lsi_maps_corpus = gensim.matutils.corpus2dense(lsi[myC],K).T  #  compute the dense version of the mapped corpus
lsi_maps_q = gensim.matutils.corpus2dense(lsi[myQueries],K).T # compute the dense version of the mapped query
dotproduct = dot(lsi_maps_corpus,lsi_maps_q.T) #compute dot product
denom = diag(dot(lsi_maps_corpus,lsi_maps_corpus.T)).T  # compute denominator for normalizing it
scores = numpy.divide(dotproduct.T,denom).T  # normalize it
scores[numpy.where(denom==0)]=0 # take care of nans

c) Variation 2: Compute the mapping and similarity yourself by using gensim's compute lsi model

U = lsi.projection.u  # Gensim's computed U
s = lsi.projection.s  # Gensim's computed s
V = dot(dot(linalg.inv(diag(s)),U.transpose()),dense)  # creating my own mapping of documents
q_lsi = dot(dot(linalg.inv(diag(s)),U.transpose()),q) #creating my own mapping of query
dotproduct = dot(V.T,q_lsi)  # compute dot product
denom = diag(dot(V.T,V))  # normalization steps
scores = numpy.divide(dotproduct.T,denom).T
scores[numpy.where(denom==0)] =0  #get rid of nans

I have attached the zip file in this post, with the text corpus, the query file and the dictionary and the script to run the retrieval example

basic option: ranks of relevant documents
[  60.  139.  380.]
Variation 1: ranks of relevant documents
[ 384.  199.  430.]
Variation 2: ranks of relevant documents
[ 72.  31.  94.]

I am not able to figure out why  essentially the same pipeline gives 3 different answers.

More analysis will follow

Shivani

Shivani

unread,
Aug 8, 2012, 4:21:05 PM8/8/12
to gen...@googlegroups.com
oops here are the relevant files
example_retrieval.tar.gz

Radim Řehůřek

unread,
Aug 8, 2012, 4:59:21 PM8/8/12
to gen...@googlegroups.com
Hello Shivani, and thanks for investigating.

It looks like each of your versions uses a different formula to compute the similarities -- that's why you're getting different scores :) 

Gensim follows the Deerwester et al. formula, where to compare LSA document vectors D1, D2 you stretch both by the singular values first: D1 * S^2 * D2'. Your "variation 2" seems to compute D1 * D2', without the scaling.

Also, you're using plain bag-of-words counts, but it is generally better to regularize the documents first, using e.g. TF-IDF.

I'm curious, how did you evaluate the relevant documents for your ranking? Can you post some human-friendly info about what these three documents are? (and what are the other documents suggested by the other variations).

We can certainly look into this further, but let's clear up the evaluation procedure first.

Best,
Radim

Mai Al-Duailij

unread,
Aug 8, 2012, 5:47:02 PM8/8/12
to gen...@googlegroups.com
I am experimenting with LSA (using small and large corpus) then verifying the results with an already known similarities and I'm getting great results
--
"The desire of knowledge, like the thirst of riches, increases ever with the acquisition of it."
-- Laurence Sterne

Shivani

unread,
Aug 8, 2012, 11:49:12 PM8/8/12
to gen...@googlegroups.com
Hello Mai,

Glad to hear that you are getting great results. By itself LSA rankings look alright, but for the exact same matrix, R's rankings are better. Did you compare gensim's retrieval with other tools out there? If so, what has been your experience?

Radim,

I am attaching the updated files. I have added code snippets that do tfidf modelling and use it for lsa instead. The rankings do improve but still it ain't better than rankings I am obtaining with R. I am working on software libraries. The query is a bug description's title, the database is all source files. So the source files are the rows of the document-term-matrix. The relevant files are indexed at [39,129,280]. These files are the "fixed files" as scraped from bugzilla for that bug. I could send you the parsed source files along with the query title, but I am afraid it is not informative. I have a file called bug_1788282.xml attached to give you a sample of the kind of data i am dealing with.

Last but not the least, I am also attaching two additional files in data folder. This contains the dictionary (dict_gensim_format.txt) and the corpus in blei format (blei_format.txt"). These are exported from R at the end of the processing pipeline, just before computing lsa model. It is possible to initialize a corpus using this data. Creating a model using this data ensures that the per-processing is the same for both tools.

Here is a comparison of the ranks of the relevant documents

                   R's data        gensim data           R's algorithm

tf-idf            30,8,49            30,8,63             2 10 45
lsa basic      5 172 162        60 139 380        1 13 342        
lsa on tfidf    22 33 31          19 30 136         2 8 146
lsa var1       234,184,303      384,199,430
lsa var2       174 57 217        72, 31,94       

I also noticed that R's lsa code, maps documents and queries back to the term-space and then computes similarity here. I tried that here and I get even worse results...I will send those out soon. Should I try with the new distance and see what is going wrong?

Any ideas will be helpful

Shivani
gensim_tuts.tar.gz

Mai Al-Duailij

unread,
Aug 9, 2012, 12:31:53 AM8/9/12
to gen...@googlegroups.com
Hi,
with a small corpus but didnt get good results. The tool is using Python's SVD to decompose the term-document matrix. I guess the key difference between Gensim and that tool is the preprocessing step: in that tool, the common words are removed while in Gensim the words that appear only once are removed.
So got good results with Gensim and a small corpus (large corpus results are not as good as small corpus)
hope that helps

Radim Řehůřek

unread,
Aug 9, 2012, 3:24:45 AM8/9/12
to gen...@googlegroups.com
Hello,

On Thursday, August 9, 2012 6:31:53 AM UTC+2, mai wrote:
Hi,
with a small corpus but didnt get good results. The tool is using Python's SVD to decompose the term-document matrix. I guess the key difference between Gensim and that tool is the preprocessing step: in that tool, the common words are removed while in Gensim the words that appear only once are removed.

in gensim, the minimum threshold is configurable (and the default is 5). If you wish to use an English stop list, it's in `gensim.parsing.preprocessing.STOPWORDS`.

HTH,
Radim

Radim Řehůřek

unread,
Aug 9, 2012, 3:28:26 AM8/9/12
to gensim
Great, thanks Shivani. I'll have a look this weekend.

-rr
> > On Wed, Aug 8, 2012 at 4:59 PM, Radim Řehůřek <m...@radimrehurek.com<javascript:>
> >>> myC = gensim.corpora.MmCorpus(**resultsdir + 'TextCorpus.nm') #read the
> >>> corpus
> >>> Queries = open(QueryFile).readlines()   **   #read the query , a single
> >>> document
> >>> Qtext = [[word for word in q.lower().split()] for q in Queries]
> >>> myQueries = [myCorpus.dictionary.doc2bow(**text) for text in Qtext]
> >>> gensim.models.lsimodel.P2_**EXTRA_ITERS = 4
> >>> lsi = gensim.models.LsiModel(corpus=**myC, id2word=myCorpus.dictionary,
> >>> num_topics=K, distributed=False, power_iters=4)  # compute lsi
> >>> index = similarities.Similarity(**resultsdir + "indexsim", lsi[myC],
> >>> K)  # create index
> >>> scores = index[lsi[myQueries]]  # compute scores
>
> >>> b) Variation1: Compute the similarity without using the index
>
> >>> lsi_maps_corpus = gensim.matutils.corpus2dense(**lsi[myC],K).T  #
> >>> compute the dense version of the mapped corpus
> >>> lsi_maps_q = gensim.matutils.corpus2dense(**lsi[myQueries],K).T #
> >>> compute the dense version of the mapped query
> >>> dotproduct = dot(lsi_maps_corpus,lsi_maps_**q.T) #compute dot product
> >>> denom = diag(dot(lsi_maps_corpus,lsi_**maps_corpus.T)).T  # compute
> >>> denominator for normalizing it
> >>> scores = numpy.divide(dotproduct.T,**denom).T  # normalize it
> >>> scores[numpy.where(denom==0)]=**0 # take care of nans
>
> >>> c) Variation 2: Compute the mapping and similarity yourself by using
> >>> gensim's compute lsi model
>
> >>> U = lsi.projection.u  # Gensim's computed U
> >>> s = lsi.projection.s  # Gensim's computed s
> >>> V = dot(dot(linalg.inv(diag(s)),U.**transpose()),dense)  # creating my
> >>> own mapping of documents
> >>> q_lsi = dot(dot(linalg.inv(diag(s)),U.**transpose()),q) #creating my
> >>> own mapping of query
> >>> dotproduct = dot(V.T,q_lsi)  # compute dot product
> >>> denom = diag(dot(V.T,V))  # normalization steps
> >>> scores = numpy.divide(dotproduct.T,**denom).T
>  gensim_tuts.tar.gz
> 904KViewDownload

Radim Řehůřek

unread,
Aug 12, 2012, 7:30:45 PM8/12/12
to gensim
Hello again Shivani,

I looked at the files and ran your script (LSA over blei_format.txt),
but didn't see anything out of ordinary.

I'm curious at which point do the R results start to differ... do you
have the option of dumping the SVD decomposition from R? Can you find
out what the projected vectors look like (at least of the query + docs
39, 129, 280), and what similarity function does R use to compare
those?

The pipeline is pretty straightforward, I'm sure we'll get to the
bottom of this soon :)

Cheers,
Radim


On Aug 9, 5:49 am, Shivani <raoshiv...@gmail.com> wrote:
> > On Wed, Aug 8, 2012 at 4:59 PM, Radim Řehůřek <m...@radimrehurek.com<javascript:>
> >>> myC = gensim.corpora.MmCorpus(**resultsdir + 'TextCorpus.nm') #read the
> >>> corpus
> >>> Queries = open(QueryFile).readlines()   **   #read the query , a single
> >>> document
> >>> Qtext = [[word for word in q.lower().split()] for q in Queries]
> >>> myQueries = [myCorpus.dictionary.doc2bow(**text) for text in Qtext]
> >>> gensim.models.lsimodel.P2_**EXTRA_ITERS = 4
> >>> lsi = gensim.models.LsiModel(corpus=**myC, id2word=myCorpus.dictionary,
> >>> num_topics=K, distributed=False, power_iters=4)  # compute lsi
> >>> index = similarities.Similarity(**resultsdir + "indexsim", lsi[myC],
> >>> K)  # create index
> >>> scores = index[lsi[myQueries]]  # compute scores
>
> >>> b) Variation1: Compute the similarity without using the index
>
> >>> lsi_maps_corpus = gensim.matutils.corpus2dense(**lsi[myC],K).T  #
> >>> compute the dense version of the mapped corpus
> >>> lsi_maps_q = gensim.matutils.corpus2dense(**lsi[myQueries],K).T #
> >>> compute the dense version of the mapped query
> >>> dotproduct = dot(lsi_maps_corpus,lsi_maps_**q.T) #compute dot product
> >>> denom = diag(dot(lsi_maps_corpus,lsi_**maps_corpus.T)).T  # compute
> >>> denominator for normalizing it
> >>> scores = numpy.divide(dotproduct.T,**denom).T  # normalize it
> >>> scores[numpy.where(denom==0)]=**0 # take care of nans
>
> >>> c) Variation 2: Compute the mapping and similarity yourself by using
> >>> gensim's compute lsi model
>
> >>> U = lsi.projection.u  # Gensim's computed U
> >>> s = lsi.projection.s  # Gensim's computed s
> >>> V = dot(dot(linalg.inv(diag(s)),U.**transpose()),dense)  # creating my
> >>> own mapping of documents
> >>> q_lsi = dot(dot(linalg.inv(diag(s)),U.**transpose()),q) #creating my
> >>> own mapping of query
> >>> dotproduct = dot(V.T,q_lsi)  # compute dot product
> >>> denom = diag(dot(V.T,V))  # normalization steps
> >>> scores = numpy.divide(dotproduct.T,**denom).T
>  gensim_tuts.tar.gz
> 904KViewDownload

Radim Řehůřek

unread,
Aug 13, 2012, 11:07:41 AM8/13/12
to gensim, raosh...@gmail.com
Or perhaps let's start from tf-idf -- it'll be simpler to dissect, and
the results are still appreciably different between R/gensim.

-rr
> ...
>
> read more »

Shivani

unread,
Aug 27, 2012, 10:41:06 AM8/27/12
to gen...@googlegroups.com
Hello Radim,
I am sorry I was out because I had a paper deadline to meet. I am back now. I will send you R's U and S matrix. and also look at the similarity function.

If you notice my previous email. The tf-idf with gensim created term document matrix and R's term document matrix give out very similar rankings, given different data. So, the obvious difference is in similarity function I am guessing.


>                    R's data        gensim data           R's algorithm
>                    on gensim

> > tf-idf            30,8,49            30,8,63             2 10 45

Another email will follow soon

Cheers,
Shivani

Shivani

unread,
Aug 28, 2012, 3:39:48 PM8/28/12
to gen...@googlegroups.com
Hello Radim and gensim fans,

I was able to look at the tf-idf mapping alone created by gensim and R and compare the two.

Amazingly, they are just the same matrices. I took the root mean square difference between the two matrices and it came to be 4.2 x e -22, which I assume is due to numerical errors

The query can be either a) mapped to tf-idf space before matching with documents b) matched directly

So here are the different rankings I got

a) Ranking of the three documents based on Gensim's similarity between tf_idf[query] and tf_idf[corpus] -> 30 8 49
b) Ranking of the three documents based on Gensim's similarity between query and tf_idf[corpus] -> 25 15 40
c) Ranking of the three documents based on R's similarity between query and tf_idf[corpus] -> 45 2 10

So this is definitely about the similarity function isn't it?
I will repeat the analysis with the lsi mappings created by the two tools and write another email.

Shivani

Radim Řehůřek

unread,
Aug 29, 2012, 7:46:05 AM8/29/12
to gensim
Thanks for checking, Shivani!

> So this is definitely about the similarity function isn't it?
> I will repeat the analysis with the lsi mappings created by the two tools
> and write another email.

Isn't it easier to inspect the similarity fnc in R directly (code) and
see what's going on? Or maybe it's even mentioned in the documentation
-- do you have a link?

This round-about detective work seems somewhat cumbersome :)

Best,
Radim

Shivani

unread,
Aug 29, 2012, 11:43:50 AM8/29/12
to gen...@googlegroups.com
Hello Radim,

Actually I know very well the similarity function used in R, because I code it myself. It is nothing but a cosine of the two vectors.
That is the reason this is so mind boggling to me

Here is the code snippet
denom = sqrt(sum(tcrossprod_simple_triplet_matrix(dtm,as.matrix(dtm)))))
scores = tcrossprod_simple_triplet_matrix(dtm,as.matrix(dtmq))/denom

I did try using a code snippet in python that does the exact same thing and getting different rankings when compared to the index function.

Shivani

Radim Řehůřek

unread,
Aug 29, 2012, 2:02:25 PM8/29/12
to gensim
> Actually I know very well the similarity function used in R, because I code
> it myself. It is nothing but a cosine of the two vectors.
> That is the reason this is so mind boggling to me
>
> Here is the code snippet
> denom = sqrt(sum(tcrossprod_simple_triplet_matrix(dtm,as.matrix(dtm)))))
> scores = tcrossprod_simple_triplet_matrix(dtm,as.matrix(dtmq))/denom

is it possible that you are normalizing the length of the index
vectors, but not the length of the query vector? (gensim normalizes
both)

Shivani

unread,
Aug 29, 2012, 4:59:32 PM8/29/12
to gen...@googlegroups.com
Hello Radim,
The query's length is a scalar value that scales the scores of all the documents uniformly. So i think (and also verified experimentally) that it does not affect the rankings of the documents.

In case of gensim I tried the two cases:
a) gensim's index and similarity function  ->Rankings 30 8 49
b) gensim's tf idf mapping but a dot product followed by scaling (inverse of document length)  24 15 50
Here is the code snippet that does this

dense_tfidf = gensim.matutils.corpus2dense(tf_idf[myC],len(myDict))
dotproduct = dot(dense_tfidf.T,q)
denom = diag(dot(dense_tfidf.T,dense_tfidf)).T
scores = numpy.divide(dotproduct.T,denom).T

What is the difference attributed to?

Regards,
Shivani




Shivani

Radim Řehůřek

unread,
Aug 31, 2012, 12:38:21 PM8/31/12
to gensim
Hello Shivani,

> The query's length is a scalar value that scales the scores of all the
> documents uniformly. So i think (and also verified experimentally) that it
> does not affect the rankings of the documents.

ah yes, you are right of course.


> In case of gensim I tried the two cases:
> a) gensim's index and similarity function  ->Rankings 30 8 49
> b) gensim's tf idf mapping but a dot product followed by scaling (inverse
> of document length)  24 15 50
> Here is the code snippet that does this
>
> dense_tfidf = gensim.matutils.corpus2dense(tf_idf[myC],len(myDict))
> dotproduct = dot(dense_tfidf.T,q)
> denom = diag(dot(dense_tfidf.T,dense_tfidf)).T
> scores = numpy.divide(dotproduct.T,denom).T
>
> What is the difference attributed to?

To the missing square root, in `denom`.

But that still doesn't explain why a cosine between two vectors (an
extremely straightforward and well-defined function) should return a
different result between R vs gensim. Can you check that your R
version of cosine similarity indeed does what it should? Maybe post an
example of two concrete input vectors, so we can cross-check?

And since apparently it returns superior results, can you then tell me
the difference? :) Like I said, we can add it to gensim as an
alternative scoring method, if it proves general enough.

Best,
Radim
Reply all
Reply to author
Forward
0 new messages