TruncatedSVD in Gensim

177 views
Skip to first unread message

Joshua Teer

unread,
Mar 16, 2018, 4:53:53 PM3/16/18
to gensim
Hi all, 
I'm working on a project using truncated SVD and have found some success using the sklearn.decomposition.TruncatedSVD API. I would like to scale this code up to use gensim's lower RAM requirements but cannot seem to find what APIs to use to get the results I'm looking for. 

I've used the following: 
gensim.models.lsimodel.stochastic_svd 
where the corpus input is in gensim's Matrix Market format, 

I've created V in Q3 of the FAQ  on LSI (below)
V = gensim.matutils.corpus2dense(lsi[X], len(lsi.projection.s)).T / lsi.projection.s

but am not sure how to replicate the output of 
from sklearn. 


Could one of you suggest a gensim method that would most efficiently produce the same output as that sklearn API? 

Ivan Menshikh

unread,
Mar 20, 2018, 2:53:07 AM3/20/18
to gensim
Hello Joshua,
if I understand correctly, you need to receive matrix (documents X components), what's wrong with your current V? 

Joshua Teer

unread,
Mar 20, 2018, 10:15:09 AM3/20/18
to gen...@googlegroups.com
Hi Ivan, 
I think the thing that's throwing me off is that the numbers are about all .01 or .02 off from the outcome of the SKLearn library. 

as a side note - 
if i wanted to do the dot product of lsi[X] and a matrix of the same size - how would I go about doing that? 

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/UUXWnmaU5D8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan Menshikh

unread,
Mar 21, 2018, 3:23:47 AM3/21/18
to gensim
About dot product: convert lsi[x] to the dense matrix like 

from gensim.matutils import corpus2dense

matrix
= corpus2dense(lsi[x],number_of_lsi_topics)  #this is matrix of shape (number_of_lsi_topics, number_of_documents)

and after - calculate any operation

Also, by default, LsiModel doesn't multiply to eigenvalues (i.e. we have no scaling here). For fix this, you can apply it as lsi.__getitem__(x, scaled=True)

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Radim Řehůřek

unread,
Mar 21, 2018, 4:26:17 AM3/21/18
to gensim
Note that by doing this, by converting lsi[X} to a dense in-memory matrix, you'll lose the main advantage of large-scale SVD: streamed out-of-core computation. Everything will be in RAM, putting a limit on the size of your corpus.

-rr

Joshua Teer

unread,
Mar 21, 2018, 9:56:15 AM3/21/18
to gen...@googlegroups.com
I've been trying to jury-rig  similarity=gensim.similarities.docsim.Similarity(lsi[corpus_1]) (which if I'm understanding lsi[corpus_1] correctly is m X n) and then index similarity[corpus_2] where that corpus is another matrix (m X b) ideally I'd like to map matrix two (converted into corpus_2) to the LSI space created using dimensionality reduction of corpus_1 (see point 3 below for more) 


The overall vision of the project is to 
1. TFIDF - Use TFIDF to weight terms that are less frequent
2. LSI - Dimensionally reduce that term document matrix into a feature space where terms are mapped in some arbitrary dimensionally reduced document space
3. Dot Product - Map the documents back into that space using the dot product of my LSI matrix and the TFIDF weighted term-document matrix
4. Query - Take a new query, map it into the same term document vector, apply the tfidf-weighting to its words then map that term vector into this arbitrary feature space created in part two then take the dot product of that matrix and the matrix created in step three. 

This returns the most similar document to the query and I've had good success with the results on a smaller scale but simply cannot find a way to scale this up. The parts I'm hitting serious memory limits on are the LSI, initiating the zeros matrix is taking all the ram by itself though I haven't tried looping through all my documents in smaller chunks using add_documents (will try that next), and the dot product. If you guys have any thoughts on using gensim.similarities.docsim.Similarity or anything else to simulate dot product that'd be greatly appreciated. 

Ivan Menshikh

unread,
Mar 22, 2018, 2:08:22 AM3/22/18
to gensim
Hello,

if you need really large search system - use https://scaletext.com/ (commercial), or, if this is your pet-project - try https://github.com/spotify/annoy (this is an approximate parallel indexer with mmap (no need to have all vectors in RAM) works pretty well, I used it many times and consider that annoy is a pretty good choice!
Reply all
Reply to author
Forward
0 new messages