I've been trying to jury-rig
similarity=gensim.similarities.docsim.Similarity(lsi[corpus_1]) (which if I'm understanding lsi[corpus_1] correctly is m X n) and then index similarity[corpus_2] where that corpus is another matrix (m X b) ideally I'd like to map matrix two (converted into corpus_2) to the LSI space created using dimensionality reduction of corpus_1 (see point 3 below for more)
The overall vision of the project is to
1. TFIDF - Use TFIDF to weight terms that are less frequent
2. LSI - Dimensionally reduce that term document matrix into a feature space where terms are mapped in some arbitrary dimensionally reduced document space
3. Dot Product - Map the documents back into that space using the dot product of my LSI matrix and the TFIDF weighted term-document matrix
4. Query - Take a new query, map it into the same term document vector, apply the tfidf-weighting to its words then map that term vector into this arbitrary feature space created in part two then take the dot product of that matrix and the matrix created in step three.
This returns the most similar document to the query and I've had good success with the results on a smaller scale but simply cannot find a way to scale this up. The parts I'm hitting serious memory limits on are the LSI, initiating the zeros matrix is taking all the ram by itself though I haven't tried looping through all my documents in smaller chunks using add_documents (will try that next), and the dot product. If you guys have any thoughts on using gensim.similarities.docsim.Similarity or anything else to simulate dot product that'd be greatly appreciated.