On May 5, 5:54 pm, Radim <
radimrehu...@seznam.cz> wrote:
> Lovely! So clear and clean that I'm tempted to include it in core
> gensim. But on the other hand, it's so clear and clean that it doesn't
> need to be included in core gensim :) What a dilemma.
>
> I'm thinking of adding a "best practices" section to gensim
> documentation, with little tips and helper code snippets. Yours will
> be a perfect candidate.
I realised it's possible to make it even simpler by supporting indexed
access to the indirect corpus, removing the need for the save and
load. It didn't make any difference to performance (in fact improved
it slightly, maybe from eliminating the save/load overhead time,
testing with a few hundred rows at a time).
So the class file is:
class SubCorpus(IndexedCorpus):
"""
A corpus which returns a subset of rows from a larger,
indexed corpus.
"""
def __init__(self, indexedCorpus, docIdList):
self.bigcorpus = indexedCorpus
self.idList = docIdList
def __iter__(self):
"""
Return one document at a time.
"""
for docId in self.idList:
yield self.bigcorpus[int(docId)]
def __len__(self):
"""
Return corpus length as number of row ids.
"""
return len(self.idList)
def __getitem__(self, docno):
return self.bigcorpus[int(self.idList[docno])]
and the main code is:
# create a smaller matrix from the larger one
mm = SubCorpus(bigcorpus, artIdList)
# transform corpus to LSI space and index it
index = similarities.MatrixSimilarity(lsi[mm], numFeatures = 400)
vec = mm[0]
vec_lsi = lsi[vec] # convert the query to LSI space
# perform a similarity query against the corpus
sims = index[vec_lsi]
Regards
Stephen