The error I get is when I use the shorter string as index, and compare
the longest string to it.
For this example I have two strings, one is a short (39 words when
cleaned) db-entry text (tfidfString) and the other is the plaintext of
the Wikipedia page for "Alzheimer's disease" (1674 words when
cleaned). The corpus consists out of 404 articles (built
Dictionary(28295 unique tokens) from 404 documents (total 396978
corpus positions))
This is what happens if I use the db-entry as index, and try to
compute the similarity of the Wiki page to the db-entry:
>>> index = similarities.MatrixSimilarity([tfidfString])
2011-10-01 22:34:05,551 : INFO : scanning corpus to determine the
number of features
2011-10-01 22:34:05,566 : INFO : creating matrix for 1 documents and
26630 features
>>> sims = index[tfidfAlz]
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
sims = index[tfidfAlz]
File "C:\Python26\lib\site-packages\gensim-0.8.0-py2.6.egg\gensim
\interfaces.py", line 201, in __getitem__
result = self.get_similarities(query)
File "C:\Python26\lib\site-packages\gensim-0.8.0-py2.6.egg\gensim
\similarities\docsim.py", line 378, in get_similarities
query = matutils.sparse2full(query, self.num_features)
File "C:\Python26\lib\site-packages\gensim-0.8.0-py2.6.egg\gensim
\matutils.py", line 128, in sparse2full
result[doc.keys()] = doc.values() # overwrite some of the zeroes
with explicit values
IndexError: index 27559 out of bounds 0<=index<26630
This is what happens if I do it the other way around (compute
similarity of wiki page to db-entry):
>>> index = similarities.MatrixSimilarity([tfidfAlz])
2011-10-01 22:33:36,410 : INFO : scanning corpus to determine the
number of features
2011-10-01 22:33:36,467 : INFO : creating matrix for 1 documents and
28208 features
>>> sims = index[tfidfString]
>>> print list(enumerate(sims))
[(0, 0.17138466)]