Using TF-IDF to compare pairs of documents

933 views
Skip to first unread message

dvdgrs

unread,
Oct 1, 2011, 8:56:26 AM10/1/11
to gensim
Hi!

I am trying to use TF-IDF metric to compare two documents. I am very
new to VSM and related business, so Gensim seems like the perfect lib
for me. I've plowed through the tutorials, but am quite unsure on how
to proceed on my case: I want to compute similarity between two
documents using the TF-IDF metric. As a training corpus (extract
features) I want to use a large corpus of BioMedCentral articles: my
application is in the biomedical domain.

My current approach:
- Cleaning (tokenizing + removing stopwords) the corpus (a list
containing the articles: currently I use a in-memory list, but I'll
have to use corpusstreaming as I am handling over 100k articles, but
that's for later)
- Create a dictionary out of the cleaned corpus
- Initialize the TF-IDF model using the training corpus:
models.TfidfModel(cleanedCorpus)
- Clean string1, convert to BOW model (dictionary.doc2bow(string1)),
and to tfidf space tfids[string1]
- Same for string2

I then do:
index = similarities.MatrixSimilarity([tfidf1])
sims = index[tfidf2]
print list(enumerate(sims))

If my method is correct (which I'm not sure of) I have 1 issue: the
length of the resulting vector representation of both strings is not
equal. When comparing the shorter to the longer, I get an error. How
should I handle the different lengths? I feel a bit bad by asking such
basic stuff, but I'm nor very experienced in Python, nor in VSM stuff.
Thanks for any help/clarification! And thanks for Gensim :).

Radim

unread,
Oct 1, 2011, 3:17:42 PM10/1/11
to gensim
Hello David,

no need to feel bad! If there's anything unclear, it's the
documentation's fault, not yours :)

Besides, your procedure seems entirely kosher. What is the error you
are receiving?

Gensim should handle vector normalization automatically (inside the
Sparse/MatrixSimilarity class), so the problem must lie elsewhere.

Best,
Radim

dvdgrs

unread,
Oct 1, 2011, 4:47:24 PM10/1/11
to gensim
The error I get is when I use the shorter string as index, and compare
the longest string to it.
For this example I have two strings, one is a short (39 words when
cleaned) db-entry text (tfidfString) and the other is the plaintext of
the Wikipedia page for "Alzheimer's disease" (1674 words when
cleaned). The corpus consists out of 404 articles (built
Dictionary(28295 unique tokens) from 404 documents (total 396978
corpus positions))

This is what happens if I use the db-entry as index, and try to
compute the similarity of the Wiki page to the db-entry:

>>> index = similarities.MatrixSimilarity([tfidfString])
2011-10-01 22:34:05,551 : INFO : scanning corpus to determine the
number of features
2011-10-01 22:34:05,566 : INFO : creating matrix for 1 documents and
26630 features
>>> sims = index[tfidfAlz]

Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
sims = index[tfidfAlz]
File "C:\Python26\lib\site-packages\gensim-0.8.0-py2.6.egg\gensim
\interfaces.py", line 201, in __getitem__
result = self.get_similarities(query)
File "C:\Python26\lib\site-packages\gensim-0.8.0-py2.6.egg\gensim
\similarities\docsim.py", line 378, in get_similarities
query = matutils.sparse2full(query, self.num_features)
File "C:\Python26\lib\site-packages\gensim-0.8.0-py2.6.egg\gensim
\matutils.py", line 128, in sparse2full
result[doc.keys()] = doc.values() # overwrite some of the zeroes
with explicit values
IndexError: index 27559 out of bounds 0<=index<26630

This is what happens if I do it the other way around (compute
similarity of wiki page to db-entry):

>>> index = similarities.MatrixSimilarity([tfidfAlz])
2011-10-01 22:33:36,410 : INFO : scanning corpus to determine the
number of features
2011-10-01 22:33:36,467 : INFO : creating matrix for 1 documents and
28208 features
>>> sims = index[tfidfString]
>>> print list(enumerate(sims))
[(0, 0.17138466)]

Radim

unread,
Oct 1, 2011, 5:11:08 PM10/1/11
to gensim
I see.

Short answer: use `index =
similarities.MatrixSimilarity([tfidfString],
num_features=len(dictionary))`... where len(dictionary)=28295 in your
case.

Longer answer: the matrix used internally in MatrixSimilarity needs to
know its shape in advance = needs to know how many features (=unique
words) are used. If you don't pass `num_features` explicitly,
MatrixSimilarity assumes
num_features=the_highest_word_id_that_appears_in_the_supplied_corpus
+1. In your case, no word with id >= 26630 appears in tfidfString. Now
if you supply any query document with word ids outside of this 0 <= id
< 26630 range -- such as the word id 27559 in `tfidfAlz` -- you'll get
the error you see. So pass num_features explicitly.

Note to self: maybe disable this "default" num_features, and force
user to always supply num_features explicitly? Like it is in
`Similarity` class.

HTH,
Radim

dvdgrs

unread,
Oct 1, 2011, 5:19:32 PM10/1/11
to gensim
Ah right, that makes sense. So does the error, now :).
Thanks for your help!
Reply all
Reply to author
Forward
0 new messages