The
SparseMatrixSimilarity and
DenseMatrixSimilarity classes from the
gensim.similarity.docsim module go to great lengths to ensure that both the indexed documents and the queries will be normalized to unit length, so that we get cosine similarities when we query the index.
This is perhaps the right thing to do when we have raw bag-of-words vectors that we received by calling
dictionary.doc2bow(). However, any modern vector space weighting functions such as
the TF-IDF dtb.nnn or
Okapi BM25 use their own normalization functions, often different for query and document vectors, and the scoring function is the raw dot product, not the cosine similarity. Any further normalization by the
*MatrixSimilarity is undesirable and will degrade accuracy.
Here is what I need to do to prevent the normalization of queries and documents in SparseMatrixSimilarity:
from gensim.matutils import corpus2csc
index = SparseMatrixSimilarity(None)
bm25_index.normalize = False
index.index = corpus2csc(
documents,
num_docs=len(documents),
num_terms=len(dictionary),
).T
This is extremely difficult to get right and will surprise even experienced users. Here is what I would expect:
index = SparseMatrixSimilarity(
documents,
normalize_queries=False,
normalize_documents=False,
)