Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Normalization in gensim.similarity.*MatrixSimilarity considered harmful

49 views
Skip to first unread message

Vít Novotný

unread,
Mar 7, 2022, 5:38:18 AM3/7/22
to Gensim
The SparseMatrixSimilarity and DenseMatrixSimilarity classes from the gensim.similarity.docsim module go to great lengths to ensure that both the indexed documents and the queries will be normalized to unit length, so that we get cosine similarities when we query the index.

This is perhaps the right thing to do when we have raw bag-of-words vectors that we received by calling dictionary.doc2bow(). However, any modern vector space weighting functions such as the TF-IDF dtb.nnn or Okapi BM25 use their own normalization functions, often different for query and document vectors, and the scoring function is the raw dot product, not the cosine similarity. Any further normalization by the *MatrixSimilarity is undesirable and will degrade accuracy.

Both MatrixSimilarity and SparseMatrixSimilarity will automatically set the SimilarityABC.normalized attribute to True, which will cause any queries to be normalized. Furthermore, both MatrixSimilarity and SparseMatrixSimilarity will automatically normalize the documents during construction.

Here is what I need to do to prevent the normalization of queries and documents in SparseMatrixSimilarity:

from gensim.matutils import corpus2csc

index = SparseMatrixSimilarity(None)
bm25_index.normalize = False
index.index = corpus2csc(
    documents,
    num_docs=len(documents),
    num_terms=len(dictionary),
).T

This is extremely difficult to get right and will surprise even experienced users. Here is what I would expect:

index = SparseMatrixSimilarity(
    documents,
    normalize_queries=False,
    normalize_documents=False,
)
Reply all
Reply to author
Forward
0 new messages