Normalization in gensim.similarity.*MatrixSimilarity considered harmful

53 views

Skip to first unread message

Vít Novotný

unread,

Mar 7, 2022, 5:38:18 AM3/7/22

to Gensim

The SparseMatrixSimilarity and DenseMatrixSimilarity classes from the gensim.similarity.docsim module go to great lengths to ensure that both the indexed documents and the queries will be normalized to unit length, so that we get cosine similarities when we query the index.

This is perhaps the right thing to do when we have raw bag-of-words vectors that we received by calling dictionary.doc2bow(). However, any modern vector space weighting functions such as the TF-IDF dtb.nnn or Okapi BM25 use their own normalization functions, often different for query and document vectors, and the scoring function is the raw dot product, not the cosine similarity. Any further normalization by the *MatrixSimilarity is undesirable and will degrade accuracy.

Both MatrixSimilarity and SparseMatrixSimilarity will automatically set the SimilarityABC.normalized attribute to True, which will cause any queries to be normalized. Furthermore, both MatrixSimilarity and SparseMatrixSimilarity will automatically normalize the documents during construction.

Here is what I need to do to prevent the normalization of queries and documents in SparseMatrixSimilarity:

from gensim.matutils import corpus2csc

index = SparseMatrixSimilarity(None)

bm25_index.normalize = False

index.index = corpus2csc(

documents,

num_docs=len(documents),

num_terms=len(dictionary),

).T

This is extremely difficult to get right and will surprise even experienced users. Here is what I would expect:

index = SparseMatrixSimilarity(

documents,

normalize_queries=False,

normalize_documents=False,

)

Reply all

Reply to author

Forward

0 new messages