Get Top N Most Similar Vector from SparseMatrixSimilarity gensim based on a specific query

Andrea Ciufo

unread,

May 3, 2021, 12:30:58 PM5/3/21

to Gensim

I am stuck with this problem.

I calculated, based on a query vector, the cosine similarity value between the query vector and my corpus.

index = similarities.SparseMatrixSimilarity(tfidf[BoW_corpus], num_features = feature_cnt)

tfidf_query=tfidf[query_vector]

sim = index[tfidf[query_vector]]

I don't know how can I extract and save for example the first 10 most similar vectors in their original format "untokenized"(don't know if this is a correct term).

Here on this colab you can reproduce the entire code

https://colab.research.google.com/drive/1wbYuufncV6LaiDBHstk01m8LltuheEec#scrollTo=XSEKeGpons_D

You can find also my question on stackoverflow :)

https://stackoverflow.com/questions/67358081/get-top-n-most-similar-vector-from-sparsematrixsimilarity-gensim-based-on-a-spec

To all the community members, thank you for the patience :)

Andrea

Radim Řehůřek

unread,

May 5, 2021, 4:45:49 AM5/5/21

to Gensim

Hi Andrea,

Gensim doesn't store your original "untokenized" documents at all.

So if you want to retrieve them, you have to do that outside of Gensim. For example:

corpus = [doc1, doc2, doc3, …]

bow_corpus = …whatever processing you use…

index.num_best = 10

for doc_no, score in index[tfidf[query_vector]]:

print("original document:", corpus[doc_no])

In this example, you'd be using corpus as your outside-of-Gensim storage of your original tokens.

Alternatively, you could store the documents in a database etc. The link is always the document position within the corpus, that's what the Gensim index returns. See also https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html

Hope that helps,

Radim

Andrea Ciufo

unread,

May 13, 2021, 3:24:02 AM5/13/21

to Gensim

Thank you, Radim.

I am studying the documentation and the link you shared.

Tried on my code your solution on my colab but doesn't work, it returns me an error message:

for doc_no, score in index[tfidf[query_vector]]:

print("original document:", corpus[doc_no])

TypeError: cannot unpack non-iterable numpy.float32 object

I tried this, but I feel the results are not good.

https://colab.research.google.com/drive/1wbYuufncV6LaiDBHstk01m8LltuheEec#scrollTo=4ibS28Rq49AC

#testing a loop on sorted output

sims = sorted(enumerate(sim), key=lambda item: -item[1])

for doc_position, doc_score in sims:

print(doc_score, tokenized_lines[doc_position])

Radim Řehůřek

unread,

May 14, 2021, 3:28:25 PM5/14/21

to Gensim

Did you set num_best as in my example?

-rr

Andrea Ciufo

unread,

May 16, 2021, 3:40:50 AM5/16/21

to Gensim

No, because I forgot at least two shots of coffee.

Now it works thanks!

Why I have to define the number of best through

index.num_best =10 ?

I would like to understand the logic behind it.

Radim Řehůřek

unread,

May 17, 2021, 2:23:48 AM5/17/21

to Gensim

Check out its documentation: https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.SparseMatrixSimilarity

-rr

Andrea Ciufo

unread,

May 18, 2021, 2:45:48 AM5/18/21

to Gensim

Great Thanks!

Reply all

Reply to author

Forward