Get Top N Most Similar Vector from SparseMatrixSimilarity gensim based on a specific query

106 views
Skip to first unread message

Andrea Ciufo

unread,
May 3, 2021, 12:30:58 PM5/3/21
to Gensim
I am stuck with this problem.

I calculated, based on a query vector, the cosine similarity value between the query vector and my corpus. 

index = similarities.SparseMatrixSimilarity(tfidf[BoW_corpus], num_features = feature_cnt)
tfidf_query=tfidf[query_vector]
sim = index[tfidf[query_vector]]

I don't know how can I extract and save for example the first 10 most similar vectors in their original format "untokenized"(don't know if this is a correct term).

Here on this colab you can reproduce the entire code



You can find also my question on stackoverflow :) 
https://stackoverflow.com/questions/67358081/get-top-n-most-similar-vector-from-sparsematrixsimilarity-gensim-based-on-a-spec

To all the community members, thank you for the patience :) 
Andrea 

Radim Řehůřek

unread,
May 5, 2021, 4:45:49 AM5/5/21
to Gensim
Hi Andrea,

Gensim doesn't store your original "untokenized" documents at all.

So if you want to retrieve them, you have to do that outside of Gensim. For example:

corpus = [doc1, doc2, doc3, …]
bow_corpus = …whatever processing you use…

index.num_best = 10
for doc_no, score in index[tfidf[query_vector]]:
     print("original document:", corpus[doc_no])

In this example, you'd be using corpus as your outside-of-Gensim storage of your original tokens.

Alternatively, you could store the documents in a database etc. The link is always the document position within the corpus, that's what the Gensim index returns. See also https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html

Hope that helps,
Radim

Andrea Ciufo

unread,
May 13, 2021, 3:24:02 AM5/13/21
to Gensim
Thank you, Radim.
I am studying the documentation and the link you shared.

Tried on my code your solution on my colab but doesn't work, it returns me an error message: 

for doc_no, score in index[tfidf[query_vector]]:
     print("original document:", corpus[doc_no])
TypeError: cannot unpack non-iterable numpy.float32 object

I tried this, but I feel the results are not good. 

#testing a loop on sorted output
sims = sorted(enumerate(sim), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, tokenized_lines[doc_position])

Radim Řehůřek

unread,
May 14, 2021, 3:28:25 PM5/14/21
to Gensim
Did you set num_best as in my example?

-rr

Andrea Ciufo

unread,
May 16, 2021, 3:40:50 AM5/16/21
to Gensim
No, because I forgot at least two shots of coffee. 

Now it works thanks! 

Why I have to define the number of best through 

index.num_best =10 ?

I would like to understand the logic behind it.

Radim Řehůřek

unread,
May 17, 2021, 2:23:48 AM5/17/21
to Gensim

Andrea Ciufo

unread,
May 18, 2021, 2:45:48 AM5/18/21
to Gensim
Great Thanks!
Reply all
Reply to author
Forward
0 new messages