I am calculating the similarity between a query:
query2 = 'Audit and control, Board structure, Remuneration, Shareholder rights, Transparency and Performance' and a document(in my case it is a company's annual report).
I am using glove vectors and calculating the soft cosine between vectors, however somehow I get the similarity score of 1 with two documents. How is that possible? For sure I know that the document does not contain only these query words. The document is a .txt file with cleaned text. And if the document matches exactly these words, then similarity can be 1 but I know it does not match exactly.
Code:
if 'glove' not in locals():
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
def build_term(corpus, query):
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
return similarity_matrix
tfidf_model = build_term(corpus, query)
def doc_similarity_scores(query,similarity_matrix):
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(tfidf[[dictionary.doc2bow(document) for document in corpus]],similarity_matrix)
doc_similarity_scores = index[query_tf]
return doc_similarity_scores
document_sim_scores = doc_similarity_scores(query,tfidf_model)
sorted_sim_scores = sort_similarity_scores_by_document(document_sim_scores)
doc_similar_terms = []
max_results_per_doc = 50
for term in query:
dictionary = Dictionary(corpus+[query])
idx1 = dictionary.token2id[term]
for document in corpus:
results_this_doc = []
for word in set(document):
idx2 = dictionary.token2id[word]
score = tfidf_model.matrix[idx1, idx2]
if score > 0.0:
results_this_doc.append((word, score))
results_this_doc = sorted(results_this_doc, reverse=True, key=lambda x: x[1])
results_this_doc = results_this_doc[:min(len(results_this_doc), max_results_per_doc)]
doc_similar_terms.append(results_this_doc)
for idx in sorted_sim_scores[:90]:
similar_terms_string = ', '.join([result[0] for result in doc_similar_terms[idx]])
print(f'{idx} \t {document_sim_scores[idx]:0.3f} \t {titles[idx]}')
Results:
1.000 2019_q4_en_eur_con_00.txt
1.000 2017_q3_en_eur_con_00.txt
0.994 2018_ar_en_eur_con_00.txt
0.989 2019_ar_en_eur_con_00.txt
0.986 2020_q2_en_eur_con_00.txt
0.963 2014_ar_en_eur_con_00.txt
It is strange that when i put only 1 document to the model, i get the similarity 0.873.
0.873 2019_q4_en_eur_con_00.pdf.txt : accounting, commission, audited, disclosure, regulatory, reviewed, committee, report, board, assessment, department, preliminary, disclosures, disclosed, compliance, supervisory, supervision, management, guidelines, commissions, advisory, corrections, remuneration, boards
When i use 31 documents(annual reports), i get different result..
Question also asked in stackoverflow:
https://stackoverflow.com/questions/66533269/soft-cosine-similarity-1-between-query-and-a-document


[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('thorough', 0.45976236), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('preliminary', 0.41754), ('disclosures', 0.41536415), ('submitted', 0.4147601)]
[('audit', 1.0), ('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('regulatory', 0.49845448), ('committee', 0.46439847), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('committee', 0.46439847), ('thorough', 0.45976236), ('report', 0.45536885), ('agency', 0.44895616), ('evaluation', 0.44741735), ('board', 0.44654623), ('assessing', 0.43565267), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('accounting', 0.5946457), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('submitted', 0.4147601), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('examining', 0.51577145), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessing', 0.43565267), ('recommendations', 0.4335667), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('disclosed', 0.41097352), ('supervisory', 0.39804897), ('assessed', 0.3894063)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('assessed', 0.3894063), ('commissions', 0.36198187)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('assessed', 0.3894063), ('commissions', 0.36198187)]
[('audit', 1.0), ('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('agency', 0.44895616), ('evaluation', 0.44741735), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('examining', 0.51577145), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('audit', 1.0), ('auditors', 0.7022769), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('inquiries', 0.49698716), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('submitted', 0.4147601), ('disclosed', 0.41097352)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('preliminary', 0.41754), ('disclosures', 0.41536415), ('disclosed', 0.41097352), ('supervisory', 0.39804897)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('oversight', 0.5876334), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('submitted', 0.4147601), ('supervisory', 0.39804897), ('commissions', 0.36198187)]
[('audit', 1.0), ('auditing', 0.6833676), ('accounting', 0.5946457), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('agency', 0.44895616), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('agency', 0.44895616), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187)]--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/4ad042bf-55fd-451c-a593-0a15eaa6c282n%40googlegroups.com.
For example, for word audit for each document in corpus the nonzero similarities are:
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)] …
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/CAL4OFgi%3De7h7O70Gh6a4868FqU6E8m_dktmp-2QUmn9zJhtpiQ%40mail.gmail.com.

| from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex |
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/CAKt3JyWDZ-GGFHyfY8BUOhtJ1pxEF%2Bkyt0%2Bbdupckrbw83M7ew%40mail.gmail.com.
I don't know if the problem is in the number of words but for me this gist function does not work. I tried taking the document text from my own pandas dataframe and running the function on it but i still got back 0.0.Then i just tried to replace your variables with my own document text and the query, i still got back 0.0:
One thing i noticed: you are usingand this gives me error: ImportError: cannot import name 'WordEmbeddingSimilarityIndex' from 'gensim.similarities' (C:\Users\marek.keskull\Anaconda3\lib\site-packages\gensim\similarities\__init__.py)
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
Instead, i used: from gensim.models import WordEmbeddingSimilarityIndex
Why didn't you use lemmatization when processing your documents? Is there a reason behind that?
Why do you use Word2Vec pre-trained model with SCM? Why not GloVe? Is there a difference?
Is there a way you can combine models together to get a better similarity score?
TF-IDF removes all terms with TF-IDF score less than 1e-12. Annoyingly, there seems to be no way to switch it off in the constructor. However, you can always set eps=0 in __getitem__, which should prevent any removal.
Why didn't you use lemmatization when processing your documents? Is there a reason behind that?
Why do you use Word2Vec pre-trained model with SCM? Why not GloVe? Is there a difference?
Is there a way you can combine models together to get a better similarity score?