Soft cosine similarity 1 between query and a document

497 views
Skip to first unread message

Marek Kesküll

unread,
Mar 9, 2021, 10:13:40 AM3/9/21
to Gensim

I am calculating the similarity between a query: 

query2 = 'Audit and control, Board structure, Remuneration, Shareholder rights, Transparency and Performance' and a document(in my case it is a company's annual report).

I am using glove vectors and calculating the soft cosine between vectors, however somehow I get the similarity score of 1 with two documents. How is that possible? For sure I know that the document does not contain only these query words. The document is a .txt file with cleaned text. And if the document matches exactly these words, then similarity can be 1 but I know it does not match exactly.

Code:

if 'glove' not in locals():  

glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)


def build_term(corpus, query):

dictionary = Dictionary(corpus+[query])

tfidf = TfidfModel(dictionary=dictionary)

similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf) 

return similarity_matrix


tfidf_model = build_term(corpus, query)


def doc_similarity_scores(query,similarity_matrix):

dictionary = Dictionary(corpus+[query])

tfidf = TfidfModel(dictionary=dictionary)

query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(tfidf[[dictionary.doc2bow(document) for document in corpus]],similarity_matrix)

doc_similarity_scores = index[query_tf]

return doc_similarity_scores


document_sim_scores = doc_similarity_scores(query,tfidf_model)


sorted_sim_scores = sort_similarity_scores_by_document(document_sim_scores)


doc_similar_terms = []

max_results_per_doc = 50

for term in query:

dictionary = Dictionary(corpus+[query])

idx1 = dictionary.token2id[term]

for document in corpus:

    results_this_doc = []

    for word in set(document):

        idx2 = dictionary.token2id[word]

        score = tfidf_model.matrix[idx1, idx2]

        if score > 0.0:

            results_this_doc.append((word, score))

    results_this_doc = sorted(results_this_doc, reverse=True, key=lambda x: x[1])  

    results_this_doc = results_this_doc[:min(len(results_this_doc), max_results_per_doc)]  

    doc_similar_terms.append(results_this_doc)


for idx in sorted_sim_scores[:90]:

similar_terms_string = ', '.join([result[0] for result in doc_similar_terms[idx]])

print(f'{idx} \t {document_sim_scores[idx]:0.3f} \t {titles[idx]}')

Results:

1.000   2019_q4_en_eur_con_00.txt

1.000   2017_q3_en_eur_con_00.txt

0.994   2018_ar_en_eur_con_00.txt

0.989   2019_ar_en_eur_con_00.txt

0.986   2020_q2_en_eur_con_00.txt

0.963   2014_ar_en_eur_con_00.txt

It is strange that when i put only 1 document to the model, i get the similarity 0.873.

0.873 2019_q4_en_eur_con_00.pdf.txt : accounting, commission, audited, disclosure, regulatory, reviewed, committee, report, board, assessment, department, preliminary, disclosures, disclosed, compliance, supervisory, supervision, management, guidelines, commissions, advisory, corrections, remuneration, boards

When i use 31 documents(annual reports), i get different result..

Capture.PNG

Question also asked in stackoverflow:

https://stackoverflow.com/questions/66533269/soft-cosine-similarity-1-between-query-and-a-document

Vít Novotný

unread,
Mar 12, 2021, 6:52:15 AM3/12/21
to Gensim
Dear Marek,

I am the author of the implementation of the soft cosine similarity in Gensim. You ask how it is possible to get the similarity score of 1 with two different documents. Mathematically, this is not possible unless the embeddings for some words (such as Hello and Hi) are identical. Then, softcossim(hello_world, hi_world) would be 1, but this is very unlikely to happen in practice.

> similarity can be 1 but I know it does not match exactly

Can you please verify that similarity - 1.0 is zero? Since you are rounding to three decimal places, there could be a rounding error.

> It is strange that when i put only 1 document to the model, i get the similarity 0.873.

Does this mean that you call tfidf_model = build_term(corpus, query) with corpus of size 1? This would produce a smaller dictionary, a different tfidf model, and a smaller term similarity matrix, so the scores are expected to be different.

If you call build_term(corpus, query) with your whole corpus, but then do index = SoftCosineSimilarity(..., similarity_matrix) with just a single document, you should still be getting the soft cosine similarity of 1 (or close to 1). Can you please verify?

Best regards,
Vítek

Dne úterý 9. března 2021 v 16:13:40 UTC+1 uživatel kesky...@gmail.com napsal:

Vít Novotný

unread,
Mar 12, 2021, 7:14:05 AM3/12/21
to Gensim
Oh, could you also please print out the 25 pairs of (term, word) with nonzero similarities tfidf_model.matrix[idx1, idx2] for your document and their similarities?

Dne pátek 12. března 2021 v 12:52:15 UTC+1 uživatel Vít Novotný napsal:

Marek Kesküll

unread,
Mar 13, 2021, 8:18:39 AM3/13/21
to gen...@googlegroups.com
> similarity can be 1 but I know it does not match exactly
Can you please verify that similarity - 1.0 is zero? Since you are rounding to three decimal places, there could be a rounding error.

I rounded up to 6 decimal places.

image.png

> It is strange that when i put only 1 document to the model, i get the similarity 0.873.

>>Does this mean that you call tfidf_model = build_term(corpus, query) with corpus of size 1? This would produce a smaller dictionary, a different tfidf model, and a smaller term similarity matrix, so the scores are expected to be different.

Yes, this does mean that i call tfidf model with a corpus size of 1.

>>If you call build_term(corpus, query) with your whole corpus, but then do index = SoftCosineSimilarity(..., similarity_matrix) with just a single document, you should still be getting the soft cosine similarity of 1 (or close to 1). Can you please verify?

I built a model with whole corpus and then did index with just a single document, i still got similarity of 1.
image.png



>> Oh, could you also please print out the 25 pairs of (term, word) with nonzero similarities tfidf_model.matrix[idx1, idx2] for your document and their similarities?

Im not sure i understand what are you asking. There isn't only 25 pairs. idx = 25 is the index of document.

image.png

For example, for word audit for each document in corpus the nonzero similarities are:
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('thorough', 0.45976236), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('preliminary', 0.41754), ('disclosures', 0.41536415), ('submitted', 0.4147601)]
[('audit', 1.0), ('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('regulatory', 0.49845448), ('committee', 0.46439847), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('committee', 0.46439847), ('thorough', 0.45976236), ('report', 0.45536885), ('agency', 0.44895616), ('evaluation', 0.44741735), ('board', 0.44654623), ('assessing', 0.43565267), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('accounting', 0.5946457), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('submitted', 0.4147601), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('evaluation', 0.44741735), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('examining', 0.51577145), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessing', 0.43565267), ('recommendations', 0.4335667), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('disclosed', 0.41097352), ('supervisory', 0.39804897), ('assessed', 0.3894063)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('assessed', 0.3894063), ('commissions', 0.36198187)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('assessed', 0.3894063), ('commissions', 0.36198187)]
[('audit', 1.0), ('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('agency', 0.44895616), ('evaluation', 0.44741735), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('examining', 0.51577145), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187), ('corrections', 0.3525373)]
[('audit', 1.0), ('auditors', 0.7022769), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('inquiries', 0.49698716), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('concluded', 0.43356472), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('submitted', 0.4147601), ('disclosed', 0.41097352)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('preliminary', 0.41754), ('disclosures', 0.41536415), ('disclosed', 0.41097352), ('supervisory', 0.39804897)]
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('oversight', 0.5876334), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('reviewed', 0.46478835), ('committee', 0.46439847)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('submitted', 0.4147601), ('supervisory', 0.39804897), ('commissions', 0.36198187)]
[('audit', 1.0), ('auditing', 0.6833676), ('accounting', 0.5946457), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('agency', 0.44895616), ('board', 0.44654623), ('assessing', 0.43565267), ('concluded', 0.43356472)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187)]
[('accounting', 0.5946457), ('commission', 0.54735374), ('audited', 0.53008056), ('reviewed', 0.46478835), ('committee', 0.46439847), ('report', 0.45536885), ('agency', 0.44895616), ('board', 0.44654623), ('assessment', 0.42662713), ('department', 0.4245992), ('disclosures', 0.41536415), ('supervisory', 0.39804897), ('commissions', 0.36198187)]





--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/4ad042bf-55fd-451c-a593-0a15eaa6c282n%40googlegroups.com.

Marek Kesküll

unread,
Mar 13, 2021, 8:29:37 AM3/13/21
to Gensim
Also, can you please explain how the similarity score will be calculated if there is a word which is exactly matching the term, for example ('audit', 1.0). Does that mean that the document gets automatically similarity 1?

Vít Novotný

unread,
Mar 14, 2021, 5:32:00 AM3/14/21
to gen...@googlegroups.com
so 13. 3. 2021 v 14:18 odesílatel Marek Kesküll <kesky...@gmail.com> napsal:
For example, for word audit for each document in corpus the nonzero similarities are:
[('audit', 1.0), ('auditors', 0.7022769), ('audits', 0.6958291), ('auditing', 0.6833676), ('accounting', 0.5946457), ('auditor', 0.5844135), ('reviewing', 0.5687027), ('commission', 0.54735374), ('audited', 0.53008056), ('review', 0.52923644), ('disclosure', 0.5214933), ('regulatory', 0.49845448), ('pricewaterhousecoopers', 0.4658655), ('reviewed', 0.46478835), ('committee', 0.46439847)] …
 
These all seem quite reasonable to me, but here we are looking at random documents. It would be interesting to see what the situation looks like for documents 25, 14, 16, and others with high soft cosine similarity to the query. We don't even have to look at just a single word (here “audit”), since the soft cosine similarity is highly interpretable and we can decompose it to a sum of word pair similarities. I created a Gist for you that interprets the similarity, like this:

>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>> # ... stopword removal, conversion to TF-IDF, creation of word similarity matrix
>>> interpret_soft_cosine_measure(sentence_obama, sentence_president, dictionary, similarity_matrix)
0.32 = 0.13 (obama:president) + 0.12 (illinois:chicago) + 0.06 (media:press)

If you could run your query and your high-similarity documents through interpret_soft_cosine_measure, the results should be quite revealing.

Best regards,
Vítek

Marek Kesküll

unread,
Mar 14, 2021, 7:54:00 AM3/14/21
to gen...@googlegroups.com
I don't know if the problem is in the number of words but for me this gist function does not work. I tried taking the document text from my own pandas dataframe and running the function on it but i still got back 0.0.
Then i just tried to replace your variables with my own document text and the query, i still got back 0.0:
image.png

One thing i noticed: you are using 
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
and this gives me error: ImportError: cannot import name 'WordEmbeddingSimilarityIndex' from 'gensim.similarities' (C:\Users\marek.keskull\Anaconda3\lib\site-packages\gensim\similarities\__init__.py)

Instead, i used: from gensim.models import WordEmbeddingSimilarityIndex

I also created a gist for you so you could run this with my data and see if it does give you any results. It is here 

Marek Kesküll

unread,
Mar 14, 2021, 8:58:40 AM3/14/21
to gen...@googlegroups.com
I ran this function with document number 25 and results: 
0.03 = 0.00 (audit:management) + 0.00 (audit:commission) + 0.00 (rights:commission) + 0.00 (rights:non) + 0.00 (audit:accounting) + 0.00 (audit:supervisory) + 0.00 (rights:groups) + 0.00 (audit:report) + 0.00 (rights:granted) + 0.00 (rights:legal) + 0.00 (audit:committee) + 0.00 (rights:international) + 0.00 (rights:regard) + 0.00 (rights:public) + 0.00 (rights:support) + 0.00 (rights:nations) + 0.00 (audit:assessment) + 0.00 (audit:supervision) + 0.00 (rights:united) + 0.00 (rights:regarding) + 0.00 (rights:issues) + 0.00 (audit:commissions) + 0.00 (rights:union) + 0.00 (rights:decision) + 0.00 (audit:boards) + 0.00 (rights:authority) + 0.00 (rights:claims) + 0.00 (rights:government) + 0.00 (rights:supporting) + 0.00 (audit:reviewed) + 0.00 (rights:amendment) + 0.00 (audit:determine) + 0.00 (rights:persons) + 0.00 (rights:human) + 0.00 (rights:considers) + 0.00 (audit:audited) + 0.00 (rights:country) + 0.00 (rights:participation) + 0.00 (audit:disclosure) + 0.00 (rights:issue) + 0.00 (rights:recognized) + 0.00 (audit:regulatory) + 0.00 (rights:maintains) + 0.00 (rights:independence) + 0.00 (rights:act) + 0.00 (rights:intellectual) + 0.00 (rights:recognition) + 0.00 (rights:governments) + 0.00 (rights:responsible) + 0.00 (rights:rule) + 0.00 (rights:independent) + 0.00 (rights:concerning) + 0.00 (rights:adoption) + 0.00 (rights:organisation) + 0.00 (audit:department) + 0.00 (audit:preliminary) + 0.00 (audit:disclosures) + 0.00 (audit:disclosed) + 0.00 (rights:filed) + 0.00 (rights:affairs) + 0.00 (audit:compliance) + 0.00 (audit:guidelines) + 0.00 (audit:registration) + 0.00 (rights:exclusion) + 0.00 (audit:advisory) + 0.00 (audit:corrections)

Marek Kesküll

unread,
Mar 15, 2021, 12:44:28 PM3/15/21
to Gensim
I also do not understand why models.tfidf removes query words. If the eps is set to 0.01, it should always return back query words, even if the score is close to 0.

asdada.PNG
So, the score will be higher if the term is used more frequently in a document but lower if the term is used in more documents. The idea is that terms with the highest tfidf score for a given document are the most distinguishing ones for that particular document.

Vít Novotný

unread,
Mar 17, 2021, 2:38:42 PM3/17/21
to Gensim
Dne neděle 14. března 2021 v 12:54:00 UTC+1 uživatel kesky...@gmail.com napsal:
I don't know if the problem is in the number of words but for me this gist function does not work. I tried taking the document text from my own pandas dataframe and running the function on it but i still got back 0.0.
Then i just tried to replace your variables with my own document text and the query, i still got back 0.0:

I am getting the same result. You can check that this is the same thing that Gensim would give you by running similarity_matrix.inner_product(sentence_obama, sentence_president, normalized=(True, True)).
What this means is that none of the terms in your query either hard-match or soft-match the terms in your document.

The first one is easy to verify: cossim(sentence_obama, sentence_president) is 0.0, so there really are no hard matches.

But why would there be no soft matches? Surely some of the many words in sentence_president would have nonzero similarity with some of the words in sentence_obama, right? Not necessarily! The similarity matrix is highly sparse and does not contain all word similarities provided by the word embeddings. By default, at most 100 most similar words for each word are considered. In your example below, only 1.15% of similarity_matrix are non-zero (compare similarity_matrix.matrix.nnz with len(dictionary)**2). The algorithm deciding which 100 is greedy (see end of Section 3 in my paper for details), but favours words with high IDF (rare, self-informative words), which is why we are feeding tfidf to SparseTermSimilarityMatrix. However, since the training corpus are just two sentences, the IDF estimate for the words will not be too accurate.
 
image.png

One thing i noticed: you are using 
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
and this gives me error: ImportError: cannot import name 'WordEmbeddingSimilarityIndex' from 'gensim.similarities' (C:\Users\marek.keskull\Anaconda3\lib\site-packages\gensim\similarities\__init__.py)

Instead, i used: from gensim.models import WordEmbeddingSimilarityIndex

WordEmbeddingSimilarityIndex has moved from gensim.models to gensim.similarities between Gensim 3.8.3 and 4.0.0.

Vít Novotný

unread,
Mar 17, 2021, 2:46:07 PM3/17/21
to Gensim
TF-IDF removes all terms with TF-IDF score less than 1e-12. Annoyingly, there seems to be no way to switch it off in the constructor. However, you can always set eps=0 in __getitem__, which should prevent any removal.

If we don't use TF-IDF, your gist produces the following results. Specifically, the soft cosine measure is still quite close to zero, but now it's slightly above zero.

>>> sentence_obama = dictionary.doc2bow(sentence_obama)
>>> sentence_president = dictionary.doc2bow(sentence_president)
>>> 
>>> cossim(sentence_obama, sentence_president)
0.07971335862064016
>>> similarity_matrix.inner_product(sentence_obama, sentence_president, normalized=(True, True))
0.0
>>> interpret_soft_cosine_measure(sentence_obama, sentence_president, dictionary, similarity_matrix)
0.00 = 0.00 (board) + 0.00 (board:management) + 0.00 (audit:board) + 0.00 (audit) + 0.00 (shareholder:profit) + 0.00 (shareholder:share) + 0.00 (board:supervisory) + 0.00 (audit:supervisory) + 0.00 (performance) + 0.00 (board:audit) + 0.00 (board:members) + 0.00 (shareholder:shares) + 0.00 (shareholder:shareholders) + 0.00 (board:member) + 0.00 (shareholder:corporate) + 0.00 (board:committee) + 0.00 (shareholder:equity) + 0.00 (audit:accounting) + 0.00 (audit:committee) + 0.00 (performance:rating) + 0.00 (audit:report) + 0.00 (board:general) + 0.00 (shareholder:pension) + 0.00 (board:commission) + 0.00 (board:company) + 0.00 (shareholder:banking) + 0.00 (shareholder:company) + 0.00 (board:finance) + 0.00 (shareholder:fund) + 0.00 (shareholder:transaction) + 0.00 (audit:commission) + 0.00 (audit:auditors) + 0.00 (board:chairman) + 0.00 (audit:remuneration) + 0.00 (shareholder:parent) + 0.00 (performance:results) + 0.00 (audit:agencies) + 0.00 (board:employees) + 0.00 (shareholder:acquisition) + 0.00 (board:reserve) + 0.00 (transparency:governance) + 0.00 (performance:quality) + 0.00 (board:boards) + 0.00 (audit:disclosed) + 0.00 (shareholder:merger) + 0.00 (audit:assessed) + 0.00 (performance:good) + 0.00 (performance:best) + 0.00 (audit:auditor) + 0.00 (performance:ratings) + 0.00 (transparency:standards) + 0.00 (performance:previous) + 0.00 (performance:stage) + 0.00 (performance:performed) + 0.00 (shareholder:subsidiaries) + 0.00 (transparency:guarantee) + 0.00 (board:association) + 0.00 (board:office) + 0.00 (board:auditors) + 0.00 (transparency:policies) + 0.00 (transparency:supervision) + 0.00 (audit:boards) + 0.00 (performance:positive) + 0.00 (board:unit) + 0.00 (board:approved) + 0.00 (shareholder:subsidiary) + 0.00 (transparency:compliance) + 0.00 (transparency:adequacy) + 0.00 (performance:strong) + 0.00 (audit:disclosure) + 0.00 (transparency:objective) + 0.00 (shareholder:investor) + 0.00 (performance:overall) + 0.00 (shareholder:ownership) + 0.00 (audit:assessment) + 0.00 (board:state) + 0.00 (audit:audits) + 0.00 (board:independent) + 0.00 (audit:assessing) + 0.00 (audit:compliance) + 0.00 (transparency:disclosure) + 0.00 (performance:short) + 0.00 (audit:analysis) + 0.00 (board:union) + 0.00 (shareholder:dividend) + 0.00 (board:auditor) + 0.00 (performance:outstanding) + 0.00 (shareholder:associates) + 0.00 (transparency:scope) + 0.00 (performance:experience) + 0.00 (audit:procedures) + 0.00 (audit:assessments) + 0.00 (audit:evaluation) + 0.00 (board:authority) + 0.00 (board:employee) + 0.00 (board:department) + 0.00 (audit:review) + 0.00 (transparency:relevant) + 0.00 (shareholder:holdings) + 0.00 (board:review) + 0.00 (board:decision) + 0.00 (transparency:efficiency) + 0.00 (performance:presentation) + 0.00 (board:appointed) + 0.00 (performance:successful) + 0.00 (shareholder) + 0.00 (audit:department) + 0.00 (shareholder:brokerage) + 0.00 (shareholder:partners) + 0.00 (performance:performing) + 0.00 (audit:evaluating) + 0.00 (performance:credits) + 0.00 (board:branch) + 0.00 (performance:test) + 0.00 (board:approval) + 0.00 (shareholder:holding) + 0.00 (audit:findings) + 0.00 (audit:recommendations) + 0.00 (board:advisory) + 0.00 (audit:disclosures) + 0.00 (transparency:competence) + 0.00 (audit:detailed) + 0.00 (performance:sound) + 0.00 (shareholder:employer) + 0.00 (transparency:commitment) + 0.00 (performance:reviews) + 0.00 (transparency:assurance) + 0.00 (board:decided) + 0.00 (audit:advisory) + 0.00 (shareholder:firm) + 0.00 (performance:achievement) + 0.00 (performance:perform) + 0.00 (performance:combination) + 0.00 (performance:earned) + 0.00 (performance:category) + 0.00 (board:agreed) + 0.00 (transparency) + 0.00 (audit:regulatory) + 0.00 (transparency:regulatory) + 0.00 (board:administration) + 0.00 (board:plans) + 0.00 (board:managers) + 0.00 (audit:pricewaterhousecoopers) + 0.00 (audit:reviewed) + 0.00 (board:firm) + 0.00 (shareholder:partner) + 0.00 (board:governing) + 0.00 (transparency:adherence) + 0.00 (performance:excellent) + 0.00 (transparency:enhancing) + 0.00 (audit:concluded) + 0.00 (transparency:cooperation) + 0.00 (audit:audited) + 0.00 (performance:competition) + 0.00 (transparency:enhance) + 0.00 (shareholder:institutional) + 0.00 (shareholder:profits) + 0.00 (performance:technical) + 0.00 (transparency:effectiveness) + 0.00 (shareholder:agreed) + 0.00 (transparency:determination) + 0.00 (transparency:transparent) + 0.00 (audit:auditing) + 0.00 (performance:pace) + 0.00 (performance:success) + 0.00 (board:offices) + 0.00 (board:consulting) + 0.00 (transparency:institutional) + 0.00 (shareholder:bankruptcy) + 0.00 (performance:achieved) + 0.00 (shareholder:insurance) + 0.00 (transparency:ensuring) + 0.00 (performance:shown) + 0.00 (performance:improvement) + 0.00 (shareholder:bid) + 0.00 (transparency:maintaining) + 0.00 (board:head) + 0.00 (transparency:intermediation) + 0.00 (board:agency) + 0.00 (audit:committees) + 0.00 (audit:agency) + 0.00 (transparency:flexibility) + 0.00 (board:committees) + 0.00 (board:filed) + 0.00 (shareholder:ceo) + 0.00 (audit:ethics) + 0.00 (board:ethics) + 0.00 (audit:accountants) + 0.00 (transparency:establishes) + 0.00 (performance:rated) + 0.00 (board:ceo) + 0.00 (audit:commissions) + 0.00 (transparency:auditing) + 0.00 (shareholder:liquidation) + 0.00 (audit:evaluates) + 0.00 (board:executive) + 0.00 (audit:reviewing) + 0.00 (performance:show) + 0.00 (performance:talent) + 0.00 (performance:shows) + 0.00 (board:national) + 0.00 (board:recommended) + 0.00 (performance:combined) + 0.00 (transparency:sustainability) + 0.00 (transparency:strengthen) + 0.00 (performance:showing) + 0.00 (performance:ever) + 0.00 (shareholder:creditors) + 0.00 (transparency:stability) + 0.00 (board:joint) + 0.00 (performance:production) + 0.00 (performance:credited) + 0.00 (transparency:budgetary) + 0.00 (transparency:insufficient) + 0.00 (audit:overseeing) + 0.00 (transparency:ensures) + 0.00 (transparency:objectivity) + 0.00 (transparency:safeguards) + 0.00 (transparency:coordination) + 0.00 (board:senior) + 0.00 (transparency:conformity) + 0.00 (board:officer) + 0.00 (audit:submitted) + 0.00 (transparency:improving) + 0.00 (board:vice) + 0.00 (shareholder:bondholders) + 0.00 (board:safety) + 0.00 (board:commerce) + 0.00 (performance:performs) + 0.00 (performance:awards) + 0.00 (audit:corrections) + 0.00 (board:chaired) + 0.00 (transparency:leveraging) + 0.00 (transparency:macro) + 0.00 (shareholder:broker) + 0.00 (performance:tempo)

Dne pondělí 15. března 2021 v 17:44:28 UTC+1 uživatel kesky...@gmail.com napsal:

Marek Kesküll

unread,
May 31, 2021, 7:31:01 AM5/31/21
to Gensim
  • Why didn't you use lemmatization when processing your documents? Is there a reason behind that?

  • Why do you use Word2Vec pre-trained model with SCM? Why not GloVe? Is there a difference?

  • Is there a way you can combine models together to get a better similarity score?

Radim Řehůřek

unread,
May 31, 2021, 1:23:55 PM5/31/21
to Gensim
Hi,


TF-IDF removes all terms with TF-IDF score less than 1e-12. Annoyingly, there seems to be no way to switch it off in the constructor. However, you can always set eps=0 in __getitem__, which should prevent any removal.

Explicit zero feature weights (eps=0) will break the invariant of sparse vector representation, and lead to segfaults. Don't do that.

If you need a dense (numpy array) representation including for some reason, all the zeros (very RAM inefficient!), use gensim.matutils.sparse2full.

HTH,
Radim


Vít Novotný

unread,
Jun 21, 2021, 7:29:31 PM6/21/21
to Gensim
Dne pondělí 31. května 2021 v 13:31:01 UTC+2 uživatel kesky...@gmail.com napsal:
  • Why didn't you use lemmatization when processing your documents? Is there a reason behind that?

Lemmatization requires that the word embeddings have been trained on a lemmatized corpus, which is rarely the case with pre-trained word embeddings.
  • Why do you use Word2Vec pre-trained model with SCM? Why not GloVe? Is there a difference?

Just for example. You will have to see which model gives the best results for your domain. 
  • Is there a way you can combine models together to get a better similarity score?

You can combine similarity matrices by (weighted) averaging:

combined_similarity_matrix = SparseTermSimilarityMatrix(0.1 * first_similarity_matrix.matrix + 0.9 * second_similarity_matrix.matrix)

You could also view the similarity matrices as sparse directed graphs between words and apply e.g. power iteration to compute a denser closure, where we infer the similarities of previously unconnected words by taking e.g. the harmonic mean of the shortest path between them.

Both of these techniques could be helpful for your domain, but few published experimental results exist at the moment.
If you try them, I would be interested if they helped you improve your performance.
Reply all
Reply to author
Forward
0 new messages