Different values for cosine similarity calculated by gensim model.wv.n_similarity vs. sklearn.metrics.pairwise pairwise

tedo.v...@gmail.com

unread,

Apr 23, 2019, 12:58:25 PM4/23/19

to Gensim

Values of two different approaches using the same model (Word2Vec from Gensim) give different values for cosine similarity.

Eg.

$ head -c 50 Gensim_CS.csv

1.0 0.8943 0.16969 0.15607 0.38753 0.46953 0.32108

$ head -c 51 Sklearn_CS.csv

1.0 0.95788 0.64737 0.63894 0.73894 0.77508 0.71154

I must explain how I convert distances given from sklearn into similarities:

1. Results has been divided by maximum of absolute values.

2. Similarity = 1/(1+distance)

I even tried my favorite (angular) transformation:

similarity = 2*arccos(distance)/Pi

That didn't work either.

Now, is it model.wv.n_similarity cosine similarity after all, or not?

tedo.v...@gmail.com

unread,

Apr 23, 2019, 1:21:55 PM4/23/19

to Gensim

One more on performanse:

Using sklearn pairwise_distances is aprox. 16.5 times faster then gensim model.wv.n_similarity even if I am using (for model.wv.n_similarity) multiprocessing, calculating only triangle values and compiling with Cython...

tedo.v...@gmail.com

unread,

Apr 23, 2019, 1:32:53 PM4/23/19

to Gensim

Word2Vec model is normalized via w2v_model.init_sims(replace=True)

Gordon Mohr

unread,

Apr 23, 2019, 3:24:27 PM4/23/19

to Gensim

It's hard to guess exact reasons for the discrepancy without seeing the exact code which generated it. Pick two vectors, show them, show your two methods' code and different output.

But note: "cosine similarity" is *not* equal to (1 / 1 + cosine_distance)). That (1 / 1 + distance) formulas is just one shortcut to turn arbitrary distances into 0.0 to 1.0 similarity-like values. Cosine similarity is something else entirely, and will have a -1.0 to 1.0 range:

https://en.wikipedia.org/wiki/Cosine_similarity

You can see exactly what `n_similarity()` does in the source code:

https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/keyedvectors.py#L957

- Gordon

Andrey Kutuzov

unread,

Apr 23, 2019, 4:13:59 PM4/23/19

to gen...@googlegroups.com

Note also that sklearn's pairwise_distances() method uses Euclidean
distance by default, not cosine.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

tedo.v...@gmail.com

unread,

Apr 23, 2019, 5:34:09 PM4/23/19

to Gensim

Dana utorak, 23. travnja 2019. u 21:24:27 UTC+2, korisnik Gordon Mohr napisao je:

It's hard to guess exact reasons for the discrepancy without seeing the exact code which generated it. Pick two vectors, show them, show your two methods' code and different output.

But note: "cosine similarity" is *not* equal to (1 / 1 + cosine_distance)). That (1 / 1 + distance) formulas is just one shortcut to turn arbitrary distances into 0.0 to 1.0 similarity-like values. Cosine similarity is something else entirely, and will have a -1.0 to 1.0 range:

https://en.wikipedia.org/wiki/Cosine_similarity

Yes, I know what cosine similarity is. I was a little bit confused about cosine distance. I check in sklearn code and they calculate distance = 1- similarity.

You can see exactly what `n_similarity()` does in the source code:

https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/keyedvectors.py#L957

Dot product which is exactly cosine similarity in case od normalized vectors.

tedo.v...@gmail.com

unread,

Apr 23, 2019, 5:35:38 PM4/23/19

to Gensim

Dana utorak, 23. travnja 2019. u 22:13:59 UTC+2, korisnik Andrey Kutuzov napisao je:

Note also that sklearn's pairwise_distances() method uses Euclidean
distance by default, not cosine.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

Thanks, but that's irrelevant, due to use of parameters.

tedo.v...@gmail.com

unread,

Apr 23, 2019, 6:10:49 PM4/23/19

to Gensim

After applying transformation formula distance -> similarity to sklearn results (similarity = 1 - distance), I have got similar results:

1.0 0.89897 0.1939 0.17753 0.36849 0.45357 0.28358

I think it is close enough.

tedo.v...@gmail.com

unread,

Apr 23, 2019, 6:13:40 PM4/23/19

to Gensim

Dana utorak, 23. travnja 2019. u 19:21:55 UTC+2, korisnik tedo....@gmail.com napisao je:

One more on performanse:
Using sklearn pairwise_distances is aprox. 16.5 times faster then gensim model.wv.n_similarity even if I am using (for model.wv.n_similarity) multiprocessing, calculating only triangle values and compiling with Cython...

That one remains as a performance issue. So, it is much faster to use sklearn to calculate cosine similarity instead of gensim's model.wv.n_similarity.

Gordon Mohr

unread,

Apr 23, 2019, 9:32:50 PM4/23/19

to Gensim

Without seeing the code demonstrating the difference, it's not clear what the reasons might be. (It might be something as simple as `pairwise_distances()` being called in a way that lets it use some native bulk array operation, while your code is calling `n_similarity()` in some loop paying Pyton overhead on each calculation.)

- Gordon

Reply all

Reply to author

Forward

Different values for cosine similarity calculated by gensim model.wv.n_similarity vs. sklearn.metrics.pairwise pairwise_distances

tedo.v...@gmail.com

tedo.v...@gmail.com

tedo.v...@gmail.com

Gordon Mohr

Andrey Kutuzov

tedo.v...@gmail.com

tedo.v...@gmail.com

tedo.v...@gmail.com

tedo.v...@gmail.com

Gordon Mohr