Different values for cosine similarity calculated by gensim model.wv.n_similarity vs. sklearn.metrics.pairwise pairwise_distances

133 views
Skip to first unread message

tedo.v...@gmail.com

unread,
Apr 23, 2019, 12:58:25 PM4/23/19
to Gensim
Values of two different approaches using the same model (Word2Vec from Gensim) give different values for cosine similarity.
Eg.
$ head -c 50 Gensim_CS.csv
1.0 0.8943 0.16969 0.15607 0.38753 0.46953 0.32108
$ head -c 51 Sklearn_CS.csv 
1.0 0.95788 0.64737 0.63894 0.73894 0.77508 0.71154

I must explain how I convert distances given from sklearn into similarities:
1. Results has been divided by maximum of absolute values.
2. Similarity = 1/(1+distance)

I even tried my favorite (angular) transformation:
similarity = 2*arccos(distance)/Pi
That didn't work either.

Now, is it model.wv.n_similarity cosine similarity after all, or not?

tedo.v...@gmail.com

unread,
Apr 23, 2019, 1:21:55 PM4/23/19
to Gensim
One more on performanse:
Using sklearn pairwise_distances is aprox. 16.5 times faster then gensim model.wv.n_similarity even if I am using (for model.wv.n_similarity) multiprocessing, calculating only triangle values and compiling with Cython...

tedo.v...@gmail.com

unread,
Apr 23, 2019, 1:32:53 PM4/23/19
to Gensim
Word2Vec model is normalized via w2v_model.init_sims(replace=True)

Gordon Mohr

unread,
Apr 23, 2019, 3:24:27 PM4/23/19
to Gensim
It's hard to guess exact reasons for the discrepancy without seeing the exact code which generated it. Pick two vectors, show them, show your two methods' code and different output.

But note: "cosine similarity" is *not* equal to (1 / 1 + cosine_distance)). That (1 / 1 + distance) formulas is just one shortcut to turn arbitrary distances into 0.0 to 1.0 similarity-like values. Cosine similarity is something else entirely, and will have a -1.0 to 1.0 range:


You can see exactly what `n_similarity()` does in the source code:


- Gordon

Andrey Kutuzov

unread,
Apr 23, 2019, 4:13:59 PM4/23/19
to gen...@googlegroups.com
Note also that sklearn's pairwise_distances() method uses Euclidean
distance by default, not cosine.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

tedo.v...@gmail.com

unread,
Apr 23, 2019, 5:34:09 PM4/23/19
to Gensim


Dana utorak, 23. travnja 2019. u 21:24:27 UTC+2, korisnik Gordon Mohr napisao je:
It's hard to guess exact reasons for the discrepancy without seeing the exact code which generated it. Pick two vectors, show them, show your two methods' code and different output.

But note: "cosine similarity" is *not* equal to (1 / 1 + cosine_distance)). That (1 / 1 + distance) formulas is just one shortcut to turn arbitrary distances into 0.0 to 1.0 similarity-like values. Cosine similarity is something else entirely, and will have a -1.0 to 1.0 range:



Yes, I know what cosine similarity is. I was a little  bit confused about cosine distance. I check in sklearn code and they calculate distance = 1- similarity.
 Dot product which is exactly cosine similarity in case od normalized vectors.

tedo.v...@gmail.com

unread,
Apr 23, 2019, 5:35:38 PM4/23/19
to Gensim


Dana utorak, 23. travnja 2019. u 22:13:59 UTC+2, korisnik Andrey Kutuzov napisao je:
Note also that sklearn's pairwise_distances() method uses Euclidean
distance by default, not cosine.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html



Thanks, but that's irrelevant, due to use of parameters.

tedo.v...@gmail.com

unread,
Apr 23, 2019, 6:10:49 PM4/23/19
to Gensim
After applying transformation formula distance -> similarity to sklearn results (similarity = 1 - distance), I have got similar results:
1.0 0.89897 0.1939 0.17753 0.36849 0.45357 0.28358
I think it is close enough.

tedo.v...@gmail.com

unread,
Apr 23, 2019, 6:13:40 PM4/23/19
to Gensim
Dana utorak, 23. travnja 2019. u 19:21:55 UTC+2, korisnik tedo....@gmail.com napisao je:
One more on performanse:
Using sklearn pairwise_distances is aprox. 16.5 times faster then gensim model.wv.n_similarity even if I am using (for model.wv.n_similarity) multiprocessing, calculating only triangle values and compiling with Cython...

 That one remains as a performance issue. So, it is much faster to use sklearn to calculate cosine similarity instead of gensim's model.wv.n_similarity.

Gordon Mohr

unread,
Apr 23, 2019, 9:32:50 PM4/23/19
to Gensim
Without seeing the code demonstrating the difference, it's not clear what the reasons might be. (It might be something as simple as `pairwise_distances()` being called in a way that lets it use some native bulk array operation, while your code is calling `n_similarity()` in some loop paying Pyton overhead on each calculation.)

- Gordon
Reply all
Reply to author
Forward
0 new messages