How to interpret negative similarity in word2vec

1,898 views

Skip to first unread message

Simon Lindgren

unread,

May 15, 2019, 1:21:21 PM5/15/19

to Gensim

With this code ...

>>> reddit_model.wv.most_similar(positive=["🍔"])
('🍟', 0.9034159183502197)
('🍕', 0.8422054052352905)

... I interpret is as fries being more similar than pizza to burger.

When using "negative", like this ...

>>> reddit_model.wv.most_similar(negative=["🦄"])
('🏀', -0.0860629677772522)

('🔊', -0.09106328338384628)

('🏈', -0.13306763768196106)

... should I interpret football or basketball as being the least similar to unicorn?

Thanks!

Gordon Mohr

unread,

May 15, 2019, 10:56:50 PM5/15/19

to Gensim

The cosine-similarities reported by `most_similar()` will be in the range -1.0 (diametrically-dissimilar) to 1.0 (exactly-similar).

Supplying words or vectors as `negative` examples in `most_similar()` just means those items are fully negated – reflected aon every axis - before being combined with any `positive` examples. It's not typical to provide negative examples without any positive examples, as the mere negation of a vector may not have a clear interpretation. (Setting related vectors against each other in the `positive` and `negative` positions *may* be usefully interpretable, as in the classic analogy-solving example of `wordvecs.most_similar(positive=['king', 'woman'], negative=['man'])`, where it's the difference between 'woman' and 'man' that shifts the positive input 'king' nto the general vicinity of a word like 'queen'.)

That said, I believe the following to be the case:

* `wordvecs.most_similar(negative=[some_word])` should give results equivalent to `wordvecs.most_similar(positive=[-wordvecs[some_word]])`

* the bottom-10 words given from a full-ranking of all words' relation to a single positive example, such as:

wordvecs.most_similar(positive=[some_word], topn=len(wordvecs))[-10:]

...should be the same words, in reverse order, as the top-10 words of that same single negative example:

wordvecs.most_similar(negative=[some_word],topn=10)

In your results, it is '🏀' that is "most similar" to the negation-of-'🦄'. But, with a similarity of just -0.08, it's *not* very close to that negation-of-'🦄' at all. That suggests to me that whole neighborhood, in that semi-hypershere "direction" from the origin, is unpopulated – all your vectors have been pushed "on one side" of the space. That seems to be natural in these models – perhaps especially when using typical `negative` settings that are greater-than-1 (like the default `5`) – but that muddies the possible interpretation of the "negation-of-a-vector" as a word's natural "opposite". See more discussion of these effects in a prior thread <https://groups.google.com/d/msg/gensim/o8cDWyihuKc/hruB7QLHHwAJ>.