word2vec understanding similarity functions

ziqi zhang

unread,

Apr 3, 2017, 3:49:42 AM4/3/17

to gensim

I notice that there are several methods for computing similarity/searching for similar words, but the documentation is not complete and I wonder what is the difference:

print(model.wv.most_similar(positive=['biology']))
print(model.wv.most_similar_cosmul(positive=['biology']))
print(model.wv.similarity('cell', 'blood'))
print(model.wv.n_similarity('cell','blood'))

using my model the results I get are (in order):

[('requisite', 0.36276543140411377), ('significantly_decreased', 0.35393762588500977), ...

[('requisite', 0.6813820600509644), ('significantly_decreased', 0.6769681572914124),...

-0.117000218559
0.321996696627

also, the first 2 methods allows setting a top N to be returned. is it possible to return any with a score greater than  a threshold?

Many thanks

ziqi zhang

unread,

Apr 3, 2017, 12:19:20 PM4/3/17

to gensim

could I have some advice on this please, I'd really like to know which method I should use for finding similar words and calculating similarity scores. As you can see the different methods produces different output.

Thanks

Gordon Mohr

unread,

Apr 3, 2017, 4:30:22 PM4/3/17

to gensim

The `wv` property is a KeyedVectors object – with extensive per-method documentation for the relevant methods starting at:

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar

In short, the `_cosmul` variant uses a slightly-different comparison when using multiple positive/negative examples (such as when asking about analogies) that one paper has shown does better. (For a single positive target word, the values will be different, but the rank-ordering of similar words returned shouldn't change.)

`similarity()` compares exactly 2 words, and `n_similarity()` compares two *lists* of words. If you feed it strings, as in your example, it will interpret them as lists-of-one-character tokens. So:

model.wv.n_similarity('cell','blood'))

...is interpreted as...

model.wv.n_similarity(['c', 'e', 'l', 'l'],['b', 'l', 'o', 'o', 'd']))

(I'm surprised your call isn't giving a KeyError, but perhaps you have words for each character in your model?)

There's no option for a specific `similarity` score-cutoff, just the top-N most-similars. You would have to perform such a filter on the entire set of pairwise similarities yourself, The existing code might be useful as a model:

https://github.com/RaRe-Technologies/gensim/blob/4fcf0d79838cf1713a28b5cf4413d3ca7738b0fc/gensim/models/keyedvectors.py#L277

- Gordon

ziqi zhang

unread,

Apr 3, 2017, 5:04:46 PM4/3/17

to gensim

Thank you so much for the pointers!

Sumeet Sandhu

unread,

Jul 22, 2017, 8:18:04 PM7/22/17

to gensim

I am using an older model generated with gensim 0.13.3, in python 2.7.13 gensim 2.1.0.

I get different values with model.most_similar() versus model.n_similarity() :

>>> model.most_similar(['editorial' ,'rule'])

[(u'rules', 0.6381771564483643), (u'criteria', 0.5107885003089905), (u'criterion', 0.492725133895874), (u'trigger', 0.4885510504245758), (u'policy', 0.484529972076416), (u'DRM/DMCA', 0.47995659708976746), (u'condition', 0.47984564304351807), (u'evaluation', 0.46047770977020264), (u'review', 0.4602917432785034), (u'policies', 0.4565350115299225)]

>>> [ model.n_similarity(['editorial','rule'],[w[0]]) for w in model.most_similar(['editorial','rule']) ]

[0.6019811969674348, 0.48154911757292618, 0.46389647223886166, 0.46570814004737227, 0.46567470188832039, 0.4726064578375645, 0.44584272549756965, 0.44441483919354241, 0.46469044441119817, 0.44388189457279986]

The sort orders of these two results are different - which makes me worry about using one vs the other.

My understanding of how these are calculated is that both do a dot product of two vectors, each vector = <normalized sum_of_word_vectors>.

Give or take an extra average done in one of these - which should wash out in the normalization.

What am I missing?

regards,

Sumeet

Gordon Mohr

unread,

Jul 24, 2017, 4:31:45 PM7/24/17

to gensim

The only relevant difference I see in the two paths is that `most_similar()` averages the already-unit-normed vectors for supplied multiple positive examples, while `n_similarity()` averages the raw vectors of supplied multiple examples.

I wouldn't expect this to "wash out" in the normalization before final cosine-similarity calculation: the magnitudes of raw word-vectors can vary a lot, and averaging the unnormalized vectors will result in different vectors than pre-normalized – with the higher-magnitude vectors having more influence in the 1st case. There are some indications that higher-magnitude word-vectors have less-ambiguous meanings – for example words with multiple conflicting senses tend to have lower-magnitude vectors. Whether one approach or the other is better for your purposes will likely depend on your project goals and evaluation measures.

- Gordon

Reply all

Reply to author

Forward