Word2Vec most similar

356 views

Skip to first unread message

Jonathan Oheix

unread,

Jul 18, 2016, 5:05:32 AM7/18/16

to gensim

Hello,

I've read that the score used to get the most similar words in Word2Vec was computed using the "cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model" (https://radimrehurek.com/gensim/models/word2vec.html#id4).

Why does Word2Vec uses this score, instead of the mean of cosine similarities between projection vectors of the given words and vectors for each word in the model ?

Why is it relevant to compute the similarity score based on a mean of the projection vectors of the given words ?

Jonathan

Gordon Mohr

unread,

Jul 18, 2016, 2:17:25 PM7/18/16

to gensim

Superficially, gensim does it this way because that's what was done in the original word2vec.c package. Specifically the `distance` executable averaged all words input on a line, then returned the 'closest' to the combination, and the `word-analogy` executable averaged together two positive words, with one word negated, to set the target vector from which similarity was calculated.

More intuitively, the idea of 'words as directions in a high-dimensional space' requires an ability to conserve and combine those directions. Summing/averaging the full-n-dimensional vectors plausibly does that. Averaging just the distances quickly loses the 'directionality', collapsing it to single values.

So for example perhaps the vectors for 'male' and 'monarch', when averaged in n-dimensional space, point in a very similar direction as 'king'. You could test if this is the case, or perhaps find the word 'king' if you didn't already expect it, by finding [pseudocode] `minimum(all_distances(mean(`male', 'monarch')))`.

If you instead do `minimum(means(zip(all_distances('male'), all_distances('monarch'))))`, you won't be focusing in on the same mixed-meaning 'midpoint' direction between the two seed words; you'll be also finding words very close to one but far from the other (whose simple distance-average is also small). These might still be interesting words! (You'd have to try it.) But those words aren't usually what tasks like analogy-solving are looking for, in the high-dimensional space.

Another factor: if you have a large number m of seed words, averaging them once, then computing the v distances to the entire v-sized vocabulary, requires less calculation than computing the m*v distances, then performing the v averages.