sum or average word embeddings? I always get the same result.

David Batista

unread,

May 10, 2015, 6:15:48 AM5/10/15

to gen...@googlegroups.com

I'm want to represent a small phrase (3 to 4 words) as a unique vector, either by adding each individual word embedding or by calculating the average of word embeddings.

I do a simply SUM, no weights involved, it if you want the weight is 1 for each word. The average vector is obtained like the sum, but at the end I divide each entry in the vector by the number of summed words. I'm using the cosine similarity as defined here http://en.wikipedia.org/wiki/Cosine_similarity

From the experiments I've done I always get the same cosine similarity, either I have a BUG in the code, or I'm missing something. The word2vec model was generated with the tool by Mikolov, and I'm using gensim to read the model and average or sum.

vec1 ['founder', 'and', 'ceo']
vec2 ['co-founder', 'and', 'former', 'chairman']

SUM
dot(vec1,vec2) 5.4008677771
norm(p1) 2.19382594282
norm(p2) 2.87226958166
dot((norm)vec1,norm(vec2)) 6.30125952303
cosine(vec1,vec2) 0.857109242583

AVG
dot(vec1,vec2) 0.450072314758
norm(p1) 0.731275314273
norm(p2) 0.718067395416
dot(norm(vec1),norm(vec2)) 0.525104960252
cosine(vec1,vec2) 0.857109242583

A more detailed example and the code are available here:
http://stackoverflow.com/questions/30142345/word2vec-sum-or-average-word-embeddings

Gordon Mohr

unread,

May 10, 2015, 3:41:00 PM5/10/15

to gen...@googlegroups.com

Cosine similarity is based on a difference of the *angles* (from origin) of the two vectors. These angles aren't changed by any scaling of the vectors – so the choice of summing or averaging (at this final step) won't change cosine similarity measures.

By analogy to the 2D domain, consider each of the following pairs of vectors:

(1,0), (0,1)

(2,0), (0,1)

(2,0), (0,2)

(1,0), (0,2)

In all cases, the angle between them is 90°.

- Gordon

David Batista

unread,

May 10, 2015, 3:51:01 PM5/10/15

to gen...@googlegroups.com

Yes, I just realized that, so I wonder when someone says they average the word embeddings to have a single vector representation of a sentence, what is the similarity metric used?

Dot product? That is not bounded as far as I know, of course one can always apply a simple normalization by dividing by the max of all dot products. Anyway, I was wondering what other similarity metrics one can apply when using the average.

--
./david

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,

May 10, 2015, 9:46:17 PM5/10/15

to gen...@googlegroups.com

I believe people will use cosine similarity between such sum-of-all-word-vectors. But I don't think that gives very good results. For example, in the tutorial accompanying the Kaggle competition, "Bag of Words Meets Bag of Popcorn", see "Part 3: More Fun With Word Vectors":

https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors

Their section "From Words To Paragraphs, Attempt 1: Vector Averaging" concludes: "We found that this produced results much better than chance, but underperformed Bag of Words by a few percentage points." Their next two attempts also based on word-vectors similarly fail to noticeably outperform their earlier bag-of-words approach.

Of course this is just on a binary sentiment-analysis task – for other tasks, the tradeoffs might be different. (For example, the Mikolov et al subsampling-of-frequent-word seems to improve the quality of word-vectors. On my experiments with the IMDB sentiment task, and doc-vectors from Doc2Vec, *any* subsampling seems to weaken the power of the resulting doc-vectors to predict sentiment via logistic regression. Perhaps this is because the small, frequent words turn out to be helpful for sentiment, even though they might be noise on some other multi-topic/doc-similarity/recommendation task. )

- Gordon

Reply all

Reply to author

Forward