I'm want to represent a small phrase (3 to 4 words) as a unique vector, either by adding each individual word embedding or by calculating the average of word embeddings.
I do a simply SUM, no weights involved, it if you want the weight is 1 for each word. The average vector is obtained like the sum, but at the end I divide each entry in the vector by the number of summed words. I'm using the cosine similarity as defined here http://en.wikipedia.org/wiki/Cosine_similarity
From the experiments I've done I always get the same cosine
similarity, either I have a BUG in the code, or I'm missing something. The word2vec model was generated with the tool by Mikolov, and I'm using gensim to read the model and average or sum.
vec1 ['founder', 'and', 'ceo']
vec2 ['co-founder', 'and', 'former', 'chairman']
SUM
dot(vec1,vec2) 5.4008677771
norm(p1) 2.19382594282
norm(p2) 2.87226958166
dot((norm)vec1,norm(vec2)) 6.30125952303
cosine(vec1,vec2) 0.857109242583
AVG
dot(vec1,vec2) 0.450072314758
norm(p1) 0.731275314273
norm(p2) 0.718067395416
dot(norm(vec1),norm(vec2)) 0.525104960252
cosine(vec1,vec2) 0.857109242583
A more detailed example and the code are available here:
http://stackoverflow.com/questions/30142345/word2vec-sum-or-average-word-embeddings
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.