How do I subtract and add vectors with gensim KeyedVectors?

242 views
Skip to first unread message

Simon Thormeyer

unread,
Jan 7, 2021, 9:25:29 AM1/7/21
to Gensim

Hello everyone,

I also asked this question on stack overflow, however I don't know how active the gensim community is there, so I'm happy to have this as a second place to ask my question:

I need to add and subtract word vectors, for a project in which I use gensim.models.KeyedVectors (from the word2vec-google-news-300 model)

Unfortunately, I've tried but can't manage to do it correctly.  

Let's look at the poular example queen ~= king - man + woman.   
When I want to subtract *man* from *king* and add *woman*,
I can do this by

# model is loaded using gensim.models.KeyedVectors.load()
model.wv.most_similar(positive=["king", "woman"], negative=["man"])[0]


which, as expected, returns ('queen', 0.7118192911148071) for the model I use.  

Now, to achieve the same with adding and subtracting vectors, I've tried the following code:

 vec1, vec2, vec3 = model.wv["king"], model.wv["man"], model.wv["woman"]
 result = model.similar_by_vector(vec1 - vec2 + vec3)[0]



result in the code above is ('king', 0.7992597222328186) which is not what I'd expect.

What is my mistake?

Thank you!

Simon

Gordon Mohr

unread,
Jan 7, 2021, 1:07:50 PM1/7/21
to Gensim
I've also posted this this your SO question at <https://stackoverflow.com/a/65617717/130288>, but:

You're generally doing the right thing, but note:

* the `most_similar()` method also disqualifies from its results any of the named words provided - so even if `'king'` is (still) the closest word to the result, it will be ignored. You formulation might very well have `'queen'` as the next-closest word, after ignoring the input words - which is all that the 'analogy' tests need.

* the `most_similar()` method also does its vector-arithmetic on versions of the vectors that are *normalized to unit length*, which can result in slightly different answers. If you change your uses of `model.wv['king']` to `model.get_vector('king', norm=True)`, you'll get the unit-normed vectors instead. 

See also similar earlier answer: https://stackoverflow.com/a/65065084/130288

- Gordon
Reply all
Reply to author
Forward
0 new messages