word2vec Incoherent Vocab and Syn0

1,004 views
Skip to first unread message

Debora Nozza

unread,
Mar 24, 2017, 9:07:22 AM3/24/17
to gensim
Hi!

I have a list of words and their embeddings as a numpy array and I want to import them as a gensim Word2Vec model.

Since the gensim model embeddings can be easily updated I wrote gensim.wv.syn0 = my_embeddings.

However, it is not as easy to update the vocabulary keys with a new string list to maintain the word,embedding pair order. First, I tried to change the value of the vocabulary items but I did not succeed. Then, I decided to give the list of words as input to the model and use the order that gensim used (I also can not understand what kind of sorting it is using "Leap years" < "four" < "aria", consider that a word appears only once).
So, I updated the embeddings following the order of model.wv.vocab, discovering that model.wv.syn0 has a different order than model.wv.vocab.

I would expect the following condition to be True for all i but it is not!
np.array_equal(model[model.vocab.keys()[i]] , model.syn0[i]) 

1. There is an easier way to import a list of words and a numpy array in gensim Word2Vec model?
2. Why the order of syn0 and vocab are different and how can I deal with it?

Thank you,
Debora

Gordon Mohr

unread,
Mar 26, 2017, 6:38:11 PM3/26/17
to gensim
Note that a Python dict (like the `vocab` dictionary here) doesn't necessarily keep its keys in any particular order, such as sorted or by-original order-of-insertion. Further, the values in the `vocab` dictionary are `Vocab` objects, whose `index` property points to which array-index-position in `syn0` holds the corresponding word vector. 

So if for instance you had an existing model which already has 'four' in its vocabulary (perhaps because a `build_vocab()` on suitable text has been run), the existing `four` vector (perhaps just randomly initialized) is specifically at:

    model.wv.syn0[model.wv.vocab['four'].index]

If you have an alternate vector for `four`, that's where it would need to be assigned. 

The property `model.wv.index2word` also lists all the known words, in `syn0` slot order. 

More generally, I would suggest using existing source code that does similar things as a model. if trying to import values from elsewhere. For example, the existing `KeyedVectors.load_word2vec_format()` method builds a usable `KeyedVectors` instance (including syn0/vocab/index2word etc) from a file on disk, so does many of steps an import from any other data source would also need to do:


If you want a full Word2Vec model that's trainable, you'd want to be sure that any modifications you make still maintain the same sort of ending-state as is usually achieved by a `build_vocab()` model-prep over training text. 

- Gordon
Reply all
Reply to author
Forward
0 new messages