KeyedVectors get_keras_embedding and word indexing

388 views
Skip to first unread message

Alex Zhicharevich

unread,
May 2, 2019, 5:03:39 PM5/2/19
to Gensim
The get_keras_embedding function looks great as it saves a lot of boilerplate code when using Gensim to load different word vector models to Keras. However, Keras expects the input to the embedding layer be a sequence of word indices. I could not find an elegant way to translate words to their corresponding indices in the embedding matrix except looping over the words in the text and encoding them using wv.vocab[word].index (like in the answer in this SO question https://stackoverflow.com/questions/51492778/how-to-properly-use-get-keras-embedding-in-gensim-s-word2vec). I feel like this solution kind of misses the point of
get_keras_embedding method, since it forces the user to replace one boilerplate with another. I also looked at the keras wrapper example here https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/keras_wrapper.ipynb and could not understand how the indices coming out of Keras tokenizer align to the correct embeddings (in the classification examples)? Am I missing something very basic here?

Alex Zhicharevich

unread,
May 8, 2019, 7:38:23 AM5/8/19
to Gensim
Hi,

To dive deeper into the above issue, I ran the code in the notebook in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/keras_wrapper.ipynb. In the 20_newsgroup example I've tried to check the correspondence between the indices in keras tokenizer and the w2v.wv.vocab. From what I see there is no alignment: for example

(tokenizer.word_index['good'] , Keras_w2v_wv.vocab['good'].index) returns  (76, 54)
(tokenizer.word_index['system'],Keras_w2v_wv.vocab['system'].index) returns (145, 109)
tokenizer.word_index['high'],Keras_w2v_wv.vocab['high'].index returns (34, 115).

As far as my understanding what is happening in this example is training of w2v, but the Keras model gets random initial word vectors because the indices don't match. A possible solution to this can be for the KeyedVectors object would have a method to return another object which can translate words to their corresponding embedding indices, like an already fitted Keras tokenizer or a Gensim dictionary object, or there would be a method which encodes this directly.

Radim Řehůřek

unread,
May 8, 2019, 12:20:07 PM5/8/19
to Gensim
Hi Alex,

I'm afraid the person who contributed this functionality is not around any more, and I never used it myself. So I cannot comment much.

If you can think of a way to make the integration with Keras easier (or fix bugs), that'd be great. We're definitely open to that. Especially if you could also include integration tests, to avoid similar issues in the future.

Cheers,
Radim
Reply all
Reply to author
Forward
0 new messages