Using gensim word2vec model in TensorFlow CNN?

Cameron English

unread,

Apr 26, 2017, 1:32:14 AM4/26/17

to gensim

Hi everyone,

I'm working on a CNN to detect repeated questions in a Quora type setting and the gensim word2vec implementation seemed perfect for creating word embeddings from my training set. However, I'm having a devil of a time actually feeding my word embeddings into my CNN (which wants the word embedding vector for each word in a given question).

Using KeyedVectors.load_word2vec_format('file', binary=True, unicode_errors='ignore') gives a UnicodeDecodeError and I'm unsure how else to feed in the proper embeddings in a memory-feasible manner. Do any of you lovely people have any ideas? I'm sorry that this is rather basic but I've been stuck for an embarrassing amount of time now, thanks in advance!

Radim Řehůřek

unread,

Apr 26, 2017, 1:59:33 AM4/26/17

to gensim, Lev Konstantinovskiy

No worries, we're happy to help. Lev, can you send the link to that "tensorflow integration" notebook, or usage examples?

Plus also make them more prominent in the docs, if people have trouble finding it. I myself tried and failed.

The docs need better structuring overall -- there's too much stuff, in multiple locations, and no clear and concise "table of content / cookbook" beyond http://radimrehurek.com/gensim/tutorial.html.

Cheers,
Radim

Lev Konstantinovskiy

unread,

Apr 26, 2017, 12:31:01 PM4/26/17

to gensim

Hi Cameron,

Here is a way to put gensim word2vec into a Keras convnet using the great shorttext package. There is no such example for TF, there is only work in the opposite direction of loading a TF-trained word2vec into gensim.

Hope it helps your Kaggle :)

Lev

Message has been deleted

Shiva Manne

unread,

May 6, 2017, 7:36:21 PM5/6/17

to gensim

Hey Cameron,

Guess you would have already figured it out by now, but just in case you didnt, here you go:

------------------------------

--------------------------------------------------------------
model = gensim.models.Word2Vec.load(pathToTrainedModel)

# store the embeddings in a numpy array

embedding_matrix = np.zeros((len(model.wv.vocab) + 1, EMBEDDING_DIM))
for i in range(len(model.wv.vocab)):
    embedding_vector = model.wv[model.wv.index2word[i]]
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

#free memory
del(model)

# memory efficient way to load the embeddings(avoids several copies of embeddings) in tf
embedding_weights = tf.Variable(tf.constant(0.0, shape=[len(model.wv.vocab) + 1, EMBEDDING_DIM]),
                trainable=False, name="embedding_weights") #embedding layer weights are frozen to avoid updating embeddings while training

embedding_placeholder = tf.placeholder(tf.float32, [len(model.wv.vocab) + 1, EMBEDDING_DIM])
embedding_init = embedding_weights.assign(embedding_placeholder)

with tf.Session() as sess:
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})
-----------------------------------------------------------------------------------

Regards,
Shiva.

Ajay Babu

unread,

Sep 6, 2017, 3:23:01 AM9/6/17

to gensim

Hi i tried this method I'm getting this error a reply would be appreciated .


ValueError: could not broadcast input array from shape (300) into shape (128)

Reply all

Reply to author

Forward