Extracting word and subword embeddings from FastText native in gensim for Keras

Crl Srf

unread,

Dec 5, 2017, 7:46:02 PM12/5/17

to gensim

Hi all,

After training a fasttext model in genism (native fasttext and not the wrapper), I want to use the embeddings as a first layer in Keras for a deep Neural Network.

Basically:

model.wv.syn0 are the embeddings for vocabulary words.

model.wv.syn0_ngrams are the embeddings for the character n-grams.

It looks like e have 2 different dictionaries, 1 that maps the vocabulary words to integers, and 1 that maps the character n-grams to integers.

In Keras, the input layer should correspond to both the vocabulary words AND the character n-grams (subwords), right?

So is it my responsibility to create a merged dictionary that would uniquely map words and character n-grams to integers, and also combine the embeddings from model.wv.syn0 and model.wv.syn0_ngrams ?

Thanks!

Shiva Manne

unread,

Dec 6, 2017, 3:30:58 PM12/6/17

to gensim

Hi Crl,

You do not need `model.wv.syn0_ngrams` for the first embedding layer. These are only useful for inferring/constructing vectors for out-of-vocabulary words. So there's no point in combining embeddings from `model.wv.syn0` and `model.wv.syn0_ngrams` for your Keras embedding layer.

In case the text that you are training your Keras model does not have unseen words -- words different from the words you trained the `FastText` model on, you can straight away use the first dictionary (vocab-int mappings) with `model.wv.syn0`.

There's also a chance that your training text (for Keras model) contains words that are not present in the corpus used to train `FastText`. One possible solution, in this case, is to construct your own dictionary from the Keras training text and get vectors for all these words. These vectors can then be saved and loaded into the Keras embedding layer.

Regards,
Shiva.

Crl Srf

unread,

Dec 6, 2017, 4:29:53 PM12/6/17

to gensim

Hi Shiva,

Thanks a lot for your reply! Indeed in my case, I want to first train fasttext on a big corpus to get the embeddings, and then train in Keras on a different data set that will probably have a big number of unseen words.

I have a followup question about that. During the training of fasttext, what is the structure of the input layer of the neural network (for skip-gram or CBOW) ? Do the nodes only correspond to the words like in word2vec? Or words & subwords inputs in the input layer? My understanding was that I would have the same input layer + embeddings structure in the subsequent Keras model.

Thanks again!

Crl Srf

unread,

Dec 7, 2017, 12:51:21 AM12/7/17

to gensim

By the way, building a new dictionary won't be suitable for scoring new text after training a supervised neural network... What if we don't encounter the same words in the testing set than in the training set? Hence the need to have all the words and subwords in the input layer...

lpm

unread,

Dec 15, 2018, 4:45:01 PM12/15/18

to Gensim

What did you end up doing to make it work?

Reply all

Reply to author

Forward