How to initialize new models with pre-trained model weights?

3,510 views
Skip to first unread message

Nouman Dilawar

unread,
Mar 14, 2016, 11:14:39 AM3/14/16
to gensim
I am using Gensim Library in python to train word2vec model. I am trying to initialize my model weights with some pre-trained word2vec model such as (GoogleNewDataset pretrained model). I have been struggling with it couple of weeks. Now, I just searched out that in Gesim there is a function that can help me to initialize the model with another other model by using following gensim function. But, will it applicable to do that ?
 
reset_from(other_model)

Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.




I don't know this function can do the same thing or not. Please help!!!

Gordon Mohr

unread,
Mar 15, 2016, 5:58:25 PM3/15/16
to gensim
The `reset_from()` takes a model that's already done a vocabulary-discovery scan (`build_vocab()`) over a corpus, and reuses references to its sharable datastructures. It saves a little memory if you are training multiple models on the exact same corpus. It's not really applicable if you're loading pre-trained vectors from elsewhere. 

The `load_word2vec_format()` method will load vectors from the Google word2vec.c export format(s). See:


However, such models are essentially read-only: you can look-up and compare word vectors, but not continue training. (The word2vec.c export format doesn't export the whole trained model.)

Another option is the `intersect_word2vec_format()` method:


This method assumes your model has already discovered its own vocabulary (via `build_vocab()` over a corpus), but then scans a Google word2vec.c-format file, and for the words that already exist in the local model (the intersection of the two word sets), it loads the vector from the file instead. Further, it freezes that vector against further training (using an experimental feature of gensim Word2Vec in the `syn0_lockf` array). So it roughly sets you up for a situation where you'd want to use exactly the pretrained vectors, where they exist, but then (through your own training) bootstrap vectors for new words that were in your local scan but not the pretrained file. 

Are the results of this approach any good? I'm not sure; my limited experiments pretty much just made sure the code ran, and I've not seen anyone else write-up significant experiments. In general it seems to me that people are way too eager to try to reuse Google's vectors, compared to understanding the vectors they can get from their own data.

- Gordon

Nouman Dilawar

unread,
Mar 16, 2016, 12:27:38 AM3/16/16
to gen...@googlegroups.com

Thank you sir, that is somehow a relevant answer. So, I will load a word vector from pretrained model if it exists otherwise I will be getting it from training a new model. So, in this approach (intersect_word2vec_format) I will be using two models at the same time right ? I am surely going to give it a try. 

I just want ask one more question, that while using pre trained models, I am getting alot of missing words because I am dealing with review dataset and alot of words are not following English standards and some are not in the dictionaries. Can you please guide me to deal with such type of words? Example (freeeeeeeeeeeezzzzing, 2moro, hmmmmm, etc) and (food names : samosa, Dosa) how can I incorporate these words?

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/Y_WmJST9xx8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Mar 16, 2016, 8:34:26 PM3/16/16
to gensim


On Tuesday, March 15, 2016 at 9:27:38 PM UTC-7, Nouman Dilawar wrote:

Thank you sir, that is somehow a relevant answer. So, I will load a word vector from pretrained model if it exists otherwise I will be getting it from training a new model. So, in this approach (intersect_word2vec_format) I will be using two models at the same time right ? I am surely going to give it a try. 


No, just the one model. It discovered its own vocabulary (via `build_vocab()`), then you call `intersect_word2vec_format()` to merge in some of the vectors from a prior vector set. (Maybe that vector set is from an earlier trained model, but that file format doesn't store full models, just final vectors.)
 

I just want ask one more question, that while using pre trained models, I am getting alot of missing words because I am dealing with review dataset and alot of words are not following English standards and some are not in the dictionaries. Can you please guide me to deal with such type of words? Example (freeeeeeeeeeeezzzzing, 2moro, hmmmmm, etc) and (food names : samosa, Dosa) how can I incorporate these words?


Most common is to ignore them, or train your own vectors from your own source material. If you have enough of your own material, using words in the senses relevant to your problem domain, trying to re-use pre-trained vectors from another domain may add more complexity than is necessary. (Are word senses from news stories – as with the example Google vectors – relevant to other domains? Maybe not.)

- Gordon

Nouman Dilawar

unread,
Mar 17, 2016, 2:57:05 AM3/17/16
to gen...@googlegroups.com

Thank you  Sir Gordan , one last question please :)/........  is there anyway from somewhere I can get pretrained layers weights of some other model and initialize a new a model with those weights instead of new random weights ? And then I may start training from there.... Is is possible in gensim or any other way?

Gordon Mohr

unread,
Mar 17, 2016, 4:47:30 AM3/17/16
to gensim
I don't know of anyone distributing full pre-trained models. I'm also not sure it'd offer any benefit: if you have enough data to be doing any extra training, you might be just fine using just that data. 

I mentioned some options for continuing training of models in another thread (https://groups.google.com/d/msg/gensim/Z9fr0B88X0w/M00f-ASyHAAJ), but also as I note there, I don't know established precedents or even rules-of-thumb for balancing the influence of the prior model and your new training. All new training is diluting away the influence of the original data/training, and there's at least a plausible case that if you've done enough training on the new data to move the model to a new optimal equilibrium (the usual approximate goal of the inner SGD), you'll be left with negligible influence from the starting state. (So, you may have been just as well off starting from a random state.)

If your real goal is to extend your vocabulary with words that you don't think are suitably represented in your data, a better approach might be to learn a projection of some prior vectors into your own space (or vice-versa). There's no code yet in gensim for this, but it is a wishlist item on the project wiki (and probably not too hard, especially given the example implementation linked from there): 


Finally, while gensim wants the full model state to be in place to be trainable, and the loading of just-vectors (as if by `load_word2vec_format()`) doesn't properly create a model in which training can continue, you could try manually patching around that – modify the model to fill in the parts it's missing – and it might mostly work. At one point I had a bug where the inner-layer weights were inadvertently being zeroed after every training pass over a corpus. (The neural net was getting a partial lobotomy each epoch: 100% dropout!) Amazingly, the vectors still kept getting better each pass, just more slowly. It was as if the inner layer could be recovered, to a useful state consistent with the vectors so far, over the progress of any single pass. 

- Gordon

Nouman Dilawar

unread,
Mar 19, 2016, 8:21:29 PM3/19/16
to gen...@googlegroups.com

Thank you very much Gordon sir for your time and such wonderful answers.

Reply all
Reply to author
Forward
0 new messages