update embedding of Existing model with data but with high contributing in embedding

226 views
Skip to first unread message

Surender Nandal

unread,
Mar 12, 2016, 10:20:20 AM3/12/16
to gensim
Hello Together,

   I have a existing trained word2vec model. Now, I have some new dataset which is domain specific.

I want to train my existing trained model using this new data set. I want to have higher contribution of these new sentences over the previously learned sentences.

Is it possible to do this? if yes, then how can i achieve this using gensim?

Thanks & Regards
Surender Kumar

Gordon Mohr

unread,
Mar 15, 2016, 5:19:57 PM3/15/16
to gensim
The simple answer is:

You can `save()` a full model, and later `load()` it to continue to `train()`, including with different data. (Though, when calling `train()` with different data than was initially provided to `build_vocab()`, you should also specify the `total_examples` or `total_words` parameters to `train()`, so that progress-readouts and alpha-decay are estimated properly. Also, any words that weren't found during the one initial `build_vocab()` scan will be ignored as unknown.)

The influence of examples presented during any `train()` is largely determined by the `alpha`/`min_alpha` values that are used during the training sessions, and in general examples presented last tend to have more influence. So vaguely, just naively trying the steps you've described will kind-of achieve your stated goals.

But, it may not give you good results. 

Everyone seems to want to do this but I don't know of any published work that establishes theoretical or practical guidelines for balancing the influence of the original training/data and the update. If you were to train on the new data until the model reaches equilibirum (optimal weights for the current data) – which is roughly what the choice of iterations/alpha is intending to do – most of the influence of the original training/data may have been diluted away. (Words that were meaningfully close after the early data may have drifted far apart, perhaps just because some of them didn't appear in the current data at all.)

My hunch is that the best approach in such cases would be to create a combined dataset of all examples. If there are some examples you want to have more weight, repeat them (and tweak the number of repetitions to achieve your goals). Train on the combined set, and when you're done you'll have vectors for all words across all examples, and because they were trained in an interleaved fashion, they will be meaningfully comparable. (None will have drifted away from the others because of later training on a subset.)

If for some reason you can't do that, another choice would be to use an experimental feature in the gensim Word2Vec model that can lock some vectors against change: the `syn0_lockf` array. It's the same size as the array of word-vectors-in-training (`syn0`), and by default all its values are 1.0, which means the corresponding word-vector receives full updates during training. If you set some/all of the `syn0_lockf` values to 0.0 instead, those word-vectors will be frozen against changes during further training. So in the above naive scenario, you could freeze most of the words from the original data – perhaps especially foundational words for which you have a lot of examples and don't expect (or want) later smaller datasets to change. Then when you train on new data, those words may (theoretically) serve as 'semantic anchors`. The training tries to update them, but can't – so across training iterations the still-free-to-change words must continue to adjust to fit within the framework already set by those anchors. In that way, they might also remain more meaningfully-comparable to other words that are not even in the fresh data (and were thus de facto 'frozen' because no new examples mentioned them). 

In all these cases, note that the known-words vocabulary doesn't expand beyond what was originally provided to `build_vocab()`, and so all-new words in new `train()` batches will be ignored-as-unknown. There's no support in the API of currently-released gensim to update the vocabulary – but if you reach into the model itself you could manually do it yourself. There's a pending Pull Request – https://github.com/piskvorky/gensim/pull/615 – based on other earlier work that adds some of these kinds of incremental word-adding capabilities, or could be used as a model for crafting your own approach.

- Gordon

Andrey Kutuzov

unread,
Mar 15, 2016, 7:54:58 PM3/15/16
to gen...@googlegroups.com
Surender, in addition to what Gordon wrote, you might also want to check
this thread:
https://groups.google.com/forum/#!topic/gensim/CBPl4aXN7Ao
> *I want to have higher contribution of these new sentences over the
> previously learned sentences. *
>
> Is it possible to do this? if yes, then how can i achieve this using
> gensim?
>
> Thanks & Regards
> Surender Kumar
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Surender Nandal

unread,
Mar 17, 2016, 6:03:15 AM3/17/16
to gensim
Hello Gordon,

Thank you very much for such a nice and clear explanation. I will try training my models using your approaches and will let you know the results.
Reply all
Reply to author
Forward
0 new messages