a right way to resume a Word2Vec model and continue the training process

Henry Chang

unread,

Apr 6, 2016, 9:19:54 PM4/6/16

to gensim

Hi,

I would like to understand the right way to resume a Word2Vec model and continue the training process. There is something I am still not clear. Could you please help?

After

==
from gensim.models.word2vec import Word2Vec

all_sentences = [['first', 'sentence'], ['second', 'sentence'], ['third', 'sentence'], ['fourth', 'sentence']]
some_sentences = [['first', 'sentence'], ['second', 'sentence']]

model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model.train(some_sentences)
print "sisimilarity_1 for some_sentences:"
print model.similarity('first','second')
print "mose_sisimilarity_1 in some_sentences:"
print model.most_similar(positive=['first', 'sentence'], negative=['second'], topn=1)

model.save('mdlObj')

==

was executed, it returned

==

similarity_1:
-0.0450417522552
most_similarity_1:
[('fourth', -0.09383071959018707)]

==

Next,

==

from gensim.models.word2vec import Word2Vec

all_sentences = [['first', 'sentence'], ['second', 'sentence'], ['third', 'sentence'], ['fourth', 'sentence']]

model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model.load('mdlObj')
print "similarity_2:"
print model.similarity('first','second')
print "most_similarity_2:"
print model.most_similar(positive=['first', 'sentence'], negative=['second'], topn=1)

other_sentences = [['third', 'sentence'], ['fourth', 'sentence']]
model.train(other_sentences)
print "similarity_3:"
print model.similarity('first','second')
print "most_similarity_3:"
print model.most_similar(positive=['first', 'sentence'], negative=['second'], topn=1)

==

was executed. It returned

==

similarity_2:
-0.0451681058758
most_similarity_2:
[('fourth', -0.09381453692913055)]
similarity_3:
-0.0451681058758
most_similarity_3:
[('fourth', -0.09404119849205017)]

==

Question 1: Is this a right way to resume a word2vec model and continue the training? In other words, I built the vocabulary tree based on all the sentences after loading and then trained the other sentences. I expect to have the same word vectors for

'first', 'sentence', 'second', 'third' and 'fourth' after the above 2 executions, just like what we get from

==
from gensim.models.word2vec import Word2Vec

all_sentences = [['first', 'sentence'], ['second', 'sentence'], ['third', 'sentence'], ['fourth', 'sentence']]

model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model.train(all_sentences)

==

Question 2:

Should similarity_1 and similarity_2 be identical theoretically?

Should most_similarity_1 and most_similarity_2 be identical theoretically?

Are they not identical due to model loading and/or vocabulary tree rebuilding?

Question 3:

Should most_similarity_2 and most_similarity_3 be different, because additional sentences were trained.

Thanks for your help.

Best,

Henry

Gordon Mohr

unread,

Apr 7, 2016, 3:28:40 AM4/7/16

to gensim

There isn't really a 'right' way to resume training. Though the gensim interface allows you to keep calling `train()`, published work hasn't addressed good ways to update a model with new text examples. While it's plausible you might be able to improvise an approach to some advantage, there are a bunch of murky tradeoffs to navigate, and I wouldn't recommend it unless you're already well-versed in the algorithm's workings and ready to do some custom experimentation and evaluation.

By way of example, consider a Word2Vec model trained on a dataset 'X'. The goal of training is to make the wordvec-fed neural-net as powerful as possible in predicting that X dataset's text. The stochastic gradient descent, with decaying learning rate, plausibly achieves that, with sufficient training passes – further improvements, with regard to X, are very hard to achieve (or even impossible).

If you then get dataset Y, do you then present those different examples with a large or small learning rate, compared to the earlier training? And whatever you pick, every increment of training on Y, while making the model objectively 'better' at predicting Y, is likely pulling it further away from being optimal at X. (The influence of 'X' is being diluted.) How long do you continue to train? If long enough so that more Y-training no longer improves the model's Y-predictiveness, then the model may very well have reached an optimal point for Y, but with zero lingering influence from the original X training. (You might have just as well trained only on 'Y', from random initialization.)

The practice with the strongest theoretical backing would be, when Y data arrives, is to shuffle it with X and retrain a fresh model with all the data. Then at the end of training, given suitable choices of training-passes and learning-rate, you'll have a model that's plausibly optimal for the union of X and Y. You're using all information equally.

Other hybrid approaches *might* yield something valuable, but that's somewhat speculative and contingent on a lot of choices and caveats. (For example, a better approach than feeding Y examples to an X-trained model *might* be to train a Y model completely separately, then for words unique in Y, use a learned-projection, based on common words, to re-map Y words into the X space. That'd be similar to the process described in the Word2Vec-for-translation paper <http://arxiv.org/abs/1309.4168> or section 2.2 of the Skip-Thought Vectors paper <http://arxiv.org/abs/1506.06726>. But I don't know for sure: this would need to be tested against other strategies, with respect to a particular goal for the final wordvecs.. and I haven't yet seen that sort of analysis done & written-up.)

Some notes regarding your example code:

* toy-sized examples tend not to illustrate Word2Vec behavior well, as the desired end-qualities are somewhat dependent on the word-distributions in bigger datasets, and the SGD training over many examples

* `Word2Vec.load()` is a class method that *returns* a loaded model; the invocation in your code isn't having any persistent effect, since you're ignoring the return value

* `train()` assumes it will get as many examples as was passed to the previous `build_vocab()`, and relies on that expectation for proper progress-reporting and linear-decay of `alpha`. (You can override this expectation with the optional parameters to `train()`. This could matter with a real larger dataset, but in this toy-sized example, with fewer examples than a single thread-batch size, it doesn't.)

* Let's call the four sentences in your `all_sentences` T1, T2, T3, T4. If you're using the latest gensim, the default number of training iterations (mimicking the original word2vec.c) will be 5. So your first training actually presents the training examples as: T1, T2, T1, T2, T1, T2, T1, T2, T1, T2. Then the second training presents: T3, T4, T3, T4, T3, T4, T3, T4, T3, T4. That's quite a different order than if `all_sentences` were trained at once: T1, T2, T3, T4, T1, T2, T3, T4, T1, T2, T3, T4, T1, T2, T3, T4, T1, T2, T3, T4. So training with the two subsets, serially, nudges the model differently than training with the full set.

* Ideally, you'd also shuffle between passes – so that T4 isn't always being trained with a later/lower learning-rate than T1. (But in really big training sets, with plenty of diverse word examples at all places in the ordering, this might not make that much difference.)

* You may also want to take a look at the prior thread <https://groups.google.com/d/msg/gensim/7eiwqfhAbhs/4NuuncQlHwAJ>, for discussion of why runs with the same data may differ in results, unless you take very specific steps to enforce determinism.

- Gordon

Henry Chang

unread,

Apr 7, 2016, 1:30:17 PM4/7/16

to gensim

Hi Gordon,

Many thanks for answering my questions and explaining me the details of gensim/Word2Vec.

Let say all_sentences contains 2 large sets of words, X and Y.
After
==
model = Word2Vec(min_count=1)
model.build_vocab(X)
model.train(X)
model.save('mdlObj')
==
we can either go with
== (T1: from scratch)

model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model.train(all_sentences)
==

or
== (T2: based on the trained model from X)
model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model = model.load('mdlObj')
model.train(all_sentences)
==
Is there any advantage by using T2 instead of T1? Should the weights of Neural Network have a better chance to converge faster theoretically?

I appreciate your help very much!

Best,
Henry

Gordon Mohr

unread,

Apr 7, 2016, 3:49:29 PM4/7/16

to gensim

That's hard to say. The training on `all_sentences` (aka 'Y' in your latest example?) will take the same amount of time in either case: the exact same number of examples will be provided to the model, in the same order. (The simple SGD being done here doesn't have any accelerated updating, or early-stopping, when it notices more-or-less progress is being made in prediction errors. It's not an adaptive process, and there's no checking happening as to whether the NN has really 'converged', it's just assumed that it's gotten as good as it can in the cycles you've chosen to allot.)

If the model state from the earlier 'X' training was 'good', then *maybe* having started from that already-good model will make the final state a little better. After all, that model has been trained with more total examples, over more time. But having left `alpha` and `min_alpha` at the defaults, it did all Y examples 5 times (with a gradual decay of learning rate to near-0), then did all X examples (with the learning rate higher then lower again – a sawtooth pattern that isn't the usual and theoretically-defensible way to do SGD). Maybe that's a little closer to what would have resulted, had X&Y been presented together 5 times.

But usually you'd want a process that gets unambiguously better with more training data or iterations (until optimal). By only continuing to iterate on the Y data, the model will only be getting better at modeling Y, at the same time it's getting incrementally worse at X (because the remnants of its influence are being eroded with each new non-X example).

- Gordon

Henry Chang

unread,

Apr 7, 2016, 4:33:42 PM4/7/16

to gensim

Many thanks, Gordon.

My all_sentences meant X+Y. Sorry for the confusion.

It seems adding the line "model = model.load('mdlObj')" before doing model.train(X+Y) is not necessarily better (T2). We may simply train X+Y from scratch (T1) instead.

Best Regards,
Henry

After
==
model = Word2Vec(min_count=1)
model.build_vocab(X)
model.train(X)
model.save('mdlObj')
==
we can either go with
== (T1: from scratch)
model = Word2Vec(min_count=1)

model.build_vocab(X+Y)

model.train(X+Y)

==
or
== (T2: based on the trained model from X)
model = Word2Vec(min_count=1)

model.build_vocab(X+Y)

model = model.load('mdlObj')

model.train(X+Y)
==

Reply all

Reply to author

Forward