The simple answer is:
You can `save()` a full model, and later `load()` it to continue to `train()`, including with different data. (Though, when calling `train()` with different data than was initially provided to `build_vocab()`, you should also specify the `total_examples` or `total_words` parameters to `train()`, so that progress-readouts and alpha-decay are estimated properly. Also, any words that weren't found during the one initial `build_vocab()` scan will be ignored as unknown.)
The influence of examples presented during any `train()` is largely determined by the `alpha`/`min_alpha` values that are used during the training sessions, and in general examples presented last tend to have more influence. So vaguely, just naively trying the steps you've described will kind-of achieve your stated goals.
But, it may not give you good results.
Everyone seems to want to do this but I don't know of any published work that establishes theoretical or practical guidelines for balancing the influence of the original training/data and the update. If you were to train on the new data until the model reaches equilibirum (optimal weights for the current data) – which is roughly what the choice of iterations/alpha is intending to do – most of the influence of the original training/data may have been diluted away. (Words that were meaningfully close after the early data may have drifted far apart, perhaps just because some of them didn't appear in the current data at all.)
My hunch is that the best approach in such cases would be to create a combined dataset of all examples. If there are some examples you want to have more weight, repeat them (and tweak the number of repetitions to achieve your goals). Train on the combined set, and when you're done you'll have vectors for all words across all examples, and because they were trained in an interleaved fashion, they will be meaningfully comparable. (None will have drifted away from the others because of later training on a subset.)
If for some reason you can't do that, another choice would be to use an experimental feature in the gensim Word2Vec model that can lock some vectors against change: the `syn0_lockf` array. It's the same size as the array of word-vectors-in-training (`syn0`), and by default all its values are 1.0, which means the corresponding word-vector receives full updates during training. If you set some/all of the `syn0_lockf` values to 0.0 instead, those word-vectors will be frozen against changes during further training. So in the above naive scenario, you could freeze most of the words from the original data – perhaps especially foundational words for which you have a lot of examples and don't expect (or want) later smaller datasets to change. Then when you train on new data, those words may (theoretically) serve as 'semantic anchors`. The training tries to update them, but can't – so across training iterations the still-free-to-change words must continue to adjust to fit within the framework already set by those anchors. In that way, they might also remain more meaningfully-comparable to other words that are not even in the fresh data (and were thus de facto 'frozen' because no new examples mentioned them).
In all these cases, note that the known-words vocabulary doesn't expand beyond what was originally provided to `build_vocab()`, and so all-new words in new `train()` batches will be ignored-as-unknown. There's no support in the API of currently-released gensim to update the vocabulary – but if you reach into the model itself you could manually do it yourself. There's a pending Pull Request –
https://github.com/piskvorky/gensim/pull/615 – based on other earlier work that adds some of these kinds of incremental word-adding capabilities, or could be used as a model for crafting your own approach.
- Gordon