For now, I disagree about the appropriateness of the recent work that's been called "Online Word2Vec".
"Online" is a bit of a misnomer - "online" often implies a model that can always take any increment of new input, and tends to gives meaningful (and improving) results for every increment. This recent work instead lets you supply a big new batch of examples to grow the known vocabulary. So a better term for this feature would be "vocabulary expansion".
To then train with just the new examples might improve *or deteriorate* the quality of word-vectors, and while the model will technically let you compare words in the new batch with words in the old batch, the quality of such comparisons is hard to know, and probably gets *worse* the more a later batch is trained, against the usual expectation that more training always helps. (Word-vectors are only comparable to the extent they were trained against each other; all training on a later batch is improving the words, with respect to only those later examples, at the likely cost of worsening them, with respect to the earlier examples.)
The testing that's occurred with this new feature has really only verified that new tokens are available with at-a-glance somewhat-meaningful vectors. The effect on existing tokens, or relations with tokens that don't appear in later training batches, hasn't been evaluated. (I'm also not sure it's doing the best thing with respect to features like frequent-word downsampling.)
I don't know of any project write-ups on the right way to choose a new `alpha` learning-rate decay, or relative number of passes, for meaningful results. It is likely the right choices for these values will vary a lot based on the relative sizes and vocabulary-overlap of your incremental batches. Unrealistic expectations for what this vocab-expansion feature achieves, given that reasonable/best practices are not yet known, may encourage wasteful fumbling trying to use it, with no or negative benefit.
I would only recommend people call Word2Vec/Doc2Vec `train()` with incremental subsets of data if they…
(1) …have a good understanding of what's happening behind the scenes and thus the limitations on strong interpretations of results; and…
(2) …have effective project-specific evaluation mechanisms to check whether this feature, using various parameter choices, are helping or hurting the resulting vectors.
(And if these do apply to anyone reading this who manages to profitably apply this feature, by all means please write-up and share what you learn!)
Another option might be to create a merged overall corpus that's expanded with synthetic words representing a word-in-a-single-year. For example, an original sentence in 1993 that's…
"The apple doesn't fall very far from the tree"
…might be included both as-is and with multiple era-specific transformations…
"The apple^1993 doesn't fall^1993 very far^1993 from the^1993 tree"
"The^1993 apple doesn't^1993 fall very^1993 far from^1993 the tree^1993"
…etc…
You'd then shuffle all data together, and ensure before training that 'apple', 'apple^1993', 'apple^1994', etc are all equivalently-initialized. In that way, all words (era-oblivious and era-specific) are trained against each other, in a misture of era-specific and era-oblivious contexts. The differences in era-specific words are then more likely to yield meaningful comparisons. (Though, I'd try to validate this assumption against words that are perhaps so old/common they're not expected to change, or random subsets of words for whom any indicated drift must just be an artifact of the subsetting.)
- Gordon