As Ivan notes, to get a vector for a new document, you don't need to do a full bulk-training cycle – you can use `infer_vector()`, and you'll get a doc-vector that approximates what that new document would have received, had it been part of the last bulk-training process. That is, it'll be in a compatible coordinate-space.
In fact, it's even possible for the inferred-vectors, if you use a generous number of `steps`, to work better for downstream info-retrieval/similarity/classification tasks than those left in the model by bulk-training. This is likely because the inferred-vector gets all of its inference-adjustments from the final frozen model, rather than (as the bulk-trained vectors) from the model over the full course of moving from untrained-to-fully-trained. And, while inference on a single text is single-threaded, the frozen model can be used by many processes, or replicated to many machines, making inference of large batches of documents somewhat more parallelizable/distributable than original training.
(For this reason, it could be worth evaluating whether to re-infer vectors for the training data, after training - depending on time/process costs and the achievable lift it might be worth the effort. There's not yet explicit support in gensim for some practices that might be worthwhile here – like fine-tuning the bulk vectors with some extra final inference, or starting inference from a supplied vector rather than a fresh random vector. But they're a matter of discussion for the future, for example at Github issue <
https://github.com/RaRe-Technologies/gensim/issues/515>.)
Also, I'd presume that when you mention "expired" documents, you mainly mean that they are no longer interesting as live most-similar results. They might still be useful as training data. (For example, some rare terms might fall below `min_count` occurrences, or otherwise have thin examples of usage/meaning, within a limited sliding window of 'active' documents. But training on a longer history of older documents would model such terms better... even if you don't need to retain the doc-vectors for those expired documents.
Combining all these observations, I'd lean toward a system that:
* occasionally does gigantic model-builds, on as much data as is thought domain-relevant (so as to have the largest domain vocabulary and richest training data). For example, this might happen weekly or monthly.
* from those big models, have a separate process to maintain a smaller subset of 'active' doc-vectors – either by extracting the subset from the bulk-trained vectors, or re-inferring as new documents arrive. (The gensim `KeyedVectors` class is an example ofstoring vectors by lookup keys, and supporting `most_similar()`, but without full training support/model-overhead. You could probably shoehorn your 'active' sets of vectors into it or a class like it.). This active-set might be maintainable daily or hourly, or even (nearly) instantly.
* when a next "big model" rebuild happens, the resulting vectors won't usually or necessarily be comparable to those from a prior "big model" - the coordinate spaces weren't trained together, so while relative document similarities should be of similar quality, individual docs/neighborhoods will have moved around arbitrarily/randomly. So, you'd throw out any cached vectors in 'active' applications when rolling to a new big models. There might be ways of speeding the big builds, or achieving better quality in the same time, by starting the new model with some state from the previous epoch – but this would be a matter of ad hoc customization, there's no explicit support for this yet. There could also be ways of enforcing coordinate-compatibility, by carrying some state forward or locking some significant set of reference docs in the training set at fixed coordinates, or some set of domain terms (in input-words or NN output-encodings) at fixed coordinates – but that again would require a lot of research/tinkering to implement and evaluate its potential value.
- Gordon