The `most_similar()` checks currently use a full-scan over all candidates, then sort and return the top-N. But, the good news is that you can limit the set of candidates to scan to a contiguous range of the backing array. See the `clip_start` and `clip_end` parameters of DocvecsArray `most_similar()`:
So, if you ensure that the 400K wind up at contiguous indexes, calling `most_similar()` clipped to that range should be the easiest way to achieve your goal.
For example, it should be enough to make your corpus iterable return those 400K in sequence (to the corpus-scan that happens pre-training inside `build_vocab()`), either before or after all the others.
But also note, for maximum memory-efficiency, you'd want to use plain python ints, rather than composed strings like `TAG_i`, as your document tags. In that case you'd just want to make sure whatever external steps you take to assign those ints makes them contiguous indexes.
Other thoughts:
Regarding your idea (1), you're probably right that it's better to train on the larger set. But if the other 1.6M examples are somewhat different in their vocabulary/content, it's also possible they dilute rather than improve the modeling of the core 400k. If you have time and an easy way to evaluate the quality of the end results, I might try both including them and excluding them.
Regarding (2), while there are no built-in methods to support such post-training trimming, it could certainly be done with some extra custom python to perform surgery on the trained model's internal. (If you never want to look-up or otherwise receive results from the 1.6M, they're just taking up memory: they're not even required for inferring new document vectors.)
Regarding (3), there's not currently any support to "continue" training with more examples. (There are ways you could try to force it, by improvising fix-ups on the internal model state, but whether that could achieve good results is unknown, and in any case the model tends to be most-influenced by the latest examples it has seen... so you wouldn't want the non-core data to come last.) You can ask the model to infer vectors for new texts, but that doesn't change/improve the model at all.
Be sure to try DBOW mode (`dm=0`) – some of the more interesting Doc2Vec results use that, rather than the default DM (`dm=1`) mode. (It also scores better in my attempts to reproduce the "Paragraph Vectors" IMDB sentiment experiment.)
- Gordon