Pure PV-DBOW (`dm=0, dbow_words=0`) mode is fast and often a great performer on downstream tasks. It doesn't consult or create the traditional (input/'projection-layer') word-vectors at all. Whether they are zeroed-out, random, or pre-loaded from word-vectors created earlier won't make any difference.
PV-DBOW with concurrent skip-gram training (`dm=0, dbow_words=1`) will interleave wordvec-training with docvec-training. It can start with random word-vectors, just like plain word-vector training, and learn all the word-vectors/doc-vectors together, based on the current training corpus. The word-vectors and doc-vectors will influence each other, for better or worse, via the shared hidden-to-output-layer weights. (The training is slower, and the doc-vectors essentially have to 'share the coordinate space' with the word-vectors, and with typical `window` values the word vectors are in-aggregate getting far more training cycles.)
PV-DM (`dm=1`) inherently mixes word- and doc-vectors during every training example, but also like PV-DBOW+SG, can start with random word-vectors and learn all that's needed, from the current corpus, concurrently during training.
In either PV-DBOW+SG or PV-DM, you could try to re-use word-vectors from an earlier session. I'd expect that starting the model like this, with some of its weights already in a somewhat-meaningful configuration, could give the model somewhat of a 'head-start' on achieving a useful doc-vector configuration. There's no separate phase where word-vectors are learned "first", so N iterations of Doc2Vec training will still take the same amount of time, but *maybe* the model would make a little more progress in the same number of iterations (or make do with fewer iterations).
However there's also some chance you'd be impairing the doc-vectors for some purposes, especially if the word-vectors come from a different corpus, by having brought in state from a prior word2vec training session which had different predictive objectives. You'd also want to consider any time/overhead for creating/optimizing the word-vectors.
I'd suspect such pre-seeding to be most helpful with smaller datasets, where you're seeding using word-vectors left over from a much larger (but still believed usefully 'compatible') dataset.
I doubt you'd want to train-up word-vectors as a separate optimized step of a now multi-step process. For example, given one large corpus, I'd expect 20 iterations of Doc2Vec, starting from random initialization, to give better results than 10 iterations of Word2Vec from random-initialization, then 10 iterations of Doc2Vec from reused-word-vector-initialization. The one combined training lets everything co-improve together from the very beginning, and gives the doc-vectors relatively more attention.
Considering another possible scenario: let's say you already have a well-performing Doc2Vec model, on an older/smaller dataset. Then you then want to train a new Doc2Vec model, with similar parameters, on larger/newer data from a similar domain. The case for re-using the older model's state as a starting point seems stronger to me here (compared to just importing other word-vectors). It's the same domain, and same training-objective, and maybe even mostly-the-same documents. (But an issue I'd see would be that the accumulated 'weight' of all the older/repeated documents might make the model less-influenced by any novel documents, compared to the alternative of retraining-from-scratch – and especially so if you're leveraging the 'head-start' to skimp on additional full training epochs with the current dataset.)
- Gordon