If you're mixing PV-DBOW with simultaneous skip-gram word-training (as with gensim `Doc2Vec` mode `dm=0, dbow_words=1`, it's true that the (traditional input) word-vectors don't *directly* affect the doc-vectors. Each document example presented to the model during training triggers a forward-propagation through the NN that doesn't use any word-vectors. Then, the back-propagation only updates the hidden-to-output weights and then the PV-DBOW doc-vector itself.
But, there's indirect influence, because between each document-example will be many word-to-word examples, which share the hidden-to-output weights. The doc-to-word predictions are tried & improved against a model that's also doing word-to-word predictions, and vice-versa.
The Lau & Baldwin paper is a valuable attempt to elucidate & benchmark a bunch of options, but I find its analysis confused on a few issues, such as the implication that the "randomized embeddings" have any influence on pure PV-DBOW training (with default `dbow_words=0`). The word embeddings are ignored entirely in that case – the gensim code which allocates & randomly-initializes the word-vectors only runs because of shared initialization code-paths with other modes that haven't been conditionally skipped (as they theoretically could be).
Is it definitely the case that adding interleaved skip-gram training works as a sort of corpus-expansion trick: every sliding-window of word-to-word predictions is like another mini-document, carved out of the existing documents. So adding simultaneous word-training winds up presenting the NN model with many more individual training-examples – by a factor of the `window` parameter, which also multiplies training-time – and perhaps more varied shades-of-usage via the word-to-word frequencies.
So for example, comparing...
dm=0, iter=10
...vs...
dm=0, dbow_words=1, window=10, iter=10
...the second set of parameters is going to result in 11x more micro-examples (individual input-to-desired-output patterns) to the NN over the 10 corpus iterations, and thus about 11x more training time. It might thus result in a stronger model, but if your main interest was the quality of the doc-vectors, and you have the same 11x more time to spend on training that `dbow_words, window=10` would cost you, you might want to also test pure-DBOW as...
dm=0, iter=110
...so that the model innards are getting the same number of micro-examples as in the simultaneous word-training case, but now fully focused on the doc-vectors.
- Gordon