It all depends.
Is it a lot of data or a little? Are they long texts or short texts? What's the method of assessing quality? Are you assessing doc-vectors from the bulk-training, or from later inference? Have you tried other parameter values?
In the demo notebook that tries to replicate one of the experiments from the Paragraph Vectors paper (
https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb), you can see that for the purpose of coarse-grained (positive/negative) sentiment-prediction on that relatively-small dataset, DBOW vectors are better at every iteration (1-20) than the two DM approaches, mean or concatenation, that are also compared. (The DM approach you're using, sum-of-vectors, isn't tried there... if I recall correctly it was left out because it did comparably but slightly worse than DM/mean.)
A few things I've noticed that may apply to your usage – though they could easily be different for different data sources and end-tasks – are:
* using 'sample' frequent-word downsampling often improves both training speed *and* word-vector quality (on the analogies evaluation); its effect on doc-vector quality has been mixed
* leaving in very-infrequent words seems to work like noise that dilutes what can be learned. Consider the extreme case of every document (or context-window) having some unique word that only appears once. Training devotes effort to making that word predictive of its context, the same as it's trying to make the doc-vector predictive of the same context. So "some" of the power/wisdom of the training process flows into that nearly-useless single word, rather than the doc-vec that you'd rather be representative.
* larger windows aren't necessarily better; in particular in DM mode (or in DBOW with simultaneous skip-gram word training, as enabled by the `dbow_words=1` parameter) larger windows mean proportionately more of the model's net-effort is being spent on updating words rather than doc-vectors. It might even be appropriate to think of the doc-vector, in those modes that also train word-vectors, as indicating some 'remainder-of-meaning' that *isn't* reflected by the words... so the most-powerful post-training representation of some text might be a mix of its word-vectors *and* its doc-vector.
* sometimes, with the right inference parameters, re-inferring vectors for texts at the end can result in "better" vectors than the in-model vectors left over from the bulk training. Why might that be? Consider your 10 training passes: every document in the training set is the result of those 10 passes, but the early passes were on an almost-random, barely-trained model that was still changing quite a bit. If you instead infer a new vector at the end for the same text using 10 steps, all 10 steps happen on the frozen, final, "best" model.
- Gordon