Because of the way typical training modes work, there might not be much benefit to such an option – at least not compared to just training word-vectors and doc-vectors in totally separate models. Consider:
In pure PV-DBOW, word-vectors aren't trained.
In PV-DBOW interleaved with skip-gram word-training, the usual benefits sought are:
(1) word-vectors & doc-vectors that are in the 'same space' - same dimensionality, proximities mean similarities; and
(2) all the sliding word-context windows serve as sort-of micro-documents, perhaps working as a kind of corpus-extension to make the doc-vector space more expressive
Both of these would be lost with different-sized vectors, and any alternating training of doc-vectors (of one size) and word-vectors (of another size) would be essentially like training two wholly separate models – which is already easy enough to do as separate steps.
In PV-DM with either summing or averaging of context-vectors, the doc-vectors and word-vectors must be summable/average-able, so must be the same size.
Only in PV-DM with a concatenative input layer (`dm=1, dm_concat=1`) would combined training of different-sized word-vectors and doc-vectors *possibly* make sense. However, this mode is still best considered experimental. Despite the claims of the original 'Paragraph Vectors' paper, it doesn't seem to offer a noticeable win over other modes. (The results reported there have never to my knowledge been reproduced.) It creates a giant, slow-to-train model. Perhaps, on giant datasets, or with far more training iterations, or with other modifications not detailed in published work, this mode is worthwhile. But for now it's of dubious value.
So, before adding new tunable options (like mixed-size word-and-doc-vectors) to this experimental mode, it'd be good to find some conditions (such as a particular dataset/task) where this mode is valuable, and can be realistically evaluated. (I suppose also there is some chance there's a bug with gensim's implementation of this mode, which I wrote when trying to reproduce the paper's results. But I've reviewed the code quite closely several times, and it does seem to behave in the general ways one would expect. Also, I haven't yet come across other implementations of this mode, in Python or other languages, against which we could double-check its results. If an alternate implementation can be found, that comparison could be interesting.)
- Gordon