`Doc2Vec`, like other algorithms in the `Word2Vec` family, needs lots of training data to train & essentially "fill" the high-dimensional space usefully. I wouldn't expect "hundreds" of docs to be able to fill even the smallish-dimensional models you're trying.
There's no theoretical basis for this, but as a very very rough rule-of-thumb, in `Word2Vec`, I wouldn't expect to train an N-dimensional dense embedding unless the vocabulary has at least N*N distinct words – and plenty of varied, subtly-contrasting usage examples of all those N*N words. (One or a few examples don't tend to give good vectors – there'll be just a few training-visits to those examples, and your model will reflect whatever those few, likely idiosyncratic, usages imply, rather than a more-generalizable vector that might be possible with a wider range of examples.
So I wouldn't expect hundreds-of-docs to give a doc-vector space of even a meager 40 dimensions.
Also: `min_count=1` is almost always a bad idea with these algorithms. Again, one (or even jsut a few) usage-examples don't have much chance of training a 'good', generalizable representation of a word. But given things like the zipfian distribution of word frequencies in typical natural langauge text, there are a *lot* of such singleton/few-count words. So you may find the majority of the model's state, & training time, struggling from the influence of insufficient examples, and this even interferes with the vectors for other more-common words. *Discarding* rare words, as with the default `min_count=5` (or even higher values when you have more data) usually gives better results.
My general sense (from limited experiments) is that `min_count=1` can be especially damaging to `Doc2Vec` trainings, where you usually have only one text per document-tag (id). Those single-appearance words, in their single-appearance docs, either compete with the doc-vector as explainers of their contexts (in PV-DM mode, or PV-DBOW with skip-gram words), or have an oversized influence on the doc-vector – as the data suggests they are 100%-suggestive of the matching doc-vector (when real relationshisp are almost always more-subtl than that).
Further, while it's certainly possible to apply these algorithm to things other than true natural-language text, like XML, their performance could be very sensitive to the specifics of what's in the XML, and your tokenization choices. For example, does the XML contain real natural prose in its elements – or just more rigorously-formatted data? How are elements, attributes, & element-bodies each tokenized (if at all)? How many tokens are in one of your typical docs?
The more the XML is like natural-language, the more I'd expect `Word2Vec`-like techniques to do something useful. But if it's just dumps of database tables, with things like scalar data or selections from narrow controlled-voccabularies, it might not work well without lots of other tuning.
I'm not sure what kind of testing you're implying here. What's the 'accuracy' you're evaluating? 0.92 accuracy doesn't seem too bad! And, without seeing your method of evaluation, there could be other problems, & potential improvements, in your approach. (There's not enough info here to conclude "underfitting", especially given the likely data-insufficiency issue.)
It's not clear what cross-validation would mean in this sort of scenario. If you're holding back 1 line (document) from each training, then what's the test, post-training, using that 1-held-back-line, that tells you if the model's succeeded or failed.
Also, `dm` is essentially a boolean value, `vector_size` an integer (usually best left as a multiple of 4 for a slight performance benefit), and `window` an integer (that has no meaning in `dm=0` PV-DBOW mode unless separate option `dbow_words=1` is toggled on). So any automated optimization process that reports floating-point values for these parameters is probably nonsense.