The alpha values your code is passing to `train()` are insane.
A negative ending alpha means the model will literally be trying to make itself worse with each training example by the end of training. Documents with very-similar words may still wind up near each other, having gone through the same wild ride-into-opposite-land, but overall model utility will likely be weak, and docs that go through a later inference process won't have landed anywhere similar. (Inference will, by default, be using the sane default alpha values you specified at initialization... but that won't be much use on a model that's been anti-trained so much.)
But also: in your code, you don't even need to call `train()`. By supplying the corpus on the line that creates the model, the initialization will do all necessary training using the supplied corpus. (You only need to call `build_vocab()` and then `train()` later if you *didn't* supply a corpus at model initialization.) So the bulk-trained vectors in your model went through one training with sensible alpha values, then an extra training with nonsense values. On the other hand, the inferred vectors are going through one training with sensible alpha values – but on a model whose internal weights were last trained in the nonsense mode and then frozen. It's quite understandable the vectors wouldn't be comparable.
I highly suggest alway enabling logging at the INFO level. That would likely have made it clear that training was happening twice.
You may get better results simply by not calling `train()` at all, since you've already done training in the instance initiatlization.
Separately:
* 4000 docs, especially if many are just a 3-4 words, is a very very small corpus for `Doc2Vec`. Published work uses tens-of-thousands to millions of docs, each of dozens to hundreds or thousands of words. It might never work well in such a case, but also: it may work better with a smaller model (fewer dimensions), to avoid overfitting, or with a simpler mode like `dm=0` (PV-DBOW).
* it's unclear what `tags` were actually attached to each document in your setup. You're accessing the doc by both raw int 572 , and by a long string like ''572: 06_246.xml | Silia v Minister for Immigration & Multicultural & Indigenous Affairs [2006] FCA 246 (1 March 2006).' – but then the results indicate the primary tags may be strings-of-integers, like "572". For clarity, you should pick one canonical ID for each document and stick with it.
* it looks like there may be significant duplication in the dataset: documents 572, 630, and 650, at least, are the same 4 words. That's usually not good: it makes the effective size of the corpus – it's essential variety for distinguishing documents – even smaller than the overall document count. (It could make sense to train the model on deduplicated data, but then externally assign all documents with the same words the same vector.)
* it looks like the corpus may be sorted to group similar-topics together. Training works better if there's not such clumping, so at least one initial shuffle could help. (And, after a shuffle, being consistent in how documents are tagged becomes extra-important, or you might be mixing pre-shuffle and post-shuffle document positions if using plain int indexes.)
* You shouldn't really think of the similarity values as 'percentages'. They range from -1.0 to 1.0, but also their effective ranges within a model can be fairly influenced by other model parameters, and are most meaningful only when compared to values from the same mode – not any absolute idea of "how much overlap" two docs exhibit. So if, for example, you were testing different dimensionality `vector_size` values, one model might tell you 2 quite-similar documents have a similarity of X, and for another model for those 2 docs have similarity Y, and X and Y are very different. But if in both models, they're still each others' nearest neighbor, and the relative rankings compared to other docs are sensible, there's no real meaning to the difference between X and Y.
- Gordon