Note when you do `model.most_similar[`ischemia']`, you are not 'inferring' a vector - but looking up the pretrained word-vector for 'ischemia'. That you're getting back meaningful words from `most_similar()` indicates effective word-training of some sort is happening.
I suspect something is wrong with other steps of your model handling, or the choice/preparation of your `example`.
Let's assume that `taggeddoc_corpus` is set up to be your TaggedDocuments with a single unique ID tag per document. Also, that `doc_text[id_tag]` will return the text associated with one of the trained tags. I suggest you run & review/share the output of the following (which uses different options than you've been using, to run a quick minimal trial):
model = Doc2Vec(taggeddoc_corpus, size=100, dm=0, min_count=50, sample=1e-06, iter=10, workers=cores)
probe_doc = iter(taggeddoc_corpus).next() # 1st document
probe_tag = probe_doc.tags[0]
probe_tokens = probe_doc.words
vec = model.infer_vector(probe_tokens, alpha=0.025, steps=100)
similars = model.docvecs.most_similar([vec])
print("original text: %s" % doc_text[probe_tag])
print("as tokens: %s" % probe_tokens)
print("most similars:")
for i, sim in enumerate(similars):
print("___ #%i %s\n%s" % (i, sim[0], doc_text[sim[0]]))
(I haven't run this but it should be roughly correct with the above assumptions.)
If this gives more sensible results than what you've been seeing, adjust the steps incrementally towards your other goals – ready to reverse/debug anything that seems to hurt.
- Gordon