gensim.models.doc2vec.Doc2Vec.infer_vector produces different vectors for the same input

599 views
Skip to first unread message

Andre Lima

unread,
Feb 22, 2018, 7:24:22 PM2/22/18
to gensim
Hey guys,

I was going through the Doc2Vec tutorial that uses IMDB Sentiment Dataset (link below) and, in some point, I realised that given a list of words,  gensim.models.doc2vec.Doc2Vec.infer_vector would produce different vector representations. Does someone know why this is so?


The code I ran was something like this (and the context was section "Examining Results" of the notebook):

doc_id = 0
wordList = alldocs[doc_id].words
vector1 = simple_models[0].infer_vector[wordList]
vector2 = simple_models[0].infer_vector[wordList]

# this statement evaluates False for all elements,
# and the same conclusion can be draw by visual inspection

vector1 == vector2 

Cheers,

Andre


Gordon Mohr

unread,
Feb 22, 2018, 9:36:25 PM2/22/18
to gensim
The Doc2Vec algorithm uses randomization during initialization & training. Inference is just a form of constrained training – and the `sample` and `negative` parameters, especially, control operations that are driven by a random-number generator. 

So subsequent calls are not guaranteed to be deterministic, unless you take extra steps to try to force determinism. Some possible steps are discussed in the open issue, <https://github.com/RaRe-Technologies/gensim/issues/447>.

Even without determinism, if the model is sufficiently trained, and sufficient inference applied, the resulting vectors should wind up "near" each other, and thus have similar (but not identical) lists of other most-similar vectors. 

There's an optional parameter for `infer_vector()`, `steps`, which controls how many inference passes are made over your supplied list-of-tokens. The default is just 5, while most report better results with a far larger number (especially for short texts). Using a starting inference `alpha` matching the usual training starting alpha, 0.025 (rather than its inference default of 0.1) may also help. And models that are undertrained or overfit – with too little data, too few training `iter` epochs, or too many free parameters (large `size` small data) – can give more varied responses to similar inference inputs. 

- Gordon

Andre Lima

unread,
Feb 23, 2018, 10:00:04 AM2/23/18
to gensim
Thanks Gordon!

Just to let you (and readers) know: the difference is not substantial. I just ran a test that goes like this:
1. compute a inferred vector representation for each document
2. check if each inferred vector's most similar trained vector refers to the same document.

Results were:
Model Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t2)
      98382 matches, 1618 misses (accuracy 0.98382)

Model Doc2Vec(dbow,d100,n5,mc2,s0.001,t2)
      98475 matches, 1525 misses (accuracy 0.98475)

Model Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t2)
      94347 matches, 5653 misses (accuracy 0.94347)

It is pretty good to the task I have in mind.

Thanks again for your prompt help!

Regards,

Andre

Gordon Mohr

unread,
Feb 23, 2018, 3:21:40 PM2/23/18
to gensim
Glad it's working well! Your "check that inference for a training text, gives a vector closest to the bulk-trained vector for that same text" process is a good sanity-check for model training/inference. (And if inferred-vectors aren't close to the bulk-vectors, or each other for subsequent runs on the same text, it usually means some combo of more-training-passes, more-data, more inference-steps, or smaller-dimensionality would be a good idea – because the model isn't yet good at forcing same-texts to same-meaningful-locations.)

One other advanced technique is to tune or replace the bulk-trained vectors with intensely-inferred vectors. This has the potential advantage of using vectors that were all created, completely, with only final frozen model. (The bulk-trained vectors left over at the end of initial training received much of their tuning early in the model's training, with coarser updates & non-final values for word/internal weights.) 

This can be expensive-in-time, as `infer_vector()` isn't multithreaded for bulk calculations, but I have seen the vectors from such a process do slightly better as features for downstream classifiers. (There's no built-in support to help, but some related functionality is a wishlist item: <https://github.com/RaRe-Technologies/gensim/issues/515>.)

One other note: I see you're trying the `dm_concat` mode as well. Note that this results in much slower, larger models and (despite the claims in the original 'Paragraph Vectors' paper) there aren't well-reproduced examples of this mode helping much over the other simpler modes. So on the one hand: don't be surprised if despite its giant memory/time requirements, it still just offers middling performance on your evaluations. But on the other hand: if it *does* ever turn out to be best for your dataset, it'd be great to hear about the size-of-data, and type-of-evaluation, where it turns out to be worth the cost. 

- Gordon
Reply all
Reply to author
Forward
0 new messages