doc2vec.infer_vector produces a non-normalized vector

568 views
Skip to first unread message

Gregory Larchev

unread,
Feb 5, 2016, 5:18:48 PM2/5/16
to gensim
I noticed that doc2vec.infer_vector method produces a vector that's not normalized. Is this by design? I suppose it doesn't matter when computing the cosine similarity, but wouldn't it affect the training of the vector when the number of steps is greater than 1?

Gordon Mohr

unread,
Feb 5, 2016, 7:38:50 PM2/5/16
to gensim
It is by design: during training, there's no unit-magnitude-normalization, and inference is just a constrained form of training. 

Any normalization is a final optional step, and in some cases I've seen the non-normalized vectors work better as features fed to downstream tasks. 

(Which also implies: if you'll be dong inference in one of the `dm=1` modes, you probably *don't* want to ever call the `init_sims(replace=True)` variant, because it will throw away the 'true' un-normalized `syn0` word-vector values.)

- Gordon

Gregory Larchev

unread,
Feb 8, 2016, 11:31:58 AM2/8/16
to gensim
I see, thanks!

Gregory Larchev

unread,
Feb 8, 2016, 2:30:57 PM2/8/16
to gensim
Actually, it looks like the model vectors (returned by the call to mymodel.docvecs[tag]) are not normalized either. I guess the only way to normalize the vectors (aside from doing it myself, of course) is to call init_sims()?


On Friday, February 5, 2016 at 7:38:50 PM UTC-5, Gordon Mohr wrote:

Gordon Mohr

unread,
Feb 9, 2016, 4:32:58 PM2/9/16
to gensim
Yes, `init_sims()` will bulk-normalize all the vectors in `model.docvecs.doctag_syn0`, saving them in `model.docvecs.doctag_syn0norm`. You'd then have to fetch them from there by index. 

(It may make sense to add a norming option to `infer_vector()`.)

- Gordon

Dhiraj Patnaik

unread,
Oct 19, 2018, 6:35:24 AM10/19/18
to Gensim
How will you fetch it from `model.docvecs.doctag_syn0norm` and after using these normalised vectors how will i compare it with the original ones to produce that particular text? Please let me know.
-Dhiraj

Gordon Mohr

unread,
Oct 19, 2018, 11:38:58 AM10/19/18
to Gensim
It's unclear what you mean by "it" on this 2.5-year-old question. There's no "it" that needs to be retrieved if you're using `infer_vector()`, as the subject line mentions. And, if you pass a non-normalized vector into `most_similar()`, it will work just as well, calculating the same-cosine-similarities.

It's better to ask a clear standalone question in a new thread, explaining what you want to do, why, what you've tried, and how the results you've seen aren't sufficient. 

- Gordon
Reply all
Reply to author
Forward
0 new messages