Understanding most_similar method aka cosine similarity between documents

688 views
Skip to first unread message

Rathish Mohan

unread,
Jan 17, 2017, 4:43:18 PM1/17/17
to gensim
Hello,

I have been using doc2vec for quiet sometime now and am very pleased with the results that I get.

Recently I wanted to see some similarity scores from the vectors of the model that I get if I do something like :

scores1 = spatial.distance.cdist(vector1, vector2, 'cosine')

where vector1 and vector2 are obtained as : vector1 = model.infer_vector(document1,steps=15, alpha=.09) vs how doc2vec's most_similar method is implemented.

I read the method that implements the most_similar method says this :

************************

def most_similar(self, positive=[], negative=[], topn=10, clip_start=0, clip_end=None, indexer=None):

"""


Find the top-N most similar docvecs known from training. Positive docs contribute

positively towards the similarity, negative docs negatively.

This method computes cosine similarity between a simple mean of the projection

weight vectors of the given docs.


*************************

What I am trying to understand is say I have projected vectors from the model

v1 = [.4,.5.6,7.8,9.]

and

v2 = [.2.3.4.5.6,.7]

what does : simple mean of the projection weight vectors of the given docs mean?

And say if I have to implement the most_similar method, how does the cosine similarity gets calculated in doc2vec?

Any insight will be much appreciated.

Regards,
Rathish

Gordon Mohr

unread,
Jan 17, 2017, 5:06:50 PM1/17/17
to gensim
The `positive` and `negative` arguments of `most_similar()` can take lists-of-vectors. But what this method returns is a list of vectors most-similar to a single target vector. The comment about the `simple mean` is just that when more than one vector is in either of these parameters, the single target vector used is the mean of all the provided vectors (the negation of the vectors, for any `negative` vectors). 

You can see how the cosine-similarity is calculated a few lines down, via the dot-product of every candidate-vector against the target vector (`mean`):


Note that for cosine-similarity between similarly nit-normed vectors, higher values (closer to 1.0) indicate more-similar documents. If you're using cosine-distance elsewhere, lower values (closer to 0.0) indicate more-similar documents. 

- Gordon

Rathish Mohan

unread,
Jan 18, 2017, 1:44:00 PM1/18/17
to gensim
Thanks for the reply Gordon!

Regards,
Rathish


Based on
Reply all
Reply to author
Forward
0 new messages