Gensim's approach to cosine similarity for sets of vectors (ie, n_similarity)

1,394 views
Skip to first unread message

Scott Klarenbach

unread,
Feb 12, 2015, 3:21:01 PM2/12/15
to gen...@googlegroups.com
I'm wondering if someone can provide some insights into the underlying theory and justification that gensim uses to compute the similarity for 2 sets of vectors.

For example, n_similarity(['restaurant', 'japanese'], ['sushi', 'shop']) => 0.6154

Currently, gensim takes the mean vector of each set of vector, and then computes the cosine similarity of the resultant vector means.  But there are other approaches.  

One could create a similarity matrix of M x N dimensions, where each element is the cosine similarity of each pair of vectors, and then normalize the matrix to a score between 0 and 1 using L2 or something.  Another approach would simply be to treat each similarity matrix as an MN vector and then compare those vector simliarities, or normalize them somehow.

I'd just like some insight into the tradeoffs of each approach.  For example, taking the mean of a similarity matrix provides a very different score than taking the similarity of 2 vector means.  I'm wondering if there is a theory behind which to use when, or if it's a trial and error sort of thing for each application.

Thanks. 

Radim Řehůřek

unread,
Feb 18, 2015, 2:04:41 PM2/18/15
to gen...@googlegroups.com
Hello Scott,

that will depend very much of what "similarity" means for your application.

The "average vectors for all items in A, average vectors for B, take cosine between the two averages" in word2vec simulates a common scenario, where it is assumed words within A share some common theme (brought out by averaging), as do words in B. This crude method works surprisingly well in many cases, in the same way bag-of-words does.

Another approach is to model the vector of A jointly, based on all its words in tandem, for example using LDA or doc2vec.

I don't have much insight into the other suggestions you mention -- taking max / min / mean / median of a similarity matrix are all reasonable options. If you describe your scenario in more detail, it should be easier to see which one of these options makes more sense.

Hope that helps,
Radim

David Haas

unread,
Jul 8, 2020, 12:25:21 PM7/8/20
to Gensim
Radim, do you think it'd be more useful to normalize each embedding in the sequence before taking the average? My understanding is that vector length is a function of its frequency in the corpus, which we wouldn't care about too much if we're simply trying to get a representation of the word sequence's meaning. 

Ultimately, I'm trying to compare a short query to a set of documents and I'm trying to decide whether to use the in-in function as described in this brief paper, or to just average the query and documents and take cosine similarity.

I'd appreciate your advice if you have the time.

Thanks,
David

Radim Řehůřek

unread,
Jul 9, 2020, 4:31:18 PM7/9/20
to Gensim
Definitely – I've seen the individual vectors not only normalized, but also weighted by something like IDF (or some other application-dependent weighting scheme).

Radim

Gordon Mohr

unread,
Jul 9, 2020, 11:44:42 PM7/9/20
to Gensim
I'd suggest trying it various ways & seeing what works well for your needs. 

In some cases the *non-normalized* vectors have, in their non-unit magnitude, some hint of how pure/strong that word's meaning is. (Polysemous and filler words tend to have weaker magnitudes.) Weighting words by other measures, as Radim suggests, may also be relevant. 

And measures like "Word Mover's Distance" or "Soft Cosine Similarity" avoid collapsing all the words into a single vector before comparing sets-of-words (but are correspondingly more expensive to calculate). 

- Gordon
Reply all
Reply to author
Forward
0 new messages