which vectors are used for cosine similarity in Word2Vec

311 views
Skip to first unread message

amjass

unread,
May 25, 2022, 5:26:02 AM5/25/22
to Gensim
Hi!

I am doing an orthogonal alignment to compare the same word across models using a histwords gist implementation (smart_procrustes_alignment):

after alignment, I compare the cosine similarity between the same word across models:

from scipy.spatial.distance import cosine
1-cosine(model_one.wv['word1'], model_two.wv['word1'])

all steps above work fine, the vectors in the second model change and the cosine value reflects this change. I notice however, that in their implementation, they carry out the SVD on the normed_vectors but insert the aligned embedding back in to the .wv.vectors attribute of the second model. 

for clarity, may I ask why, .wv.most_similar (within a model) uses the unnormalised vectors and in which instance normed vectors would be required. presumably, the cosine calculation above is legitimate as it is acting on the aligned vector, which would correspond to second_model.wv['word1']?

thank you!

Gordon Mohr

unread,
May 29, 2022, 1:51:30 PM5/29/22
to Gensim
When I search for code that meets your description – unsure if I'm looking at the same procrustes code that you're using! – it looks like the intent of that process is to perform a caclulation using intentionally unit-normed-vectors as input. Then, a resulting set of modified vectors gets assigned back into the 'target' vectors. And, in a variant of the code I find for an older version of Gensim, that result is *also* placed into a variable which usually holds a cached unit-normed set of vectors (`syn0norm`) - though I'm not sure the results of the prior calcs are guaranteed to already be unit-normed. So I can't comment on whether that code is doing an appropriate thing – you'd have to ask its authors.

With regard to the `.most_similar()` within a usual Gensim `KeyedVectors` model, however, it's *always* applying a unit-norming to supplied keys (words) and vectors. It does this in the common-case where a single lookup key or vector is supplies, but also as part of averaging any multiple `positive` (or `negative`) vectors together. It's also applying a de-facto unit-norming to the full range of candidate vectors, in the model, as part of the cosine-similarity calculation. It does these unit-normalizations to match the calculations done by the original released `word2vec.c` source code on which it was based.

So with regard to the question, "why, .wv.most_similar (within a model) uses the unnormalised vectors", the simple answer is: it always takes the (possibly-not-normalized) vectors as its input, but then does in fact unit-normalize them in the course of calculating bulk similarities. 

In any case where `.vectors` might already be unit-normalized by prior choices (as perhaps with an earlier procrustes step?) this redundant division-by-magnitudes (all 1.0) has no numeric effect.

- Gordon

amjass

unread,
May 29, 2022, 2:59:21 PM5/29/22
to Gensim
Hi Gordon, 

thank you very much for detailed reply and for the information - the question was missing a crucial link which is the implementation of the SVD i was using! 
(sorry) - https://gist.github.com/zhicongchen/9e23d5c3f1e5b1293b16133485cd17d8

thank you very much explanation on the use of unit normed vectors in .most_similar

for procrustes - yes the normalised vectors are taken prior to SVD - so I understand the comment about redundant division - this would be when comparing within a model after alignment correct? still the .most_similar method -  what about when comparing across models? why is it legitimate to do a dot product with a normalised vector in the second model whose embedding matrix has been aligned to the first model with an 'unnormalised vector' from the first model. If i use the a canonical example (which i have tested) and do the alignment between two Gensim Word2Vec model from english text across two time periods... when not aligned, examples of the cosine between (for example) dog or king between model1 and model2 are random. After alignment, the cosine between the word dog in model1 and the word dog in model2 are extremely high(>0.9) (example below)- so the alignment 'works' but based on your explanation my understanding is the dot product occurs between one unit normalised vector and another non unit normalised - is my understanding correct or is there something i have missed? clearly the alignment works and the cosine distance (or similarity) reflects this! sorry for the verbose question, I am learning matrices and vectors so also want to get a deeper understanding of how vectors are utilised in w2v.

temporal english text test - 

unaligned - 1-cosine(model_one.wv['dog'], model_two.wv['dog'])
>>> 0.2 -- vectors in different coordinate space!

aligned - 
smart_procustes_align_gensim(model_one, model_two)
1-cosine(model_one.wv['dog'], model_two.wv['dog'])
>>> 0.92 -- -- vectors aligned

thank you!

Gordon Mohr

unread,
May 29, 2022, 6:23:12 PM5/29/22
to Gensim
On Sunday, May 29, 2022 at 11:59:21 AM UTC-7 amjass wrote:
Hi Gordon, 

thank you very much for detailed reply and for the information - the question was missing a crucial link which is the implementation of the SVD i was using! 
(sorry) - https://gist.github.com/zhicongchen/9e23d5c3f1e5b1293b16133485cd17d8

thank you very much explanation on the use of unit normed vectors in .most_similar

for procrustes - yes the normalised vectors are taken prior to SVD - so I understand the comment about redundant division - this would be when comparing within a model after alignment correct? still the .most_similar method -  what about when comparing across models?

In general, without special extra steps, you *can't* compare between models. Each vector only has meaning in its relative distances/orientations against other vectors in the same model, trained together. 

Now, you've taken some extra steps via this alignment code that purports to translate both models to be comparable. I've not used the method/code you're referring to, so have no opinion on its fitness for that purpose. But, you report that it seems to create the desired compatibility. 

At the end of that process, the changed `other_embed` model had its original vectors replaced with the output of the process. Maybe, that process had the side-effect of also unit-normalizing those vectors. (I'm not sure if that's guaranteed mathematically; an earlier version of the code you've linked seemed to imply, by explicitly overwriting the `.syn0norm` value, such unit-normalization will have occurred. But maybe that code was doing something wrong! I don't know.)

But whether they're unit-normalized or not, the code you're using has replaced the prior `.vectors` with a new array-of-vectors.

Whether those vectors are good for any purpose or not, whether they're nonsense or not, `.most_similar()` just does its calculations: it takes some indication of a reference point (either by a single vector, or multiple `positive`/`negative` vectors normed & averaged). It then correctly calculates the cosine-similarity of that reference point against *every* vector in the set-of-vectors, then reports the top-N most-similar results. It performs this calculation properly, mechanistically, no matter the quality of the new vectors you've poked into the model. You can consider it a black-box. 

Whether the end results of that process work for your goals is an empirical question: try them, test them. 
 
why is it legitimate to do a dot product with a normalised vector in the second model whose embedding matrix has been aligned to the first model with an 'unnormalised vector' from the first model. If i use the a canonical example (which i have tested) and do the alignment between two Gensim Word2Vec model from english text across two time periods... when not aligned, examples of the cosine between (for example) dog or king between model1 and model2 are random. After alignment, the cosine between the word dog in model1 and the word dog in model2 are extremely high(>0.9) (example below)- so the alignment 'works' but based on your explanation my understanding is the dot product occurs between one unit normalised vector and another non unit normalised - is my understanding correct or is there something i have missed?

You say, "the dot product occurs between one unit normalised vector and another non unit normalised". That's only superficially true in the source of `.most_similar()`. The proper formula for cosine-similarity is still followed in that source-code. In the line...


    dists = dot(self.vectors[clip_start:clip_end], mean) / self.norms[clip_start:clip_end]

...the variable `mean` is already unit-normalized. Whether the `self.vectors` are already unit-normed or not, the division by `self.norms` at the end ensures that for the purpose of the cosine-similarity calculation, the results are exactly same as if the `.vectors` were unit-normed before being a term to the dot-product. That is, the above line is mathematically-equivalent to:

    dists = dot((self.vectors[clip_start:clip_end] / self.norms[clip_start:clip_end]), mean) 

It just happens to require less interim calculation, and interim-result array states, to do the division at the end, against the dot-product scalar values.

If in fact a side effect of the alignment was to pre-unit-normalize all the changed vectors, that `.norms` array will be all `1.0`. But the corect cosine-similarlities will be reported in any case. 

(And if in fact the alignment unit-normed things, then some magnitude info from the original vectors has been lost. That won't change any of the `.most_similar()` cosine-similarity calculations, which are all magnitude-invariant angular comparisons, but could affect *other* things you may do with the raw vectors that could be sensitive to magnitude, for either better or worse.)
 
clearly the alignment works and the cosine distance (or similarity) reflects this! sorry for the verbose question, I am learning matrices and vectors so also want to get a deeper understanding of how vectors are utilised in w2v.

temporal english text test - 

unaligned - 1-cosine(model_one.wv['dog'], model_two.wv['dog'])
>>> 0.2 -- vectors in different coordinate space!

aligned - 
smart_procustes_align_gensim(model_one, model_two)
1-cosine(model_one.wv['dog'], model_two.wv['dog'])
>>> 0.92 -- -- vectors aligned

These results are as expected: comparison between different models starts meaningless. Forcing alignment creates some comparability. (Is that comparability good enough for any particular purpose? Only your own tests can say.)

- Gordon 

amjass

unread,
May 30, 2022, 7:34:23 AM5/30/22
to Gensim
Hi Gordon, 

again, thank you so much for the time taken to answer my question in detail!

For the .wv.most_similar explanation - this is all clear. I am still somewhat unclear about why the vectors from one model compared to the vectors for another model produce 'valid' results, when one has been unit normalised and the original model vector is unchanged - in other words.. using the example of dog in my previous comment - let us say for argument sake, that the forced alignment is adequately comparable (i.e we are satisfied that the meaning of dog does not really change across both models and the cosine of 0.9 is sufficient...) - I still cannot understand why the cosine of the unit normalised vector for dog from the second model can be compared to the non-unit normalised vector for the word dog in the first model - and still produce a cosine that 'makes sense' (remembering that I have done the cosine similarity using Scipy, with no internal unit normalising of the vector from the first model, I just call model_one.wv['dog']. for me this is where the biggest confusion is which I still don't quite get!

thank you!

Gordon Mohr

unread,
May 30, 2022, 5:57:46 PM5/30/22
to Gensim
The cosine-similarity calculation strictly reflects the *angle* between vectors. Magnitude doesn't matter. Before attempting a cosine-similarity calculation between 2 vectors, you could unit-normalize one vector, or the other, or both, or neither – and the cosine-similiarity will not change. 

So it doesn't matter if your alignment process unit-normalized a bunch of things or not. It doesn't affect the cosine-similarity calculation either way. There's no question of what's 'valid' or not. The unit-normalization you're wondering about doesn't change, so you're ruminating about an irrelevant matter. 

Go ahead, try taking the `model_one.wv['dog']` vector – not at all normalized, if I'm following your assumptons correctly. Try supplying it with its original raw magnitude to `model_two.most_similar()`, to see what results you get. Now try unit-normalizing it. Or dividing it by 2. Or multiplying it by 10. In each case, its magnitude will vary - but the list of most-similar vectors (and their calculated similarities) will be the same, because the actual terms of the cosine-similarity calculation formula ensure its result is magnitude-invariant.

(Now, a forced unit-normalization of all vectors might change other coommon calculations. If you have raw vectors A, B, and C of varies magnitudes, some below 1.0 and some above 1.0, and you calculate `mean(A, B, C)`, you'll get one result. If you then take unit-normalized versions of each A_normed, B_normed & C_normed, each with magnitude 1.0, and calculate `mean(A_normed, B_normed, C_normed)`, you'll get a different result. Which is better may depend on your specific vectors & goals. So there *may be* a loss if your alignment steps discard the original ragged magnitudes. But it has no effect on direct vector-to-vector cosine-similarity calculations.) 

- Gordon

amjass

unread,
May 31, 2022, 3:49:16 AM5/31/22
to Gensim
Hi Gordon - 

ok, yes this clarifies it completely. thank you so much for clearing up my doubts and for taking the time to answer my questions! :)

Reply all
Reply to author
Forward
0 new messages