Normalizing individual indices of trained word2vec/doc2vec vectors

764 views

Skip to first unread message

Gregory Larchev

unread,

Apr 20, 2017, 5:28:31 PM4/20/17

to gensim

Let's say we train a word2vec (or a doc2vec) model on a corpus. Now we have a bunch of word (or document) vectors (let's say of length 100). Each vector is then [i1 i2 i3 ... i100]. If we look at all the produced vectors, we may find that the value i1 may vary between -0.001 and 0.001, while value i2 may vary between -0.01 and 0.01 (the distribution means may also be non-zero). Thus, each index will have different contribution to tasks such as word (or document) similarity.

Is there a capability in gensim to normalize each index value of a vector? It should be pretty easy to add, but I was wondering whether it already exists. Has there been any research regarding whether or not such normalization would be beneficial?

Gordon Mohr

unread,

Apr 20, 2017, 6:23:55 PM4/20/17

to gensim

For word-similarity comparisons, it is customary to normalize all vectors to be of unit-length. This is done automatically before returning similarity results, and the normed vectors are cached in a property `syn0norm` (instead of `syn0`). You can take a look at the `init_sims()` method to see what's done, including an option for replacing the non-normalized vectors entirely (for example if desired to save memory).

It's not certain that such normalization is always a good or necessary step, for other uses of word-vectors. For example, words with more generic meanings, or multiple competing senses, can tend to have smaller magnitudes, whereas words with stronger/singular meanings can tend to have larger magnitudes. This could be a useful signal for some purposes; in particular when simply averaging words together as a simple vector for a longer text, this larger-contribution from less-ambiguous words may desirable – an alternative to other frequency-based ways of weighting words differently.