Word Mover's Distance in Gensim (semi-normalized results?)

Tedo Vrbanec

unread,

Feb 18, 2023, 2:23:03 PM2/18/23

to Gensim

Are the results of wmdistance(doc1, doc2) somehow semi-normalized? Results are "nice" but sometimes goes up to 1 which is undesirable for conversion to similarity (S=D-1). Therefore, they are technically not normalized.

Gordon Mohr

unread,

Feb 19, 2023, 2:58:25 PM2/19/23

to Gensim

As far as I know, Gensim's `wmdistance()` returns the calculation as described in the original papers defining Word Mover's Distance, without extra normalization/scaling.

Word Mover's Distance is not the same thing as cosine-distance, and isn't inherently limited to a maximum of 1.0 or 2.0 – so should not be considered convertable to cosine-distance by a simple calculation like `S=D-1`).

(The `wmdistance()` method does offer an optional keyword argument `norm`, per the method docs, if you want to turn off the default unit-normalization of individual word-vectors before the word-mover's-distance calculation. But I don't think either setting of this parameter ensures that distance result stay within any definite range. Arbitrarily-long texts of arbitrarily-different words could have quite large distances.)

If for your purposes you wanted to convert an unbounded non-negative distance value into a [-1.0, 1.0] range that's *like* that of cosine-similarity, you could use some other arbitrary but convenient conversion formula, like say:

quasi_similarity = (2.0 / (1.0 + distance)) - 1.0

…but keep in mind this still isn't real cosine-similarity, with a distribution-of-values that's necessarily-comparable with actual cosine-similairities you have from other models/coordinate-spaces.

- Gordon

Tedo Vrbanec

unread,

Feb 21, 2023, 6:32:10 AM2/21/23

to Gensim

Thank you very much, Gordon!

Reply all

Reply to author

Forward