Word Mover's Distance maths

Anna N.

unread,

Sep 21, 2020, 6:00:20 PM9/21/20

to gen...@googlegroups.com

Hi everyone,

When I'm loading the word2vec Google News vectors into gensim and try to run a wmdistance between the following two documents "cat guitar" and "dog piano" I get 1.9. However, when I run the distance between "cat" and "dog" (2.9), and "guitar" and "piano" (2.2), I just don't understand how the math works.

I was expecting the (wmdistance("cat", "dog") + wmdistance("piano", "guitar"))/2 to be 1.9, but it obviously is not the case. And no, measuring cat to piano, and dog to guitar does not add up either.

What am I missing here?

Thanks so much,

A.

Gordon Mohr

unread,

Sep 23, 2020, 1:14:05 PM9/23/20

to Gensim

As the WMD calculation doesn't originate with Gensim, not sure anyone here may can explain it any better than the originating paper (http://proceedings.mlr.press/v37/kusnerb15.pdf) and your own experimental calculations on various inputs, and/or review of the code & interim products.

I would say WMD for multiword texts isn't a simple/linear/average combination of word-to-word distances, but the result of a weighted optimization, so isn't certain to fit simple geometric intuitions. I'd also not especially suspect it of doing well on comparisons of single-words or tiny synthetic texts, as those aren't like the scenarios where the original introduction, or later assessments, have suggested it may be useful.

Also, a few quick trials using the 'GoogleNews' word-vectors didn't give me results that were similar to yours, so perhaps there are other problems in your word-vectors or code setup?

For example:

gkv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

gkv.wmdistance(['cat'], ['dog']) # = 0.691

gkv.wmdistance(['piano'], ['guitar']) # = 0.740

gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano']) # = 0.716

...which in this case, isn't far from your geometric intuition.

- Gordon

Anna N.

unread,

Sep 24, 2020, 7:55:29 AM9/24/20

to gen...@googlegroups.com

Thank you, Gordon. Your explanation is very helpful. One of the reasons I got the wrong values was that I haven't normalized the vectors. Another is that instead of feeding each document as a collection of tokens, I mistakenly entered the entire document as one long string. For example, instead of doing:

gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano'])

I did this:

gkv.wmdistance('cat guitar', 'dog piano')

I know this is likely problematic since the results I had gotten were inaccurate (and also the calculation time was suspiciously fast), but I am unsure how was gensim returning any result at all. I assume the word2vec model does not contain vectors for "cat guitar", and definitely not for any of the longer documents I was trying (some containing over 100 different tokens, all in one string). How was it returning reasonable looking output when it was asked to compare two strings that are not in the model?

Thanks again,

Anna

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/7aebce7a-bde1-4a2b-930f-6d4e86d562feo%40googlegroups.com.

Gordon Mohr

unread,

Sep 25, 2020, 1:20:33 PM9/25/20

to Gensim

With Python's loose typing, strings are just lists of characters, and even characters report themselves as the same as 1-character-long strings. So

anywhere you supply the string:

'cat guitar'

...it's essentially the same as...

['c', 'a', 't', ' ', 'g', 'u', 'i', 't', 'a', 'r']

...which can be processed as if it were 10 one-character-long words. Also, inference (like training) simply ignores any words-not-in-the-model – potentially *all* the words of a text example, though many models will have word-vectors for things like `i` and `a`.

(There will be a warning displayed if someone initializes their model's vocabulary with such strings, rather than lists-of-tokens, but we haven't added a similar warning to `infer_document()`.)

- Gordon

On Thursday, September 24, 2020 at 4:55:29 AM UTC-7, Anna N. wrote:

Thank you, Gordon. Your explanation is very helpful. One of the reasons I got the wrong values was that I haven't normalized the vectors. Another is that instead of feeding each document as a collection of tokens, I mistakenly entered the entire document as one long string. For example, instead of doing:
gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano'])

I did this:

gkv.wmdistance('cat guitar', 'dog piano')

I know this is likely problematic since the results I had gotten were inaccurate (and also the calculation time was suspiciously fast), but I am unsure how was gensim returning any result at all. I assume the word2vec model does not contain vectors for "cat guitar", and definitely not for any of the longer documents I was trying (some containing over 100 different tokens, all in one string). How was it returning reasonable looking output when it was asked to compare two strings that are not in the model?

Thanks again,
Anna

On Wed, Sep 23, 2020 at 1:14 PM Gordon Mohr <> wrote:

As the WMD calculation doesn't originate with Gensim, not sure anyone here may can explain it any better than the originating paper (http://proceedings.mlr.press/v37/kusnerb15.pdf) and your own experimental calculations on various inputs, and/or review of the code & interim products.

I would say WMD for multiword texts isn't a simple/linear/average combination of word-to-word distances, but the result of a weighted optimization, so isn't certain to fit simple geometric intuitions. I'd also not especially suspect it of doing well on comparisons of single-words or tiny synthetic texts, as those aren't like the scenarios where the original introduction, or later assessments, have suggested it may be useful.

Also, a few quick trials using the 'GoogleNews' word-vectors didn't give me results that were similar to yours, so perhaps there are other problems in your word-vectors or code setup?

For example:

gkv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
gkv.wmdistance(['cat'], ['dog']) # = 0.691
gkv.wmdistance(['piano'], ['guitar']) # = 0.740
gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano']) # = 0.716

...which in this case, isn't far from your geometric intuition.

- Gordon

On Monday, September 21, 2020 at 3:00:20 PM UTC-7, Anna N. wrote:
Hi everyone,

When I'm loading the word2vec Google News vectors into gensim and try to run a wmdistance between the following two documents "cat guitar" and "dog piano" I get 1.9. However, when I run the distance between "cat" and "dog" (2.9), and "guitar" and "piano" (2.2), I just don't understand how the math works.
I was expecting the (wmdistance("cat", "dog") + wmdistance("piano", "guitar"))/2 to be 1.9, but it obviously is not the case. And no, measuring cat to piano, and dog to guitar does not add up either.
What am I missing here?

Thanks so much,
A.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.

Anna N.

unread,

Sep 26, 2020, 1:51:47 PM9/26/20

to gen...@googlegroups.com

Thanks once again, Gordon. This is very helpful information. Mystery solved, I guess!

Anna

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/7aebce7a-bde1-4a2b-930f-6d4e86d562feo%40googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "Gensim" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/d4e7702a-2e13-4cd8-81d6-b43a77e3645bo%40googlegroups.com.

Reply all

Reply to author

Forward