Word Mover's Distance maths

76 views
Skip to first unread message

Anna N.

unread,
Sep 21, 2020, 6:00:20 PM9/21/20
to gen...@googlegroups.com
Hi everyone,

When I'm loading the word2vec Google News vectors into gensim and try to run a wmdistance between the following two documents "cat guitar" and "dog piano" I get 1.9. However, when I run the distance between "cat" and "dog" (2.9), and "guitar" and "piano" (2.2), I just don't understand how the math works. 
I was expecting the (wmdistance("cat", "dog") + wmdistance("piano", "guitar"))/2 to be 1.9, but it obviously is not the case. And no, measuring cat to piano, and dog to guitar does not add up either.
What am I missing here?

Thanks so much,
A.

Gordon Mohr

unread,
Sep 23, 2020, 1:14:05 PM9/23/20
to Gensim
As the WMD calculation doesn't originate with Gensim, not sure anyone here may can explain it any better than the originating paper (http://proceedings.mlr.press/v37/kusnerb15.pdf) and your own experimental calculations on various inputs, and/or review of the code & interim products. 

I would say WMD for multiword texts isn't a simple/linear/average combination of word-to-word distances, but the result of a weighted optimization, so isn't certain to fit simple geometric intuitions. I'd also not especially suspect it of doing well on comparisons of single-words or tiny synthetic texts, as those aren't like the scenarios where the original introduction, or later assessments, have suggested it may be useful. 

Also, a few quick trials using the 'GoogleNews' word-vectors didn't give me results that were similar to yours, so perhaps there are other problems in your word-vectors or code setup?

For example:

    gkv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
    gkv.wmdistance(['cat'], ['dog'])  # = 0.691
    gkv.wmdistance(['piano'], ['guitar'])  # = 0.740
    gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano'])  # = 0.716

...which in this case, isn't far from your geometric intuition.

- Gordon

Anna N.

unread,
Sep 24, 2020, 7:55:29 AM9/24/20
to gen...@googlegroups.com
Thank you, Gordon. Your explanation is very helpful. One of the reasons I got the wrong values was that I haven't normalized the vectors. Another is that instead of feeding each document as a collection of tokens, I mistakenly entered the entire document as one long string. For example, instead of doing:
gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano'])

I did this:
gkv.wmdistance('cat guitar', 'dog piano')

I know this is likely problematic since the results I had gotten were inaccurate (and also the calculation time was suspiciously fast), but I am unsure how was gensim returning any result at all. I assume the word2vec model does not contain vectors for "cat guitar", and definitely not for any of the longer documents I was trying (some containing over 100 different tokens, all in one string). How was it returning reasonable looking output when it was asked to compare two strings that are not in the model?

Thanks again,
Anna
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/7aebce7a-bde1-4a2b-930f-6d4e86d562feo%40googlegroups.com.

Gordon Mohr

unread,
Sep 25, 2020, 1:20:33 PM9/25/20
to Gensim
With Python's loose typing, strings are just lists of characters, and even characters report themselves as the same as 1-character-long strings. So
anywhere you supply the string:

    'cat guitar'

...it's essentially the same as...

    ['c', 'a', 't', ' ', 'g', 'u', 'i', 't', 'a', 'r']

...which can be processed as if it were 10 one-character-long words. Also, inference (like training) simply ignores any words-not-in-the-model – potentially *all* the words of a text example, though many models will have word-vectors for things like `i` and `a`. 

(There will be a warning displayed if someone initializes their model's vocabulary with such strings, rather than lists-of-tokens, but we haven't added a similar warning to `infer_document()`.)

- Gordon


On Thursday, September 24, 2020 at 4:55:29 AM UTC-7, Anna N. wrote:
Thank you, Gordon. Your explanation is very helpful. One of the reasons I got the wrong values was that I haven't normalized the vectors. Another is that instead of feeding each document as a collection of tokens, I mistakenly entered the entire document as one long string. For example, instead of doing:
gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano'])

I did this:
gkv.wmdistance('cat guitar', 'dog piano')

I know this is likely problematic since the results I had gotten were inaccurate (and also the calculation time was suspiciously fast), but I am unsure how was gensim returning any result at all. I assume the word2vec model does not contain vectors for "cat guitar", and definitely not for any of the longer documents I was trying (some containing over 100 different tokens, all in one string). How was it returning reasonable looking output when it was asked to compare two strings that are not in the model?

Thanks again,
Anna


On Wed, Sep 23, 2020 at 1:14 PM Gordon Mohr <> wrote:
As the WMD calculation doesn't originate with Gensim, not sure anyone here may can explain it any better than the originating paper (http://proceedings.mlr.press/v37/kusnerb15.pdf) and your own experimental calculations on various inputs, and/or review of the code & interim products. 

I would say WMD for multiword texts isn't a simple/linear/average combination of word-to-word distances, but the result of a weighted optimization, so isn't certain to fit simple geometric intuitions. I'd also not especially suspect it of doing well on comparisons of single-words or tiny synthetic texts, as those aren't like the scenarios where the original introduction, or later assessments, have suggested it may be useful. 

Also, a few quick trials using the 'GoogleNews' word-vectors didn't give me results that were similar to yours, so perhaps there are other problems in your word-vectors or code setup?

For example:

    gkv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
    gkv.wmdistance(['cat'], ['dog'])  # = 0.691
    gkv.wmdistance(['piano'], ['guitar'])  # = 0.740
    gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano'])  # = 0.716

...which in this case, isn't far from your geometric intuition.

- Gordon

On Monday, September 21, 2020 at 3:00:20 PM UTC-7, Anna N. wrote:
Hi everyone,

When I'm loading the word2vec Google News vectors into gensim and try to run a wmdistance between the following two documents "cat guitar" and "dog piano" I get 1.9. However, when I run the distance between "cat" and "dog" (2.9), and "guitar" and "piano" (2.2), I just don't understand how the math works. 
I was expecting the (wmdistance("cat", "dog") + wmdistance("piano", "guitar"))/2 to be 1.9, but it obviously is not the case. And no, measuring cat to piano, and dog to guitar does not add up either.
What am I missing here?

Thanks so much,
A.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.

Anna N.

unread,
Sep 26, 2020, 1:51:47 PM9/26/20
to gen...@googlegroups.com
Thanks once again, Gordon. This is very helpful information. Mystery solved, I guess!

Anna

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/d4e7702a-2e13-4cd8-81d6-b43a77e3645bo%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages