--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/7aebce7a-bde1-4a2b-930f-6d4e86d562feo%40googlegroups.com.
Thank you, Gordon. Your explanation is very helpful. One of the reasons I got the wrong values was that I haven't normalized the vectors. Another is that instead of feeding each document as a collection of tokens, I mistakenly entered the entire document as one long string. For example, instead of doing:gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano'])I did this:
gkv.wmdistance('cat guitar', 'dog piano')I know this is likely problematic since the results I had gotten were inaccurate (and also the calculation time was suspiciously fast), but I am unsure how was gensim returning any result at all. I assume the word2vec model does not contain vectors for "cat guitar", and definitely not for any of the longer documents I was trying (some containing over 100 different tokens, all in one string). How was it returning reasonable looking output when it was asked to compare two strings that are not in the model?Thanks again,Anna
On Wed, Sep 23, 2020 at 1:14 PM Gordon Mohr <> wrote:
As the WMD calculation doesn't originate with Gensim, not sure anyone here may can explain it any better than the originating paper (http://proceedings.mlr.press/v37/kusnerb15.pdf) and your own experimental calculations on various inputs, and/or review of the code & interim products.--I would say WMD for multiword texts isn't a simple/linear/average combination of word-to-word distances, but the result of a weighted optimization, so isn't certain to fit simple geometric intuitions. I'd also not especially suspect it of doing well on comparisons of single-words or tiny synthetic texts, as those aren't like the scenarios where the original introduction, or later assessments, have suggested it may be useful.Also, a few quick trials using the 'GoogleNews' word-vectors didn't give me results that were similar to yours, so perhaps there are other problems in your word-vectors or code setup?For example:gkv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)gkv.wmdistance(['cat'], ['dog']) # = 0.691gkv.wmdistance(['piano'], ['guitar']) # = 0.740gkv.wmdistance(['cat', 'guitar'], ['dog', 'piano']) # = 0.716...which in this case, isn't far from your geometric intuition.- GordonOn Monday, September 21, 2020 at 3:00:20 PM UTC-7, Anna N. wrote:Hi everyone,When I'm loading the word2vec Google News vectors into gensim and try to run a wmdistance between the following two documents "cat guitar" and "dog piano" I get 1.9. However, when I run the distance between "cat" and "dog" (2.9), and "guitar" and "piano" (2.2), I just don't understand how the math works.I was expecting the (wmdistance("cat", "dog") + wmdistance("piano", "guitar"))/2 to be 1.9, but it obviously is not the case. And no, measuring cat to piano, and dog to guitar does not add up either.What am I missing here?Thanks so much,A.
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/7aebce7a-bde1-4a2b-930f-6d4e86d562feo%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/d4e7702a-2e13-4cd8-81d6-b43a77e3645bo%40googlegroups.com.