WmdSimilarity effectiveness

Erik Řehulka

unread,

Sep 24, 2021, 9:37:29 AM9/24/21

to Gensim

Hi,

I have a smaller dataset of 40 sentences, on which I would like to compute similarity score for each possible pair. I am using gensim model 'word2vec-google-news-300' and the sentences are normally long.

However when I try to create instance of WmdSimilarity with the sentences and the google-news-300 model, it takes more than 180 seconds to create.

Is there some other way to make the computing more effective? Or maybe some other ideas/tools/libraries how to compute WMD similarity between the sentences effectively (result in range [0;1])? Thanks!

Best Regards

Erik Ř.

Gordon Mohr

unread,

Sep 26, 2021, 1:55:22 AM9/26/21

to Gensim

The WMD calculation is inherently a lot more expensive, in CPU time, than typical single-vector to single-vector similarity calculations. It's essentially searching for an optimized transformation of one text to another, according to some rules, then reporting the total 'effort' of that best-transformation. And, that search becomes a lot harder with longer texts.

I suspect you'd only want to use the brute-force approach of `WmdSimilarity`, comparing against all candidates, if your data is very small and/or you're very patient and/or you can devote the extra work/budget needed to fan out the cost over many parallel replicas.

Some published work about WMD tries to figure ways to approximate WMD's benefits with less calculation - for example, by using a cheaper initial calculation to eliminate most pairwise comparisons from consideration, then using full WMD only to rank the best candidates found by a faster process. (One recent paper that reviews, & extends, some ideas to speed WMD: https://arxiv.org/pdf/1912.00509.pdf)

I mention other possible strategies that *might* help make Gensim WMD more practical in some usage scenarios, in a recent StackOverflow answer: https://stackoverflow.com/questions/69290768/how-to-speed-up-word-movers-distance-computation-on-text-in-dataframe/69323072#69323072

But, support for such shortcuts & optimizations isn't yet in, or planned for, Gensim - though adding anything proven to help (either from the literature or other novel ideas) would likely be a welcome contribution.

- Gordon

Erik Řehulka

unread,

Sep 30, 2021, 5:22:44 AM9/30/21

to Gensim

Thanks for the response.

The problem is in creating the instance of WmdSimilarity, that is the thing I can't control. I can definitely try to somehow parallelize computing the similarities (everyone against everyone), as it was written in the stackoverflow thread, but that is not the whole problem.

Thanks for the paper though, I will take a look in it and maybe try to implement in somehow in Gensim. Hopefully I will work something out. Or maybe would it help to use some smaller model, other than 'word2vec-google-news-300' (1.6Gb)?

- Erik

Dátum: nedeľa 26. septembra 2021, čas: 7:55:22 UTC+2, odosielateľ: Gordon Mohr

Gordon Mohr

unread,

Sep 30, 2021, 1:33:17 PM9/30/21

to Gensim

Because the WMD calculation uses only pairwise similarities between words in your texts – never searching through all the model's words for `most_similar` words – I wouldn't expect using a smaller word-vector model to help much. (Maybe it'd help if you make the model so small many of your texts aren't fully represented, and thus get shrunk. But it'd likely be better to choose some other way of slimming your texts. Or maybe it'd help if you've got other RAM swapping problems. But that should be fixed in other ways.)

But, if you wanted to try a smaller model, you can use the optional `limit` parameter of `load_word2vec_format()` to load the (leading, most-frequent-words) subset of a word2vec-file on disk. For example...

w2v_model = KeyedVectors.load_word2vec_model(GOOGNEWSPATH, limit=500000)

...to load just 500K, rather than the full 3M, words. (But again: I wouldn't expect much help in the WMD case from using such a truncated model.)

- Gordon

Reply all

Reply to author

Forward