Error when parallelizing wmdistance with joblib

190 views
Skip to first unread message

Joris

unread,
Dec 19, 2016, 10:42:01 AM12/19/16
to gensim
I am trying to calculate the WMD between the first element of an array (which contains text in the form of lists of words) and the rest of that array. Since this is a very slow process I want to parallelize this process using joblib. This is what I am trying to do:

from joblib import Parallel, delayed
import multiprocessing
import gensim

def wmdistance(cands_descr, stop, parallel, origin=0):
    word2vec_model
= gensim.models.Word2Vec.load(ROOT_DIR+'/word2vec_data/word2vec_model_yc_50_10')
    calc_wmdistance
= word2vec_model.wmdistance
    cores
= multiprocessing.cpu_count()
    wmd
= Parallel(n_jobs=cores-1, verbose=50)(delayed(calc_wmdistance)(cands_descr[0], descr) for descr in cands_descr)

I get a TypeError: can't pickle instanemethod objects and I am unable to solve it. Could anybody offer some advice? Thanks!

Gordon Mohr

unread,
Dec 20, 2016, 2:34:29 PM12/20/16
to gensim
This is more of a joblib/Parallel matter than gensim-specific. 

`calc_wmdistance` (`word2vec_model.wmdistance`) is an instance-bound method, which can't be pickled (to be passed to other processes). It might work for you if you:

(1) Define a global method to do your operation, which expects the model as a parameter. EG:

    def calc_wmdistance(model, doc1, doc2): 
        return model.wmdistance(doc1, doc2)

(2) Use this global as the function passed to `delayed()`, and add the model as the 1st parameter. EG: `delayed(calc_wmdistance)(word2vec_model, cands_descr[0], descr) ...`

While this may solve the error, you may not get the desired speedup, as the model is (likely) quite large and pickle-sending it to the child processes (which then each have a full copy of the model) for each calculation might dominate the runtime. 

It might be better to restructure the code so that each subprocess loads the model once itself (likely also using the `mmap` optional argument to `load()` so that the bulk of the models are shared), then each calculates distances for an equal-sized batch of the target words. That'd best minimize duplicate effort/memory.

- Gordon

Lev Konstantinovskiy

unread,
Dec 20, 2016, 4:48:27 PM12/20/16
to gensim
Hi Joris,

BTW There is parallel WMD KNN code here :

http://vene.ro/blog/word-movers-distance-in-python.html

Regards
Lev

Reply all
Reply to author
Forward
0 new messages