Distributed mode without broadcast

41 views
Skip to first unread message

Arvind Kalyan

unread,
May 20, 2012, 2:21:43 AM5/20/12
to gen...@googlegroups.com
Does anyone have pointers on how to use gensim on distributed mode without relying on broadcast?

My current setup doesn't let me do broadcast and I can only get 2 workers on my machine. Starting more workers on other machines don't seem to help.

Thanks
 

Arvind Kalyan

unread,
May 20, 2012, 2:29:35 AM5/20/12
to gen...@googlegroups.com
On Sat, May 19, 2012 at 11:21 PM, Arvind Kalyan <bas...@gmail.com> wrote:
Does anyone have pointers on how to use gensim on distributed mode without relying on broadcast?

My current setup doesn't let me do broadcast and I can only get 2 workers on my machine. Starting more workers on other machines don't seem to help.


As in, those workers are not recognized when I launch my job. It only identifies (atmost) 2 workers on the local machine.


Radim Řehůřek

unread,
May 20, 2012, 5:14:53 PM5/20/12
to gensim
Hello Arvind,

hmm, that is not possible at the moment. The distributed code relies
on Pyro nameserver, and the nameserver is located via broadcasting.

But changing that shouldn't be difficult -- in `gensim.utils.getNS()`,
replace `Pyro4.locateNS()` by a direct ip address lookup =
`Pyro4.locateNS(host, port)`. I think that should be enough (but
haven't tested).

Let me know how that went,
Radim

Arvind Kalyan

unread,
May 20, 2012, 5:53:07 PM5/20/12
to gen...@googlegroups.com
Hi Radim,

Thanks for your response. I did exactly that earlier today but I didn't see any difference in behavior even after hardcoding my host/port params.

I was happy with the results I obtained on smaller sample datasets we have running on 2 workers; great work there! But to benchmark against our current implementations we need this to scale for a really large dataset and run on servers that are not necessarily on the same broadcast network and so forth. I might probably revisit sometime later - hopefully gensim and/or pyro would have evolved a little bit more and be more deterministic.

Thanks and best regards,
Arvind

Radim Řehůřek

unread,
May 23, 2012, 1:12:05 PM5/23/12
to gensim
Hello Arvind,

> I was happy with the results I obtained on smaller sample datasets we have
> running on 2 workers; great work there! But to benchmark against our
> current implementations we need this to scale for a really large dataset
> and run on servers that are not necessarily on the same broadcast network
> and so forth. I might probably revisit sometime later - hopefully gensim
> and/or pyro would have evolved a little bit more and be more deterministic.

Both gensim and pyro are pretty stable and deterministic. You can open
a feature request at the github issue page: https://github.com/piskvorky/gensim/issues

Extending the cluster discovery beyond a broadcast domain should be
straightforward (though apparently not as straightforward as I thought
above!), so I might get to it soon.

Best,
Radim
Reply all
Reply to author
Forward
0 new messages