Nesting functions called by futures.map; printing worker id, like print COM.rank in mpi4py

Russell Jarvis

unread,

Jul 22, 2016, 11:21:26 PM7/22/16

to scoop-users

Hi everyone,

Firstly I just want to state that I think the concept of scoop is great.

Secondly just as the subject heading suggests, I wanted to know if you can nest calls to futures.map.

My use case is doing a parameter sweep of a DEAP genetic algorithm on a cluster.

In an outer-loop I want to iterate through different values of GA population sizes, and random number seeds. However a part of the GA specifically the evaluate function is also able to be distributed with futures.map.

If I do a nested call to futures.map can these sub-processes be launched on an entirely different pool of hosts/CPUs?

Can I some how specify how many CPUs each call to futures.map takes, and also their rank/id/worker_number?

Thanks a lot for any advice:). Also in the code below I was playing around

from scoop import futures

def helloWorld(value):

print utils.getCPUcount() #me desperately trying to find a function that will return the worker id

print utils.getHosts() #me desperately trying to find a function that will return the worker id

return "Hello World from Future #{}".format(value)

def helloWorld_nested(value):

returnValues = futures.map(helloWorld, range(5))

print("\n".join(returnValues))

return "Hello World from Future nested #{}".format(value)

if __name__ == "__main__":

returnValues = futures.map(helloWorld_nested, range(5))

print("\n".join(returnValues))

Yannick Hold-Geoffroy

unread,

Jul 23, 2016, 12:09:52 AM7/23/16

to scoop-users

Hello,

SCOOP can indeed perform nested futures.map() calls, it's one of the use cases. There are examples of that in the provided examples (look at tree.py, for example).

However, you cannot specify on which pool of hosts should the inner loop be executed or "attribute" computing resources to the futures.map(). This is part of SCOOP's job to balance the work load, it will do that automatically.

You should try to use the value of 'scoop.worker' to get a unique identifier of the current worker on the pool. This will return a 2-tuple that is unique for one given computation. It is described here in the documentation: http://scoop.readthedocs.io/en/latest/api.html#scoop-constants-and-objects .

Have a nice day,
Yannick

Russell Jarvis

unread,

Jul 23, 2016, 12:21:46 AM7/23/16

to scoop...@googlegroups.com

Hi Yannick,

I looked at the full_tree.py example and I can see that you are right. Scoop is great. When I print scoop.worker I think it outputs the IP address corresponding to each worker CPU (see below).

Do you know if anyone has made any functions to give them a simpler second set of names like 0,1,2,3 for example?

Thanks again for your help.

Russell.

--
Vous recevez ce message, car vous êtes abonné à un sujet dans le groupe Google Groupes "scoop-users".
Pour vous désabonner de ce sujet, visitez le site https://groups.google.com/d/topic/scoop-users/iCyn7w3ER_0/unsubscribe.
Pour vous désabonner de ce groupe et de tous ses sujets, envoyez un e-mail à l'adresse scoop-users...@googlegroups.com.
Pour obtenir davantage d'options, consultez la page https://groups.google.com/d/optout.

Yannick Hold-Geoffroy

unread,

Jul 23, 2016, 12:44:59 AM7/23/16

to scoop-users

Hello Russell,

It is indeed the IP address followed by the ephemeral port. This is what is used to uniquely identify workers among a SCOOP pool. If you prefer a number, you could hash() that string...

We didn't want to give them 'simpler' names on purpose, because we wanted the cluster to be dynamic. What if the worker "0" and "1" left after a while, but the computation continued? You then have a pool starting at "2", which doesn't seem to make more sense than a bunch of numbers representing an IP followed by a port. Furthermore, it is discouraged to address workers directly by their name, as this could potentially encourage users to perform actions that will fight against the load balancing, leading to slower execution times that will leave users puzzled. Also, out of curiosity, why should a given computing resource be "the first" (or why should they be ordered)? It would be "the first" what, or order on what basis?

By the way, workers leaving the pool mid-computation was a planned feature that has been started but not quite finished in the development version of SCOOP, I don't encourage you to try this at home at the current time.

Have a nice day,
Yannick

Russell Jarvis

unread,

Jul 23, 2016, 7:14:53 PM7/23/16

to scoop...@googlegroups.com

Hi Yannick,

I agree with in terms of the cluster being dynamic it is better to have the ip-address for each node.

I don't want non unique names for manipulating the data which is on each host, I just want a simpler name for print statements. I just don't find the IP addresses to be very readable, or easy for me to discriminate between as they often contain some of the same numbers. Having simple small numbers will help me quickly read of the number of CPUs involved and which CPU is invoked in which function. I agree I can easily hash the IP address as a dictionary key, and give it an arbitrary small number name, so I might do that. I don't think I will make this usage a long term habit, its just to satisfy my curiosity.

My understanding is that SCOOP and futures is for embarrassingly parallel computations. I am also have some non embarrassingly parallel code written with mpi4py that I would like to rehash using zeromq. Is it okay to use the zeromq inside scoop to do this?