mpi4py.futures: MPI-Based Asynchronous Task Execution for Python

Iaroslav Igoshev

unread,

Jan 25, 2023, 1:23:11 AM1/25/23

to mpi...@googlegroups.com

Hi there,

This is Iaroslav (YarShev GH nickname). I read through mpi4py.futures: MPI-Based Asynchronous Task Execution for Python advised to me by Lisandro Dalcin. The article is great and very interesting. I have some questions regarding it, probably, for you, Lisandro Dalcin, as you are one of the authors of the article.

1. How did you measure bandwidth for mpi4py? Did you consider MPI Send and MPI Recv as a single communication, i.e. the time of either MPI Send or MPI Recv, because MPI Send may block until the message is received? Or you measured the time of MPI Send + MPI Recv?

2. Not sure I fully understand why you didn't present any measurements of multi-node bandwidth. Didn't mpi4py show any speed up over Dask? Did you try enabling such things as infiniband? Does Dask support such things?

3. There is the following statement in the article - "typically, users would limit the number of processes per node below the CPU core count only in specific scenarios, such as memory-bound applications". My question is don't users limit the number of processes per node below the CPU core count only in specific scenarios, such as "COMPUTE-BOUND" applications to gain performance? If not, why is the limit for memory-bound applications?

4. Why is the communication between the parent and the spawned children processes achieved through network modules? And also, which network modules did you mean? On the other hand, why do processes started as part of the world group (MPI_COMM_WORLD) communicate through a faster shared memory channel?

Kind regards,

Iaroslav

Marcin Rogowski

unread,

Jan 25, 2023, 8:30:46 AM1/25/23

to mpi4py

Hello,

Thanks for your interest in our paper! I hope the answers below help:

1. Since this paper discusses mpi4py.futures, we do not call MPI functions explicitly. The timing measured is that of the executor's 'map' function (internally MPI_Issend and MPI_Recv) and iterating over its results, i.e., sending and receiving the data to and from workers.

2. We did! Section 5.5 and Fig. 9 show the multi-node bandwidth of mpi4py.futures and Dask, both running over InfiniBand.

3. If a single process saturates the entire memory bandwidth, and the compute part is negligible compared to the memory access, there is no benefit from using more processes; hence it makes sense not to spawn them.

4. This is MPI implementation-dependent. Please see the issue Lisandro created with MPICH: https://github.com/pmodels/mpich/issues/6089. We discovered that when running on a single node, we achieved much better performance when using a "fixed" communicator from the beginning of execution rather than when dynamically creating processes. This is where our investigation led us.

Thanks,

Marcin Rogowski

unread,

Jan 26, 2023, 2:51:06 AM1/26/23

to mpi4py

Re: 3. Strictly speaking, the explanation I wrote is for memory bandwidth-bound problems. Another scenario is for memory capacity-bound applications, where the required memory per process does not allow using all the available CPU cores.

Thanks,

Marcin Rogowski

Iaroslav Igoshev

unread,

Jan 30, 2023, 7:44:17 PM1/30/23

to mpi4py

Hi Marcin Rogowski,

Thank you very much for the responses. That makes sense to me.

Kind regards,

Iaroslav

четверг, 26 января 2023 г. в 08:51:06 UTC+1, marcin....@gmail.com:

Reply all

Reply to author

Forward