Hi there,
1. How did you measure bandwidth for mpi4py? Did you consider MPI Send and MPI Recv as a single communication, i.e. the time of either MPI Send or MPI Recv, because MPI Send may block until the message is received? Or you measured the time of MPI Send + MPI Recv?
2. Not sure I fully understand why you didn't present any measurements of multi-node bandwidth. Didn't mpi4py show any speed up over Dask? Did you try enabling such things as infiniband? Does Dask support such things?
3. There is the following statement in the article - "typically, users would limit the number of processes per node below the CPU core count only in specific scenarios, such as memory-bound applications". My question is don't users limit the number of processes per node below the CPU core count only in specific scenarios, such as "COMPUTE-BOUND" applications to gain performance? If not, why is the limit for memory-bound applications?
4. Why is the communication between the parent and the spawned children processes achieved through network modules? And also, which network modules did you mean? On the other hand, why do processes started as part of the world group (MPI_COMM_WORLD) communicate through a faster shared memory channel?
Kind regards,
Iaroslav