Optimizing collective operation involving allgather using mpi4py

11 views
Skip to first unread message

Liu Cong

unread,
Jun 17, 2020, 3:42:46 AM6/17/20
to mpi4py
Dear all,

Sorry for maybe posting the similar topic twice as I didn't my previous post show up in the list.

I am trying to optimize the collective communication involving allgather in a script. I have wrote two version using comm.allgather as well as comm.Allgather. Preliminary benchmarking shows that the comm.Allgather version is about 4X faster when launching 5 parallel processes. However, as the number of parallel process goes up (30 ones which are the ones I gonna launch in production run), the performance difference isn't that much. Allgather version is only slightly faster than the other one. Since comm.Allgather promise to have near native C speed, am I doing something wrong? Is there a way to improve it?

Thanks,
Cong

Lisandro Dalcin

unread,
Jun 17, 2020, 10:55:53 AM6/17/20
to mpi...@googlegroups.com
On Wed, 17 Jun 2020 at 10:42, Liu Cong <liuco...@gmail.com> wrote:
Dear all,

Sorry for maybe posting the similar topic twice as I didn't my previous post show up in the list.


First-posts are moderated to prevent spam, sorry. 

I am trying to optimize the collective communication involving allgather in a script. I have wrote two version using comm.allgather as well as comm.Allgather. Preliminary benchmarking shows that the comm.Allgather version is about 4X faster when launching 5 parallel processes. However, as the number of parallel process goes up (30 ones which are the ones I gonna launch in production run), the performance difference isn't that much.

Are you launching all these processes on a single compute node or workstation?
 
Allgather version is only slightly faster than the other one. Since comm.Allgather promise to have near native C speed, am I doing something wrong? Is there a way to improve it?

Well, you should determine your baseline C-speed before attempting likely premature optimizations. You can use IMB (Intel MPI benchmark) or the OSU benchmarks to test Allgather in C using 5 and 30 processes, and look at the timings for message sizes similar to the one you are using in your Python code. My bet is that you will confirm that you cannot optimize further. I have no idea what kind of network you are running things on, and how many cores per compute node you are using, but it could be that the latency for a 30 processes run is large enough to make the extra pickling cost of lowercase allgather() irrelevant.


--
Lisandro Dalcin
============
Research Scientist
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

Liu Cong

unread,
Jun 17, 2020, 1:18:02 PM6/17/20
to mpi4py
Thanks for your quick reply Lisandro! The 5 processes one is launched on a single compute node. While the 30 ones is launched on 3 different nodes. 

It's a good idea to determine the C MPI speed. I will complie the OSU benchmark set and keep you posted about the results.

Best,
Cong

Liu Cong

unread,
Jun 18, 2020, 6:02:13 PM6/18/20
to mpi4py
Hi Lisandro,

I compiled the OSU benchmark and tested the network latency of the 30 processes run with the same node configuration of my previous one. The results confirmed your guess that the network latency (0.77s) is comparable to the mpi4py Allgather time (0.80s). So the myth is resolved! Given the similar performance of allgather and Allgather in mpi4py in my previous test, could you elaborate more on what will the suitable situation to use Allgather function?

Thanks,
Cong 

On Wednesday, June 17, 2020 at 10:55:53 AM UTC-4, Lisandro Dalcin wrote:

Lisandro Dalcin

unread,
Jun 19, 2020, 6:44:15 AM6/19/20
to mpi...@googlegroups.com
On Fri, 19 Jun 2020 at 01:02, Liu Cong <liuco...@gmail.com> wrote:
 
Given the similar performance of allgather and Allgather in mpi4py in my previous test, could you elaborate more on what will the suitable situation to use Allgather function?


Allgather should be never slower. Allgather does not require extra resources (CPU and memory to perform pickling). Allgather is most appropriate if your messages involve communication of very large arrays.  Allgather can be used with MPI-2 dynamic process management, such that a MPI application in Python can interoperate with another app written in any other language supporting MPI. And the list goes on... It is ultimately your responsibility to decide what to use and whether trading allgather convenience for Allgather ultimate efficiency is good business. mpi4py gives you choices; so be free, choose, and live with the consequences.

Liu Cong

unread,
Jun 19, 2020, 12:50:21 PM6/19/20
to mpi4py
Awesome! Thanks for your insight. 

Best,
Cong
Reply all
Reply to author
Forward
0 new messages