Sorry for maybe posting the similar topic twice as I didn't my previous post show up in the list.
I am trying to optimize the collective communication involving allgather in a script. I have wrote two version using comm.allgather
as well as comm.Allgather
. Preliminary benchmarking shows that the comm.Allgather version is about 4X faster when launching 5 parallel processes. However, as the number of parallel process goes up (30 ones which are the ones I gonna launch in production run), the performance difference isn't that much. Allgather version is only slightly faster than the other one. Since comm.Allgather promise to have near native C speed, am I doing something wrong? Is there a way to improve it?