Hi all,
As I was comparing our code I came across something extremely strange. Both teams call MPI_Reduce exactly the same way the same number of times (verified with TAU), however we called MPI_Reduce within DDOT, and the other team called DDOT and then with the return values called MPI_REDUCE. The strange thing is that our code is about 25% slower until we did the same thing and moved our MPI_Reduce out of the ddot, at which point it becomes slightly faster. So to give a more concrete example:
Faster:
ddot()
MPI_Reduce()
Slower:
ddot() //MPI_Reduce call inside ddot
Anyone have a clue what could cause this?
best,
Kyle
--
Kyle Klein
Ph.D. Student
Department of Computer Science UC Santa Barbara