Hello,
I am a Phd student at the University of Santiago de Compostela, in Spain. I am trying DataMPI and I have a question.
I am launching a Java program in a cluster with 4 computing nodes. Each one of these nodes has 8 cores. The command is this one:
mpidrun -mode COMMON -O 31 -A 1 -jar DataMPIProlnat.jar DataMPIProlnat /Datos/Wikipedias/WikipediaTest.txt /Datos/Wikipedias/saidaDataMPI
Actually, I don't need an aggregator task, but if I set it to 0, it doesn't work. Then, according with the exit produced by the program, the processes distribution is something like this:
node1: Rank 0 to 7
node2: Rank 8 to 15
node3: Rank 16 to 23
node4: Rank 24 to 31
This distribution can change, but the processes 0 to 7 are always running together, also the 8 to 15 and so on.
The thing is that I've noticed that the node where processes from 0 to 7 are running is slower than the others. I am running various executions to check if it is a problem with the node, but it is not, it always happens in the node where the processes 0 to 7 are running. If I launch a top on this node, I can see how process 0 is using about 300% of the CPU.
This has as consequence that the other processes running in this node are slower than the rest of the processes in the rest of the nodes, and the execution time of the whole program is conditioned by the processes running in this node.
So, my question is: Is this because the process with the rank 0 is coordinating the rest of the processes and it uses CPU time for this?
Or maybe it is because I am not using the aggregator, despite I am indicating it in the mpidrun command?
As an example, some execution times:
Rank: 26. Time: 291.087 seconds
Rank: 28. Time: 346.387 seconds
Rank: 10. Time: 374.219 seconds
Rank: 27. Time: 373.876 seconds
Rank: 14. Time: 384.833 seconds
Rank: 8. Time: 388.848 seconds
Rank: 11. Time: 389.071 seconds
Rank: 20. Time: 407.365 seconds
Rank: 12. Time: 411.784 seconds
Rank: 23. Time: 409.859 seconds
Rank: 25. Time: 409.053 seconds
Rank: 17. Time: 419.079 seconds
Rank: 13. Time: 423.223 seconds
Rank: 30. Time: 422.54 seconds
Rank: 24. Time: 425.876 seconds
Rank: 9. Time: 443.026 seconds
Rank: 15. Time: 440.794 seconds
Rank: 29. Time: 442.323 seconds
Rank: 22. Time: 442.808 seconds
Rank: 16. Time: 453.815 seconds
Rank: 19. Time: 463.651 seconds
Rank: 21. Time: 463.868 seconds
Rank: 18. Time: 470.588 seconds
Rank: 4. Time: 496.915 seconds
Rank: 5. Time: 508.04 seconds
Rank: 6. Time: 515.287 seconds
Rank: 1. Time: 518.023 seconds
Rank: 2. Time: 533.121 seconds
Rank: 3. Time: 542.59 seconds
Rank: 7. Time: 583.004 seconds
Rank: 0. Time: 815.559 seconds
As you can see, the execution times of ranks 0 to 7 are considerably higher than the others and, the one from rank 0 even higher. If you are running a program that takes a few hours to run, this can be a problem.
Thank you very much!!