Lukas
unread,Jan 20, 2020, 9:27:38 AM1/20/20Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to ul...@googlegroups.com
Hey,
I sometimes observe, that a communicator seems to be split somehow
after there was a rank failure. More precisely, multiple ranks seem to
think that they'd have id 0. (See attachment for the source I used.)
[rank 0] Starting off with 80 ranks.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 79 ranks left.
[... and so on. Sometimes, a few ranks will be lost at once]
[rank 0] We've lost one rank, we now have 68 ranks left.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 55 ranks left.
[...]
[rank 0] We've lost one rank, we now have 7 ranks left.
[rank 1] I will fail now.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 5 ranks left.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 2 ranks left.
[rank 0] We've lost one rank, we now have 1 ranks left.
[rank 0] Test finished, only one node left, exiting.
[rank 0] We've lost one rank, we now have 1 ranks left.
[rank 0] Test finished, only one node left, exiting.
[rank 0] We've lost one rank, we now have 1 ranks left.
[rank 0] Test finished, only one node left, exiting.
[The last two messages appear 7 more times]
[MPI_Finalize does not seem to complete]
I'm using 4.1.0_ulfm-2.1a1 running under a Slurm scheduler using PMI2,
started via mpiexec in an sbatch file. The source code I used is
attached.
I really hope, you can help me again!
Regards,
Lukas