Hi all,
I have implemented ULFM calls into finite difference fluid solver (fortran).
To test the functionality I use raise(SIGKILL) to kill some of the mpi processes.
I am observing strange behavior depending on the number of mpi processes that I use.
I start the run with spares and always always kill processes 2 and 3.
During the recovery the communicator is shrunk to the remaining processes and the killed
processes are replaced with the spares.
I am running in cluster environment, each MPI process is a separate compute node.
The problem that I am encountering is that some of the processes get stuck in the mpix_comm_shrink call.
The number of dead processes after shrink (size original communicator – size shrunk communicator) is larger than 2 (the number of processes I kill)
If I run on 32 mpi nodes I end up with 4 dead nodes, process 30 and 32 don’t exit the shrink call.
If I run on 36 mpi nodes I end up with 2 dead nodes, just as expected.
If I run on 64 mpi nodes I end up with 5 dead nodes, process 50, 62 and 64 don’t exit the shrink call.
If I run on 110 mpi nodes I end up with 2 dead nodes, just as expected.
The additional nodes that ‘fail’ (get stuck in shrink) are always the same if I repeat the run with the same number of mpi processes.
Has anyone experienced a similar behavior?
Does anyone has any suggestions regarding what could be going on?
Is there something in the mpix_comm_shrink call that could account for such behavior?
Merry Christmas Eve!
Atis
--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/88bcdfcb-f7eb-4e8a-a793-b6b7476812c2%40googlegroups.com.
Best,Aurelien
To unsubscribe from this group and stop receiving emails from it, send an email to ul...@googlegroups.com.