Dear all,
I have been trying to test fault tolerant mpi implementation on a finite difference solver.
When running on a cluster I tried two ways on inducing faults:
-ssh to one of the running nodes and kill processes one by one using 'kill -9 PID';
-reboot one of the nodes (more realistic node failure scenario).
When using the first option the run recovers as expected and proceeds,
if I reboot a node however, the run just hangs.
Is this behavior expected?
Or might this be a reason of the way ULFM is installed?
I am using icldistcomp-ulfm2-2e75c73cc620.tar.bz2 verion
of the ULFM.
Thank you and best regards,
Atis