Dear ULFM Team,
I’m currently working on my master thesis and I have to use ULFM do detect and resolve process failures. Unfortunately, I ran into a problem.
I tried to modify the “Fault-tollerant iterative refinement with shrink and agreement” example 15.5 of the Unofficial ULFM Draft document. I swapped the MPI_Allreduce out for a MPI_Alltoall operation. To test my implementation a random processes are killed with raise(SIGKILL). An example code, which fails is attached. I use the latest version of ulfm2 and compile with “mpicxx -O3 ./main.cpp”
The error messages differ, and I get one of the following errors:
1) All still remaining processors call MPI_Comm_shrink, but the call results in multiple communicators. For instance, after the call of MPI_Comm_shrink there should be one communicator with 79 ranks, but the call results in one communicator with 75 ranks and four different communicators with 1 rank.
2) Not all processors call MPI_Comm_shrink. The remaining processors produce a segmentation fault in MPI_Comm_revoke or MPI_Comm_agree. For instance, 75 processors call MPI_Comm_shrink and produce a new communicator successfully, at least one of the remaining processes results in a segmentation fault.
Note that most of the times no error occurs, I have to run the program multiple times to reproduce the failure. The failure can only be reproduced on an HPC. This implementation seams to have a higher fail rate when I start the program on 3 nodes and 20 tasks per node.
I run the program with
mpirun -n $SLURM_NTASKS --mca mpi_ft_enable true --mca mpi_ft_detector_thread true --mca mpi_ft_detector_period 0.3 --mca ft_detector_timeout 1 $EXECUTABLE
I have tried out other run configurations by modifying the mca parameters, but the program keeps failing.
Best regards,
Charel Mercatoris
--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/600385E5-8A54-46B8-B79D-B445510B8A76%40gmail.com.
<main.cpp>
--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/600385E5-8A54-46B8-B79D-B445510B8A76%40gmail.com.