MPI_Comm_shrink failure in combination with MPI_Comm_Alltoall

41 views
Skip to first unread message

Charel Mercatoris

unread,
Feb 15, 2021, 4:58:37 AM2/15/21
to ul...@googlegroups.com

Dear ULFM Team,

 

I’m currently working on my master thesis and I have to use ULFM do detect and resolve process failures. Unfortunately, I ran into a problem.

 

I tried to modify the “Fault-tollerant iterative refinement with shrink and agreement” example 15.5 of the Unofficial ULFM Draft document. I swapped the MPI_Allreduce out for a MPI_Alltoall operation. To test my implementation a random processes are killed with raise(SIGKILL). An example code, which fails is attached. I use the latest version of ulfm2 and compile with “mpicxx -O3 ./main.cpp”

 

The error messages differ, and I get one of the following errors:

 

1)    All still remaining processors call MPI_Comm_shrink, but the call results in multiple communicators. For instance, after the call of MPI_Comm_shrink there should be one communicator with 79 ranks, but the call results in one communicator with 75 ranks and four different communicators with 1 rank.

2)    Not all processors call MPI_Comm_shrink. The remaining processors produce a segmentation fault in MPI_Comm_revoke or MPI_Comm_agree. For instance, 75 processors call MPI_Comm_shrink and produce a new communicator successfully, at least one of the remaining processes results in a segmentation fault. 

 

Note that most of the times no error occurs, I have to run the program multiple times to reproduce the failure. The failure can only be reproduced on an HPC.  This implementation seams to have a higher fail rate when I start the program on 3 nodes and 20 tasks per node.

 

I run the program with

 

mpirun -n $SLURM_NTASKS --mca mpi_ft_enable true --mca mpi_ft_detector_thread true --mca mpi_ft_detector_period 0.3 --mca ft_detector_timeout 1 $EXECUTABLE

 

I have tried out other run configurations by modifying the mca parameters, but the program keeps failing.

Best regards,

Charel Mercatoris

main.cpp

Aurelien Bouteiller

unread,
Feb 15, 2021, 11:59:19 AM2/15/21
to User Level Fault Mitigation
Charel,

This is a known bug in the version you are using. Good news is that we have just merged ULFM into mainline Open MPI, and that bug has been resolved there. 

I now recommend you switch over to the Open MPI master branch. 

The procedure for installing what is essentially a preview of the next release should be straightforward: git clone open MPI; ULFM is compiled-in by default, so just do the normal autogen/configure/make install sequence, no extra flags needed for FT.

I recommend you start from a fresh clone and install directories. Leftovers from prior compilations are known to cause issues.

To run your FT code, you will need to call `mpiexec —with-ft=mpi` and then pass the rest of your options as usual.



Best,
Aurelien

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/600385E5-8A54-46B8-B79D-B445510B8A76%40gmail.com.
<main.cpp>

 



--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/600385E5-8A54-46B8-B79D-B445510B8A76%40gmail.com.

Reply all
Reply to author
Forward
0 new messages