Hi.
Doing some tests with the latest OMPI version (with FT enabled) and killing a random process with "raise(SIGKILL)" produces some failures I cannot handle.
For example:
--------------------------------------------------------------------------
[
gros-75.nancy.grid5000.fr:28869] [[24089,1],33] ompi: Process [[24089,1],1] failed (state = -57).
[
gros-75.nancy.grid5000.fr:28869] [[24089,1],33] ompi: Error event reported through PMIx from [[24089,1],33] (state = -57). This error type is not handled by the fault tolerant layer and the application will now presumably abort.
[
gros-90.nancy.grid5000.fr:16766] [[24089,1],49] ompi: Process [[24089,1],1] failed (state = -57).
[gros-75:00000] *** An error occurred in PMIx Event Notification
[gros-75:00000] *** reported by process [1578696705,33]
[gros-75:00000] *** on a NULL communicator
[gros-75:00000] *** Unknown error (this should not happen!)
[gros-75:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gros-75:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
A fast explanation of what I'm trying to do is:
I have a process grid with a "global" communicator (dup of MPI_COMM_WORLD), and a lot of rows and columns communicators, created with MPI_Cart_create and MPI_Cart_sub.
When I kill a process, I use MPIX_Comm_agree over the "global" communicator to allow all processes to catch the error, no matter if the dead process was not part of its row/column, and restore the three communicators the dead process was integrated: global_comm, row_comm, and col_comm.
When the process dies, the error appears and the restoring procedure does not start. I attach the output error file.
I know "topo" modules are still not supported for fault tolerance, but I'm trying to catch the error with a duplicate of MPI_COMM_WORLD.
Could you please help me with this issue?
The way I cloned and configured my OMPI/ULFM version is:
--------------------------------------------------------------------------
git clone --recursive
https://github.com/open-mpi/ompi.git --branch=master --single-branch
./configure --with-ft=mpi --prefix=$HOME/ULFM --disable-man-pages
--------------------------------------------------------------------------
The way I launch my tests is:
--------------------------------------------------------------------------
mpiexec --np 64 --machinefile hostfile --map-by node:oversubscribe --mca btl tcp,vader,self --mca btl_base_verbose 100 --enable-recovery --mca mpi_ft_enable true --mca mpi_ft_detector_thread true --mca mpi_ft_verbose 1 ./myTest myArgs
--------------------------------------------------------------------------
OMPI/PMIX version installed is:
--------------------------------------------------------------------------
$ mpiexec --version
prterun (Open MPI) 5.1.0a1
$ grep PMIX_VERSION include/pmix_version.h
#define PMIX_VERSION_MAJOR 5L
#define PMIX_VERSION_MINOR 0L
#define PMIX_VERSION_RELEASE 0L
--------------------------------------------------------------------------
P.D. The next error:
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
module:untested:failundef
But I couldn't open the help file:
/home/dtorres/ULFM/share/openmpi/help-ft-mpi.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
is also shown, but changing the file name "help-ft-mpi.txt" to "help-mpi-ft.txt" solves it.
Thanks for your attention.