Hi everyone.
Recently I have been testing my OpenMPI/ULFM codes on a cluster.
My different tests include killing pseudo-randomly some of the processes involved in the computation to re-spawn them and continue.
When I use only 1 node of the cluster, everything goes fine; no errors, all processes finish without problems, even when I kill some processes; the others use MPI_Comm_spawn to restore them and it works great! The communicator is restored perfectly.
When I extend the tests and use 2 nodes, executions without killing processes work fine. All nodes can communicate with each other, all processes are launched on different nodes and their communication is successful. The computation finish.
But when I kill some processes in a node (or nodes), something happens.
All processes detect the error (process 9 in this case):
--------------------------------------------------------------------------
Process [9] in [
dahu-3.grenoble.grid5000.fr]: [74][MPI_ERR_PROC_FAILED: Process Failure]
--------------------------------------------------------------------------
Then the re-spawning process begins, but an error is shown:
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[21821,1],25]) is on host: dahu-3
Process 2 ([[21821,2],0]) is on host: unknown!
BTLs attempted: self openib vader
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
Then, more errors continue:
--------------------------------------------------------------------------
[
dahu-3.grenoble.grid5000.fr:13193] [[21821,1],25] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[
dahu-29.grenoble.grid5000.fr:43824] 31 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[
dahu-29.grenoble.grid5000.fr:43824] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
At this moment, I must stop manually the execution and the surviving processes show the message:
--------------------------------------------------------------------------
Process [29] in [
dahu-3.grenoble.grid5000.fr]: MPI_Comm_spawn [17][MPI_ERR_INTERN: internal error]
--------------------------------------------------------------------------
Do you know why it works fine with only 1 node but it starts failing when I use 2 or more nodes?
Thanks for your attention and help.
EXTRA INFO
--------------------------------------------------------------------------
The command line I use to execute is:
mpiexec -np 64 --machinefile myNodes --map-by node --mca btl openib,vader,self -oversubscribe --mca mpi_ft_detector_thread true ./my_test myArgs
The mpiexec version is:
mpiexec (OpenRTE) 4.0.2u1
The OS information is:
Linux fgrenoble 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) x86_64 GNU/Linux
The configuration I used to install ULFM is:
./configure --with-ft=mpi --prefix=$HOME/ULFM2 --enable-mpi-cxx --enable-mpi-cxx-seek --enable-cxx-exceptions --enable-mpi-ext=ftmpi
--------------------------------------------------------------------------