Error re-spawning processes in 2 or more nodes

Daniel Torres

unread,

Sep 25, 2020, 7:37:27 AM9/25/20

to User Level Fault Mitigation

Hi everyone.

Recently I have been testing my OpenMPI/ULFM codes on a cluster.
My different tests include killing pseudo-randomly some of the processes involved in the computation to re-spawn them and continue.

When I use only 1 node of the cluster, everything goes fine; no errors, all processes finish without problems, even when I kill some processes; the others use MPI_Comm_spawn to restore them and it works great! The communicator is restored perfectly.

When I extend the tests and use 2 nodes, executions without killing processes work fine. All nodes can communicate with each other, all processes are launched on different nodes and their communication is successful. The computation finish.

But when I kill some processes in a node (or nodes), something happens.
All processes detect the error (process 9 in this case):
--------------------------------------------------------------------------
Process [9] in [dahu-3.grenoble.grid5000.fr]: [74][MPI_ERR_PROC_FAILED: Process Failure]
--------------------------------------------------------------------------

Then the re-spawning process begins, but an error is shown:
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

Process 1 ([[21821,1],25]) is on host: dahu-3
Process 2 ([[21821,2],0]) is on host: unknown!
BTLs attempted: self openib vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

Then, more errors continue:
--------------------------------------------------------------------------
[dahu-3.grenoble.grid5000.fr:13193] [[21821,1],25] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[dahu-29.grenoble.grid5000.fr:43824] 31 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[dahu-29.grenoble.grid5000.fr:43824] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------

At this moment, I must stop manually the execution and the surviving processes show the message:
--------------------------------------------------------------------------
Process [29] in [dahu-3.grenoble.grid5000.fr]: MPI_Comm_spawn [17][MPI_ERR_INTERN: internal error]
--------------------------------------------------------------------------

Do you know why it works fine with only 1 node but it starts failing when I use 2 or more nodes?
Thanks for your attention and help.

EXTRA INFO
--------------------------------------------------------------------------
The command line I use to execute is:
mpiexec -np 64 --machinefile myNodes --map-by node --mca btl openib,vader,self -oversubscribe --mca mpi_ft_detector_thread true ./my_test myArgs

The mpiexec version is:
mpiexec (OpenRTE) 4.0.2u1

The OS information is:
Linux fgrenoble 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) x86_64 GNU/Linux

The configuration I used to install ULFM is:
./configure --with-ft=mpi --prefix=$HOME/ULFM2 --enable-mpi-cxx --enable-mpi-cxx-seek --enable-cxx-exceptions --enable-mpi-ext=ftmpi
--------------------------------------------------------------------------

Aurelien Bouteiller

unread,

Sep 30, 2020, 2:11:12 AM9/30/20

to User Level Fault Mitigation

Daniel,

The error you are seeing indicates that your existing processes (comm_world) could not connect to the spawned processes.

One way to collect more information about what happens here is to run with the extra verbose flag `--mca btl_base_verbose 100`

The usual reason for that error is that vader cannot be used between spawnees and the existing comm_world (it’s an upstream Open MPI limitation), but your test results are not consistent with that explanation, given that you can run on single node w/faults. So something must be amiss with the openib BTL loading, and hopefully the above flag will reveal what.

Best,

Aurelien

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/e773eae5-d59d-4961-86ba-196dea2829een%40googlegroups.com.

Daniel Torres

unread,

Sep 30, 2020, 7:21:55 AM9/30/20

to User Level Fault Mitigation

Hi Aurelien.

Thanks a lot for your reply.

I have launched the tests on 2 nodes with the extra flag you have indicated, with the command line:

mpiexec -np 64 --machinefile myNodes --map-by node --mca btl openib,vader,self --mca btl_base_verbose 100 -oversubscribe --mca mpi_ft_detector_thread true ./my_test myArgs

I saved the output on the attached file.

Thanks for your attention and help.

error2nodes.txt

Aurelien Bouteiller

unread,

Oct 1, 2020, 4:36:31 PM10/1/20

to User Level Fault Mitigation

Daniel,

from your outputs, it appears that 2 processes have crashed during MPI_INIT. I think these are the 2 processes that have been freshly spawned.

They crash during the initialization of the detector. It almost certainly is a bug on our side. Could you try with the `detector_thread false`?

Best,

Aurelien

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/0699c2b7-198c-4835-8bc7-2f17cbebfa9dn%40googlegroups.com.
<error2nodes.txt>

Daniel Torres

unread,

Oct 2, 2020, 6:31:42 PM10/2/20

to User Level Fault Mitigation

Hi Aurelien.

I made the test with the change on the flag you have mentioned, but the final result is the same.

mpiexec -np 64 --machinefile myNodes --map-by node --mca btl openib,vader,self --mca btl_base_verbose 100 -oversubscribe --mca mpi_ft_detector_thread false ./my_test myArgs

I attach the corresponding error output file.I also launched the same test only on one node and saved the success output file. Hope it can help.

Both executions were tested with one failure.

Thanks a lot for your help.

success1node.txt

error2nodes.txt

Daniel Torres

unread,

Oct 12, 2020, 12:14:22 PM10/12/20

to User Level Fault Mitigation

Hi Aurelien.

Do you have any good news on the issue of process failure on respawn?

Again, thanks a lot for your help and time.

Aurelien Bouteiller

unread,

Oct 22, 2020, 12:00:53 AM10/22/20

to User Level Fault Mitigation

Daniel, I have finally had a bit of time to look into this issue.

There is a current bug in PMIx 3.1.4 (the one that ships internal in Open MPI 4.0.2, on which this version of ULFM is based).

Using external PMIx 3.2.0rc1 should fix your problem.

1. You will need to compile an external libevent (or use the system installed, any version beyond 2.0.22 should work fine).

2. Compile OpenPMIx v3.0.2rc1 using the appropriate libevent (—with-libevent=path)

3. Compile ULFM Open MPI, adding to your configure flags `—with-pmix=path —with-libevent=path`; you can use —with-libevent=external if system installed.

Best,

Aurelien

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/a75a8c73-973f-437d-a979-30d2c8283e59n%40googlegroups.com.

Daniel Torres

unread,

Oct 28, 2020, 11:46:09 AM10/28/20

to User Level Fault Mitigation

Hi Aurelien.

Thanks for your answer. I have installed both "libevent-2.1.12" and the latest "PMIx-3.2" external libraries as follows:

TO INSTALL LIBEVENT

************************

tar xzf libevent-2.1.12-stable.tar.gz

cd libevent-2.1.12-stable/

./autogen.sh

./configure --prefix=$HOME/LIBEVENT --disable-static --disable-debug-mode --disable-libevent-regress

make

make verify

make -j install

************************

TO INSTALL PMIX

************************

unzip openpmix-3.2.zip

cd openpmix-3.2/

./autogen.pl

./configure --disable-debug --prefix=$HOME/PMIX --enable-man-pages=false --with-libevent=$HOME/LIBEVENT/

make -j install

make check

************************

The installation seems to be successful, as the "make verify" and "make check" commands output shows all tests were successful.

Then I export my new libraries path:

************************

export LIBE=$HOME/LIBEVENT/
export PATH=$LIBE/bin:$PATH
export LD_LIBRARY_PATH=$LIBE/lib:$LD_LIBRARY_PATH

export PMIX=$HOME/PMIX/
export PATH=$PMIX/bin:$PATH
export LD_LIBRARY_PATH=$PMIX/lib:$LD_LIBRARY_PATH

************************

source ~/.bashrc

************************

But when I try to configure ULFM's installation providing the "--with-pmix=$HOME/PMIX/" and "--with-libevent=$HOME/LIBEVENT/" flags, the next error is shown:

--- MCA component pmix:ext3x (m4 configuration macro)
checking for MCA component pmix:ext3x compile mode... dso
checking if external component is version 3.x... yes
configure: WARNING: EXTERNAL PMIX SUPPORT REQUIRES USE OF EXTERNAL LIBEVENT
configure: WARNING: LIBRARY. THIS LIBRARY MUST POINT TO THE SAME ONE USED
configure: WARNING: TO BUILD PMIX OR ELSE UNPREDICTABLE BEHAVIOR MAY RESULT
configure: error: PLEASE CORRECT THE CONFIGURE COMMAND LINE AND REBUILD

Reading the warning message, I verified both paths were right written, and yes, they are. The steps I followed to install ULFM are:

************************

tar -xvf icldistcomp-ulfm2-117942b0e44e.tar.gz

cd icldistcomp-ulfm2-117942b0e44e/

./autogen.pl

/configure --with-ft=mpi --prefix=$HOME/ULFM2 --enable-mpi-cxx --enable-mpi-cxx-seek --enable-cxx-exceptions --enable-mpi-ext=ftmpi --with-pmix=$HOME/PMIX/ --with-libevent=$HOME/LIBEVENT/

************************

Am I missing a flag to allow the configuration/installation process to continue? Or did I installed incorrectly libevent or PMIx?

Thanks for your help.

Aurelien Bouteiller

unread,

Oct 28, 2020, 4:02:10 PM10/28/20

to User Level Fault Mitigation

Daniel,

In the file contrib/platform/ft_mpi_ulfm; remove the lines

with_libevent=internal

and

with_pmix=internal.

Command line parameters should take precedence but lets make sure.

Best,

Aurelien

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/c635791e-74c3-4b29-91ac-7f3707f7b9aen%40googlegroups.com.

George Bosilca

unread,

Oct 28, 2020, 6:39:22 PM10/28/20

to ul...@googlegroups.com

Daniel,

Following the steps you mentioned in your email I was able to install everything, with the correct dependencies. I usually pull directly from git, and things go smooth. Anyway, it worked while using the official stable tarballs for libevent and pmix, and this without doing any modification in the contrib/platform/ft_mpi_ulfm file.

However, I noticed an issue. You have a full set of options related to CXX (--enable-mpi-cxx --enable-mpi-cxx-seek --enable-cxx-exceptions) and those have been removed from the OMPI master, so from the ULFM master version. What version of ULFM are your trying to install ?

George.

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/62426323-3EEC-4EB9-B92F-5CEBA534C409%40icl.utk.edu.

Daniel Torres

unread,

Oct 29, 2020, 9:51:56 AM10/29/20

to User Level Fault Mitigation

Hi Aurelien and George.

Thanks both of you for your answers.

The version I was trying to install previously was "v4.0.2u1", that I found on the download section of the ULFM repository. Now, to avoid older versions, as George suggested, I just cloned the repo (git clone --recursive https://bitbucket.org/icldistcomp/ulfm2.git).

When I try to configure with the command:

************************
./configure --with-ft=mpi --prefix=$HOME/ULFM2 --enable-mpi-ext=ftmpi --with-pmix=$HOME/PMIX/ --with-libevent=$HOME/LIBEVENT/ (I removed the CXX flags)
************************

Again it shows the same error message related to external PMIX and LIBEVENT libraries. So I tried removing (or commenting) the lines in "contrib/platform/ft_mpi_ulfm" as Aurelien said and it worked!

In the "Miscellaneous" configuration resume it is shown:

************************
Miscellaneous
-----------------------
CUDA support: no
Fault Tolerance support: mpi (disabling components mtl pml-cm pml-crcpw pml-yalla pml-ucx coll-cuda coll-fca coll-hcoll coll-portals4 btl-usnic btl-portals4 btl-scif btl-smcuda pml-monitoring pml-v vprotocol crcp)
HWLOC support: internal
Libevent support: external
PMIx support: External (3x)
************************

So i guess both external libraries were installed correctly. Then, compilation and installation process can finish without problem.

About the re-spawning error that started the discussion, it was solved too!

Now, using 2 nodes, when a process has failed and the error was detected by all the other processes, a spawny process can be launched without trouble. =D

Thanks a lot for your time, attention and help to solve this issue.

George Bosilca

unread,

Oct 29, 2020, 3:30:02 PM10/29/20

to ul...@googlegroups.com

Daniel,

It's awesome you have a working version. You can move forward with what you have or you can move one step closer to the OMPI 5.0 by pulling the export/ulfm-to-ompi5-expanded branch. I forgot to mention in my previous message that this is the branch I was using, a branch that apparently does not have the same issue as the default ulfm branch because the platform file does not exist. Anyway, your choice if you stay with the old or transition to the new branch.

George.

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/a149008b-2d0f-449e-9fad-bcba8c4d25bcn%40googlegroups.com.

Reply all

Reply to author

Forward