Process failure not caught on OMPI 5.1.0a1

65 views
Skip to first unread message

Daniel Torres

unread,
Mar 26, 2021, 12:58:20 PM3/26/21
to User Level Fault Mitigation
Hi everyone.

Recently I have installed the OMPI version 5.1.0a1, cloning the master branch.
--------------------------------------------------------------------------------------------------------------------
git clone --recursive -b master https://github.com/open-mpi/ompi.git
--------------------------------------------------------------------------------------------------------------------

I used an internal all configuration (HWLOC, LIBEVENT, PMIX) for testing my previous working codes on my clean installation.
--------------------------------------------------------------------------------------------------------------------
./autogen.pl
./configure --prefix=$HOME/OMPI --disable-man-pages
make all install
make check
make clean
--------------------------------------------------------------------------------------------------------------------

Until here everything is fine.
--------------------------------------------------------------------------------------------------------------------
Open MPI configuration:
-----------------------
Version: 5.1.0a1
Build MPI C bindings: yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Build MPI Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)

Miscellaneous
-----------------------
CUDA support: no
CUDA support: no
Fault Tolerance support: mpi
hwloc: internal
libevent: internal
pmix: internal
prrte: internal
Threading Package: pthreads
 
Atomics
-----------------------
OMPI: BUILTIN_C11
 
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Open UCX: no
OpenFabrics OFI Libfabric: no
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
 
OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no
Lustre: no
PVFS2/OrangeFS: no
--------------------------------------------------------------------------------------------------------------------

If I run my tests without killing processes, all processes finish well, but if I kill one (with SIGKILL), all processes stop showing the message:
--------------------------------------------------------------------------------------------------------------------
[daniel-lap:29408] [grpcomm_bmg_module.c:259] PMIx Error: UNPACK-INADEQUATE-SPACE
--------------------------------------------------------------------------------------------------------------------

So despite having selected fault tolerance, when the process stops, the error is not caught.
Do you know what this error means?
Did I miss something with my installation process?
Should I post this error on the OMPI mailing list?

Thanks a lot for your help.

EXTRA INFO
--------------------------------------------------------------------------
The command line I use to compile is:
mpicc -g -O3 test.c -o test -lm

The command line I use to execute is:
mpiexec --np 4 --machinefile hostfile --mca btl_base_verbose 100 --map-by node:oversubscribe --with-ft mpi --enable-recovery --mca mpi_ft_detector_thread true  ./test myArgs

My machine is:
Linux daniel-lap 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux
--------------------------------------------------------------------------

George Bosilca

unread,
Mar 26, 2021, 1:17:52 PM3/26/21
to ul...@googlegroups.com
Daniel,

Please take a look at the README.FT.ULFM.md file, it explains how to get the FT part enabled. I don't want to spoil it but you are missing a configure flag.

 Thanks,
  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/37265b90-7823-4161-a1e4-46f5fdc631ffn%40googlegroups.com.

Amina Guermouche

unread,
Mar 29, 2022, 8:12:06 AM3/29/22
to User Level Fault Mitigation
Hello,
I am having the same issue as Daniel.
I cloned openMPI master, and I configure using : 
./configure --prefix=/home/LIB/ompi-install-ulfm --with-ft=ulfm --disable-oshmem
make && make install

Everything went smoothly and the config.log mentions: 
configure:5174: *** Fault tolerance
configure:78131: checking if want fault tolerance
configure:78209: result: Enabled mpi (Specified ulfm)
configure:78213: WARNING: **************************************************
configure:78215: WARNING: *** Fault Tolerance Integration into Open MPI is *
configure:78217: WARNING: *** compiled-in, but off by default. Use mpiexec *
configure:78219: WARNING: *** and MCA parameters to turn it on.            *
configure:78221: WARNING: *** Not all components support fault tolerance.  *
configure:78223: WARNING: **************************************************
configure:78263: checking if want checkpoint/restart enabled debugging option
configure:78273: result: Disabled

I tried running the sc21 tutorial examples, and:
mpiexec -n 2 ./02.err_handler => returns a "normal" error of a dead process (prterun noticed that process rank 1 with PID 0 on node ... exited on signal 9 (Killed)). A similar behavior is observed if I use : mpirun --tune ft-mpi ./02.err_handler
If a I use the options mentioned by Daniel, then I have a core dumped.

For now I just run on my computer (thus single node). I guess I am missing something from the command line, but I can't figure out what. If you could point me to the right direction :)
Kind regards,
Amina
ps : the link mentioned above REAMDE.FT.ULFM.md seems to be broken

George Bosilca

unread,
Mar 29, 2022, 11:39:29 AM3/29/22
to ul...@googlegroups.com
Hi Amina,


I think you both missed the --with-ft on the configure line. Here is how I build the code:
./configure --enable-debug --prefix=*** --with-pmix=internal --enable-picky --enable-visibility --enable-mpirun-prefix-by-default --disable-oshmem --without-memkind --with-ft

Let me know if this helps,
  George.


Amina Guermouche

unread,
Mar 31, 2022, 10:38:54 AM3/31/22
to User Level Fault Mitigation
Thank you George :)
It's better (the processes do not return with the exit signal) but the execution just never stops (and it seems that no process enters the handler). I am using the 02.err_handler from the SC 21 tutorial.
To run I use: mpiexec --with-ft ulfm -n 2 ./02.err_handler (still on my own computer).
I think it's using prte and I don't know if that's the correct framework to use.
What's the correct command line to use ?
Amina

Amina Guermouche

unread,
Apr 4, 2022, 3:12:37 AM4/4/22
to User Level Fault Mitigation
Hello,
By using the --tune and the provided ft-mpi, it works as expected.
Thank you again for your help,
Amina

Aurelien Bouteiller

unread,
Apr 4, 2022, 2:21:54 PM4/4/22
to User Level Fault Mitigation
Hey Amina,

It is odd that you have to pass the tune file by hand.

I am double checking your configure flags and they look alright. Still there are common traps:

During the configure stage I normally also use ‘prefix by default’ as a way to make sure I’m not picking up random `prted` from a system install.

`../master/configure —prefix=some/place --enable-prte-prefix-by-default --with-pmix=internal --with-libevent=internal --with-prrte=internal --with-ft=ulfm`

Then during run, you need to enable fault tolerance (it is compiled-in/runtime off by default):

`some/place/bin/mpiexec --with-ft ulfm example2`

This should remove the need to pass-in tune files or other controls to enable fault tolerance.

Let me know how that goes.
Aurelien
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/9fd2cbfa-2b5e-4253-8b3b-3b498f980de3n%40googlegroups.com.

Amina Guermouche

unread,
Apr 5, 2022, 7:32:07 AM4/5/22
to User Level Fault Mitigation
Hey Aurélien,
Thank you for your answer. I tried the exact same configure as you mentioned ( ./configure --prefix=somewhere --enable-prte-prefix-by-default --with-pmix=internal --with-libevent=internal --with-prrte=internal --with-ft=ulfm), then mpiexec (after make and make install) with ft enabled. If I do not specify the --tune, the execution never ends (I'm trying on the 02.err_handler.c provided in the tutorial, and it seems like the execution does not enter the handler). If I add the --tune, then it works as expected. I tried on another machine (a cluster where we load modules, so the environment is cleaner than my own machine) and I have the same issue.
Amina

Aurelien Bouteiller

unread,
Apr 7, 2022, 8:37:55 PM4/7/22
to User Level Fault Mitigation
Amina,

Just double checking you did use the `mpiexec —with-ft mpi myprogram` command to launch?

Aurelien
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/75b8fdbf-43ee-47b2-a563-71220cf7c8b8n%40googlegroups.com.

Amina Guermouche

unread,
Apr 12, 2022, 4:38:54 AM4/12/22
to User Level Fault Mitigation
Hello,
Sorry for the delay.
Yes I run with mpiexec --with-ft mpi 
Amina

Aurelien Bouteiller

unread,
Apr 12, 2022, 9:38:56 AM4/12/22
to User Level Fault Mitigation
Amina, I cannot reproduce this with the latest ‘main’. Can you provide the exact git hash for both Open MPI and the submodules (git show, git submodule)?

Best,
Aurelien
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/75fa7ed4-e5d0-49bd-a1f7-340b9bdc653en%40googlegroups.com.

Amina Guermouche

unread,
Apr 12, 2022, 11:40:42 AM4/12/22
to User Level Fault Mitigation
Here's the ouput :
$ git show
Merge: dfb9ca0bb5 4f3257d7b6
Author: Tommy Janjusic <jan...@users.noreply.github.com>
Date:   Thu Mar 31 20:02:48 2022 -0500

    Merge pull request #10208 from vspetrov/coll_ucc_build_fix
   
    coll/ucc: build and warn fixes
$ git submodule
1b86a35db2816ee9c0f3a41988005a2ba7d29adb 3rd-party/openpmix (v1.1.3-3481-g1b86a35d)
 91f791e209ccbdfb4b8647900d292ef51d52f37d 3rd-party/prrte (psrvr-v2.0.0rc1-4319-g91f791e209)
Reply all
Reply to author
Forward
0 new messages