Error produced by grid topology?

25 views
Skip to first unread message

Daniel Torres

unread,
Aug 20, 2021, 6:27:21 AM8/20/21
to User Level Fault Mitigation
Hi.

Doing some tests with the latest OMPI version (with FT enabled) and killing a random process with "raise(SIGKILL)" produces some failures I cannot handle.

For example:
--------------------------------------------------------------------------
[gros-75.nancy.grid5000.fr:28869] [[24089,1],33] ompi: Process [[24089,1],1] failed (state = -57).
[gros-75.nancy.grid5000.fr:28869] [[24089,1],33] ompi: Error event reported through PMIx from [[24089,1],33] (state = -57). This error type is not handled by the fault tolerant layer and the application will now presumably abort.
[gros-90.nancy.grid5000.fr:16766] [[24089,1],49] ompi: Process [[24089,1],1] failed (state = -57).
[gros-75:00000] *** An error occurred in PMIx Event Notification
[gros-75:00000] *** reported by process [1578696705,33]
[gros-75:00000] *** on a NULL communicator
[gros-75:00000] *** Unknown error (this should not happen!)
[gros-75:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gros-75:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------

A fast explanation of what I'm trying to do is:
I have a process grid with a "global" communicator (dup of MPI_COMM_WORLD), and a lot of rows and columns communicators, created with MPI_Cart_create and MPI_Cart_sub.
When I kill a process, I use MPIX_Comm_agree over the "global" communicator to allow all processes to catch the error, no matter if the dead process was not part of its row/column, and restore the three communicators the dead process was integrated: global_comm, row_comm, and col_comm.

When the process dies, the error appears and the restoring procedure does not start. I attach the output error file.

I know "topo" modules are still not supported for fault tolerance, but I'm trying to catch the error with a duplicate of MPI_COMM_WORLD.

Could you please help me with this issue?

The way I cloned and configured my OMPI/ULFM version is:
--------------------------------------------------------------------------
git clone --recursive https://github.com/open-mpi/ompi.git --branch=master --single-branch
./configure --with-ft=mpi --prefix=$HOME/ULFM --disable-man-pages
--------------------------------------------------------------------------

The way I launch my tests is:
--------------------------------------------------------------------------
mpiexec --np 64 --machinefile hostfile  --map-by node:oversubscribe --mca btl tcp,vader,self --mca btl_base_verbose 100 --enable-recovery --mca mpi_ft_enable true --mca mpi_ft_detector_thread true --mca mpi_ft_verbose 1 ./myTest myArgs
--------------------------------------------------------------------------

OMPI/PMIX version installed is:
--------------------------------------------------------------------------
$ mpiexec --version
prterun (Open MPI) 5.1.0a1

$ grep PMIX_VERSION include/pmix_version.h
#define PMIX_VERSION_MAJOR 5L
#define PMIX_VERSION_MINOR 0L
#define PMIX_VERSION_RELEASE 0L
--------------------------------------------------------------------------

P.D. The next error:
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    module:untested:failundef
But I couldn't open the help file:
    /home/dtorres/ULFM/share/openmpi/help-ft-mpi.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
is also shown, but changing the file name "help-ft-mpi.txt" to "help-mpi-ft.txt" solves it.

Thanks for your attention.

out.txt

George Bosilca

unread,
Aug 20, 2021, 11:12:44 AM8/20/21
to Aurelien Bouteiller, ul...@googlegroups.com
Daniel,

Aurelien discovered few days ago that an update to PRRTE/PMIx broke the fault handling in OMPI. We are investigating right now. Meanwhile you might have to rollback to an older version.

@Aurelien do you know what was the last OMPI version that worked correctly ?

Thanks,
  George.

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/9ea4a3e4-4e98-4252-b244-f7ce6e1ec7c6n%40googlegroups.com.

Daniel Alberto Torres Gonzalez

unread,
Aug 20, 2021, 6:54:06 PM8/20/21
to 'George Bosilca' via User Level Fault Mitigation

Hi.

Thanks for your answer.

I'm going to rollback PMIx to the last stable version, it's v4.1.0, if I remember well.

I will be aware of the changes made to version 5.x.

Cordialement.

El 20/08/21 a las 17:12, 'George Bosilca' via User Level Fault Mitigation escribió:

Aurelien Bouteiller

unread,
Aug 31, 2021, 9:30:19 PM8/31/21
to George Bosilca, ul...@googlegroups.com
Sorry I missed this email,

Open MPI version b541521c should work as intended.

Alternatively you may `git revert 0afa3487`

Best,
Aurelien

Daniel Alberto Torres Gonzalez

unread,
Sep 29, 2021, 5:46:26 AM9/29/21
to ul...@googlegroups.com

Hi again.

After reverting to the hashid 0afa3487 as specified in the previous answer, I have tested again (and again) but the restoring procedure still "keeps waiting" in the MPI_Comm_spawn call.

I have also tested with version b541521c, but I had the same result.

Is there another version I could try for testing?

Thanks in advance for your help.

Best regards.

El 01/09/21 a las 3:30, 'Aurelien Bouteiller' via User Level Fault Mitigation escribió:

George Bosilca

unread,
Sep 29, 2021, 10:53:22 AM9/29/21
to ul...@googlegroups.com
Daniel,

I tried with the current master and some of our examples using comm_spawn and things seem to work as expected. What batch scheduler are you using ? Do you have a reproducer we can play with (if you don't want your code on the mailing list send it to us privately)  ?

Thanks,
  George.


Reply all
Reply to author
Forward
0 new messages