openmpi crashes, hangs or does not finalize despite the MPI error handler

101 views
Skip to first unread message

Vahid Jafari

unread,
Apr 22, 2022, 5:43:32 AM4/22/22
to User Level Fault Mitigation

Dear ULFM team,

I’m working currently with ULFM with my colleagues in a parallel computing discipline. However, we have some technical problems: every time we run one of the classical tutorials with SIGKILL, e.g. when running the noft2.c (attached file) tutorial example the application will hang or be finished with the following error despite the MPI error handler:

 

mpirun noticed that process rank 1 with PID 0 on node master exited on signal 9 (Killed).

 

We use different version of open MPI (Open MPI) 4.0.2rc3 and openmpi-5.0.0rc2:

 

  1. here is what we tried and the program crashes:

 

> mpirun -np 2 ./noft

Rank 0 / 2: Before sigkill . !

Rank 1 / 2: Before sigkill . !

--------------------------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun noticed that process rank 1 with PID 0 on node master exited on signal 9 (Killed).

--------------------------------------------------------------------------

 

 

  1. here the program hangs:

 

> mpirun -np 2 --enable-recovery ./noft

Rank 1 / 2: Before sigkill . !

Rank 0 / 2: Before sigkill . !

[master.ipoib:128008] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923

 

 

  1. And if we comment out the line 26, 27 and 30 (MPI_Barrier, MPI_Error_string and MPI_Finalize) the program is finished:

 

 

> mpirun -np 2 --enable-recovery ./noft

Rank 1 / 2: Before sigkill . !

Rank 0 / 2: Before sigkill . !

[master.ipoib:130374] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923

Rank 0 / 2: Notified of error . Stayin' alive!

 

I wanted to ask you if you know what else we can try here in order to do all of these works together. Do we have to pass  some parameters at runtime (=when calling mpirun) for OpenMPI, etc.?

 

Thanks a lot in advance

 

Best,

Vahid

noft2.c

George Bosilca

unread,
Apr 25, 2022, 10:21:54 AM4/25/22
to ul...@googlegroups.com
Vahid,

OMPI 4.x did not include ULFM, so there was no support for resilience. The OMPI 5.0 does include it, but it must be specifically enabled in order to be available to users.You need to add `–with-ft=mpi` to have it available.

George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/4faa4d22-61b8-4005-8b00-22dc7c45bca3n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages