Question on ULFM OpenMPI 5.0.0 for node failure

92 views
Skip to first unread message

Kawin Nimsaila

unread,
Nov 27, 2023, 5:43:37 PM11/27/23
to User Level Fault Mitigation
Dear ULFM team,

I am quite new to ULFM. I would like to check if ULFM on OpenMPI 5.0.0 has fault tolerance support for node failure?
I run a small program on AWS by killing a parent process: kill (getppid(), SIGKILL).
The program fails with the error message: PRTE has lost communication with a remote daemon.

If I change to kill(getpid(), SIGKILL), then program will work fine.
I am not sure if there is a limitation or any configuration that I miss.

Here is a configure command from ompi_info.

 Configure command line: '--prefix=/home/ec2-user/openmpi/openmpi-5.0.0/build' '--enable-prte-ft' '--with-ft=ulfm' '--with-libevent=internal' '--with-hwloc=internal' '--with-pmix=internal' '--with-prrte=internal'
 
Please let me know if you need more information.
Thank you very much.

Regards,
Kawin

George Bosilca

unread,
Nov 27, 2023, 6:08:12 PM11/27/23
to ul...@googlegroups.com
The parent of one of the MPI processes is the PRRTE daemon. By killing the daemon and leaving the processes around you should get a message from the mpirun process stating that it lost a daemon, and then you alive MPI processes should start seeing notifications about dead processes. The application should continue to run, and you should be able to start new processes (and new daemons) via spawn. 

From your question it is not clear what is the outcome you are observing ?

  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/a8d9ec93-e978-48f2-9012-1f63040b012cn%40googlegroups.com.

Kawin Nimsaila

unread,
Nov 28, 2023, 10:51:10 AM11/28/23
to User Level Fault Mitigation
Dear George,
Thank you for the reply. I also try running kill_node program from https://github.com/ICLDisco/ulfm-testing/blob/master/stress/kill_node.c

The output does not seem to continue to run to the end. Here is the command I ran.

$ mpirun --with-ft=ulfm  -np 4  --host queue1-dy-t3nano-1,queue1-dy-t3nano-1,queue1-dy-t3nano-2,queue1-dy-t3nano-2  ./kill_node

and here is the result.

Warning: Permanently added 'queue1-dy-t3nano-2,172.31.68.234' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue1-dy-t3nano-1,172.31.65.214' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------

PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-ip-172-31-48-121-5100@0,0] on node ip-172-31-48-121
  Remote daemon: [prterun-ip-172-31-48-121-5100@0,2] on node queue1-dy-t3nano-2

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

I am not sure if I miss any parameters for mpirun. Could you please advise?

Regards,
Kawin
Reply all
Reply to author
Forward
0 new messages