simulating node failure

44 views
Skip to first unread message

Atis Degro

unread,
Aug 5, 2019, 6:31:25 AM8/5/19
to User Level Fault Mitigation
Dear all,

I have been trying to test fault tolerant mpi implementation on a finite difference solver.
When running on a cluster I tried two ways on inducing faults:
-ssh to one of the running nodes and kill processes one by one using 'kill -9 PID';
-reboot one of the nodes (more realistic node failure scenario).

When using the first option the run recovers as expected and proceeds,
if I reboot a node however, the run just hangs.

Is this behavior expected?
Or might this be a reason of the way ULFM is installed?
I am using icldistcomp-ulfm2-2e75c73cc620.tar.bz2 verion
of the ULFM.

Thank you and best regards,
Atis

George Bosilca

unread,
Aug 5, 2019, 10:46:33 AM8/5/19
to ul...@googlegroups.com
Atis,

Your version is now almost 2 years old, I am not entirely sure of my answer, but at that time I think the detection relied on the TCP socket timeout. This is unfortunately a system level parameter, that you can change (if you have the necessary rights) using  net.ipv4.tcp_keepalive_time. The new failure detector got in early 2018, so updating might be necessary.

If you update to a more recent version (today's HEAD 1a657f7) you can configure the failure detector using the ${HOME}/.openmpi/mca-params.conf file by adding:
mpi_ft_detector_period = 10
mpi_ft_detector_timeout = 30

  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/17f9b51e-5a54-4369-ab74-93f3757cafca%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages