Failure detection mechanism -- infinito loop detected as failure

18 views
Skip to first unread message

Lukas

unread,
Jan 15, 2020, 10:58:21 AM1/15/20
to User Level Fault Mitigation
Hey all,

when using ULFM2.1rc1, I've sent one rank into an infinite loop (while (42) ;). The other ranks seem to think, that it failed in under a second and succeed in building a new communicator without it. I've tested thin on a single processor machine using mpiexec -n 4 program-name. I guess this is a very unrealistic scenario, but I was wondering if something like this could happen in a real application -- for example if one iteration between two MPI calls take multiple minutes to complete?
What is happening here? I keep reading about a "heartbeat" signal -- who is sending and receiving this signal? Can this only be done if some MPI function is called before the heartbeat timeout is reached or is a separate thread used for this? How are the mpi_ft_detector_thread and mpi_ft_detector parameters involved in this?

Regards,
Lukas

George Bosilca

unread,
Jan 15, 2020, 11:15:07 AM1/15/20
to ul...@googlegroups.com
This is indeed a known problem for systems without asynchronous communications. Keep in mind that MPI standard does not mandate communication progress outside MPI calls, which means that when you block the process outside any MPI call there is no guarantee that any communication will be answered, including the fault detection. That being said there are few solutions to this issue:

1. increase the fault detector timeout to a small multiple of the longest time in which your application will be unresponsive.
2. use a communication thread or ask OMPI/ULFM to provide asynchronous progress (you will need to dedicate some resources and to accept the potential performance impact of this)
3. use a more recent version of ULFM, where the fault detector is located outside the MPI library (in the PPRTE/ORTE daemon).
4. don't do "while (42) ;" ;)

Hope this helps,
  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/99e2fdd5-1dcc-4541-ac6b-f6e381dec693%40googlegroups.com.

Lukas

unread,
Jan 15, 2020, 11:52:29 AM1/15/20
to ul...@googlegroups.com
Hey George,

first of all, thank you a lot for your fast response! I'll certainly try 1.+2.
Regarding 3.: Maybe I misunderstood the ULFM version naming
convention. I used the ULFM version I found a few days ago (12. Nov
2020) on https://bitbucket.org/icldistcomp/ulfm2/downloads/?tab=downloads
Is there a more recent version?
Thanks also for your hint with the ORTE daemon. The error occurs only
when --mca mpi_ft_detector true is set. After handling over
fault-detection to ORTE using --mca mpi_ft_detector false, the problem
no longer exists. So now I know what this switch is for :-)

- Lukas
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/CAMJJpkX9EqmRfupa%2BAmiCkbLo9rSrmTXR4OgEHM-93vxFZmOdQ%40mail.gmail.com.

Aurelien Bouteiller

unread,
Jan 16, 2020, 11:19:19 AM1/16/20
to ul...@googlegroups.com
Lukas, 

The most recent release is tagged 4.0.2u1 (meaning it’s the first release of ULFM based on Open MPI 4.0.2).

Setting the detector to false altogether works with some network transports (e.g., TCP) but not for all cases. If you observe cases where failures are not detected —at all—, you can also set the detector in a thread. —mca mpi_ft_detector_thread true.  This will raise the latency to the same level as if you had initialized MPI with MPI_THREAD_MULTIPLE.


Best,
Aurelien

Aurelien Bouteiller

unread,
Jan 16, 2020, 11:30:03 AM1/16/20
to ul...@googlegroups.com

As an addendum, we have documented the most important UFLM runtime flags in this blog post:

https://fault-tolerance.org/2019/11/18/ulfm-4-0-2u1/#Run-time_tuning_knobs

Best,
Aurelien
Reply all
Reply to author
Forward
0 new messages