Detection of failed nodes

Martin Küttler

unread,

Jan 27, 2020, 9:21:23 AM1/27/20

to User Level Fault Mitigation

Hi,

I have a question. How is the detection of failed nodes done?

Specifically, how are failed nodes distinguished from ones that are not

at the given operation yet? Are there any references supporting the

decision taken? I know the ULFM-standard doesn't say anything about

that, but in an actual implementation something must be done.

I'm trying to implement something similar to part of ULFM, where the

same issue comes up.

Martin

George Bosilca

unread,

Jan 27, 2020, 11:25:12 AM1/27/20

to ul...@googlegroups.com

Martin,

I am not sure I understand what you mean by "not yet at the right operation" with regard to fault discovery. The fault discovery is not a collective call in the sense of the MPI standard, it is done independently of whatever the application itself is doing. There are many ways to detect hard process faults, either between processes themselves or relying on external entities (such as the runtime daemons).

Talking specifically about our ULFM implementation, I attached below 2 links to recent papers about this topic, one having the detection in the processes themselves [2] and one with the failure detector externalized in the runtime system (PMIx in this particular instance).

George.

[1] https://dl.acm.org/doi/pdf/10.1145/3343211.3343225

[2] https://journals.sagepub.com/doi/pdf/10.1177/1094342017711505

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/28b69c04-5b22-4e14-aaf9-787d22fd8085%40googlegroups.com.

Martin Küttler

unread,

Jan 28, 2020, 6:31:32 AM1/28/20

to User Level Fault Mitigation

Hi George,

thank you for the quick reply and the references! I didn't yet get to
read them completely, unfortunately.

> I am not sure I understand what you mean by "not yet at the right
> operation" with regard to fault discovery.

Given the API of MPI, you can not tell from the outside (as far as I can
see) whether a node is faulty or just slow. It might not be at the
e.g. Broadcast yet, but since there are no regular calls into MPI
required, progress can not be seen.

> The fault discovery is not a collective call in the sense of the MPI
> standard, it is done independently of whatever the application itself
> is doing. There are many ways to detect hard process faults, either
> between processes themselves or relying on external entities (such as
> the runtime daemons).

I was only thinking about detecting failures when needed, i.e., during
an operation. Relying on an external fault detector is of course an
option (that is allowed by ULFM).

> Talking specifically about our ULFM implementation, I attached below 2
> links to recent papers about this topic, one having the detection in the
> processes themselves [2] and one with the failure detector externalized in
> the runtime system (PMIx in this particular instance).

Can you tell me what is done in OpenMPI?

The reason for me asking is that I implemented some fault-tolerant
collective operations (along the lines of [1]), that need a separate
thread to make the distinction between slow and faulty processes. This
is obviously not perfect for fault-tolerance (as faults could occur
independently in them), and it would help me to be able to say that
others can't conjure up this information either.

Martin

[1] http://htor.inf.ethz.ch/publications/img/corrected_trees.pdf

George Bosilca

unread,

Jan 28, 2020, 11:18:02 AM1/28/20

to ul...@googlegroups.com

Martin,

Indeed, without external help (either a thread or a daemon at the node level) it is difficult to prevent false positive reports of failures for processes that are away from MPI functions for too long (this is mainly due to the lack of progress in MPI outside MPI calls). Mitigation techniques exists however, one being proposed and implemented in [2].

In OMPI we have what the 2 papers describe, aka. a version at the MPI library level ([2]) and a version in the PMIx runtime ([1]) , plus a full set of MCA parameters to select what type of detection, at what timeout, how many retries and so on. We released a blog entry on this topic yesterday [3].

The solutions implemented in ULFM are deterministic, unlike the gossip-based algorithm proposed in your paper. A more recent, extended version of [2] I referenced earlier, does a comparison with one of the versions of your gossip algorithm (I don't remember which one of the corrections over the initial algorithm we compared with), and details the distinctions between the classes of algorithms and about the specific implementations.

George

[3] https://fault-tolerance.org/2020/01/21/spurious-errors-lack-of-mpi-progress-and-failure-detection/

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/3a5ea624-2e10-432d-9cb3-8745cf8cc875%40googlegroups.com.

Martin Küttler

unread,

Jan 29, 2020, 5:30:17 AM1/29/20

to User Level Fault Mitigation

George,

I read your references now. Thanks again!

> Indeed, without external help (either a thread or a daemon at the node
> level) it is difficult to prevent false positive reports of failures for
> processes that are away from MPI functions for too long (this is mainly due
> to the lack of progress in MPI outside MPI calls). Mitigation techniques
> exists however, one being proposed and implemented in [2].

The paper doesn't seem to talk about integration into an existing app at all
- or I missed that. Instead it describes a fault detector in itself. While that
is a worthwhile task, it doesn't cover what I was thinking about.

If there is a FD implemented at the library level, there must be a non-application
thread that (among other things) sends and receives heartbeats, right?

> The solutions implemented in ULFM are deterministic, unlike the
> gossip-based algorithm proposed in your paper. A more recent, extended
> version of [2] I referenced earlier, does a comparison with one of the
> versions of your gossip algorithm (I don't remember which one of the
> corrections over the initial algorithm we compared with), and details the
> distinctions between the classes of algorithms and about the specific
> implementations.

The paper is not about the gossip-based algorithm. It is based on the algorithm you
might remember [2], but it combines the corrections with a tree-phase to be
deterministic in the end (for simple correction; there are options that send to
successive nodes until one is sent to that already send a correction). I am currently
working on extending the idea to all-to-one and all-to-all communications. The goal
is different from what ULFM requires: Communication should complete successfully if
only a limited number of nodes died (in a time-window, if information about failures
is spread).

Martin

[2] https://ieeexplore.ieee.org/document/7967125

George Bosilca

unread,

Jan 29, 2020, 10:05:51 AM1/29/20

to ul...@googlegroups.com

On Wed, Jan 29, 2020 at 5:30 AM Martin Küttler <martin....@gmail.com> wrote:

George,

I read your references now. Thanks again!

> Indeed, without external help (either a thread or a daemon at the node
> level) it is difficult to prevent false positive reports of failures for
> processes that are away from MPI functions for too long (this is mainly due
> to the lack of progress in MPI outside MPI calls). Mitigation techniques
> exists however, one being proposed and implemented in [2].

The paper doesn't seem to talk about integration into an existing app at all
- or I missed that. Instead it describes a fault detector in itself. While that
is a worthwhile task, it doesn't cover what I was thinking about.

Both papers I referenced have an extensive experimental section, analysing the different aspects of the failure detector, including the network impact and the latency of detection. Both are based on ULFM implementations, and both make clear references to the git hash of ULFM used for experiments, where the fault detectors described in the papers are available for open access.

We did not show the results for a particular app integration, but instead we took advantage of the integration in the ULFM stack to analyse the portable aspect of the detector, and expose the performance benefits all applications over ULFM would seamlessly inherit.

In addition, the second paper includes in Section 5.1 a discussion about the need for thread support to have an accurate detector.

If there is a FD implemented at the library level, there must be a non-application
thread that (among other things) sends and receives heartbeats, right?

> The solutions implemented in ULFM are deterministic, unlike the
> gossip-based algorithm proposed in your paper. A more recent, extended
> version of [2] I referenced earlier, does a comparison with one of the
> versions of your gossip algorithm (I don't remember which one of the
> corrections over the initial algorithm we compared with), and details the
> distinctions between the classes of algorithms and about the specific
> implementations.

The paper is not about the gossip-based algorithm. It is based on the algorithm you
might remember [2], but it combines the corrections with a tree-phase to be
deterministic in the end (for simple correction; there are options that send to
successive nodes until one is sent to that already send a correction). I am currently
working on extending the idea to all-to-one and all-to-all communications. The goal
is different from what ULFM requires: Communication should complete successfully if
only a limited number of nodes died (in a time-window, if information about failures
is spread).

Thanks for the reference,

George.

Martin

[2] https://ieeexplore.ieee.org/document/7967125

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/e36bf35a-2724-43bb-801f-5b076813b293%40googlegroups.com.

Reply all

Reply to author

Forward