--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/28b69c04-5b22-4e14-aaf9-787d22fd8085%40googlegroups.com.
thank you for the quick reply and the references! I didn't yet get to
read them completely, unfortunately.
> I am not sure I understand what you mean by "not yet at the right
> operation" with regard to fault discovery.
Given the API of MPI, you can not tell from the outside (as far as I can
see) whether a node is faulty or just slow. It might not be at the
e.g. Broadcast yet, but since there are no regular calls into MPI
required, progress can not be seen.
> The fault discovery is not a collective call in the sense of the MPI
> standard, it is done independently of whatever the application itself
> is doing. There are many ways to detect hard process faults, either
> between processes themselves or relying on external entities (such as
> the runtime daemons).
I was only thinking about detecting failures when needed, i.e., during
an operation. Relying on an external fault detector is of course an
option (that is allowed by ULFM).
> Talking specifically about our ULFM implementation, I attached below 2
> links to recent papers about this topic, one having the detection in the
> processes themselves [2] and one with the failure detector externalized in
> the runtime system (PMIx in this particular instance).
Can you tell me what is done in OpenMPI?
The reason for me asking is that I implemented some fault-tolerant
collective operations (along the lines of [1]), that need a separate
thread to make the distinction between slow and faulty processes. This
is obviously not perfect for fault-tolerance (as faults could occur
independently in them), and it would help me to be able to say that
others can't conjure up this information either.
Martin
[1] http://htor.inf.ethz.ch/publications/img/corrected_trees.pdf
--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/3a5ea624-2e10-432d-9cb3-8745cf8cc875%40googlegroups.com.
I read your references now. Thanks again!
> Indeed, without external help (either a thread or a daemon at the node
> level) it is difficult to prevent false positive reports of failures for
> processes that are away from MPI functions for too long (this is mainly due
> to the lack of progress in MPI outside MPI calls). Mitigation techniques
> exists however, one being proposed and implemented in [2].
The paper doesn't seem to talk about integration into an existing app at all
- or I missed that. Instead it describes a fault detector in itself. While that
is a worthwhile task, it doesn't cover what I was thinking about.
If there is a FD implemented at the library level, there must be a non-application
thread that (among other things) sends and receives heartbeats, right?
> The solutions implemented in ULFM are deterministic, unlike the
> gossip-based algorithm proposed in your paper. A more recent, extended
> version of [2] I referenced earlier, does a comparison with one of the
> versions of your gossip algorithm (I don't remember which one of the
> corrections over the initial algorithm we compared with), and details the
> distinctions between the classes of algorithms and about the specific
> implementations.
The paper is not about the gossip-based algorithm. It is based on the algorithm you
might remember [2], but it combines the corrections with a tree-phase to be
deterministic in the end (for simple correction; there are options that send to
successive nodes until one is sent to that already send a correction). I am currently
working on extending the idea to all-to-one and all-to-all communications. The goal
is different from what ULFM requires: Communication should complete successfully if
only a limited number of nodes died (in a time-window, if information about failures
is spread).
Martin
George,
I read your references now. Thanks again!
> Indeed, without external help (either a thread or a daemon at the node
> level) it is difficult to prevent false positive reports of failures for
> processes that are away from MPI functions for too long (this is mainly due
> to the lack of progress in MPI outside MPI calls). Mitigation techniques
> exists however, one being proposed and implemented in [2].
The paper doesn't seem to talk about integration into an existing app at all
- or I missed that. Instead it describes a fault detector in itself. While that
is a worthwhile task, it doesn't cover what I was thinking about.
If there is a FD implemented at the library level, there must be a non-application
thread that (among other things) sends and receives heartbeats, right?
> The solutions implemented in ULFM are deterministic, unlike the
> gossip-based algorithm proposed in your paper. A more recent, extended
> version of [2] I referenced earlier, does a comparison with one of the
> versions of your gossip algorithm (I don't remember which one of the
> corrections over the initial algorithm we compared with), and details the
> distinctions between the classes of algorithms and about the specific
> implementations.
The paper is not about the gossip-based algorithm. It is based on the algorithm you
might remember [2], but it combines the corrections with a tree-phase to be
deterministic in the end (for simple correction; there are options that send to
successive nodes until one is sent to that already send a correction). I am currently
working on extending the idea to all-to-one and all-to-all communications. The goal
is different from what ULFM requires: Communication should complete successfully if
only a limited number of nodes died (in a time-window, if information about failures
is spread).
--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/e36bf35a-2724-43bb-801f-5b076813b293%40googlegroups.com.