Simulate failed ranks

Martin Küttler

unread,

Aug 12, 2021, 6:04:11 AM8/12/21

to User Level Fault Mitigation

Hi,

for benchmarking I would like to simulate failed ranks. My idea for that is having rank 0 decide a bitmap after MPI_Init (that is simple), MPI_Scatter the information, and then having all ranks return early from MPI_* if they want to simulate an error. That also seems easy. The issue I see is that for ULFM to work correctly, the fault detector should report information in accordance to this flag. In particular, for a rank that is alive but needs to simulate it is dead, the respective fault detector should report that the rank failed. Is that (easily) possible?

Regards,

Martin Küttler

George Bosilca

unread,

Aug 17, 2021, 2:42:03 PM8/17/21

to ul...@googlegroups.com

Martin,

It sounds as an interesting approach, but from the software perspective will be extremely complex to achieve, requiring a deep rewrite of many components. We also have the issue that confirming a process is dead is a global decision, one that cannot be undone easily at least from the point of view of the MPI library. There are at least two aspects to this:

1. For the other processes: the same process (and here I am referring to the unique naming given to that process, the guid provided by the runtime) will be able to reappear with a different MPI rank information. This basically means that the guid cannot be used as a key in hash tables, which is something many of the internal components in OMPI are doing.

2. From the process point of view: The name change operation will basically require the reinitialization of the entire set of components, because many of them will take in account the guid. As an example, the shared memory BTL will use the jobid (which is part of the guid) to create and attach to named shared memory regions.

Hope this helps,

George.

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/3d3203f4-c9d3-4e22-99a4-c0c780149a4dn%40googlegroups.com.

Martin Küttler

unread,

Aug 18, 2021, 6:53:05 AM8/18/21

to ul...@googlegroups.com

Hi,

if what I intended to do doesn't work (easily) I will do something else. So there's no problem there. But I'd like to clarify one point, that seemingly was unclear. I do not intend to simulate processes becoming alive again. In fact, even having processes simulating failure during the execution is unnecessary. All I imagine is for the root to decide about who is alive and who has failed for this execution. Distributing this information and having the ranks follow this flag (e.g., by preloading a library that returns from all communication, if the current rank is "dead") is easy. What I don't know is if the failure detector can be made to respect the flag. Maybe the easiest way would be to have the MPI implementation not send heartbeats/not answer according messages. I assume there must be such a mechanism. If it could be turned off, the failure detector would eventually detect the rank to have failed.

Thinking about this, maybe it's easier to pass the decision of who is alive at startup of a rank. If that is so, having the root process decide the alive-bitmap at runtime is not necessary. It would be sufficient to compute this decision offline, so that it can be read at startup of the ranks.

George Bosilca

unread,

Aug 18, 2021, 6:22:43 PM8/18/21

to ul...@googlegroups.com

OK, so let's make sure we are all on the same page. If what you want is for one process (root in your case) to define a schedule describing when processes will simulate their death, then this should be easy. Just let all processes receive the schedule, and then each one of them can quit according to it. If they quit using raise (as in the ULFM examples) then there is no additional management to do, the fault detector will do it's job and the fault information will be correctly propagated.

But this looks too easy, I might have misunderstood something.

George.

--

You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/CAMwSCsOSECvgmLqEDZ%3DmgAeFz5VwhokX8-VQoaf9T2K4%3DZcGdg%40mail.gmail.com.

Martin Küttler

unread,

Aug 19, 2021, 5:04:58 AM8/19/21

to ul...@googlegroups.com

This sure looks easy, but I think it matches my intentions. I'm contemplating details, but so far I haven't found any that don't seem solvable. So I thank you very much, and I guess I will come back with further questions, if any arise.