Doubt about failure detection

11 views
Skip to first unread message

Lucas Baptista De Moraes

unread,
Mar 5, 2020, 4:17:44 PM3/5/20
to User Level Fault Mitigation
Hello, I would like to know if there is a specific rank where failure detection occurs, if it's random, or if there is how you set in which rank you would like to do that job.

In the example "bag4.c", if I raise a error in rank==size-1, the output is, and the failure notification always come from the rank 0:

Rank 0 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 7 }
Pi from Rank 0: 3.1419216000


But if a raise a error in rank == 0, the output is:

Rank 2 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 2 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 5 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 6 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 6 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 1 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 7 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 1 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 3 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 3 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 4 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 4 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 7 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }
Rank 5 / 8: Notified of error MPI_ERR_PROC_FAILED: Process Failure. 1 found dead: { 0 }

Why in the second case I get this output? Why all the ranks raise a failure notification when the failure happens in rank 0, and when the failure isn't in rank 0, only rank 0 notifies it?

bag4.c

George Bosilca

unread,
Mar 5, 2020, 4:27:16 PM3/5/20
to ul...@googlegroups.com
For scalability reasons, detected failures are only reported on processes that have ongoing communications that could be matched with communications from the dead process. In your example, the communication scheme is a star, with process 0 connected to every other process, while the others never communicate in between. Thus, rank 0 will always detect a failure of any other process, when the others will be able only to detect the failure of 0.

  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/8d4fe586-81e4-4679-83eb-1a1ae0eb2c23%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages