stuck in mpi_comm_shrink

22 views
Skip to first unread message

Kean Loon

unread,
May 4, 2024, 5:52:18 AM5/4/24
to User Level Fault Mitigation
Hello,

I encountered several situations where the mpi program get stuck in either MPI_Comm_shrink or MPI_finalized. Below are the information.

1. Version used: 4.0.2u1 
(not using openmpi 5 now as program crashes when prte process gone),
2. using MPI_Init_thread with MPI_THREAD_MULTIPLE.

(1) run with 1 master + 4 slaves on the same machine, and kill all slaves except 1 before the shrinking. stuck with mpi_comm_shrink (see simple.c), in ompi_sync_wait_mt.

to work around (1), I have enabled mpi_ft_detector_thread.
This in turn leads to two random stuck cases.

(2)  get stuck at mpi_finalized->ompi_comm_failure_detector_finalized.
This appears to get fixed after I added volatile keyword to comm_detector_t->hb_observing.

(3) all processes get stuck at mpi_comm_shrink (same as problem 1) even when no process have failed. The chance of the program getting stuck appears to get reduced when I added volatile to two more variables. However, the program will eventually get stuck at the same line with larger number of machines + longer running time.

The two variables that appear to partially reduce chance of stuck are:
1) ompi_status_public_t -> MPI_ERROR
2) ompi_wait_sync_t -> count


simple.c
Reply all
Reply to author
Forward
0 new messages