Hello,
I encountered several situations where the mpi program get stuck in either MPI_Comm_shrink or MPI_finalized. Below are the information.
1. Version used: 4.0.2u1
(not using openmpi 5 now as program crashes when prte process gone),
2. using MPI_Init_thread with MPI_THREAD_MULTIPLE.
(1) run with 1 master + 4 slaves on the same machine, and kill all slaves except 1 before the shrinking. stuck with mpi_comm_shrink (see simple.c), in ompi_sync_wait_mt.
to work around (1), I have enabled mpi_ft_detector_thread.
This in turn leads to two random stuck cases.
(2) get stuck at mpi_finalized->ompi_comm_failure_detector_finalized.
This appears to get fixed after I added volatile keyword to comm_detector_t->hb_observing.
(3) all processes get stuck at mpi_comm_shrink (same as problem 1) even when no process have failed. The chance of the program getting stuck appears to get reduced when I added volatile to two more variables. However, the program will eventually get stuck at the same line with larger number of machines + longer running time.
The two variables that appear to partially reduce chance of stuck are:
1) ompi_status_public_t -> MPI_ERROR
2) ompi_wait_sync_t -> count