Hi,
I am experiencing abrupt crashes in a really simple code. Please check attached.
I am experimenting with worst case concurrent failure repeating one after other:
In every iteration, I kill odd ranks and even ranks recover from the failure. The process repeats till there is only one rank left.
The first iteration always works, but I randomly see crashes. Occasionally all 3 iterations work. There is no particular function where it crashes. valgrind shows some invalid reads / writes in mpi routines.
compile: mpicxx 128_recovery.cc -g
run: mpirun -np 8 -am ft-enable-mpi ./a.out
Logic:
1. revoke comm - MPIX_Comm_revoke (this does not make any difference)
2. while loop to get failure_ack and comm_agree to have a consistent picture across all ranks. (Is this wrong ? I saw similar code in ULFM guide)
3. get_acked to get a group of failed ranks.
4. translate ranks from group to comm
5. Shrink the comm
6. Assign new rank using MPI_Comm_split (Just to ensure that shrink does not change the sequence).
7. Call the "simulate" function again to iterate.
Actually examples show error handlers do recovery and return to the earlier code execution which got interrupted. But I am not doing that I want to restart my "timestep" in the real application after recovery and do not want to continue execution from where it got interrupted. Hence error handler calls simulate routine again and again.
Can someone please help me spotting any mistake in my code?
thanks,
Damodar.
p.s. code is really simple and small, but looks little ugly because of printf statements I added to debug.