crashes during repeated concurrent failures

35 views
Skip to first unread message

damodar sahasrabudhe

unread,
Mar 20, 2019, 2:09:49 PM3/20/19
to User Level Fault Mitigation
Hi,
I am experiencing abrupt crashes in a really simple code. Please check attached. 

I am experimenting with worst case concurrent failure repeating one after other: 
In every iteration, I kill odd ranks and even ranks recover from the failure. The process repeats till there is only one rank left.

The first iteration always works, but I randomly see crashes. Occasionally all 3 iterations work. There is no particular function where it crashes. valgrind shows some invalid reads / writes in mpi routines.

compile: mpicxx 128_recovery.cc -g
run: mpirun -np 8 -am ft-enable-mpi ./a.out


Logic: 
1. revoke comm - MPIX_Comm_revoke (this does not make any difference)
2. while loop to get failure_ack and comm_agree to have a consistent picture across all ranks. (Is this wrong ? I saw similar code in ULFM guide)
3. get_acked to get a group of failed ranks.
4. translate ranks from group to comm
5. Shrink the comm
6. Assign new rank using MPI_Comm_split (Just to ensure that shrink does not change the sequence).
7. Call the "simulate" function again to iterate. 

Actually examples show error handlers do recovery and return to the earlier code execution which got interrupted. But I am not doing that I want to restart my "timestep" in the real application after recovery and do not want to continue execution from where it got interrupted. Hence error handler calls simulate routine again and again.

Can someone please help me spotting any mistake in my code? 

thanks,
Damodar.

p.s. code is really simple and small, but looks little ugly because of printf statements I added to debug. 
128_recovery.cc

damodar sahasrabudhe

unread,
Mar 20, 2019, 2:16:25 PM3/20/19
to User Level Fault Mitigation
Sorry, forgot to mention my platform information: Intel Xeon CPU E5-2680, CentOS release 6.10, gcc version 6.3.1, ulfm 2.0. 
I experienced similar problems on LLNL's Quartz cluster.

George Bosilca

unread,
Mar 30, 2019, 7:12:40 PM3/30/19
to ul...@googlegroups.com
Damodar,

I can't replicate your issue with the current master, on a IB cluster. Can you update to g6f00293 ?

  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To post to this group, send email to ul...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages