--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To post to this group, send email to ul...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/fa4f1d1b-5a09-49fd-a0e7-f5a8bfe152a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
George,
Thank you for the quick response.
A few things I forgot to mention in the previous post that might be of importance:
- I am using mpich-3.3.1 implementation of mpix_comm_agree, so possibly it does not
correspond exactly to the latest ULFM version.
(Reason is that I am still struggling with getting the ULFM to work. Currently trying to install the ulfm on the GMU university cluster)
- Code is written in fortran 90.
The code I am
working on is rather long and complex,
therefor I cannot provide
reproducer at the moment.
I could try to make one but that would take time.
I am not sure if this helps but for now I can provide a fragment of the code where the failure occurs:
call mpi_iallreduce(...,ireq,ierrmpp)
write(6,*)' calling mpi_barrier'
call flush(6)
call mpi_barrier(MPI_COMM,ierrmpp)
write(6,*)' checking for error and getting a consensus'
call flush(6)
iflag=(MPI_SUCCESS .eq. ierrmpp)
write(6,*)' calling mpix_comm_failure_ack'
call flush(6)
call mpix_comm_failure_ack(MPI_COMM, ierrmpp)
write(6,*)' calling mpix_comm_agree'
call flush(6)
call mpix_comm_agree(MPI_COMM, iflag, ierrmpp)
write(6,*)' after agree'
call flush(6)
if (.not. iflag .or. ierrmpp .ne. 0) then
write(6,*)' error occurred'
call flush(6)
goto 9999
endif
write(6,*)' calling mpi_wait'
call flush(6)
call mpi_wait(ireq,istat,ierrmpp)
All ranks write out the highlighted line and after that the code hangs,
which makes me think that neither of the ranks has exited the mpix_comm_agree.
Thank you,
Atis
Atis,The agreement is expected to be resilient by itself, detect dead processes and only reach consensus when the knowledge about faulty processes is similar on all nodes. We have test cases to validate this behavior and none of our current test cases fail with the current implementation, but maybe there are some corner cases we do not cover. Can you provide a reproducer ?Thanks,George.
On Sat, Jun 8, 2019 at 2:46 AM Atis Degro <atis...@gmail.com> wrote:
Dear ULFM team,--I am working on implementing fault tolerancein a fluid solver.To test if it works, I am killing processes one at a time fromoutside of the code at random times using 'kill -9 PID'I have notice that the execution fails (hangs)if the process is killed while executingmpix_comm_agree.Is the mpix_comm_agree fault tolerant by itself,or is this behavior expected?Thank you in advance!Regards,Atis Degro
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ul...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To post to this group, send email to ul...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/a388781b-3993-4e65-823d-a2bcea5d9533%40googlegroups.com.