process killed while in mpix_comm_agree

31 views
Skip to first unread message

Atis Degro

unread,
Jun 8, 2019, 2:46:11 AM6/8/19
to User Level Fault Mitigation
Dear ULFM team,

I am working on implementing fault tolerance 
in a fluid solver.

To test if it works, I am killing processes one at a time from 
outside of the code at random times using 'kill -9 PID'

I have notice that the execution fails (hangs)
if the process is killed while executing 
mpix_comm_agree.

Is the mpix_comm_agree fault tolerant by itself,
or is this behavior expected?

Thank you in advance!

Regards,
Atis Degro

George Bosilca

unread,
Jun 8, 2019, 3:35:09 PM6/8/19
to ul...@googlegroups.com
Atis,

The agreement is expected to be resilient by itself, detect dead processes and only reach consensus when the knowledge about faulty processes is similar on all nodes. We have test cases to validate this behavior and none of our current test cases fail with the current implementation, but maybe there are some corner cases we do not cover. Can you provide a reproducer ?

Thanks,
  George.


--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To post to this group, send email to ul...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/fa4f1d1b-5a09-49fd-a0e7-f5a8bfe152a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Atis Degro

unread,
Jun 10, 2019, 3:30:55 PM6/10/19
to User Level Fault Mitigation

George,


Thank you for the quick response.


A few things I forgot to mention in the previous post that might be of importance:

- I am using mpich-3.3.1 implementation of mpix_comm_agree, so possibly it does not

correspond exactly to the latest ULFM version.

(Reason is that I am still struggling with getting the ULFM to work. Currently trying to install the ulfm on the GMU university cluster)

- Code is written in fortran 90.


The code I am working on is rather long and complex,
therefor I cannot provide reproducer at the moment.

I could try to make one but that would take time.


I am not sure if this helps but for now I can provide a fragment of the code where the failure occurs:


call mpi_iallreduce(...,ireq,ierrmpp)

write(6,*)' calling mpi_barrier'

call flush(6)

call mpi_barrier(MPI_COMM,ierrmpp)

write(6,*)' checking for error and getting a consensus'

call flush(6)

iflag=(MPI_SUCCESS .eq. ierrmpp)

write(6,*)' calling mpix_comm_failure_ack'

call flush(6)

call mpix_comm_failure_ack(MPI_COMM, ierrmpp)

write(6,*)' calling mpix_comm_agree'

call flush(6)

call mpix_comm_agree(MPI_COMM, iflag, ierrmpp)

write(6,*)' after agree'

call flush(6)

if (.not. iflag .or. ierrmpp .ne. 0) then

write(6,*)' error occurred'

call flush(6)

goto 9999

endif

write(6,*)' calling mpi_wait'

call flush(6)

call mpi_wait(ireq,istat,ierrmpp)


All ranks write out the highlighted line and after that the code hangs,

which makes me think that neither of the ranks has exited the mpix_comm_agree.


Thank you,

Atis


On Saturday, June 8, 2019 at 10:35:09 PM UTC+3, George Bosilca wrote:
Atis,

The agreement is expected to be resilient by itself, detect dead processes and only reach consensus when the knowledge about faulty processes is similar on all nodes. We have test cases to validate this behavior and none of our current test cases fail with the current implementation, but maybe there are some corner cases we do not cover. Can you provide a reproducer ?

Thanks,
  George.


On Sat, Jun 8, 2019 at 2:46 AM Atis Degro <atis...@gmail.com> wrote:
Dear ULFM team,

I am working on implementing fault tolerance 
in a fluid solver.

To test if it works, I am killing processes one at a time from 
outside of the code at random times using 'kill -9 PID'

I have notice that the execution fails (hangs)
if the process is killed while executing 
mpix_comm_agree.

Is the mpix_comm_agree fault tolerant by itself,
or is this behavior expected?

Thank you in advance!

Regards,
Atis Degro

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ul...@googlegroups.com.

George Bosilca

unread,
Jun 10, 2019, 7:27:30 PM6/10/19
to ul...@googlegroups.com
Atis,

Your F90 code looks OK, so I assume it should work on all correct implementations of the ULFM extension. However, I know little about the MPICH implementation of ULFM API, I would not be able to help you with your MPICH based installation. I suggest you direct your questions to the MPICH mailing list. 

You mention that installing the OMPI ULFM is troublesome. Can you share your issues with us, maybe we have some simple fixes. As an example on most Linux clusters I use the following configure command:
/configure --prefix=*** --enable-mpi-fortran --disable-mpi-cxx --disable-io-romio --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default --enable-debug --with-ft=ulfm CC=gcc CXX=gcc FC=gfortran

  George.
 

To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.

To post to this group, send email to ul...@googlegroups.com.

Atis Degro

unread,
Jun 14, 2019, 7:25:31 AM6/14/19
to User Level Fault Mitigation
George,

Thank you!
The suggested configure command worked.
We just had to slightly modify it to:

./configure --prefix=$HOME/ulfm/install --enable-mpi-fortran --disable-mpi-cxx --disable-io-romio --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default --enable-debug --with-ft=ulfm CC=gcc CXX=g++ FC=gfortran --with-slurm

but now we have a working ulfm installation.

Regards,
Atis
Reply all
Reply to author
Forward
0 new messages