process killed while in mpix_comm

Atis Degro

unread,

Jun 8, 2019, 2:46:11 AM6/8/19

to User Level Fault Mitigation

Dear ULFM team,

I am working on implementing fault tolerance

in a fluid solver.

To test if it works, I am killing processes one at a time from

outside of the code at random times using 'kill -9 PID'

I have notice that the execution fails (hangs)

if the process is killed while executing

mpix_comm_agree.

Is the mpix_comm_agree fault tolerant by itself,

or is this behavior expected?

Thank you in advance!

Regards,

Atis Degro

George Bosilca

unread,

Jun 8, 2019, 3:35:09 PM6/8/19

to ul...@googlegroups.com

Atis,

The agreement is expected to be resilient by itself, detect dead processes and only reach consensus when the knowledge about faulty processes is similar on all nodes. We have test cases to validate this behavior and none of our current test cases fail with the current implementation, but maybe there are some corner cases we do not cover. Can you provide a reproducer ?

Thanks,

George.

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To post to this group, send email to ul...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/fa4f1d1b-5a09-49fd-a0e7-f5a8bfe152a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Atis Degro

unread,

Jun 10, 2019, 3:30:55 PM6/10/19

to User Level Fault Mitigation

George,

Thank you for the quick response.

A few things I forgot to mention in the previous post that might be of importance:

- I am using mpich-3.3.1 implementation of mpix_comm_agree, so possibly it does not

correspond exactly to the latest ULFM version.

(Reason is that I am still struggling with getting the ULFM to work. Currently trying to install the ulfm on the GMU university cluster)

- Code is written in fortran 90.

The code I am working on is rather long and complex,
therefor I cannot provide reproducer at the moment.

I could try to make one but that would take time.

I am not sure if this helps but for now I can provide a fragment of the code where the failure occurs:

call mpi_iallreduce(...,ireq,ierrmpp)

write(6,*)' calling mpi_barrier'

call flush(6)

call mpi_barrier(MPI_COMM,ierrmpp)

write(6,*)' checking for error and getting a consensus'

call flush(6)

iflag=(MPI_SUCCESS .eq. ierrmpp)

write(6,*)' calling mpix_comm_failure_ack'

call flush(6)

call mpix_comm_failure_ack(MPI_COMM, ierrmpp)

write(6,*)' calling mpix_comm_agree'

call flush(6)

call mpix_comm_agree(MPI_COMM, iflag, ierrmpp)

write(6,*)' after agree'

call flush(6)

if (.not. iflag .or. ierrmpp .ne. 0) then

write(6,*)' error occurred'

call flush(6)

goto 9999

endif

write(6,*)' calling mpi_wait'

call flush(6)

call mpi_wait(ireq,istat,ierrmpp)

All ranks write out the highlighted line and after that the code hangs,

which makes me think that neither of the ranks has exited the mpix_comm_agree.

Thank you,

Atis

On Saturday, June 8, 2019 at 10:35:09 PM UTC+3, George Bosilca wrote:

Atis,

The agreement is expected to be resilient by itself, detect dead processes and only reach consensus when the knowledge about faulty processes is similar on all nodes. We have test cases to validate this behavior and none of our current test cases fail with the current implementation, but maybe there are some corner cases we do not cover. Can you provide a reproducer ?

Thanks,
George.

On Sat, Jun 8, 2019 at 2:46 AM Atis Degro <atis...@gmail.com> wrote:

Dear ULFM team,

I am working on implementing fault tolerance
in a fluid solver.

To test if it works, I am killing processes one at a time from
outside of the code at random times using 'kill -9 PID'

I have notice that the execution fails (hangs)
if the process is killed while executing
mpix_comm_agree.

Is the mpix_comm_agree fault tolerant by itself,
or is this behavior expected?

Thank you in advance!

Regards,
Atis Degro

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ul...@googlegroups.com.

George Bosilca

unread,

Jun 10, 2019, 7:27:30 PM6/10/19

to ul...@googlegroups.com

Atis,

Your F90 code looks OK, so I assume it should work on all correct implementations of the ULFM extension. However, I know little about the MPICH implementation of ULFM API, I would not be able to help you with your MPICH based installation. I suggest you direct your questions to the MPICH mailing list.

You mention that installing the OMPI ULFM is troublesome. Can you share your issues with us, maybe we have some simple fixes. As an example on most Linux clusters I use the following configure command:

/configure --prefix=*** --enable-mpi-fortran --disable-mpi-cxx --disable-io-romio --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default --enable-debug --with-ft=ulfm CC=gcc CXX=gcc FC=gfortran

George.

To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.

To post to this group, send email to ul...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/a388781b-3993-4e65-823d-a2bcea5d9533%40googlegroups.com.

Atis Degro

unread,

Jun 14, 2019, 7:25:31 AM6/14/19

to User Level Fault Mitigation

George,

Thank you!

The suggested configure command worked.

We just had to slightly modify it to:

./configure --prefix=$HOME/ulfm/install --enable-mpi-fortran --disable-mpi-cxx --disable-io-romio --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default --enable-debug --with-ft=ulfm CC=gcc CXX=g++ FC=gfortran --with-slurm

process killed while in mpix_comm_agree

Atis Degro

George Bosilca

Atis Degro

George Bosilca

Atis Degro