Communicator being split after rank failure

17 views
Skip to first unread message

Lukas

unread,
Jan 20, 2020, 9:27:38 AM1/20/20
to ul...@googlegroups.com
Hey,

I sometimes observe, that a communicator seems to be split somehow
after there was a rank failure. More precisely, multiple ranks seem to
think that they'd have id 0. (See attachment for the source I used.)

[rank 0] Starting off with 80 ranks.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 79 ranks left.
[... and so on. Sometimes, a few ranks will be lost at once]
[rank 0] We've lost one rank, we now have 68 ranks left.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 55 ranks left.
[...]
[rank 0] We've lost one rank, we now have 7 ranks left.
[rank 1] I will fail now.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 5 ranks left.
[rank 1] I will fail now.
[rank 0] We've lost one rank, we now have 2 ranks left.
[rank 0] We've lost one rank, we now have 1 ranks left.
[rank 0] Test finished, only one node left, exiting.
[rank 0] We've lost one rank, we now have 1 ranks left.
[rank 0] Test finished, only one node left, exiting.
[rank 0] We've lost one rank, we now have 1 ranks left.
[rank 0] Test finished, only one node left, exiting.
[The last two messages appear 7 more times]
[MPI_Finalize does not seem to complete]

I'm using 4.1.0_ulfm-2.1a1 running under a Slurm scheduler using PMI2,
started via mpiexec in an sbatch file. The source code I used is
attached.

I really hope, you can help me again!

Regards,
Lukas
minimal.cpp

Aurelien Bouteiller

unread,
Jan 22, 2020, 6:22:37 PM1/22/20
to ul...@googlegroups.com
Lukas,

Thank you for providing a test case. While I couldn’t directly reproduce your problem, the following PR https://bitbucket.org/icldistcomp/ulfm2/pull-requests/19 resolves a very similar behavior and may therefore resolve your issue as well. Please let me know if it helps. 

Best,
Aurelien

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/CAC_1J3u%3DSRuWwBbnO%2BZe1FN_5SHmsfL5VfKcJCxrUAxLNh61FQ%40mail.gmail.com.
<minimal.cpp>

Lukas

unread,
Jan 23, 2020, 3:46:38 AM1/23/20
to ul...@googlegroups.com
Hello Aurelien,

I don't seem to be able to find a way of getting the patch. Bitbucket
won't allow me to download the modified file or clone the repo as I
don't have access rights. I tried copying from the side-by-side diff
view, but it didn't work as some lines were skipped. Is there anything
I'm doing wrong? Could you maybe send me a .patch file?

Greetings,
Lukas
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/9A53E74C-0080-4735-AB0D-DE85D993E4C0%40icl.utk.edu.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/9A53E74C-0080-4735-AB0D-DE85D993E4C0%40icl.utk.edu.

George Bosilca

unread,
Jan 23, 2020, 10:28:03 AM1/23/20
to ul...@googlegroups.com
Lukas,

The PR are supposed to be public, and the patch associated with them can be downloaded as a diff via the Bitbucket API 2.0. Here is the link for the patch in this PR: https://bitbucket.org/api/2.0/repositories/icldistcomp/ulfm2/pullrequests/19/diff. Use it with curl or wget.

George.


Lukas

unread,
Jan 28, 2020, 3:48:43 AM1/28/20
to ul...@googlegroups.com
Hey George,

thank you! Downloading, patching and compiling worked. Sadly, I still
see the same phenomena as described above. I've attached the debug
output, maybe it helps.

Greetings,
Lukas


On Thu, Jan 23, 2020 at 4:28 PM 'George Bosilca' via User Level Fault
> To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/CAMJJpkWEh997yAk8eOW_uGLHuO35NaL-1AFP2Mg8RV643iii_w%40mail.gmail.com.
split-comm-debug.txt

George Bosilca

unread,
Jan 28, 2020, 10:48:39 AM1/28/20
to ul...@googlegroups.com
Lukas,

I was able to reproduce your findings, but unfortunately I don't yet have a clear picture of what is going on. I see that at some point the remaining processes create a bipartite world, with 2 processes thinking they are rank 1 in the shrinked communicator, and they quitting in same time.

We're looking into it.
  George.


Reply all
Reply to author
Forward
0 new messages