processes stuck in mpix_comm_shrink

66 views
Skip to first unread message

Atis Degro

unread,
Dec 24, 2019, 4:43:47 PM12/24/19
to User Level Fault Mitigation

Hi all,


I have implemented ULFM calls into finite difference fluid solver (fortran).

To test the functionality I use raise(SIGKILL) to kill some of the mpi processes.

I am observing strange behavior depending on the number of mpi processes that I use.

I start the run with spares and always always kill processes 2 and 3.

During the recovery the communicator is shrunk to the remaining processes and the killed

processes are replaced with the spares.

I am running in cluster environment, each MPI process is a separate compute node.

The problem that I am encountering is that some of the processes get stuck in the mpix_comm_shrink call.

The number of dead processes after shrink (size original communicator – size shrunk communicator) is larger than 2 (the number of processes I kill)

If I run on 32 mpi nodes I end up with 4 dead nodes, process 30 and 32 don’t exit the shrink call.

If I run on 36 mpi nodes I end up with 2 dead nodes, just as expected.

If I run on 64 mpi nodes I end up with 5 dead nodes, process 50, 62 and 64 don’t exit the shrink call.

If I run on 110 mpi nodes I end up with 2 dead nodes, just as expected.

The additional nodes that ‘fail’ (get stuck in shrink) are always the same if I repeat the run with the same number of mpi processes.


Has anyone experienced a similar behavior?

Does anyone has any suggestions regarding what could be going on?

Is there something in the mpix_comm_shrink call that could account for such behavior?


Merry Christmas Eve!


Atis

Aurelien Bouteiller

unread,
Dec 27, 2019, 12:04:44 PM12/27/19
to ul...@googlegroups.com
Hi Atis, 

Thanks for reporting your issue. I will need a bit more information to provide assistance if you don’t mind. 

1. What version of ULFM are you running (git hash). 

2. Please add the following flags to mpirun, and (if not sensitive) send me the resulting log files (archive) for consideration: `—mca mpi_ft_verbose 100 —output-filename somedir`.


As for a guess as of what’s happening, I have seen something similar in the past, but I have hopes it is resolved in the latest release. There was a bug in the failure detector that would trigger the reporting of live processes as dead. These processes would then abort when they discover they have been (incorrectly) reported (adding additional ‘failures’ to the injected set) or would linger around forever if they don’t get the info that they have been considered dead (deadlocked in shrink, but the rest of the application could proceed to Finalize). Let me know if your observations match this pattern. 


Best,
Aurelien

-- 
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/88bcdfcb-f7eb-4e8a-a793-b6b7476812c2%40googlegroups.com.

Atis Degro

unread,
Jan 2, 2020, 4:38:55 PM1/2/20
to User Level Fault Mitigation
Hi Aurelien,

Happy New Year!

Thank you for your response.
To answer your questions:
1. regarding the ULFM version, I downloaded the repository from Bitbucket (icldistcomp-ulfm2-cf8dc43f9073.zip)
2. I attached the resulting log files. I had a look however and it doesn't seem to have generated any useful information.

The case I run is submitted on 64 nodes, 32 active mpi processes and 32 spares. At a certain point in the simulation processes 2 and 3 (rank 1 and rank 2) are killed.
During the recovery after shrink (as mentioned before) rank 0 determines that 8 processes have dies since the size of the shrunk communicator is 56 not 62 as it should be.
The execution of the code continues as expected since there are enough spares to substitute also the fake dead.

The fake dead in this case are ranks 25, 31, 49, 57, 61 and 63.

Regarding your initial guess, it is hard to say if this falls under the suggested pattern since there are no signs of any of the processes being reported as dead.

Please let me know if there is any additional information I could provide.

Thank you!

Regards,
Atis
Best,
Aurelien

To unsubscribe from this group and stop receiving emails from it, send an email to ul...@googlegroups.com.
err_out.tar.gz

Zhong, Dong

unread,
Jan 10, 2020, 12:01:36 PM1/10/20
to User Level Fault Mitigation
Hi, Atis 

I tried to create your problem, I didn't see the same behavior you have. If you could share your test case it will be very helpful.

Best,
Dong
Reply all
Reply to author
Forward
0 new messages