mpirun not killing job when one python rank has an error

2,461 views
Skip to first unread message

Sam

unread,
Jul 28, 2011, 12:14:33 PM7/28/11
to mpi4py
Hello,

I have noticed a problem with mpi4py (or perhaps with my MPI setup)
and I would like to seek some advice. If one python rank has an
unhandled exception or even calls sys.exit the process doesn't
actually die so mpirun doesn't kill the other ranks. This is bad if a
collective communication occurs afterwards as the job will never
finish.

However, if I call os._exit it will actually quit. The os._exit
function doesn't call the python cleanup handlers, but I don't know
why sys.exit would get hung up. Below is the smallest example I could
come up with that reproduces this behavior. I am running openmpi 1.4.1
on Centos 5.5 and compiled mpi4py using gcc.

Thanks,
Sam Chill

from mpi4py import MPI
import sys
import os

world = MPI.COMM_WORLD

if world.rank == 0:
raise Exception # python doesn't exit and mpirun will never exit
#sys.exit(1) # python doesn't exit and mpirun will never exit
#os._exit(1) # python WILL exit and mpirun will exit

world.barrier()

Lisandro Dalcin

unread,
Jul 28, 2011, 9:09:09 PM7/28/11
to mpi...@googlegroups.com
On 28 July 2011 11:14, Sam <samc...@gmail.com> wrote:
> Hello,
>
> I have noticed a problem with mpi4py (or perhaps with my MPI setup)
> and I would like to seek some advice. If one python rank has an
> unhandled exception or even calls sys.exit the process doesn't
> actually die so mpirun doesn't kill the other ranks. This is bad if a
> collective communication occurs afterwards as the job will never
> finish.
>
> However, if I call os._exit it will actually quit. The os._exit
> function doesn't call the python cleanup handlers, but I don't know
> why sys.exit would get hung up. Below is the smallest example I could
> come up with that reproduces this behavior. I am running openmpi 1.4.1
> on Centos 5.5 and compiled mpi4py using gcc.
>

This is a well known issue, and mpi4py does not try to second-guess
the solution. A think that should probably work well is to override
sys.exepthook to call MPI.COMM_WORLD.Abort(1). That should kill all
the processes, however note that no Python cleanup functions will run
(in other words MPI will kill your process immediately).

There are other alternatives, like spawning a new Python tread busy
waiting in a recv() call on a DUPLICATE communicator (newcomm =
MPI.COMM_WORLD.Dup() ). Next, you override sys.excepthook to send a
message to itself other processes (using newcomm) flaging the error.
Then other processes will receive the message and "raise Something" in
repose. Of course, at the end of your app you should send the threads
a message "all went OK" for them to properly shutdown.

Note however that the previous option requires your MPI to be
thread-enabled (MPI.Query_thread() must return MPI.THREAD_MULTIPLE).
Just double check that, IIRC OpenMPI by default does not configures
itself to support threads.


--
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169

Sam

unread,
Jul 28, 2011, 10:49:06 PM7/28/11
to mpi4py
Thank you very much for the information Lisandro. Would you care to
comment on why this is an issue in the first place? I don't understand
where the python interpreter is blocking/getting suck after an
unhandled exception or sys.exit. What is it trying to do that prevents
the interpreter from exiting? What is special to an MPI program that
keeps these processes from dying?

Thanks,
Sam

On Jul 28, 8:09 pm, Lisandro Dalcin <dalc...@gmail.com> wrote:

Lisandro Dalcin

unread,
Jul 30, 2011, 12:30:17 PM7/30/11
to mpi...@googlegroups.com
On 28 July 2011 22:49, Sam <samc...@gmail.com> wrote:
> Thank you very much for the information Lisandro. Would you care to
> comment on why this is an issue in the first place? I don't understand
> where the python interpreter is blocking/getting suck after an
> unhandled exception or sys.exit. What is it trying to do that prevents
> the interpreter from exiting? What is special to an MPI program that
> keeps these processes from dying?
>

mpi4py initializes/finalizes MPI for you. The initialization occurs at
import time, and the finalization when the Python process is about to
finalize (I'm using Py_AtExit() C-API call to do this). As
MPI_Finalize() is collective and likely blocking in most MPI impls,
you get the deadlock.

You can disable automatic MPI finalization adding the two lines below
at the VERY beginning of you main script:

import mpi4py.rc
import mpi4py.rc.finalize = False

and call MPI.Finalize() yourself in Python code at the very end.
However, you still need have to manage yourself how to notice the
other processes that a process raised an exception. So, there is not
too much to gain from this approach.

Unless your app really requires proper cleanup at finalization,
overriding sys.excepthook to call MPI.COMM_WORLD.Abort() is your best
change of get all processes die when a single one fails.

Reply all
Reply to author
Forward
0 new messages