This is a well known issue, and mpi4py does not try to second-guess
the solution. A think that should probably work well is to override
sys.exepthook to call MPI.COMM_WORLD.Abort(1). That should kill all
the processes, however note that no Python cleanup functions will run
(in other words MPI will kill your process immediately).
There are other alternatives, like spawning a new Python tread busy
waiting in a recv() call on a DUPLICATE communicator (newcomm =
MPI.COMM_WORLD.Dup() ). Next, you override sys.excepthook to send a
message to itself other processes (using newcomm) flaging the error.
Then other processes will receive the message and "raise Something" in
repose. Of course, at the end of your app you should send the threads
a message "all went OK" for them to properly shutdown.
Note however that the previous option requires your MPI to be
thread-enabled (MPI.Query_thread() must return MPI.THREAD_MULTIPLE).
Just double check that, IIRC OpenMPI by default does not configures
itself to support threads.
--
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169
mpi4py initializes/finalizes MPI for you. The initialization occurs at
import time, and the finalization when the Python process is about to
finalize (I'm using Py_AtExit() C-API call to do this). As
MPI_Finalize() is collective and likely blocking in most MPI impls,
you get the deadlock.
You can disable automatic MPI finalization adding the two lines below
at the VERY beginning of you main script:
import mpi4py.rc
import mpi4py.rc.finalize = False
and call MPI.Finalize() yourself in Python code at the very end.
However, you still need have to manage yourself how to notice the
other processes that a process raised an exception. So, there is not
too much to gain from this approach.
Unless your app really requires proper cleanup at finalization,
overriding sys.excepthook to call MPI.COMM_WORLD.Abort() is your best
change of get all processes die when a single one fails.