Correct way of handling error in spawned child

293 views
Skip to first unread message

Conn O'Rourke

unread,
Jul 16, 2020, 10:55:22 AM7/16/20
to mpi4py
Hi, 

I'm curious as to the correct way of handling errors in a spawned child.

If I run the following code:

from mpi4py import MPI
from mpi_config import check_mpi,mpi_setup,my_rank,my_comm
import sys

mpi_setup()
mpi_info = MPI.Info.Create()
mpi_info.Set("host",  MPI.Get_processor_name())

my_comm.Set_errhandler(MPI.ERRORS_ARE_FATAL)

if my_rank > 0:
    try:
         commspawn = MPI.COMM_SELF.Spawn(sys.executable, args=['child_exception.py'], maxprocs=4,info = mpi_info)
         commspawn.Barrier()
         commspawn.Disconnect()
    except:
        print("exception raised")


my_comm.Barrier()


where child_exception.py is:

from mpi4py import MPI
my_rank = MPI.COMM_WORLD.Get_rank()
my_comm = MPI.COMM_WORLD


parent = MPI.Comm.Get_parent()
if my_rank == 2:
    1/0

parent.Barrier()
parent.Disconnect()



The code hangs, leaving some python processes lying around.

Traceback (most recent call last):
  File "child_exception.py", line 8, in <module>
    1/0
ZeroDivisionError: division by zero
Traceback (most recent call last):
  File "child_exception.py", line 8, in <module>
    1/0
ZeroDivisionError: division by zero
Traceback (most recent call last):
  File "child_exception.py", line 8, in <module>
    1/0
ZeroDivisionError: division by zero


mpi_config.py contains:

import sys, mpi4py
from  mpi4py import MPI
import os
import distutils.spawn


my_rank = MPI.COMM_WORLD.Get_rank()
my_comm = MPI.COMM_WORLD

def check_mpi():
    mpiexec_path, _ = os.path.split(distutils.spawn.find_executable("mpiexec"))
    for executable, path in mpi4py.get_config().items():
        if executable not in ['mpicc', 'mpicxx', 'mpif77', 'mpif90', 'mpifort']:
             continue
        if mpiexec_path not in path:
             raise ImportError("mpi4py may not be configured against the same version of 'mpiexec' that you are using. The 'mpiexec' path is {mpiexec_path} and mpi4py.get_config() returns:\n{mpi4py_config}\n".format(mpiexec_path=mpiexec_path, mpi4py_config=mpi4py.get_config()))



def mpi_setup():
    sys_excepthook = sys.excepthook
    def mpi_excepthook(type, value, traceback):
        sys_excepthook(type,value,traceback)
        mpi4py.MPI.COMM_WORLD.Abort(1)

    sys.excepthook = mpi_excepthook
    check_mpi()


I was under the impression that if any process failed, all should bail out - but apparently not.

What is the correct way to handle the error in this case, and prevent the hang?

If I run the same code, and spawn a fortran child that falls over the code doesn't hang.

Thanks

Conn O'Rourke

unread,
Jul 16, 2020, 1:34:31 PM7/16/20
to mpi4py

I see what the problem was - i didn't swap out the sys.excepthook in the child with a call to mpi_setup().

Fails correctly now. 

Lisandro Dalcin

unread,
Jul 17, 2020, 4:03:04 AM7/17/20
to mpi...@googlegroups.com
Look at the implementation of the mpi4py.run module. 
I you want to get really serious about the best way to do things, I would recommend the following practice as an start:

from mpi4py.run import set_abort_status

try:
    run_your_mpi_code()
except SystemExit as exc:
    set_abort_status(exc.code)
    raise
except:
    set_abort_status(1)
    raise

`SystemExit` is handled specially because `raise SystemExit` (or equivalently, `sys.exit()`) should cleanly exit the process with success.
`set_abort_status(1/nonzero)` will trigger the MPI_Abort() call you want, but that will happen after all the usual Python finalization occurs (which is a good thing, IMHO).
If you add dynamic process management to the mix, maybe you need a `finally` clause to handle the comm.Disconnect() calls.


--
You received this message because you are subscribed to the Google Groups "mpi4py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mpi4py/c1d482a7-e1c7-43a8-8a78-20c846396de0o%40googlegroups.com.


--
Lisandro Dalcin
============
Research Scientist
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

Conn O'Rourke

unread,
Jul 17, 2020, 5:43:19 AM7/17/20
to mpi4py
Hi Lisandro,

Thanks for the suggestion - I'll take a look at the mpi4py.run module. 

I want to test that when a spawned child fails, the parent run doesn't hang. I want to make a check on this as part of some unit tests, and then have the test runner continue to run the subsequent tests.

Is this possible? As far as i can tell when the child fails, the unit test runner will fail too.

Thanks again.


On Friday, 17 July 2020 09:03:04 UTC+1, Lisandro Dalcin wrote:
Look at the implementation of the mpi4py.run module. 
I you want to get really serious about the best way to do things, I would recommend the following practice as an start:

from mpi4py.run import set_abort_status

try:
    run_your_mpi_code()
except SystemExit as exc:
    set_abort_status(exc.code)
    raise
except:
    set_abort_status(1)
    raise

`SystemExit` is handled specially because `raise SystemExit` (or equivalently, `sys.exit()`) should cleanly exit the process with success.
`set_abort_status(1/nonzero)` will trigger the MPI_Abort() call you want, but that will happen after all the usual Python finalization occurs (which is a good thing, IMHO).
If you add dynamic process management to the mix, maybe you need a `finally` clause to handle the comm.Disconnect() calls.


To unsubscribe from this group and stop receiving emails from it, send an email to mpi...@googlegroups.com.

Lisandro Dalcin

unread,
Jul 17, 2020, 7:24:45 AM7/17/20
to mpi...@googlegroups.com
Oh, I see... well, it is quite doable, though maybe slightly involved for your taste.
Look at the attached example

$ mpiexec -n 1 python parent.py
<no ouput, all good>

$ mpiexec -n 1 python parent.py fail
Traceback (most recent call last):
  File "parent.py", line 11, in <module>
    assert ok, "failure in child process"
AssertionError: failure in child process


Is that enough for you?

PS: Exception handling is not only about errors, but also about control flow. Abuse of it and profit, `python -m this | grep practicality`.

To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mpi4py/abcf12e3-4ee4-4515-beeb-3673a46d208eo%40googlegroups.com.
parent.py
child.py

Conn O'Rourke

unread,
Jul 17, 2020, 7:54:38 AM7/17/20
to mpi4py
Perfect - that's exactly what I wanted.

Thanks again Lisandro.


On Friday, 17 July 2020 12:24:45 UTC+1, Lisandro Dalcin wrote:
Oh, I see... well, it is quite doable, though maybe slightly involved for your taste.
Look at the attached example

$ mpiexec -n 1 python parent.py
<no ouput, all good>

$ mpiexec -n 1 python parent.py fail
Traceback (most recent call last):
  File "parent.py", line 11, in <module>
    assert ok, "failure in child process"
AssertionError: failure in child process


Is that enough for you?

PS: Exception handling is not only about errors, but also about control flow. Abuse of it and profit, `python -m this | grep practicality`.

Reply all
Reply to author
Forward
0 new messages