troubleshooting mpi4py.futures for infiniband

Johnnie Gray

unread,

Oct 27, 2017, 6:13:05 PM10/27/17

to mpi4py

I've been using the mpi4py.futures module with great success, and want to scale to a multi-node infiniband setting.

'Normal' mpi4py seems to work fine, and I can solve large problems using slepc4py etc, however, everything hangs indefinitely when

using either MPICommExecutor or MPIPoolExecutor - no errors appear.

Just wondering if there are any suggestions for troubleshooting what the problem could be?

Some extra details:

- using OpenMPI 1.10.1

- most recent mpi4py from bitbucket

- spawning processes using mpi does not seem to work on this system (neither openmpi or intel),

* I've thus either been using ``mpiexec python -m mpi4py.futures ...`` with the Pool executor

* or ``mpiexec python ...`` with the Comm executor. Both hang.

Thanks!

Lisandro Dalcin

unread,

Oct 29, 2017, 9:16:19 AM10/29/17

to mpi4py

On 27 October 2017 at 15:16, Johnnie Gray <johnni...@gmail.com> wrote:
> I've been using the mpi4py.futures module with great success, and want to
> scale to a multi-node infiniband setting.
>

Good to know. Any complaints or feedback?

> 'Normal' mpi4py seems to work fine, and I can solve large problems using
> slepc4py etc, however, everything hangs indefinitely when
> using either MPICommExecutor or MPIPoolExecutor - no errors appear.
>

Frustrating...

> Just wondering if there are any suggestions for troubleshooting what the
> problem could be?
>

Please try fist to use just MPICommExecutor. It requires the less
advanced MPI features, actually, it should work even with ancient
MPI-1.x implementations. Now some tips to try to make things work.

Maybe this is related to the lack of threading support in the backend MPI.
Could you please edit the file `src/mpi4py/__init__.py` and replace
the rc.thread_level = 'multiple' line to set 'serialized' rather than
'multiple' ? Or maybe even 'single' (you may get a warning later, but
things may still work).

diff --git a/src/mpi4py/__init__.py b/src/mpi4py/__init__.py
index 59f9c34..2ee6c3e 100644
--- a/src/mpi4py/__init__.py
+++ b/src/mpi4py/__init__.py
@@ -89,7 +89,7 @@ def rc(**kargs): # pylint: disable=invalid-name

rc.initialize = True
rc.threads = True
-rc.thread_level = 'multiple'
+rc.thread_level = 'serialized'
rc.finalize = None
rc.fast_reduce = True
rc.recv_mprobe = True

Another thing to try is the following patch:

diff --git a/src/mpi4py/futures/_lib.py b/src/mpi4py/futures/_lib.py
index db0e01a..bef5b2e 100644
--- a/src/mpi4py/futures/_lib.py
+++ b/src/mpi4py/futures/_lib.py
@@ -245,7 +245,7 @@ def comm_split(comm, root=0):
assert 0 <= root < comm.Get_size()
rank = comm.Get_rank()

- if MPI.Get_version() >= (2, 2):
+ if 0: # MPI.Get_version() >= (2, 2):
allgroup = comm.Get_group()
if rank == root:
group = allgroup.Incl([root])

Another source of problem may be a broken MPI_Ibarrier implementation,
you can apply this patch:

diff --git a/src/mpi4py/futures/_lib.py b/src/mpi4py/futures/_lib.py
index db0e01a..97fd8ca 100644
--- a/src/mpi4py/futures/_lib.py
+++ b/src/mpi4py/futures/_lib.py
@@ -373,6 +373,7 @@ class SharedPoolCtx(object):
def barrier(comm):
assert comm.Is_inter()
try:
+ raise NotImplementedError
request = comm.Ibarrier()
backoff = Backoff()
while not request.Test():

>
> Some extra details:
> - using OpenMPI 1.10.1
> - most recent mpi4py from bitbucket
> - spawning processes using mpi does not seem to work on this system
> (neither openmpi or intel),

Well, that's usually the situation in many systems. 2017 is almost
over and we cannot use MPI features that were added to the standard in
1998.

> * I've thus either been using ``mpiexec python -m mpi4py.futures
> ...`` with the Pool executor
> * or ``mpiexec python ...`` with the Comm executor. Both hang.
>

That's the reason I have to add this `mpiexec python -m
mpi4py.futures`. This way, at least you have chance to execute you
neat script that runs just fine in a Raspberry Pi, but fails to
execute in a multi-million dollar system.

--
Lisandro Dalcin
============
Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 0109
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459

Lisandro Dalcin

unread,

Oct 29, 2017, 10:19:04 AM10/29/17

to mpi4py

On 27 October 2017 at 15:16, Johnnie Gray <johnni...@gmail.com> wrote:
>

> Some extra details:
> - using OpenMPI 1.10.1

FYI, I just tested with a debug build of OpenMPI 1.10.0 in my local
desktop box, and the mpi4py.futures testsuite ran just fine.

Johnnie Gray

unread,

Oct 31, 2017, 8:56:28 PM10/31/17

to mpi4py

Good to know. Any complaints or feedback?

The standard executor pool has been very useful, I suspect I abuse it slightly, keeping a cached pool of workers ticking over

(https://github.com/jcmgray/quimb/blob/develop/quimb/linalg/mpi_launcher.py), but it allows dynamically switching between

e.g. multi-threaded numpy routines, and multi-process mpi routines, all in an interactive setting etc.

Now some tips to try to make things work.

I have now tried those 4 patches you kindly suggested but sadly no luck.

I have also now tried intel MPI and a build of openmpi-3.0.0 but same story...

Seems it might be something quite low-level.

FYI, I just tested with a debug build of OpenMPI 1.10.0 in my local
desktop box, and the mpi4py.futures testsuite ran just fine.

Yes everything works fine with all the builds I have, as long as the processes are not launched on more than

a single node. I have run into some hangs when using the pickling based send/broadcast functions,

and am wondering if this could be related.

In case it is helpful, here is the last output from ``mpiexec --tag-output python -m trace -t <script>`` when trying to use

the MPICommExecutor, (this snippet is repeated indefinitely):

```

[1,1]<stdout>: --- modulename: _lib, funcname: sleep

[1,1]<stdout>:_lib.py(77): time.sleep(self.tval)

[1,1]<stdout>:_lib.py(78): self.tval = min(self.tmax, max(self.tmin, self.tval * 2))

[1,1]<stdout>:_lib.py(555): while not request_test(request):

[1,1]<stdout>:_lib.py(556): backoff.sleep()

```

I'm guessing this is a specific problem with this hardware so understand if there is not much more you can suggest!

For what its worth, my current workaround is to use a fake, synchronous pool (in above link) that imitates the MPIPoolExecutor somewhat.

Lisandro Dalcin

unread,

Nov 1, 2017, 5:26:53 AM11/1/17

to mpi4py

On 1 November 2017 at 03:56, Johnnie Gray <johnni...@gmail.com> wrote:
>> Good to know. Any complaints or feedback?
>
>
> The standard executor pool has been very useful, I suspect I abuse it
> slightly, keeping a cached pool of workers ticking over
> (https://github.com/jcmgray/quimb/blob/develop/quimb/linalg/mpi_launcher.py),
> but it allows dynamically switching between
> e.g. multi-threaded numpy routines, and multi-process mpi routines, all in
> an interactive setting etc.
>

I'm wondering if the issue is related to your abuses. I still don't
quite get what you are trying to do. Also, note that using "python -m
mpi4py.futures" is delicate, things are not so isolated as when you
truly spawn new workers.

Does you code work on a single node under "python -m mpi4py.futures" ?

>
> I have now tried those 4 patches you kindly suggested but sadly no luck.
> I have also now tried intel MPI and a build of openmpi-3.0.0 but same
> story...
> Seems it might be something quite low-level.
>
>> FYI, I just tested with a debug build of OpenMPI 1.10.0 in my local
>> desktop box, and the mpi4py.futures testsuite ran just fine.
>

Could you try to run the mpi4py.futures examples in demo/futures from
the git repo in the cluster, making sure you use more than one compute
node?

>
> Yes everything works fine with all the builds I have, as long as the
> processes are not launched on more than
> a single node. I have run into some hangs when using the pickling based
> send/broadcast functions,
> and am wondering if this could be related.
>
> In case it is helpful, here is the last output from ``mpiexec --tag-output
> python -m trace -t <script>`` when trying to use
> the MPICommExecutor, (this snippet is repeated indefinitely):
>
> ```
> [1,1]<stdout>: --- modulename: _lib, funcname: sleep
> [1,1]<stdout>:_lib.py(77): time.sleep(self.tval)
> [1,1]<stdout>:_lib.py(78): self.tval = min(self.tmax, max(self.tmin,
> self.tval * 2))
> [1,1]<stdout>:_lib.py(555): while not request_test(request):
> [1,1]<stdout>:_lib.py(556): backoff.sleep()
> ```
>

It seems that the workers are waiting for the master to accept the
message with the result.

This polling with exponential backoff is horrible, but if I don't do
that, idle workers (or the idle thread managing the master) will
consume 100% CPU. Blocking MPI calls are usually implemented with busy
polling: https://blogs.cisco.com/performance/polling-vs-blocking-message-passingprogress

BTW, I see that you are passing "delay=1e-2" in mpi_launcher.py, but
please note that I renamed the keyword arg to "backoff". Now that
polling is implemented using exponential backoff with a max sleep time
of 1e-3, there should be no need to tweak this value (unless you want
to pass 0 to get minimum latency at the cost of consuming 100% when
idle).

I'm actually thinking about special-casing "backoff=0" and use
blocking MPI calls. Some MPI implementations have better ways of
letting users control its polling behavior, eg:
https://www.ibm.com/support/knowledgecenter/en/SSFK3V_1.3.0/com.ibm.cluster.pe.v1r3.pe400.doc/am106_pclpoll.htm

> I'm guessing this is a specific problem with this hardware so understand if
> there is not much more you can suggest!
>

I would still run the examples in demo/futures using MPICommExecutor
to discard actual issues within your own code.

Johnnie Gray

unread,

Nov 8, 2017, 5:29:59 PM11/8/17

to mpi4py

I'm wondering if the issue is related to your abuses. I still don't
quite get what you are trying to do. Also, note that using "python -m
mpi4py.futures" is delicate, things are not so isolated as when you
truly spawn new workers.
Does you code work on a single node under "python -m mpi4py.futures" ?

Yes sorry, to clarify, I have only been troubleshooting with minimal mpi4py

only examples (see below). Was just pointing out how the new mpi4py.futures

has been useful for me.

Could you try to run the mpi4py.futures examples in demo/futures from
the git repo in the cluster, making sure you use more than one compute
node?

Yes - so test_futures.py runs fine on a single compute node. (either

python test_futures.py,

mpiexec python test_futures.py

mpiexec python -m mpi4py.futures test_futures.py

- not sure about intended test method?). The problem only arises for 2+ nodes.

BTW, I see that you are passing "delay=1e-2" in mpi_launcher.py, but
please note that I renamed the keyword arg to "backoff". Now that
polling is implemented using exponential backoff with a max sleep time
of 1e-3, there should be no need to tweak this value (unless you want
to pass 0 to get minimum latency at the cost of consuming 100% when
idle).

Thanks for pointing this out - I have updated my code, and for what its worth,

exponential backoff seems a decent compromise for my needs.

I would still run the examples in demo/futures using MPICommExecutor
to discard actual issues within your own code.

If I get some time I'll try to run a few more tests and trace exactly when the

hang occurs. For reference, the basic MPICommExecutor snippet I was

testing with is just this:

```

from mpi4py import MPI

from mpi4py.futures import MPICommExecutor

from operator import add

comm = MPI.COMM_WORLD

rank = comm.Get_rank()

size = comm.Get_size()

print("I am worker {} of {}".format(rank, size))

with MPICommExecutor(comm, root=0) as pool:

if pool is not None:

fut = pool.submit(add, 1, 2)

print('1 + 2 = {}'.format(fut.result()))

```

Thanks again for your help.

Lisandro Dalcin

unread,

Nov 9, 2017, 2:57:59 AM11/9/17

to mpi4py

On 9 November 2017 at 01:29, Johnnie Gray <johnni...@gmail.com> wrote:
>
> Thanks for pointing this out - I have updated my code, and for what its
> worth,
> exponential backoff seems a decent compromise for my needs.

Just FYI, a new mpi4py release is out.

>
> For reference, the basic MPICommExecutor snippet I was
> testing with is just this:
>
> ```
> from mpi4py import MPI
> from mpi4py.futures import MPICommExecutor
> from operator import add
>
> comm = MPI.COMM_WORLD
> rank = comm.Get_rank()
> size = comm.Get_size()
>
> print("I am worker {} of {}".format(rank, size))
>
> with MPICommExecutor(comm, root=0) as pool:
> if pool is not None:
> fut = pool.submit(add, 1, 2)
> print('1 + 2 = {}'.format(fut.result()))
> ```
>

Keep using that example, and run it with just "mpiexec -n 2 python script.py".

Could you also check whether the following code works:?

"""
with MPICommExecutor(comm, root=0) as pool:

pass

Johnnie Gray

unread,

Nov 16, 2017, 8:13:15 AM11/16/17

to mpi4py

Just FYI, a new mpi4py release is out.

Thanks, I have updated.

Keep using that example, and run it with just "mpiexec -n 2 python script.py".

Could you also check whether the following code works:?

"""
with MPICommExecutor(comm, root=0) as pool:
pass
"""

So I have run a few more tests (sorry for the dalayed reponse),

"mpiexec -n 2 python script.py", still does not work,

nor does just passing once the pool is initialized.

What does work is disabling the infiniband interface,

e.g. setting "export OMPI_MCA_pml=ob1".

So it seems to be the exact combination of infiniband interface

(on this set of clusters) and mpi4py.future pools that is not working.

Reply all

Reply to author

Forward