Mixing shared and distributed memory

Brandt Belson

unread,

May 9, 2011, 4:34:10 PM5/9/11

to mpi...@googlegroups.com

Hello all,

I'm working on a piece of software that is limited in performance by how much data is simultaneously in memory. As a result, it's slower to use MPI within a node on all available processors than to use only one processor on a node because the available memory is divided.

To increase performance, I'd like to share memory within a node and use MPI between nodes. Before I get too deep into this though I wanted to know what shared memory modules work well with mpi4py, if any? There is a list here:

http://wiki.python.org/moin/ParallelProcessing

Is POSH the right choice for this type of application? Any guidance or tips on this would be appreciated

Thanks,

Brandt

Lisandro Dalcin

unread,

May 9, 2011, 5:37:56 PM5/9/11

to mpi...@googlegroups.com

As long as a packages do not try to create processes using fork(), I
think you should be able to use any shmem package. If you are using
NumPy arrays, you should take a look here:
https://bitbucket.org/cleemesser/numpy-sharedmem/ . Using 1 process
per node and threads should also work provided that most of the busy
computing occurs with the GIL released. But I would still prefer a
shmem approach (I do not particularly like threads for parallelism,
perhaps just because I do not know too much about them :-) ).

> --
> You received this message because you are subscribed to the Google Groups
> "mpi4py" group.
> To post to this group, send email to mpi...@googlegroups.com.
> To unsubscribe from this group, send email to
> mpi4py+un...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mpi4py?hl=en.
>

--
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169

Chad Brewbaker

unread,

May 9, 2011, 5:48:42 PM5/9/11

to mpi...@googlegroups.com

I have never had an issue with this Brandt, but then I tend to make my MPI code extremely distributed in nature and try to avoid master/slave type bottlenecks when possible.

One feature I would like to see in mpi4py is a built in function to setup a communicator for all MPI processes on a compute node. Off the top of my head the easiest way would be to hash your machine name to an integer, perform some global communication on MPI_COMM_WORLD, then resolve the full name strings between those that have the same hashed value.

Honestly, since almost every machine these days is multicore I would enable it by default and call this COMM_LOCALHOST.

Lisandro Dalcin

unread,

May 9, 2011, 6:25:06 PM5/9/11

to mpi...@googlegroups.com

On 10 May 2011 00:48, Chad Brewbaker <crb...@gmail.com> wrote:
> On Mon, May 9, 2011 at 3:34 PM, Brandt Belson <bbe...@princeton.edu> wrote:
>>
>> Hello all,
>> I'm working on a piece of software that is limited in performance by how
>> much data is simultaneously in memory. As a result, it's slower to use MPI
>> within a node on all available processors than to use only one processor on
>> a node because the available memory is divided.
>> To increase performance, I'd like to share memory within a node and use
>> MPI between nodes. Before I get too deep into this though I wanted to know
>> what shared memory modules work well with mpi4py, if any? There is a list
>> here:
>> http://wiki.python.org/moin/ParallelProcessing
>> Is POSH the right choice for this type of application? Any guidance or
>> tips on this would be appreciated
>> Thanks,
>> Brandt
>
> I have never had an issue with this Brandt, but then I tend to make my MPI
> code extremely distributed in nature and try to avoid master/slave type
> bottlenecks when possible.

I agree.

> One feature I would like to see in mpi4py is a built in function to setup a
> communicator for all MPI processes on a compute node. Off the top of my head
> the easiest way would be to hash your machine name to an integer, perform
> some global communication on MPI_COMM_WORLD, then resolve the full name
> strings between those that have the same hashed value.
> Honestly, since almost every machine these days is multicore I would enable
> it by default and call this COMM_LOCALHOST.
>

1) I would like the mpi4py.MPI namespace to be reserved for things
that are explicitly defined in the MPI spec. However, I'm not opposed
to have a helper module mpi4py.utils where we could add useful stuff.
I'm very open to any proposal around these lines. Suggestions about
names and organization of a helper module/package would be most
welcome. I have my own list of stuff that I would like to add, like
printing routines, debug helpers, etc.

2) COMM_LOCALHOST = MPI.COMM_WORLD.Split(color=hash(hostname),
key=MPI.COMM_WORLD.Get_rank()) . However, this could break for the
non-uniqueness of hashes, and in 64bits archs you can have overflow
errors (hash returns long, but color should be int).

You can also gather to rank==0 the hostnames, make a dict
hostname:list-of-ranks, asign a color in range(0,len(dict)) to each
hostname, and scatter colors back, then use Split().

3) COMM_LOCALHOST can easily break on an environment with
checkpoint/restart support.

Chad Brewbaker

unread,

May 9, 2011, 7:49:26 PM5/9/11

to mpi...@googlegroups.com

On Mon, May 9, 2011 at 5:25 PM, Lisandro Dalcin <dal...@gmail.com> wrote:

On 10 May 2011 00:48, Chad Brewbaker <crb...@gmail.com> wrote:
> On Mon, May 9, 2011 at 3:34 PM, Brandt Belson <bbe...@princeton.edu> wrote:
>>
> I have never had an issue with this Brandt, but then I tend to make my MPI
> code extremely distributed in nature and try to avoid master/slave type
> bottlenecks when possible.

I agree.

> One feature I would like to see in mpi4py is a built in function to setup a
> communicator for all MPI processes on a compute node. Off the top of my head
> the easiest way would be to hash your machine name to an integer, perform
> some global communication on MPI_COMM_WORLD, then resolve the full name
> strings between those that have the same hashed value.
> Honestly, since almost every machine these days is multicore I would enable
> it by default and call this COMM_LOCALHOST.
>

1) I would like the mpi4py.MPI namespace to be reserved for things
that are explicitly defined in the MPI spec. However, I'm not opposed
to have a helper module mpi4py.utils where we could add useful stuff.
I'm very open to any proposal around these lines. Suggestions about
names and organization of a helper module/package would be most
welcome. I have my own list of stuff that I would like to add, like
printing routines, debug helpers, etc.

Great idea.

2) COMM_LOCALHOST = MPI.COMM_WORLD.Split(color=hash(hostname),
key=MPI.COMM_WORLD.Get_rank()) . However, this could break for the
non-uniqueness of hashes, and in 64bits archs you can have overflow
errors (hash returns long, but color should be int).

You can also gather to rank==0 the hostnames, make a dict
hostname:list-of-ranks, asign a color in range(0,len(dict)) to each
hostname, and scatter colors back, then use Split().

This was what I had in mind. Probably better hash functions than xor...

import os

name = os.uname()[1]

xor_hash = 0

for char in name:

xor_hash = xor_hash ^ ord(char)

#some global communication to get everyone with the same xor_hash

#more localized communication to resolve full names

#create a communicator COMM_LOCALHOST for those that share 'name'

3) COMM_LOCALHOST can easily break on an environment with
checkpoint/restart support.

Good point.

Brandt

unread,

May 17, 2011, 9:26:27 AM5/17/11

to mpi4py

Thanks to both of you for your help. I have been using python's built-
in multiprocessing module http://docs.python.org/library/multiprocessing.html
(available since version 2.6) with one MPI task (~processor) per node,
and using multiprocessing.Pool.map() for the shared memory operations
within each node. However, I see some sort of problem when I use
COMM_WORLD.bcast() for numpy arrays. It took a while to isolate this
problem, but the following simple example shows it:

import multiprocessing
import numpy as N
from mpi4py import MPI

comm = MPI.COMM_WORLD
numMPITasks = comm.Get_size()
rank = comm.Get_rank()

pool = multiprocessing.Pool()

def test_bcast():
if rank == 0:
a = N.random.random(4)
else:
a = None
print 'about to broadcast, num node is',rank
a = comm.bcast(a, root=0)
# Breaks here when "a" is a numpy array
print 'node num is',rank,'and a is',a

if __name__ == '__main__':
test_bcast()

When I submit this job on two nodes, one MPI task/node, the second
node never completes the comm.bcast() command and only the first node
(rank=0) prints the value of "a".

However, if I make "a" a list, then everything works as expected.
Also, if I simply comment out the pool=multiprocessing.Pool() line,
then numpy arrays are ok too. It is only the combination of creating
the instance of Pool and numpy arrays that appears to break bcast().
I'm not sure if this is a problem with numpy, multiprocessing, or
mpi4py, so let me know if I should be asking someone else.

I should mention that I only bcast numpy arrays in my unittests where
I don't care about performance. Generally my code bcasts large user-
supplied objects which I know very little about. As a work-around I
suppose I could bcast a 2D list instead of a numpy array, then convert
to a numpy array afterwards. I'd like to get this solved the right way
though.

Thanks,
Brandt

On May 9, 7:49 pm, Chad Brewbaker <crb...@gmail.com> wrote:

> On Mon, May 9, 2011 at 5:25 PM, Lisandro Dalcin <dalc...@gmail.com> wrote:
> > On 10 May 2011 00:48, Chad Brewbaker <crb...@gmail.com> wrote:

> > > On Mon, May 9, 2011 at 3:34 PM, Brandt Belson <bbel...@princeton.edu>

Brandt

unread,

May 17, 2011, 10:51:11 AM5/17/11

to mpi4py

Hi again,
I have some more information. I tried to use lists and found that I
also can't bcast them when they have too many elements. I can use a 2D
list with a size up to 3x2, but not 4x2 or 3x3. Similarly, I can use
1D lists up to size 9. When the lists are too large the node with
rank=1 never completes the bcast. I'm copying the code below.

import multiprocessing
import numpy as N
from mpi4py import MPI

comm = MPI.COMM_WORLD
numMPITasks = comm.Get_size()
rank = comm.Get_rank()

pool = multiprocessing.Pool()

def test_bcast():
if rank == 0:

numRows = 9
numCols = 2
a = []
# Create 2D list
#for r in range(numRows):
# b=[]
# for c in range(numCols):
# b.append(N.random.random())
# a.append(b)

# Create 1D list
for r in range(numRows):
a.append(N.random.random())

# Create 2D list from a numpy array
#a = N.random.random((3,3))
# Convert a numpy array to a 2D list
#a = list(a)
#for i in range(len(a)):
# a[i] = list(a[i])

else:
a = None
print 'about to broadcast, num node is',rank
a = comm.bcast(a, root=0)

#a = N.array(a)
# Breaks here when "a" is a numpy array or list with too many
elements
# (2D list seems to break with about 8 entries, 1D with 10)

print 'node num is',rank,'and a is',a

if __name__ == '__main__':
test_bcast()

On May 17, 9:26 am, Brandt <belso...@gmail.com> wrote:
> Thanks to both of you for your help. I have been using python's built-

> in multiprocessing modulehttp://docs.python.org/library/multiprocessing.html

Lisandro Dalcin

unread,

May 17, 2011, 10:54:07 AM5/17/11

to mpi...@googlegroups.com

I think the issue could come from using a library (multiprocessing)
that uses fork(). I do not expect MPI implementations to work well in
this situation. I think that the right solution should be to implement
a process pool using Spawn() and MPI communication, but that would
require a bit of work.

Brandt

unread,

May 17, 2011, 12:42:16 PM5/17/11

to mpi4py

Ok, thanks.
I'm concerned that MPI communication will make copies of objects when
they are passed between processors on the same node. Correct me if I'm
wrong, but I believe mpi4py does this. This makes some sense, in
general one wouldn't want different processors to change an object and
affect other processors' version of that object. However, for my case,
the performance is largely determined by how many large objects I have
in memory at once, so copies would hurt performance.

I looked into several shared memory modules and they seem to rely on
fork to some extent. Will the MPI communication and Spawn() idea you
mentioned make copies? If so, is what I'm describing possible with
mpi4py?

Thanks a lot for your help,
Brandt

On May 17, 10:54 am, Lisandro Dalcin <dalc...@gmail.com> wrote:

> On 17 May 2011 16:26, Brandt <belso...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > Thanks to both of you for your help. I have been using python's built-

> > in multiprocessing modulehttp://docs.python.org/library/multiprocessing.html

Lisandro Dalcin

unread,

May 17, 2011, 1:48:46 PM5/17/11

to mpi...@googlegroups.com

On 17 May 2011 19:42, Brandt <bels...@gmail.com> wrote:
> Ok, thanks.
> I'm concerned that MPI communication will make copies of objects when
> they are passed between processors on the same node. Correct me if I'm
> wrong, but I believe mpi4py does this.

Yes, mpi4py makes copies. Remember MPI target primarily distributed memory.

> This makes some sense, in
> general one wouldn't want different processors to change an object and
> affect other processors' version of that object. However, for my case,
> the performance is largely determined by how many large objects I have
> in memory at once, so copies would hurt performance.
>

So you need to mix MPI for internode communication and memory sharing
through shmem at the intranode level? Where are these large objects
being created? Are these objects numpy arrays?

>
> I looked into several shared memory modules and they seem to rely on
> fork to some extent. Will the MPI communication and Spawn() idea you
> mentioned make copies? If so, is what I'm describing possible with
> mpi4py?
>

Yes, Spawn() would make copies if you communicate back and forth with
the child process.

Brandt

unread,

May 17, 2011, 2:16:27 PM5/17/11

to mpi4py

> So you need to mix MPI for internode communication and memory sharing
> through shmem at the intranode level? Where are these large objects
> being created? Are these objects numpy arrays?

Yes, correct, MPI for the internode communication and shared memory
for intranode.

The objects are created when they are loaded from file, and represent
snapshots from a simulation or spatial modes. The user supplies the
library with functions that do things like load and save, so as long
as the objects and functions work together, they can be anything. In
general, the objects aren't numpy arrays.

> Yes, Spawn() would make copies if you communicate back and forth with
> the child process.

Ok, makes sense. I think this leaves two choices: find a shared memory
package that doesn't use fork, or find "creative" ways to make
multiprocessing work with mpi4py. I had no luck with the first option,
so I think I'll go with the second. I don't need much internode
communication, so when mpi4py and multiprocessing break each other,
maybe pickling dump/load will suffice.

Again, thanks for your help!
Brandt

On May 17, 1:48 pm, Lisandro Dalcin <dalc...@gmail.com> wrote:

Yung-Yu Chen

unread,

May 17, 2011, 3:45:20 PM5/17/11

to mpi...@googlegroups.com

Hello Brandt,

On Tue, May 17, 2011 at 14:16, Brandt <bels...@gmail.com> wrote:

> So you need to mix MPI for internode communication and memory sharing
> through shmem at the intranode level? Where are these large objects
> being created? Are these objects numpy arrays?

Yes, correct, MPI for the internode communication and shared memory
for intranode.

The objects are created when they are loaded from file, and represent
snapshots from a simulation or spatial modes. The user supplies the
library with functions that do things like load and save, so as long
as the objects and functions work together, they can be anything. In
general, the objects aren't numpy arrays.

> Yes, Spawn() would make copies if you communicate back and forth with
> the child process.

Ok, makes sense. I think this leaves two choices: find a shared memory
package that doesn't use fork, or find "creative" ways to make
multiprocessing work with mpi4py. I had no luck with the first option,
so I think I'll go with the second. I don't need much internode
communication, so when mpi4py and multiprocessing break each other,
maybe pickling dump/load will suffice.

I am also working on a project for hybrid parallelism with Python. The distributed-memory part is implemented with calling MPI (mvapich2) with ctypes module. The shared-memory part is implemented with thread module. It works well. The thread pool is created and managed in pure Python, while each thread runs C code interfaced with ctypes. There could be some overhead in ctypes, but in my case, the C code consumes almost 100% of run time, so it doesn't really matter. This approach also makes extension to GPU cluster straightforward.

If this approach doesn't work for you, perhaps you can consider just using multiprocessing for internode communication. multiprocessing actually works over IP network.

yyc

--
You received this message because you are subscribed to the Google Groups "mpi4py" group.
To post to this group, send email to mpi...@googlegroups.com.
To unsubscribe from this group, send email to mpi4py+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mpi4py?hl=en.

--
Yung-Yu Chen
http://solvcon.net/yyc/

+1 (614) 859 2436

Lisandro Dalcin

unread,

May 17, 2011, 4:13:38 PM5/17/11

to mpi...@googlegroups.com

Yes, but this works because your computations are performed in C, then
ctypes releases the GIL, and threads can actually run concurrently.
This is not going to work for computations performed in pure Python
code.

> If this approach doesn't work for you, perhaps you can consider just using
> multiprocessing for internode communication. multiprocessing actually works
> over IP network.

--

Dag Sverre Seljebotn

unread,

May 25, 2011, 1:01:15 PM5/25/11

to mpi...@googlegroups.com

On 05/17/2011 08:16 PM, Brandt wrote:
>
>> So you need to mix MPI for internode communication and memory sharing
>> through shmem at the intranode level? Where are these large objects
>> being created? Are these objects numpy arrays?
>
> Yes, correct, MPI for the internode communication and shared memory
> for intranode.
>
> The objects are created when they are loaded from file, and represent
> snapshots from a simulation or spatial modes. The user supplies the
> library with functions that do things like load and save, so as long
> as the objects and functions work together, they can be anything. In
> general, the objects aren't numpy arrays.

Are you saying you need a multi-threaded, shared memory model for
abitrary Python objects? AFAICT that is simply fundamentally impossible
with Python no matter how you approach it, even leaving the question of
MPI completely aside. The GIL comes in the way. (Perhaps Jython or
IronPython would work.)

If you can make sure that "data" is allocated in large, shared chunks,
or through a C API (such as NumPy arrays, or some other object that can
be managed in a C library), and only have thin "proxy objects" in each
process, then you have a chance with Python.

>
>> Yes, Spawn() would make copies if you communicate back and forth with
>> the child process.
>
> Ok, makes sense. I think this leaves two choices: find a shared memory
> package that doesn't use fork, or find "creative" ways to make
> multiprocessing work with mpi4py. I had no luck with the first option,
> so I think I'll go with the second. I don't need much internode
> communication, so when mpi4py and multiprocessing break each other,
> maybe pickling dump/load will suffice.

Assuming that you can use shared "chunks" of memory within nodes (such
as NumPy arrays), what you can do is:

- Launch n MPI processes on m nodes, for a total of n*m processes
- On launch, each process check their hostnames and group up by
hostname (for instance, all processes send their hostname to root, which
then creates groups for each node and tell processes which group they
are in)
- Processes on each node then cooperate to allocate and share chunks
of memory through the usual cross-process shared memory facilities.

I don't see any advantage at all for mpi4py+multiprocessing... it is not
like sending objects internally on a node would go faster with
multiprocessing than mpi4py?

Dag Sverre Seljebotn

Reply all

Reply to author

Forward