CUDA-aware MPI support in mpi4py?

minds...@gmail.com

unread,

Aug 19, 2014, 1:28:41 AM8/19/14

to mpi...@googlegroups.com

Is there support for CUDA-aware MPI in mpi4py ? For e.g. can I pass a GPU matrix pointer to COMM.Scatter(.) and have it be distributed across multiple GPUs across multiple machines (each machine has one or more GPUs)?

Thanks!

Lisandro Dalcin

unread,

Aug 19, 2014, 4:15:53 AM8/19/14

to mpi4py

It all depends on how you handle CUDA memory buffers in Python. Are
you using PyCUDA? In such case, according to PyCUDA's docs, you can
create a Python buffer object like this:

ary = ... # GPUArray instance
buf = ary.gpudata.as_buffer(ary.nbytes)

Then you can pass "buf" to mpi4py to perform communications, eg:

MPI.COMM_WORLD.Send([buf, MPI.DOUBLE], dest, tag)

Please try and let us know how it goes.

--
Lisandro Dalcin
---------------
CIMEC (UNL/CONICET)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1016)
Tel/Fax: +54-342-4511169

ar...@nervanasys.com

unread,

Aug 19, 2014, 6:08:12 PM8/19/14

to mpi...@googlegroups.com

Hi Lisandro,

Thanks for your reply. The 'as_buffer' seems to work, but I get a seg fault when trying to scatter a matrix (code attached). Are scatter/gather supported?

Regards,

Arjun

test_mpi_pycuda.py

Lisandro Dalcin

unread,

Aug 20, 2014, 4:40:34 AM8/20/14

to mpi4py

On 20 August 2014 01:08, <ar...@nervanasys.com> wrote:
> Hi Lisandro,
>
> Thanks for your reply. The 'as_buffer' seems to work, but I get a seg fault
> when trying to scatter a matrix (code attached). Are scatter/gather
> supported?
>

Yes, they are. However, you code is wrong, you have to use upper-case
spelling Scatter() and Gather() methods, they work quite similarly to
the C or Fortran counterparts. You should use the lower-case versions
to communicate generic Python objects using pickle serialization under
the hood. BTW, you do not really need the "x.reshape(...)" line for
mpi4py to work, but anyway you should do "x.reshape(N/nr_gpus,
nr_gpus)", this is to make clear that Scatter() scatters rows of "x".

Also, you have to pre-allocate "x_gpu_part" with size "N/nr_gpus", and
then write code like this:

if rank == 0:
sbuf = x_gpu.gpudata.as_buffer(x_gpu.nbytes)
else:
sbuf = None

x_gpu_part = pycuda.gpuarray.empty(N/nr_gpus, float32)
rbuf = x_gpu_part.gpudata.as_buffer(x_gpu_part.nbytes)

comm.Scatter([sbuf, MPI.FLOAT], [rbuf, MPI.FLOAT], root=0)

Right now I do not have a CUDA-ready box to send you tested code, but
I hope my previous comments are enough, otherwise ping back.

Lisandro Dalcin

unread,

Aug 20, 2014, 5:19:53 AM8/20/14

to mpi4py

On 20 August 2014 11:40, Lisandro Dalcin <dal...@gmail.com> wrote:
> but anyway you should do "x.reshape(N/nr_gpus,
> nr_gpus)", this is to make clear that Scatter() scatters rows of "x".

Please ignore my comment above, I was coding too much Fortran yesterday :-)

ar...@nervanasys.com

unread,

Aug 20, 2014, 12:28:51 PM8/20/14

to mpi...@googlegroups.com

Thanks, Lisandro. I added your edits (attached), but still get a segfault on the Scatter.

Segfault error is below:

this is process 0

x: [ -7.79815257e+15 1.04008317e+16 4.21789251e+14 ..., 1.61251795e+16

3.17252606e+15 -3.82599771e+15]

before scatter

[titan1:30142] *** Process received signal ***

[titan1:30142] Signal: Segmentation fault (11)

[titan1:30142] Signal code: Invalid permissions (2)

[titan1:30142] Failing at address: 0xb0021fff0

[titan1:30142] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfbb0) [0x7f93ffd2fbb0]

[titan1:30142] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x150ed6) [0x7f93ffaa8ed6]

[titan1:30142] [ 2] /usr/lib/libmpi.so.0(+0x433ed) [0x7f93fe3ef3ed]

[titan1:30142] [ 3] /usr/lib/libmpi.so.0(ompi_ddt_sndrcv+0x4f2) [0x7f93fe3ed732]

[titan1:30142] [ 4] /usr/lib/libmpi.so.0(PMPI_Scatter+0x19b) [0x7f93fe4088bb]

[titan1:30142] [ 5] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x5fb76) [0x7f93fe6bdb76]

[titan1:30142] [ 6] python(PyEval_EvalFrameEx+0x6127) [0x566e17]

[titan1:30142] [ 7] python(PyEval_EvalCodeEx+0x2a4) [0x54b7d4]

[titan1:30142] [ 8] python() [0x55830c]

[titan1:30142] [ 9] python(PyRun_FileExFlags+0x92) [0x468567]

[titan1:30142] [10] python(PyRun_SimpleFileExFlags+0x2ee) [0x468aa0]

[titan1:30142] [11] python(Py_Main+0xb5e) [0x46a1e3]

[titan1:30142] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f93ff979de5]

[titan1:30142] [13] python() [0x5735fe]

[titan1:30142] *** End of error message ***

Segmentation fault (core dumped)

test_mpi_pycuda.py

Lisandro Dalcin

unread,

Aug 21, 2014, 5:02:35 AM8/21/14

to mpi4py

On 20 August 2014 19:28, <ar...@nervanasys.com> wrote:
> Thanks, Lisandro. I added your edits (attached), but still get a segfault on
> the Scatter.
>

I don't know what's going on. Have you ever tested your MPI with some
C code to be really sure it is working with CUDA as expected? From
what I see in your stack trace, I don't think the failure is mpi4py's
fault.

BTW, In your script, you have to make similar modification for the
Gather() call, look at Gathering "NumPy arrays"
http://mpi4py.readthedocs.org/en/latest/tutorial.html#collective-communication
as an example.

Could you try to run the code snippets for Gather/Scatter numpy arrays
http://mpi4py.readthedocs.org/en/latest/tutorial.html#collective-communication,
just to be sure everything's OK with your MPI when using plain CPU
memory buffers?

ashwi...@gmail.com

unread,

Aug 25, 2014, 10:35:53 PM8/25/14

to mpi...@googlegroups.com

Hello!

First of all, thanks for your work! I'd like to report an identical issue. I've edited Arjun's script to use Send and Recv instead (see attached). My code segfaults -- there are no further error messages.

I've confirmed that the code works with CPU buffers. For instance, you can replace sendbuf and recvbuf with just x_gpu.get() and that will work OK. I'm using openmpi-1.8.1 with CUDA support.

Any further suggestions?

Thanks,
Ashwin

ashwi...@gmail.com

unread,

Aug 25, 2014, 10:38:07 PM8/25/14

to mpi...@googlegroups.com

Oops, I may have forgotten to attach the file. Here it is!

On Thursday, August 21, 2014 5:02:35 AM UTC-4, Lisandro Dalcin wrote:

test_mpi_pycuda.py

Lisandro Dalcin

unread,

Aug 26, 2014, 4:50:51 AM8/26/14

to mpi4py

On 26 August 2014 05:35, <ashwi...@gmail.com> wrote:
> First of all, thanks for your work! I'd like to report an identical issue.
> I've edited Arjun's script to use Send and Recv instead (see attached). My
> code segfaults -- there are no further error messages.

Could you check that mpi4py is linked against the right MPI shared
library? Can you run under valgrind and post any output? I received
the announcement of Open MPI 1.8.2, perhaps it is worth a try?

BTW, you attached script has two bugs: "elif rank == 0" and
"source=1", you should use 1 and 0, respectively.

--
Lisandro Dalcin
============
Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Numerical Porous Media Center (NumPor)
King Abdullah University of Science and Technology (KAUST)
http://numpor.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 4332
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459

ar...@nervanasys.com

unread,

Aug 26, 2014, 12:52:32 PM8/26/14

to mpi...@googlegroups.com

I was able to get it to work thanks to Lisandro's suggestions and some more digging around. There were a few steps involved:

1. I was using Ubuntu's OpenMPI (via apt-get). This wasn't configured with cuda-enabled. I downloaded the sources for OpenMPI 1.8.1, and installed with the following commands. It is better to install in a separate directory because it seemed difficult to cleanly uninstall the apt-get version of OpenMPI and all its dependencies.

Note that I was configuring for an ethernet cluster, so didn't need Infiniband for this install.

./configure --disable-mcast --prefix=/openmpi --enable-fast=none —-with-cuda enable-cuda --with-device=ch3:sock

make all

sudo make install

Make sure that PATH and LD_LIBRARY_PATH point to /openmpi and /openmpi/lib respectively

2. When installing mpi4py from sources make this change in the mpi.cfg file:

mpi_dir = /home/users/arjun/openmpi

Use the following command to install:

python setup.py build --configure install

3. See attached file for a demo script that worked.

In particular, there was an issue with commands of the form of x = x*2. It is better to use x*=2. Also, the script that I was using was based off an old version of pycuda, and many of the context commands are no longer needed.

Hope that helps.

test_mpi_pycuda.py

Ashwin Srinath

unread,

Aug 26, 2014, 1:45:13 PM8/26/14

to mpi...@googlegroups.com

Arjun and Lisandro,

Thanks for your comments! I realized that I was using openmpi without infiniband support. All is well now. I still get an error about cuMemHostUnregister on completion, but that may be another issue altogether.

Thanks again!

Ashwin

--
You received this message because you are subscribed to a topic in the Google Groups "mpi4py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mpi4py/xd-SR1b6GZ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mpi4py+un...@googlegroups.com.
To post to this group, send email to mpi...@googlegroups.com.
Visit this group at http://groups.google.com/group/mpi4py.
To view this discussion on the web visit https://groups.google.com/d/msgid/mpi4py/2f945746-e132-44b2-b06c-d7ffab7dd38a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lisandro Dalcin

unread,

Aug 26, 2014, 1:50:00 PM8/26/14

to mpi4py

On 26 August 2014 19:52, <ar...@nervanasys.com> wrote:
> Make sure that PATH and LD_LIBRARY_PATH point to /openmpi and /openmpi/lib
> respectively

Please note that PATH should point to /openmpi/bin, after that, you
should not need to edit mpi.cfg, mpi4py will look for the 'mpicc'
compiler wrapper and find it in /openmpi/bin

Ashwin Srinath

unread,

Aug 27, 2014, 10:01:56 AM8/27/14

to mpi...@googlegroups.com

Let me add that calling MPI.Finalize() at the end of the script took care of the cuMemHostUnregister error.

Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "mpi4py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mpi4py/xd-SR1b6GZ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mpi4py+un...@googlegroups.com.
To post to this group, send email to mpi...@googlegroups.com.
Visit this group at http://groups.google.com/group/mpi4py.

To view this discussion on the web visit https://groups.google.com/d/msgid/mpi4py/CAEcYPwDG_8yNSa7YFCDRtH%2B3GpawRocUEDNGdEbm4RG4%3D--G%2BQ%40mail.gmail.com.

ar...@nervanasys.com

unread,

Nov 13, 2014, 2:00:13 AM11/13/14

to mpi...@googlegroups.com

Thanks for your help. I am running a cuda aware MPI example. It works fine on one node, but gives the following error when run across a couple of nodes in pytools.preform.Execerror:

Traceback (most recent call last):

File "test_mpi_pycuda.py", line 64, in <module>

x_gpu_part.fill(1)

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/gpuarray.py", line 525, in fill

func = elementwise.get_fill_kernel(self.dtype)

File "<string>", line 2, in get_fill_kernel

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/tools.py", line 423, in context_dependent_memoize

result = func(*args)

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/elementwise.py", line 488, in get_fill_kernel

"fill")

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/elementwise.py", line 157, in get_elwise_kernel

arguments, operation, name, keep, options, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/elementwise.py", line 143, in get_elwise_kernel_and_types

keep, options, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/elementwise.py", line 71, in get_elwise_module

options=options, keep=keep)

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/compiler.py", line 251, in __init__

arch, code, cache_dir, include_dirs)

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/compiler.py", line 241, in compile

return compile_plain(source, options, keep, nvcc, cache_dir)

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/compiler.py", line 73, in compile_plain

checksum.update(preprocess_source(source, options, nvcc).encode("utf-8"))

File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/compiler.py", line 47, in preprocess_source

result, stdout, stderr = call_capture_output(cmdline, error_on_nonzero=False)

File "/usr/lib/python2.7/dist-packages/pytools/prefork.py", line 196, in call_capture_output

return forker[0].call_capture_output(cmdline, cwd, error_on_nonzero)

File "/usr/lib/python2.7/dist-packages/pytools/prefork.py", line 53, in call_capture_output

% ( " ".join(cmdline), e))

pytools.prefork.ExecError: error invoking 'nvcc --preprocess -arch sm_52 -I/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/cuda /tmp/tmpl0WyOY.cu --compiler-options -P': [Errno 2] No such file or directory

[max1:06760] *** Process received signal ***

[max1:06760] Signal: Segmentation fault (11)

[max1:06760] Signal code: Address not mapped (1)

[max1:06760] Failing at address: (nil)

[max1:06760] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fa4ba69a340]

[max1:06760] [ 1] /usr/lib/libcuda.so.1(+0x1fb0e5)[0x7fa4ae3f30e5]

[max1:06760] [ 2] /usr/lib/libcuda.so.1(+0x1727d6)[0x7fa4ae36a7d6]

[max1:06760] [ 3] /usr/lib/libcuda.so.1(cuEventDestroy_v2+0x52)[0x7fa4ae346f42]

[max1:06760] [ 4] /usr/local/lib/libmca_common_cuda.so.1(mca_common_cuda_fini+0xa3)[0x7fa4b60e6993]

[max1:06760] [ 5] /usr/local/lib/openmpi/mca_btl_tcp.so(+0x4f06)[0x7fa4b4e30f06]

[max1:06760] [ 6] /usr/local/lib/libopen-pal.so.6(mca_base_component_close+0x19)[0x7fa4b8bf9709]

[max1:06760] [ 7] /usr/local/lib/libopen-pal.so.6(mca_base_components_close+0x42)[0x7fa4b8bf9782]

[max1:06760] [ 8] /usr/local/lib/libmpi.so.1(+0x7d365)[0x7fa4b9186365]

[max1:06760] [ 9] /usr/local/lib/libopen-pal.so.6(mca_base_framework_close+0x63)[0x7fa4b8c02a23]

[max1:06760] [10] /usr/local/lib/libopen-pal.so.6(mca_base_framework_close+0x63)[0x7fa4b8c02a23]

[max1:06760] [11] /usr/local/lib/libmpi.so.1(ompi_mpi_finalize+0x56d)[0x7fa4b914c9cd]

[max1:06760] [12] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x28e04)[0x7fa4b9407e04]

[max1:06760] [13] python(Py_Finalize+0x1a6)[0x42fb0f]

[max1:06760] [14] python(Py_Main+0xbed)[0x46ac10]

[max1:06760] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa4ba2e5ec5]

[max1:06760] [16] python[0x57497e]

[max1:06760] *** End of error message ***

Lisandro Dalcin

unread,

Nov 13, 2014, 3:26:39 AM11/13/14

to mpi4py

On 13 November 2014 10:00, <ar...@nervanasys.com> wrote:
> Thanks for your help. I am running a cuda aware MPI example. It works fine
> on one node, but gives the following error when run across a couple of nodes
> in pytools.preform.Execerror:
>

I'm sorry, but such errors do not seem to be related to mpi4py. I
would say that pycuda folks are in better position to help you with
this issue.

ar...@nervanasys.com

unread,

Nov 13, 2014, 8:51:08 AM11/13/14

to mpi...@googlegroups.com

Thanks, Lisandro. It seems to work fine on one node, so I wonder if it's something about the interaction between mpi4py and pycuda. Does the /tmp directory need to be shared amongst the worker nodes so they have access to the kernels that are compiled on the fly by pycuda?

Lisandro Dalcin

unread,

Nov 13, 2014, 9:29:17 AM11/13/14

to mpi4py

On 13 November 2014 16:51, <ar...@nervanasys.com> wrote:
> Thanks, Lisandro. It seems to work fine on one node, so I wonder if it's
> something about the interaction between mpi4py and pycuda. Does the /tmp
> directory need to be shared amongst the worker nodes so they have access to
> the kernels that are compiled on the fly by pycuda?
>

I would say that the problem is related to the /tmp folder being
shared. Different processes try write to the same locations and erase
equally-named temporary files, then you get the failures. If pycuda
has a way to control the name of the temporary folder, you should try
to create uniquely-named ones (use the rank of COMM_WORLD for this).

> On Thursday, November 13, 2014 12:26:39 AM UTC-8, Lisandro Dalcin wrote:
>>
>> On 13 November 2014 10:00, <ar...@nervanasys.com> wrote:
>> > Thanks for your help. I am running a cuda aware MPI example. It works
>> > fine
>> > on one node, but gives the following error when run across a couple of
>> > nodes
>> > in pytools.preform.Execerror:
>> >
>>
>> I'm sorry, but such errors do not seem to be related to mpi4py. I
>> would say that pycuda folks are in better position to help you with
>> this issue.
>>
>>
>> --
>> Lisandro Dalcin
>> ============
>> Research Scientist
>> Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
>> Numerical Porous Media Center (NumPor)
>> King Abdullah University of Science and Technology (KAUST)
>> http://numpor.kaust.edu.sa/
>>
>> 4700 King Abdullah University of Science and Technology
>> al-Khawarizmi Bldg (Bldg 1), Office # 4332
>> Thuwal 23955-6900, Kingdom of Saudi Arabia
>> http://www.kaust.edu.sa
>>
>> Office Phone: +966 12 808-0459
>

> --
> You received this message because you are subscribed to the Google Groups
> "mpi4py" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mpi4py+un...@googlegroups.com.
> To post to this group, send email to mpi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mpi4py.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/mpi4py/cc1131a7-6dc5-454a-ac54-dafc42e13e79%40googlegroups.com.

>
> For more options, visit https://groups.google.com/d/optout.

ar...@nervanasys.com

unread,

Nov 13, 2014, 9:35:21 PM11/13/14

to mpi...@googlegroups.com

Actually, in my case the /tmp folder is not shared. I am wondering if pycuda requires it to be shared so that all the processes can access a common compiled kernel? This can be done by setting the TMPDIR function. I tried doing that but still get an error that the tmp[*].cu file was not found. Not sure why 2 different processes need access to each other's on the fly compiled kernels for simple functions like fill().

Arjun Bansal

unread,

Nov 13, 2014, 10:50:30 PM11/13/14

to Andreas Kloeckner, mpi...@googlegroups.com

Thanks, Andreas. That was it!
I had nvcc on the compute nodes, and when I ran the test on each node
it ran fine if I launched both MPI processes on the same node. Only
got the issue if the two MPI processes were on separate nodes.

For some reason PATH was not getting set properly on the workers.
Adding -x PATH to the mpirun solved it!

-Arjun

Sent from my iPhone

> On Nov 13, 2014, at 6:55 PM, Andreas Kloeckner <li...@informa.tiker.net> wrote:

>
> ar...@nervanasys.com writes:
>> Actually, in my case the /tmp folder is not shared. I am wondering if
>> pycuda requires it to be shared so that all the processes can access a
>> common compiled kernel? This can be done by setting the TMPDIR function. I
>> tried doing that but still get an error that the tmp[*].cu file was not
>> found. Not sure why 2 different processes need access to each other's on
>> the fly compiled kernels for simple functions like fill().
>

> Not shared is better. (With a shared one you'll run into locking
> issues.) Do you have nvcc available on your compute nodes? If not, that
> might be your problem.
>
> Andreas

Reply all

Reply to author

Forward