Non-blocking communication for cuda Aware MPI

alex...@gmail.com

unread,

Apr 21, 2015, 2:17:40 PM4/21/15

to mpi...@googlegroups.com

I've been trying to use the I* variants of the communication commands with gpubuffers when combining pycuda and mpi4py.

I am using the buffer interface with pycuda and using Send and Recv seem to work fine (both with openmpi and with mvapich2)

When i change the same commands to use Isend and Irecv, I get segmentation faults.

When I use Isend and Irecv with numpy buffers (on cpu) it works fine.

Are the non-blocking versions of Send/Recv supported when using gpu buffers?

Lev Givon

unread,

Apr 21, 2015, 2:22:59 PM4/21/15

to mpi...@googlegroups.com

Received from alex...@gmail.com on Tue, Apr 21, 2015 at 02:16:50PM EDT:

FWIW, I've been able to use Isend/Irecv with PyCUDA arrays successfully using
mpi4py built against OpenMPI 1.8.4.
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

alex...@gmail.com

unread,

Apr 21, 2015, 5:24:12 PM4/21/15

to mpi...@googlegroups.com

Hi Lev,

Thanks for that datapoint -- I ran off to install openMPI 1.8.4 but still getting the same problem. For reference, I'm attaching the test code that I'm trying to run.

I'm running with command

mpiexec -n 2 python example_mpi4py.py

As is, the program runs fine and gives the expected output, but if I change lines 40 and 46 from Send and Recv to Isend and Irecv, respectively, I get the following output

[ghosthost:02640] CUDA: cuCtxGetDevice failed: res=201

[ghosthost:02640] *** Process received signal ***

[ghosthost:02640] Signal: Aborted (6)

[ghosthost:02640] Signal code: (-6)

[ghosthost:02640] CUDA: Error in cuMemcpy: res=-1, dest=0x706d40800, src=0x7fc9fbd9f7a6, size=40

[ghosthost:02640] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fca10685340]

[ghosthost:02640] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)[0x7fca102e6cc9]

[ghosthost:02640] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fca102ea0d8]

[ghosthost:02640] [ 3] /usr/local/lib/libopen-pal.so.6(+0x45ad9)[0x7fca0eb8ead9]

[ghosthost:02640] [ 4] /usr/local/lib/libopen-pal.so.6(opal_convertor_unpack+0x10a)[0x7fca0eb871aa]

[ghosthost:02640] [ 5] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x450)[0x7fca05040f40]

[ghosthost:02640] [ 6] /usr/local/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x4b5)[0x7fca0699c9c5]

[ghosthost:02640] [ 7] /usr/local/lib/libopen-pal.so.6(opal_progress+0x4a)[0x7fca0eb7221a]

[ghosthost:02640] [ 8] /usr/local/lib/libmpi.so.1(ompi_mpi_finalize+0x24d)[0x7fca0f0f1d5d]

[ghosthost:02640] [ 9] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x2f694)[0x7fca0f3b7694]

[ghosthost:02640] [10] python(Py_Finalize+0x1a6)[0x42fb0f]

[ghosthost:02640] [11] python(Py_Main+0xbed)[0x46ac10]

[ghosthost:02640] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fca102d1ec5]

[ghosthost:02640] [13] python[0x57497e]

[ghosthost:02640] *** End of error message ***

--------------------------------------------------------------------------

The call to cuMemcpy failed. This is highly unusual and should

not happen. Please report this error to the Open MPI developers.

Hostname: ghosthost

cuMemcpy return value: 201

Check the cuda.h file for what the return value means.

--------------------------------------------------------------------------

mpiexec noticed that process rank 1 with PID 2640 on node ghosthost exited on signal 6 (Aborted).

--------------------------------------------------------------------------

example_mpi4py.py

Lev Givon

unread,

Apr 21, 2015, 5:37:48 PM4/21/15

to mpi...@googlegroups.com

Received from alex...@gmail.com on Tue, Apr 21, 2015 at 05:24:10PM EDT:

> Hi Lev,
>
> Thanks for that datapoint -- I ran off to install openMPI 1.8.4 but still
> getting the same problem. For reference, I'm attaching the test code that
> I'm trying to run.
>
> I'm running with command
>
> mpiexec -n 2 python example_mpi4py.py
>
> As is, the program runs fine and gives the expected output, but if I change
> lines 40 and 46 from Send and Recv to Isend and Irecv, respectively, I get
> the following output

Since the calls are asynchronous, you need to ensure that they complete before
trying to access the updated GPU memory. Try replacing said lines with

self.comm.Isend([bufint(x_gpu), self.mpi.FLOAT], dest=1).Wait()

and

self.comm.Irecv([bufint(x_gpu), self.mpi.FLOAT], source=0).Wait()

respectively.

alex...@gmail.com

unread,

Apr 21, 2015, 5:42:38 PM4/21/15

to mpi...@googlegroups.com

That did it!!

I wrote it like this to make the asynchronicity more explicit:

a = self.comm.Isend([bufint(x_gpu), self.mpi.FLOAT], dest=1)

print 'sent'

a.Wait()

print ('after (%i): ' % self.mpi_rank) + str(x_gpu.get())

Thanks!

Reply all

Reply to author

Forward