GPU-aware MPI

Palmer, Bruce J

unread,

May 27, 2026, 12:44:19 PMMay 27

to Open MPI Users, Panyala, Ajay

Hi,

I’m trying to modify the Global Arrays library so that it supports global arrays hosted on GPU memory. I have a version of the progress ranks runtime that works by copying data to a buffer on the host before sending it to another process located on a different SMP node but I’d like to eliminate the host memory copies by using GPU-aware MPI. I’ve implemented this in the code but it seems to be failing because I can’t use a pointer to GPU memory in an MPI send or receive call that was created via a cudaIpcOpenMemHandle call.

Are pointers to GPU memory created from IpcMemHandles supposed to work with GPU-aware MPI? This would be critical for our progress ranks runtime since it would be effectively replacing the POSIX-shared memory strategy that we use for handling messaging related to global arrays hosted on regular host memory.

I’ve included a small test code using just Cuda and MPI that reproduces the strategy we want to use inside Global Arrays.

Bruce

mpigpu_test.c

Pritchard Jr., Howard

unread,

May 27, 2026, 5:00:52 PMMay 27

to us...@lists.open-mpi.org, Panyala, Ajay

Hello Bruce,

I think a little more info is needed. Could you post the output you get from running

ompi_info

?

double check that the ompi_info you are running is in the same folder as the mpicc you’re using.

thanks,

Howard

To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.

Edgar Gabriel

unread,

May 27, 2026, 5:10:37 PMMay 27

to us...@lists.open-mpi.org, Panyala, Ajay

If I understand correctly, the issue that is being asked is: Process 0 has lets say a buffer on GPU 0. That buffer has been imported to PE 1 using the GPU IPC mechanism and is now mapped using a different virtual address into the address space of PE 1. Can PE 1 use that new virtual address in a communication operation with Open MPI? Is my understanding correct?

I think the challenge might be that if PE 1 uses that virtual address in a Send/Recv operation, the internal protocols will try to (potentially) open an IPC handle for that buffer as well, and I am not sure that PE1 can do that, since the owner of that buffer is PE 0.

@bosilca ?

Thanks

Edgar

Palmer, Bruce J

unread,

May 28, 2026, 11:50:45 AMMay 28

to us...@lists.open-mpi.org, Panyala, Ajay

Here is the output from omni_info:

[d3g293@raven ~]$ ompi_info

Package: Open MPI root@raven Distribution

Open MPI: 5.0.10

Open MPI repo revision: v5.0.10

Open MPI release date: Feb 23, 2026

MPI API: 3.1.0

Ident string: 5.0.10

Prefix: /ravenfs/software/packages/linux-zen4/openmpi-5.0.10-wukilk7fbicbyktcg5e5ftnb5b7fusvd

Configured architecture: x86_64-pc-linux-gnu

Configured by: root

Configured on: Fri Apr 24 17:45:02 UTC 2026

Configure host: raven

Configure command line: '--prefix=/ravenfs/software/packages/linux-zen4/openmpi-5.0.10-wukilk7fbicbyktcg5e5ftnb5b7fusvd'

'--enable-shared' '--disable-silent-rules'

'--disable-sphinx' '--disable-dependency-tracking'

'--enable-builtin-atomics' '--disable-static'

'--enable-mpi1-compatibility' '--without-psm'

'--without-psm2' '--without-verbs' '--without-mxm'

'--with-ucx=/ravenfs/software/packages/linux-zen4/ucx-1.20.0-ohiiu3m5ak6q3vjed2vopkomuqpnrzfk'

'--without-ofi' '--without-fca' '--without-hcoll'

'--without-ucc' '--without-xpmem' '--with-cma'

'--without-knem' '--without-xpmem' '--without-alps'

'--without-lsf' '--without-tm' '--with-slurm'

'--without-sge' '--without-loadleveler'

'--disable-memchecker' '--with-libevent=/usr'

'--with-pmix=/usr'

'--with-prrte=/ravenfs/software/packages/linux-zen4/prrte-3.0.13-khbsrxkvs3uoo7fbpehd4s6apxeaa65y'

'--with-zlib=/ravenfs/software/packages/linux-zen4/zlib-ng-2.3.3-tiqgdh3hucq5h7cnt745h2kemo45fqt4'

'--with-hwloc=/ravenfs/software/packages/linux-zen4/hwloc-2.4.1-qbqq3vvfl7po7fgyk2gzk33ip7ltnukj'

'--disable-java' '--disable-mpi-java'

'--disable-io-romio' '--with-gpfs=no'

'--enable-dlopen'

'--with-cuda=/ravenfs/software/packages/linux-zen4/cuda-12.9.1-qsftfvqos2erluh5imrw5obbe3rihbu3'

'--with-cuda-libdir=/ravenfs/software/packages/linux-zen4/cuda-12.9.1-qsftfvqos2erluh5imrw5obbe3rihbu3/lib64/stubs'

'--without-rocm' '--enable-wrapper-rpath'

'--disable-wrapper-runpath' '--enable-mpi-fortran'

'CFLAGS=-DYY_BUF_SIZE=1048576' '--disable-debug'

Built by: root

Built on: Fri Apr 24 17:47:29 UTC 2026

Built host: raven

C bindings: yes

Fort mpif.h: yes (all)

Fort use mpi: yes (full: ignore TKR)

Fort use mpi size: deprecated-ompi-info-value

Fort use mpi_f08: yes

Fort mpi_f08 compliance: The mpi_f08 module is available, but due to

limitations in the

/ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gfortran

compiler and/or Open MPI, does not support the

following: array subsections, direct passthru

(where possible) to underlying Open MPI's C

functionality

Fort mpi_f08 subarrays: no

Java bindings: no

Wrapper compiler rpath: rpath

C compiler: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gcc

C compiler absolute: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gcc

C compiler family name: GNU

C compiler version: 14.3.0

C++ compiler: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/g++

C++ compiler absolute: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/g++

Fort compiler: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gfortran

Fort compiler abs: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gfortran

Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)

Fort 08 assumed shape: yes

Fort optional args: yes

Fort INTERFACE: yes

Fort ISO_FORTRAN_ENV: yes

Fort STORAGE_SIZE: yes

Fort BIND(C) (all): yes

Fort ISO_C_BINDING: yes

Fort SUBROUTINE BIND(C): yes

Fort TYPE,BIND(C): yes

Fort T,BIND(C,name="a"): yes

Fort PRIVATE: yes

Fort ABSTRACT: yes

Fort ASYNCHRONOUS: yes

Fort PROCEDURE: yes

Fort USE...ONLY: yes

Fort C_FUNLOC: yes

Fort f08 using wrappers: yes

Fort MPI_SIZEOF: yes

C profiling: yes

Fort mpif.h profiling: yes

Fort use mpi profiling: yes

Fort use mpi_f08 prof: yes

Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,

OMPI progress: no, Event lib: yes)

Sparse Groups: no

Internal debug support: no

MPI interface warnings: yes

MPI parameter check: runtime

Memory profiling support: no

Memory debugging support: no

dl support: yes

Heterogeneous support: no

MPI_WTIME support: native

Symbol vis. support: yes

Host topology support: yes

IPv6 support: no

MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat

Fault Tolerance support: yes

FT MPI support: yes

MPI_MAX_PROCESSOR_NAME: 256

MPI_MAX_ERROR_STRING: 256

MPI_MAX_OBJECT_NAME: 64

MPI_MAX_INFO_KEY: 36

MPI_MAX_INFO_VAL: 256

MPI_MAX_PORT_NAME: 1024

MPI_MAX_DATAREP_STRING: 128

MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.10)

MCA accelerator: cuda (MCA v2.1.0, API v1.0.0, Component v5.0.10)

MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA btl: uct (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA btl: smcuda (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.10)

MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component

v5.0.10)

MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component

v5.0.10)

MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v5.0.10)

MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.10)

MCA smsc: knem (MCA v2.1.0, API v1.0.0, Component v5.0.10)

MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component

v5.0.10)

MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.10)

MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component

v5.0.10)

MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.10)

MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component

v5.0.10)

MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.10)

MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.10)

MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component

v5.0.10)

MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.10)

MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.0.10)

MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.10)

MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.10)

MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component

v5.0.10)

MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.10)

MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.0.10)

MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.10)

MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.10)

MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.10)

MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component

v5.0.10)

MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component

v5.0.10)

From: 'Pritchard Jr., Howard' via Open MPI users <us...@lists.open-mpi.org>
Date: Wednesday, May 27, 2026 at 2:00 PM
To: us...@lists.open-mpi.org <us...@lists.open-mpi.org>
Cc: Panyala, Ajay <ajay.p...@pnnl.gov>

Subject: Re: [EXTERNAL] [OMPI users] GPU-aware MPI

Check twice before you click! This email originated from outside PNNL.

Palmer, Bruce J

unread,

May 28, 2026, 12:00:02 PMMay 28

to us...@lists.open-mpi.org, Panyala, Ajay

Hi Edgar,

I should have posted more information about my reproducer. I’ve been looking at running using 4 processors on 2 SMP nodes with each node having 1 GPU. Ranks 0 and 2 allocate memory on the GPU. Rank 0 sends a message to rank 3 and rand 2 sends a message to rank 1. Rank 3 opens the allocation created by rank 2 using a cudaIpcOpenMemHandle call and uses the pointer returned by cudaIpcOpenMemHandle in an MPI_Irecv call that expects a message from rank 0. Similarly, rank 1 uses a pointer from cudaIpcOpenMemHandle to a GPU allocation created by rank 0 in an MPI_Irecv call that expects a message from rank 2. This reproduces the behavior of a onesided put call in Global Arrays using the progress ranks runtime.

The essential feature is that a pointer to GPU memory that was allocated by a different rank is being used in an MPI_Send/Recv call and that pointer is obtained via a cudaIpcOpenMemHandle call.

Bruce

From: us...@lists.open-mpi.org <us...@lists.open-mpi.org> on behalf of Edgar Gabriel <edgar.g...@outlook.com>
Date: Wednesday, May 27, 2026 at 2:10 PM
To: us...@lists.open-mpi.org <us...@lists.open-mpi.org>
Cc: Panyala, Ajay <ajay.p...@pnnl.gov>

Subject: RE: [EXTERNAL] [OMPI users] GPU-aware MPI

You don't often get email from edgar.g...@outlook.com. Learn why this is important

Check twice before you click! This email originated from outside PNNL.

Palmer, Bruce J

unread,

Jun 2, 2026, 11:51:08 AMJun 2

to us...@lists.open-mpi.org, Panyala, Ajay

Any additional thoughts on this?

Bruce

Palmer, Bruce J

unread,

Jun 16, 2026, 12:24:13 PMJun 16

to us...@lists.open-mpi.org, Panyala, Ajay

We just noticed that the MPI_Irecv and MPI_Send commands in our reproducer have typos. We fixed them but we are still getting the same results.

mpigpu_test.c

Reply all

Reply to author

Forward