GPU-aware MPI

5 views
Skip to first unread message

Palmer, Bruce J

unread,
May 27, 2026, 12:44:19 PMMay 27
to Open MPI Users, Panyala, Ajay
Hi,

I’m trying to modify the Global Arrays library so that it supports global arrays hosted on GPU memory. I have a version of the progress ranks runtime that works by copying data to a buffer on the host before sending it to another process located on a different SMP node but I’d like to eliminate the host memory copies by using GPU-aware MPI. I’ve implemented this in the code but it seems to be failing because I can’t use a pointer to GPU memory in an MPI send or receive call that was created via a cudaIpcOpenMemHandle call.

Are pointers to GPU memory created from IpcMemHandles supposed to work with GPU-aware MPI? This would be critical for our progress ranks runtime since it would be effectively replacing the POSIX-shared memory strategy that we use for handling messaging related to global arrays hosted on regular host memory.

I’ve included a small test code using just Cuda and MPI that reproduces the strategy we want to use inside Global Arrays.

Bruce
mpigpu_test.c

Pritchard Jr., Howard

unread,
May 27, 2026, 5:00:52 PMMay 27
to us...@lists.open-mpi.org, Panyala, Ajay

Hello Bruce,

 

I think a little more info is needed.   Could you post the output you get from running

 

ompi_info

 

?

 

double check that the ompi_info you are running is in the same folder as the mpicc you’re using.

 

thanks,

 

Howard

To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.

Edgar Gabriel

unread,
May 27, 2026, 5:10:37 PMMay 27
to us...@lists.open-mpi.org, Panyala, Ajay

If I understand correctly, the issue that is being asked is: Process 0 has lets say a buffer on GPU 0. That buffer has been imported to PE 1 using the GPU IPC mechanism and is now  mapped using a different virtual address into the address space of PE 1. Can PE 1 use that new virtual address in a communication operation with Open MPI? Is my understanding correct?

 

I think the challenge might be that if PE 1 uses that virtual address in a Send/Recv operation, the internal protocols will try to (potentially) open an IPC handle for that buffer as well, and I am not sure that PE1 can do that, since the owner of that buffer is PE 0.

 

@bosilca ?

 

Thanks

Edgar

Palmer, Bruce J

unread,
May 28, 2026, 11:50:45 AMMay 28
to us...@lists.open-mpi.org, Panyala, Ajay
Here is the output from omni_info:

[d3g293@raven ~]$ ompi_info

                 Package: Open MPI root@raven Distribution

                Open MPI: 5.0.10

  Open MPI repo revision: v5.0.10

   Open MPI release date: Feb 23, 2026

                 MPI API: 3.1.0

            Ident string: 5.0.10

                  Prefix: /ravenfs/software/packages/linux-zen4/openmpi-5.0.10-wukilk7fbicbyktcg5e5ftnb5b7fusvd

 Configured architecture: x86_64-pc-linux-gnu

           Configured by: root

           Configured on: Fri Apr 24 17:45:02 UTC 2026

          Configure host: raven

  Configure command line: '--prefix=/ravenfs/software/packages/linux-zen4/openmpi-5.0.10-wukilk7fbicbyktcg5e5ftnb5b7fusvd'

                          '--enable-shared' '--disable-silent-rules'

                          '--disable-sphinx' '--disable-dependency-tracking'

                          '--enable-builtin-atomics' '--disable-static'

                          '--enable-mpi1-compatibility' '--without-psm'

                          '--without-psm2' '--without-verbs' '--without-mxm'

                          '--with-ucx=/ravenfs/software/packages/linux-zen4/ucx-1.20.0-ohiiu3m5ak6q3vjed2vopkomuqpnrzfk'

                          '--without-ofi' '--without-fca' '--without-hcoll'

                          '--without-ucc' '--without-xpmem' '--with-cma'

                          '--without-knem' '--without-xpmem' '--without-alps'

                          '--without-lsf' '--without-tm' '--with-slurm'

                          '--without-sge' '--without-loadleveler'

                          '--disable-memchecker' '--with-libevent=/usr'

                          '--with-pmix=/usr'

                          '--with-prrte=/ravenfs/software/packages/linux-zen4/prrte-3.0.13-khbsrxkvs3uoo7fbpehd4s6apxeaa65y'

                          '--with-zlib=/ravenfs/software/packages/linux-zen4/zlib-ng-2.3.3-tiqgdh3hucq5h7cnt745h2kemo45fqt4'

                          '--with-hwloc=/ravenfs/software/packages/linux-zen4/hwloc-2.4.1-qbqq3vvfl7po7fgyk2gzk33ip7ltnukj'

                          '--disable-java' '--disable-mpi-java'

                          '--disable-io-romio' '--with-gpfs=no'

                          '--enable-dlopen'

                          '--with-cuda=/ravenfs/software/packages/linux-zen4/cuda-12.9.1-qsftfvqos2erluh5imrw5obbe3rihbu3'

                          '--with-cuda-libdir=/ravenfs/software/packages/linux-zen4/cuda-12.9.1-qsftfvqos2erluh5imrw5obbe3rihbu3/lib64/stubs'

                          '--without-rocm' '--enable-wrapper-rpath'

                          '--disable-wrapper-runpath' '--enable-mpi-fortran'

                          'CFLAGS=-DYY_BUF_SIZE=1048576' '--disable-debug'

                Built by: root

                Built on: Fri Apr 24 17:47:29 UTC 2026

              Built host: raven

              C bindings: yes

             Fort mpif.h: yes (all)

            Fort use mpi: yes (full: ignore TKR)

       Fort use mpi size: deprecated-ompi-info-value

        Fort use mpi_f08: yes

 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to

                          limitations in the

                          /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gfortran

                          compiler and/or Open MPI, does not support the

                          following: array subsections, direct passthru

                          (where possible) to underlying Open MPI's C

                          functionality

  Fort mpi_f08 subarrays: no

           Java bindings: no

  Wrapper compiler rpath: rpath

              C compiler: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gcc

     C compiler absolute: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gcc

  C compiler family name: GNU

      C compiler version: 14.3.0

            C++ compiler: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/g++

   C++ compiler absolute: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/g++

           Fort compiler: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gfortran

       Fort compiler abs: /ravenfs/software/packages/linux-zen4/compiler-wrapper-1.0-jieqhi3owtyhuuhwn7megq7cgywsri4x/libexec/spack/gcc/gfortran

         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)

   Fort 08 assumed shape: yes

      Fort optional args: yes

          Fort INTERFACE: yes

    Fort ISO_FORTRAN_ENV: yes

       Fort STORAGE_SIZE: yes

      Fort BIND(C) (all): yes

      Fort ISO_C_BINDING: yes

 Fort SUBROUTINE BIND(C): yes

       Fort TYPE,BIND(C): yes

 Fort T,BIND(C,name="a"): yes

            Fort PRIVATE: yes

           Fort ABSTRACT: yes

       Fort ASYNCHRONOUS: yes

          Fort PROCEDURE: yes

         Fort USE...ONLY: yes

           Fort C_FUNLOC: yes

 Fort f08 using wrappers: yes

         Fort MPI_SIZEOF: yes

             C profiling: yes

   Fort mpif.h profiling: yes

  Fort use mpi profiling: yes

   Fort use mpi_f08 prof: yes

          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,

                          OMPI progress: no, Event lib: yes)

           Sparse Groups: no

  Internal debug support: no

  MPI interface warnings: yes

     MPI parameter check: runtime

Memory profiling support: no

Memory debugging support: no

              dl support: yes

   Heterogeneous support: no

       MPI_WTIME support: native

     Symbol vis. support: yes

   Host topology support: yes

            IPv6 support: no

          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat

 Fault Tolerance support: yes

          FT MPI support: yes

  MPI_MAX_PROCESSOR_NAME: 256

    MPI_MAX_ERROR_STRING: 256

     MPI_MAX_OBJECT_NAME: 64

        MPI_MAX_INFO_KEY: 36

        MPI_MAX_INFO_VAL: 256

       MPI_MAX_PORT_NAME: 1024

  MPI_MAX_DATAREP_STRING: 128

         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.10)

         MCA accelerator: cuda (MCA v2.1.0, API v1.0.0, Component v5.0.10)

           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.10)

           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.10)

           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.10)

                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.10)

                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.10)

                 MCA btl: uct (MCA v2.1.0, API v3.3.0, Component v5.0.10)

                 MCA btl: smcuda (MCA v2.1.0, API v3.3.0, Component v5.0.10)

                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.10)

                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.10)

         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.10)

              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.10)

               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component

                          v5.0.10)

             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component

                          v5.0.10)

              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.10)

              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v5.0.10)

              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v5.0.10)

           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.10)

               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.10)

               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.10)

                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.10)

                MCA smsc: knem (MCA v2.1.0, API v1.0.0, Component v5.0.10)

             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component

                          v5.0.10)

               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.10)

                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.10)

                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component

                          v5.0.10)

                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.10)

                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.10)

               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.10)

               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.10)

                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.10)

                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component

                          v5.0.10)

                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.10)

                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.10)

                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.10)

                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component

                          v5.0.10)

                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.10)

                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.0.10)

                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.10)

                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.10)

                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component

                          v5.0.10)

                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.10)

                 MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.0.10)

                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.10)

            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)

            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.10)

                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.10)

                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component

                          v5.0.10)

           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component

                          v5.0.10)




From: 'Pritchard Jr., Howard' via Open MPI users <us...@lists.open-mpi.org>
Date: Wednesday, May 27, 2026 at 2:00 PM
To: us...@lists.open-mpi.org <us...@lists.open-mpi.org>
Cc: Panyala, Ajay <ajay.p...@pnnl.gov>
Subject: Re: [EXTERNAL] [OMPI users] GPU-aware MPI

Check twice before you click! This email originated from outside PNNL.

Palmer, Bruce J

unread,
May 28, 2026, 12:00:02 PMMay 28
to us...@lists.open-mpi.org, Panyala, Ajay
Hi Edgar,

I should have posted more information about my reproducer. I’ve been looking at running using 4 processors on 2 SMP nodes with each node having 1 GPU. Ranks 0 and 2 allocate memory on the GPU. Rank 0 sends a message to rank 3 and rand 2 sends a message to rank 1. Rank 3 opens the allocation created by rank 2 using a cudaIpcOpenMemHandle call and uses the pointer returned by cudaIpcOpenMemHandle in an MPI_Irecv call that expects a message from rank 0. Similarly, rank 1 uses a pointer from cudaIpcOpenMemHandle to a GPU allocation created by rank 0 in an MPI_Irecv call that expects a message from rank 2. This reproduces the behavior of a onesided put call in Global Arrays using the progress ranks runtime.

The essential feature is that a pointer to GPU memory that was allocated by a different rank is being used in an MPI_Send/Recv call and that pointer is obtained via a cudaIpcOpenMemHandle call.

Bruce

From: us...@lists.open-mpi.org <us...@lists.open-mpi.org> on behalf of Edgar Gabriel <edgar.g...@outlook.com>
Date: Wednesday, May 27, 2026 at 2:10 PM
To: us...@lists.open-mpi.org <us...@lists.open-mpi.org>
Cc: Panyala, Ajay <ajay.p...@pnnl.gov>
Subject: RE: [EXTERNAL] [OMPI users] GPU-aware MPI

Check twice before you click! This email originated from outside PNNL.

Palmer, Bruce J

unread,
Jun 2, 2026, 11:51:08 AMJun 2
to us...@lists.open-mpi.org, Panyala, Ajay
Any additional thoughts on this?

Bruce

Palmer, Bruce J

unread,
Jun 16, 2026, 12:24:13 PM (11 days ago) Jun 16
to us...@lists.open-mpi.org, Panyala, Ajay
We just noticed that the MPI_Irecv and MPI_Send commands in our reproducer have typos. We fixed them but we are still getting the same results.
mpigpu_test.c
Reply all
Reply to author
Forward
0 new messages