Regarding the usage of MPI-One sided communications in HPC applications

6 views
Skip to first unread message

Chandran, Arun

unread,
Jan 17, 2025, 7:24:51 AMJan 17
to us...@lists.open-mpi.org

[Public]


Hi Experts,

 

I am trying to understand the usage of MPI’s one-sided communication in HPC applications.

 

This research paper ( https://icl.utk.edu/publications/international-survey-mpi-users ) said it’s popularity is behind  collectives, and point-to-point APIs.

Given the advantages over the two-sided communication, one-sided communication should have gained popularity, right?

 

I tried to search the codebase of:

 

  1. NWCHEM (https://github.com/nwchemgit/nwchem)
  2. WRF(https://github.com/wrf-model/WRF)
  3. Quantum Espresso(https://github.com/QEF/q-e)
  4. GROMACS (https://github.com/gromacs/gromacs)

 

But did not hit any result for the search string ‘MPI_Win’. Do these applications use one-sided MPI_Get() and MPI_Put() via some other mechanism?

 

Can someone please comment about the usage of one-sided communication in HPC applications? Are there any real-world applications using it?

--Arun

Gilles Gouaillardet

unread,
Jan 17, 2025, 7:41:52 AMJan 17
to Open MPI Users

It seems you understand the advantages of one-sided communications over the other flavors, but did you carefully consider their drawbacks before concluding they "should have gain popularity"?


Cheers,


Gilles


To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.

Alfio Lazzaro

unread,
Jan 17, 2025, 7:53:25 AMJan 17
to us...@lists.open-mpi.org
One-sided communications are used in CP2K via the DBCSR library. See https://github.com/cp2k/dbcsr.
The algorithm itself was described in 


In some cases, the one-sided communications are quite handy in terms of implementation (DBCSR does matrix multiplications). However, the performance really depends on the MPI implementation support.

To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.


--
Alfio Lazzaro

Joseph Schuchart

unread,
Jan 17, 2025, 8:29:07 AMJan 17
to us...@lists.open-mpi.org
Hi Arun,

The strength of RMA (low synchronization overhead) is also its main
weakness (lack of synchronization). It's easy to move data between
processes but hard to get the synchronization right so that processes
read the right data. RMA has yet to find a good solution to the
synchronization problem (and in my book bulk synchronization is not
"good"). P2P does a pretty fine job at that, supported by hardware, and
most classes of applications simply don't need the flexibility in
communication pattern RMA would provide.

Collective operations can be fairly well optimized and almost certainly
perform better at scale than an equivalent written out in RMA (e.g.,
Broadcast can be implemented in log(N) steps vs every issuing a GET from
the root process).

Add to that the fact that MPI RMA was for the longest time not well
supported by implementations (buggy & bad performance) so those who
tried probably threw in the towel at some point. The UCX backend in OMPI
has seen quite some improvements in the last year or two but people are
hesitant to invest resources porting applications.

The MPI RMA-WG started gathering some application examples a while back
and we are definitely interested in more. Here are two examples that we
know of:

- NWCHEM via ARMCI (https://github.com/pmodels/armci-mpi/)
- MURPHY (https://www.murphy-code.dev/)

UPC++ can run over MPI RMA but certainly favors GASnet. You can also
look at use cases for OpenSHMEM and NvSHMEM, which have somewhat
different synchronization models but are similar at heart.

Hope that helps,
Joseph


On 1/17/25 07:24, 'Chandran, Arun' via Open MPI users wrote:
>
> [Public]
>
>
> Hi Experts,
>
> I am trying to understand the usage of MPI’s one-sided communication
> in HPC applications.
>
> This research paper (
> https://icl.utk.edu/publications/international-survey-mpi-users ) said
> it’s popularity is behind  collectives, and point-to-point APIs.
>
> Given the advantages over the two-sided communication, one-sided
> communication should have gained popularity, right?
>
> I tried to search the codebase of:
>
> 1. NWCHEM (https://github.com/nwchemgit/nwchem)
> 2. WRF(https://github.com/wrf-model/WRF)
> 3. Quantum Espresso(https://github.com/QEF/q-e)
> 4. GROMACS (https://github.com/gromacs/gromacs)
>
> But did not hit any result for the search string ‘MPI_Win’. Do these
> applications use one-sided MPI_Get() and MPI_Put() via some other
> mechanism?
>
> Can someone please comment about the usage of one-sided communication
> in HPC applications? Are there any real-world applications using it?
>
> --Arun
>

Chandran, Arun

unread,
Jan 17, 2025, 9:25:31 AMJan 17
to gilles.go...@gmail.com, us...@lists.open-mpi.org

[Public]


Myself not an expert on MPI applications. My statement about one-sided communication gaining popularity stems from reading some research papers found on the web (mostly academia), so I am unaware of the practical implications of one-sided communication.

 

I am interested in optimizing the one-sided communication for intra-node given there are a good number of users for it.

 

--Arun

 

From: Gilles Gouaillardet <gilles.go...@gmail.com>
Sent: Friday, January 17, 2025 6:12 PM
To: Open MPI Users <us...@lists.open-mpi.org>
Subject: Re: [OMPI users] Regarding the usage of MPI-One sided communications in HPC applications

 

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Chandran, Arun

unread,
Jan 17, 2025, 9:56:10 AMJan 17
to us...@lists.open-mpi.org

[Public]


Myself not an expert on MPI applications. My statement about one-sided communication gaining popularity stems from

reading some research papers found on the web (mostly academia),  so I am unaware of the practical implications

of one-sided communication.

 

I am interested in optimizing the one-sided communication for intra-node given there are a good number of users for it.

 

--Arun

 

From: Gilles Gouaillardet <gilles.go...@gmail.com>
Sent: Friday, January 17, 2025 6:12 PM
To: Open MPI Users <us...@lists.open-mpi.org>
Subject: Re: [OMPI users] Regarding the usage of MPI-One sided communications in HPC applications

 

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

 

It seems you understand the advantages of one-sided communications over the other flavors, but did you carefully consider their drawbacks before concluding they "should have gain popularity"?

 

Cheers,

 

Gilles

Arun chandran

unread,
Jan 18, 2025, 3:10:54 AMJan 18
to us...@lists.open-mpi.org, alfio....@gmail.com
I have some more questions related to the use of one-sided
communication in CP2K via DBCSR (Could be stupid, but anyway I am
going to shoot it!):

a) Can this be used in CPU-only clusters? Or is this more relevant to
GPU-based clusters?

b) The [DBCSR page](https://www.cp2k.org/dbcsr) says "DBCSR is used in
CP2K, where it provides
core functionality for linear scaling electronic structure theory."
I am trying to understand the scale of this use case. Let's say I am going to
run 'perf' (https://perfwiki.github.io/main/) on a big sample of
clusters running CP2K, will I see a significant amount of time being
spent in one-sided communications?

--Arun

Alfio Lazzaro

unread,
Jan 19, 2025, 3:58:42 AMJan 19
to Arun chandran, us...@lists.open-mpi.org
Il giorno sab 18 gen 2025 alle ore 09:10 Arun chandran <arun.e...@gmail.com> ha scritto:
I have some more questions related to the use of one-sided
communication in CP2K via DBCSR (Could be stupid, but anyway I am
going to shoot it!):

a) Can this be used in CPU-only clusters? Or is this more relevant to
GPU-based clusters?


Yes, we don't do any GPU-aware MPI. Recent tests on HPE Cray systems gave nice speed-up already on a lower number of nodes. I have to admit, I've tested on other systems since a while.

 
b) The [DBCSR page](https://www.cp2k.org/dbcsr) says "DBCSR is used in
CP2K, where it provides
    core functionality for linear scaling electronic structure theory."
   I am trying to understand the scale of this use case. Let's say I am going to
   run 'perf' (https://perfwiki.github.io/main/) on a big sample of
clusters running CP2K, will I see a significant amount of time being
spent in one-sided communications?


It depends on the CP2K input. Some of the benchmark (for example https://github.com/cp2k/cp2k/blob/master/benchmarks/QS_DM_LS/H2O-dft-ls.inp) are quite bound to MPI communications and you can compare P2P with onesided.
Note that Onesided algorithm is not the default, so you need to enable it. Please reach me privately for any other details.

Best regards,

Alfio

 

--Arun

On Fri, Jan 17, 2025 at 6:29 PM Alfio Lazzaro <alfio....@gmail.com> wrote:
>
> One-sided communications are used in CP2K via the DBCSR library. See https://github.com/cp2k/dbcsr.
> The algorithm itself was described in
>
> https://arxiv.org/abs/1705.10218
>
> In some cases, the one-sided communications are quite handy in terms of implementation (DBCSR does matrix multiplications). However, the performance really depends on the MPI implementation support.
>
> Il giorno ven 17 gen 2025 alle ore 13:24 'Chandran, Arun' via Open MPI users <us...@lists.open-mpi.org> ha scritto:
>>
>> [Public]
>>
>>
>> Hi Experts,
>>
>>
>>
>> I am trying to understand the usage of MPI’s one-sided communication in HPC applications.
>>
>>
>>
>> This research paper ( https://icl.utk.edu/publications/international-survey-mpi-users ) said it’s popularity is behind  collectives, and point-to-point APIs.
>>
>> Given the advantages over the two-sided communication, one-sided communication should have gained popularity, right?
>>
>>
>>
>> I tried to search the codebase of:
>>
>>
>>
>> NWCHEM (https://github.com/nwchemgit/nwchem)
>> WRF(https://github.com/wrf-model/WRF)
>> Quantum Espresso(https://github.com/QEF/q-e)
>> GROMACS (https://github.com/gromacs/gromacs)
>>
>>
>>
>> But did not hit any result for the search string ‘MPI_Win’. Do these applications use one-sided MPI_Get() and MPI_Put() via some other mechanism?
>>
>>
>>
>> Can someone please comment about the usage of one-sided communication in HPC applications? Are there any real-world applications using it?
>>
>> --Arun
>>
>> To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.
>
>
>
> --
> Alfio Lazzaro
>
> To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.


--
Alfio Lazzaro

Arun chandran

unread,
Jan 19, 2025, 6:01:12 AMJan 19
to us...@lists.open-mpi.org
Hi Joseph,

Thank you so much for the information.

Could you please comment on the scale of one-sided communication usage
in NWCHEM via ARMCI?

a) If I probe the clusters running NWCHEM, will I see a significant
amount of time spent in one-sided APIs (Put, Get),
or is it there to support only a niche use case in NWCHEM?

b) Can NWCHEM via ARMCI be used in CPU-only clusters, or is this more
relevant to GPU-based clusters?

--Arun

Palmer, Bruce J

unread,
Jan 22, 2025, 9:40:36 AMJan 22
to us...@lists.open-mpi.org

Arun,

 

The NWChem code can use MPI RMA via the MPI RMA runtime in Global Arrays. If you build GA with autotools build system and use the --with-mpi3 option, you should get this runtime.  (For CMake, set the GA_RUNTIME parameter to MPI_RMA.) You can find the MPI_Win calls, etc. used by GA in the GA_HOME/comex/src-mpi3 directory. This is not the recommended option for a high performance GA runtime, but the recent releases of both OpenMPI (5.0.x) and MPICH (4.2.x) seem to be vastly improved over previous releases and not nearly as buggy as they have been in the past.

 

To expand on Joseph’s comments, the choice of one-sided vs two-sided communication is going to depend a great deal on the application. Applications that require synchronized data transfers are going to be better off using two-sided communication (this would include many common HPC calculations) but applications that have a lot of work that can be done independently and in any order will benefit from one-sided communication. Quantum chemistry calculations tend to fall in the latter category. Things like the Fock matrix build can be decomposed into many independent tasks that can be performed in any order and synchronization to guarantee data consistency is relatively infrequent. Other types of calculations, like parallel linear solvers, may require more tightly coupled data movement and hence, two-sided communication is more appropriate.

 

Hope this helps.

 

Bruce

 

From: 'Joseph Schuchart' via Open MPI users <us...@lists.open-mpi.org>
Date: Friday, January 17, 2025 at 5:29
AM
To: us...@lists.open-mpi.org <us...@lists.open-mpi.org>
Subject: Re: [OMPI users] Regarding the usage of MPI-One sided communications in HPC applications

Check twice before you click! This email originated from outside PNNL.




Hi Arun,

The strength of RMA (low synchronization overhead) is also its main
weakness (lack of synchronization). It's easy to move data between
processes but hard to get the synchronization right so that processes
read the right data. RMA has yet to find a good solution to the
synchronization problem (and in my book bulk synchronization is not
"good"). P2P does a pretty fine job at that, supported by hardware, and
most classes of applications simply don't need the flexibility in
communication pattern RMA would provide.

Collective operations can be fairly well optimized and almost certainly
perform better at scale than an equivalent written out in RMA (e.g.,
Broadcast can be implemented in log(N) steps vs every issuing a GET from
the root process).

Add to that the fact that MPI RMA was for the longest time not well
supported by implementations (buggy & bad performance) so those who
tried probably threw in the towel at some point. The UCX backend in OMPI
has seen quite some improvements in the last year or two but people are
hesitant to invest resources porting applications.

The MPI RMA-WG started gathering some application examples a while back
and we are definitely interested in more. Here are two examples that we
know of:

- NWCHEM via ARMCI (https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpmodels%2Farmci-mpi%2F&data=05%7C02%7Cbruce.palmer%40pnnl.gov%7C654b895eb38a4d7352f308dd36faeb86%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638727173504398690%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=zEBqA7xCFTvzk3SF0dSHI3V8XfAvKuWDJq6PrbSldyo%3D&reserved=0)
- MURPHY (https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.murphy-code.dev%2F&data=05%7C02%7Cbruce.palmer%40pnnl.gov%7C654b895eb38a4d7352f308dd36faeb86%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638727173504421009%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=FUDqJiwXGN24D9Xx9ZjtjbMuBJslWlJGvou6Xtd2Oz4%3D&reserved=0)



UPC++ can run over MPI RMA but certainly favors GASnet. You can also
look at use cases for OpenSHMEM and NvSHMEM, which have somewhat
different synchronization models but are similar at heart.

Hope that helps,
Joseph


On 1/17/25 07:24, 'Chandran, Arun' via Open MPI users wrote:
>
> [Public]
>
>
> Hi Experts,
>
> I am trying to understand the usage of MPI’s one-sided communication
> in HPC applications.
>
> This research paper (


> it’s popularity is behind  collectives, and point-to-point APIs.
>
> Given the advantages over the two-sided communication, one-sided
> communication should have gained popularity, right?
>
> I tried to search the codebase of:
>

>  1. NWCHEM (https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnwchemgit%2Fnwchem&data=05%7C02%7Cbruce.palmer%40pnnl.gov%7C654b895eb38a4d7352f308dd36faeb86%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638727173504445226%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=V5WgOlNYhU2skdIIXExlsvvsMLGTG17TpWuRXPr0E6s%3D&reserved=0)
>  2. WRF(https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwrf-model%2FWRF&data=05%7C02%7Cbruce.palmer%40pnnl.gov%7C654b895eb38a4d7352f308dd36faeb86%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638727173504457052%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=G%2BJ4C5h1BJoJof0whSJ0w%2FnWohFC8TEkr1bfL6GJwzI%3D&reserved=0)
>  3. Quantum Espresso(https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FQEF%2Fq-e&data=05%7C02%7Cbruce.palmer%40pnnl.gov%7C654b895eb38a4d7352f308dd36faeb86%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638727173504468904%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=2JBOKDgTGPat39P9yTkXhDTgP%2FrWOupISucmubwnrBQ%3D&reserved=0)
>  4. GROMACS (https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgromacs%2Fgromacs&data=05%7C02%7Cbruce.palmer%40pnnl.gov%7C654b895eb38a4d7352f308dd36faeb86%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638727173504480645%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=oOROJkeO1OlL%2B7VKjQjylPLf8iiimlHUvklkCljj%2BIg%3D&reserved=0)

Arun chandran

unread,
Jan 22, 2025, 9:40:44 AMJan 22
to us...@lists.open-mpi.org
Replying from my personal email because for some reason I am not able
to reply to 'us...@lists.open-mpi.org' from my office email. (I tried
but was not able to see those replies at
https://mail-archive.com/us...@lists.open-mpi.org/)

I am not an expert on MPI applications. My statement about one-sided
communication gaining popularity stems from reading some research
papers found on the web (mostly academia), so I am unaware of the
practical implications of one-sided communication.

I am interested in optimizing the one-sided communication for
intra-node if there are good number of users for it.

--Arun

Jeff Hammond

unread,
Jan 24, 2025, 11:14:38 AMJan 24
to Open MPI Users
NWChem uses MPI RMA via ARMCI-MPI, which is a separate repo (https://github.com/pmodels/armci-mpi).  ARMCI-MPI is not used by default because it’s not maintained by PNNL (I maintain it).  It is not always the fastest way to run NWChem (which is a very complicated multi-dimensional problem) but ARMCI-MPI has a bunch of really nice performance and debugging options from an RMA perspective.

Here are two talks with some discussion of this topic:

Both are old but ARMCI-MPI hasn’t changed that much since 2019, although I wrote a custom profiler and redesigned the entire request-based RMA system, neither of which are documented beyond the git repo.

NWChem isn’t the only code that uses ARMCI-MPI for GA, but it’s the most popular one by a lot.  Other uses include GTFock (https://faculty.cc.gatech.edu/~echow/pubs/ijhpca-1094342015592960.pdf) and Takeshi Yanai’s DMRG code.  Molpro uses GA as well, and MPI RMA directly, but I haven’t analyzed it much.

The Casper project papers discuss performance issues with MPI RMA as NWChem uses it.

Open-MPI has made a lot of progress with RMA over UCX, which was published on slides at the HPC Advisory Council.  I can’t find a link via google so I’ll email you directly later.  The total wall time impact was 3.0-4.5x from OMPI 4.1 to 5.0.

In addition to chemistry codes, both GCC and Intel’s implementations of coarray Fortran use MPI RMA.  The GCC library is https://github.com/sourceryinstitute/OpenCoarrays.  Obviously, you cannot read the Intel Fortran source code, but I have (in a past job) and it’s use is qualitatively similar to OpenCoarrays (in the sense of using passive target synchronization).

Thanks, Joseph, for making sure I saw this one.  MPI RMA and NWChem is my bat-signal but I filter MPI user list traffic away from my inbox so I didn’t see this.  If you reply, put my email in BCC so it goes to inbox.

Jeff
Reply all
Reply to author
Forward
0 new messages