Inquiry on GASNet-EX put Bandwidth for CUDA UVA on Perlmutter

24 views
Skip to first unread message

Baodi Shan

unread,
Oct 31, 2024, 11:49:51 AM10/31/24
to gasnet...@lbl.gov
Hello,

I am observing significantly lower bandwidth with GASNet-EX put operations compared to get operations on Perlmutter when using CUDA Unified Virtual Addressing (local GPU to remote GPU). 
My configuration details are provided below, and the bandwidth results (./testlarge -cuda-uva -local-gpu  -remote-gpu) can be found here: https://pastebin.com/cBQjxDcK.

Could you let me know if this bandwidth difference is expected or if there are any adjustments I should make to my configuration?

Thank you for your assistance.

Best regards,
Baodi Shan

Configuration:
./configure --prefix=[PREFIX] --enable-ofi --with-ofi-provider=cxi --enable-pshm --enable-par --enable-pthreads --with-ibv-spawner=mpi --enable-segment-fast --disable-mpi --disable-smp --disable-portals --disable-mxm --enable-pthreads --with-max-segsize=16GB --enable-par --disable-seq --disable-parsync --disable-ibv-rcv-thread --disable-aligned-segments --enable-pshm --disable-fca --enable-memory-kinds --with-mpi-cflags=-fPIC --with-cflags=-fPIC --enable-kind-cuda-uva

--
Baodi SHAN
Ph.D. Candidate
Exasca||ab
Institute for Advanced Computational Science
Stony Brook University 

Paul H. Hargrove

unread,
Nov 4, 2024, 4:15:27 PM11/4/24
to Baodi Shan, gasnet...@lbl.gov
Baodi,

I am pretty sure you are encountering a known issue with the performance of RMA Puts targeting a remote Nvidia GPU on Perlmutter.
In our own testing on Perlmutter, we see roughly equal flood bandwidth from Puts to host memory, Gets from host memory, and Gets from GPU memory.
However, the case of Puts to remote GPU memory is lower than those three cases.
This was reported to NERSC and they have reported to HPE.

Our tests on AMD GPU systems with the same network (i.e. OLCF's Frontier) do not have this issue.

It is known that source code changes in GASNet can be made to make the performance comparable to the other three cases.  However, with this change RMA Puts operations targeting the GPU can appear complete at the initiator before the memory is certain to be visible in memory at the target. 

TL;DR:  This is "expected" in the sense of being a known behavior, but it is not a desired or intentional behavior.  There is no known work-around that maintains correctness in general.

-Paul

--
You received this message because you are subscribed to the Google Groups "gasnet-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gasnet-users...@lbl.gov.
To view this discussion visit https://groups.google.com/a/lbl.gov/d/msgid/gasnet-users/CAAFnSLEy%2BSOg7brrP4tfx6MLrGmYdyPuUg7Q3oKe7vvACHATyg%40mail.gmail.com.


--
Paul H. Hargrove <PHHar...@lbl.gov>
Pronouns: he, him, his
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department
Lawrence Berkeley National Laboratory
Reply all
Reply to author
Forward
0 new messages