Inquiry on GASNet-EX put Bandwidth for CUDA UVA on Perlmutter

24 views

Skip to first unread message

Baodi Shan

unread,

Oct 31, 2024, 11:49:51 AM10/31/24

to gasnet...@lbl.gov

Hello,

I am observing significantly lower bandwidth with GASNet-EX put operations compared to get operations on Perlmutter when using CUDA Unified Virtual Addressing (local GPU to remote GPU).

My configuration details are provided below, and the bandwidth results (./testlarge -cuda-uva -local-gpu -remote-gpu) can be found here: https://pastebin.com/cBQjxDcK.

Could you let me know if this bandwidth difference is expected or if there are any adjustments I should make to my configuration?

Thank you for your assistance.

Best regards,
Baodi Shan

Configuration:
./configure --prefix=[PREFIX] --enable-ofi --with-ofi-provider=cxi --enable-pshm --enable-par --enable-pthreads --with-ibv-spawner=mpi --enable-segment-fast --disable-mpi --disable-smp --disable-portals --disable-mxm --enable-pthreads --with-max-segsize=16GB --enable-par --disable-seq --disable-parsync --disable-ibv-rcv-thread --disable-aligned-segments --enable-pshm --disable-fca --enable-memory-kinds --with-mpi-cflags=-fPIC --with-cflags=-fPIC --enable-kind-cuda-uva

Baodi SHAN

Ph.D. Candidate

Exasca||ab

Institute for Advanced Computational Science

Stony Brook University

Paul H. Hargrove

unread,

Nov 4, 2024, 4:15:27 PM11/4/24

to Baodi Shan, gasnet...@lbl.gov

Baodi,

I am pretty sure you are encountering a known issue with the performance of RMA Puts targeting a remote Nvidia GPU on Perlmutter.

In our own testing on Perlmutter, we see roughly equal flood bandwidth from Puts to host memory, Gets from host memory, and Gets from GPU memory.

However, the case of Puts to remote GPU memory is lower than those three cases.

This was reported to NERSC and they have reported to HPE.

Our tests on AMD GPU systems with the same network (i.e. OLCF's Frontier) do not have this issue.

It is known that source code changes in GASNet can be made to make the performance comparable to the other three cases. However, with this change RMA Puts operations targeting the GPU can appear complete at the initiator before the memory is certain to be visible in memory at the target.

TL;DR: This is "expected" in the sense of being a known behavior, but it is not a desired or intentional behavior. There is no known work-around that maintains correctness in general.

-Paul

--
You received this message because you are subscribed to the Google Groups "gasnet-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gasnet-users...@lbl.gov.
To view this discussion visit https://groups.google.com/a/lbl.gov/d/msgid/gasnet-users/CAAFnSLEy%2BSOg7brrP4tfx6MLrGmYdyPuUg7Q3oKe7vvACHATyg%40mail.gmail.com.

Paul H. Hargrove <PHHar...@lbl.gov>
Pronouns: he, him, his

Computer Languages & Systems Software (CLaSS) Group

Computer Science Department

Lawrence Berkeley National Laboratory

Reply all

Reply to author

Forward

0 new messages