Baodi,
I am pretty sure you are encountering a known issue with the performance of RMA Puts targeting a remote Nvidia GPU on Perlmutter.
In our own testing on Perlmutter, we see roughly equal flood bandwidth from Puts to host memory, Gets from host memory, and Gets from GPU memory.
However, the case of Puts to remote GPU memory is lower than those three cases.
This was reported to NERSC and they have reported to HPE.
Our tests on AMD GPU systems with the same network (i.e. OLCF's Frontier) do not have this issue.
It is known that source code changes in GASNet can be made to make the performance comparable to the other three cases. However, with this change RMA Puts operations targeting the GPU can appear complete at the initiator before the memory is certain to be visible in memory at the target.
TL;DR: This is "expected" in the sense of being a known behavior, but it is not a desired or intentional behavior. There is no known work-around that maintains correctness in general.
-Paul