Hi Kirtus -
Thanks for the great question!
The short answer is, there is no built-in support to do exactly what you are requesting YET.
Currently the most efficient approach is probably to write a GPU kernel on the sending side that gathers/packs the data into a GPU-resident contiguous buffer (allocated from upcxx::device_allocator memory), then invoke upcxx::copy to move the packed data across the network to target memory (which can be in GPU or host memory, depending on the consumer). upcxx::copy supports remote completion events, so if you additionally need to scatter/unpack the data at the destination, you can specify remote_cx::as_rpc(callable) with a callable that unpacks the data at the target (probably invoking another GPU kernel for unpacking if the destination payload is GPU-resident). This assumes a put-like copy protocol to push the data (as suggested in your question), but a similar approach can also be used with a get-like copy that pulls the data.
The most concise demonstration I can think of in our example codes for this general use case is in the
kokkos_3dhalo example in upcxx-extras. Unfortunately that example heavily relies upon Kokkos for all the GPU kernels so it might be hard to follow if you're unfamiliar with Kokkos. The linked function packs a GPU-resident 3d HALO boundary into a contiguous GPU buffer via
kokkos::deep_copy on a subview in
pack_T_halo(), using a dedicated asynchronous CUDA stream. We later synchronize the packing kernel and send the packed data using
upcxx::copy() in
exchange_T_halo(). In this example we don't bother to explicitly unpack data at the target process, we just trigger a GPU kernel to integrate the communicated boundary elements from the GPU-resident contiguous communication buffer directly into the surface elements of the GPU-resident computational domain. This code notably performs quite well, as documented in this paper: