Non-contiguous copy with memory kinds

11 views
Skip to first unread message

Kirtus Leyba

unread,
Nov 3, 2022, 5:05:21 PM11/3/22
to UPC++
I'd like to use strided or irregular data movement for the halo cells in the boundary of a 3D simulation but I am currently using upcxx::copy because I am using GPU kernels for computation. Is it possible to do something like the rput_strided or rput_irregular with the upcxx::copy communication call? I cannot find anything like that in the docs or extras.

Alternatively can I use memory kinds with rput/rget instead of with copy?

Any suggestions for alternative approaches if this kind of operation is not supported would also be helpful.

Thanks as always for the help!
-Kirtus Leyba

Dan Bonachea

unread,
Nov 4, 2022, 12:48:32 AM11/4/22
to Kirtus Leyba, UPC++, Steven Hofmeyr
Hi Kirtus -

Thanks for the great question!
The short answer is, there is no built-in support to do exactly what you are requesting YET.

Currently the most efficient approach is probably to write a GPU kernel on the sending side that gathers/packs the data into a GPU-resident contiguous buffer (allocated from upcxx::device_allocator memory),  then invoke upcxx::copy to move the packed data across the network to target memory (which can be in GPU or host memory, depending on the consumer). upcxx::copy supports remote completion events, so if you additionally need to scatter/unpack the data at the destination, you can specify remote_cx::as_rpc(callable) with a callable that unpacks the data at the target (probably invoking another GPU kernel for unpacking if the destination payload is GPU-resident). This assumes a put-like copy protocol to push the data (as suggested in your question), but a similar approach can also be used with a get-like copy that pulls the data.

The most concise demonstration I can think of in our example codes for this general use case is in the kokkos_3dhalo example in upcxx-extras. Unfortunately that example heavily relies upon Kokkos for all the GPU kernels so it might be hard to follow if you're unfamiliar with Kokkos. The linked function packs a GPU-resident 3d HALO boundary into a contiguous GPU buffer via kokkos::deep_copy on a subview in pack_T_halo(), using a dedicated asynchronous CUDA stream. We later synchronize the packing kernel and send the packed data using upcxx::copy() in exchange_T_halo(). In this example we don't bother to explicitly unpack data at the target process, we just trigger a GPU kernel to integrate the communicated boundary elements from the GPU-resident contiguous communication buffer directly into the surface elements of the GPU-resident computational domain. This code notably performs quite well, as documented in this paper:


The other example that comes to mind is the upcxx-extras 3-d Jacobi example. This example is written in UPC++ and bare CUDA, so the GPU pack/unpack kernels are explicitly visible in the source files. For example, these CUDA kernels perform the explicit packing and unpacking of GPU boundary data using the GPU. These are invoked from the ghostRegion class to pack the elements to a contiguous buffer in preparation for transfer, and later a get-like upcxx::copy() call is used to move the contiguous data and the completion continuation invokes GPU kernels to unpack it.

Deploying something more concise/automated for this use case (ie strided upcxx::copy) is something I'm personally interested in, but it would require non-trivial engineering effort and honestly I believe you're the first user to express interest in such a feature. I'm curious to learn more about your application use case, is this for SIMCoV or something else?  I'd be happy to set up a videoconference to discuss further if that makes sense..

Hope this helps..
-D

--
You received this message because you are subscribed to the Google Groups "UPC++" group.
To unsubscribe from this group and stop receiving emails from it, send an email to upcxx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/upcxx/0f7606e3-19fb-4a7a-b4ec-43a71f780298n%40googlegroups.com.

Kirtus Leyba

unread,
Nov 15, 2022, 4:00:31 PM11/15/22
to UPC++
Wow that is a very helpful response.

Yes this is for the SIMCoV project. I'll look through your code examples and try to come up with a solution that will work for me. I'll respond to this thread if I have any hangups and then we could schedule a video call to discuss further if I need more guidance.

-Kirtus
Reply all
Reply to author
Forward
0 new messages