MAGMA with CUDA stream events and asynchronous memory transfer

Georg P.

unread,

Mar 4, 2022, 12:01:13 PM3/4/22

to MAGMA User

Hi there,
I am trying to understand how to set the streams for MAGMA computations.
To be precise, I have a loop, inside I am calling a couple of MAGMA routines.
In other words, I simply want to call:
cudaStreamWaitEvent(computation_event, magma_stream)
in order to overlap the MAGMA computations with asynchronous CUDA memory transfers, something like dummy code I appended.
I tried to use "magma_queue_get_cuda_stream(magma_queue)" to query the MAGMA stream, but no luck.
What confused me even more is, that the streams of the MAGMA computation change with each loop iteration (nvvp).
What am I missing?
Any help is greatly appreciated.
Best Georg

dummy_code.png

Mark Gates

unread,

Mar 4, 2022, 2:37:33 PM3/4/22

to Georg P., MAGMA User

Which routines are you using? For most factorizations (getrf, potrf, geqrf, geev, syev, gesvd, etc.), the MAGMA routine internally creates queues (streams), uses them, then destroys them. Because those routines internally do a lot of data transfers, it's unclear that you would benefit from trying to overlap your own data transfers with them. These routines are also synchronous with respect to the CPU: they don't return until the computation is finished. MAGMA also has native GPU-only factorizations, where this overlapping may make sense. MAGMA BLAS routines explicitly take a queue and most are asynchronous, so those should be much simpler to deal with overlapping (which we do ourselves inside factorization routines).

From your code snippet, it's unclear where the magma_stream is coming from. Having a minimal working example would be a lot more helpful to talk concretely about code.

Mark

--

Innovative Computing Laboratory

University of Tennessee, Knoxville

Georg P.

unread,

Mar 7, 2022, 9:54:25 AM3/7/22

to MAGMA User, mga...@icl.utk.edu, MAGMA User, Georg P.

Thank you so much, that explains a lot.

I have data that does not fit on the GPU as a whole.
I have thus this loop where each round:

I transfer some data to the GPU
perform the following MAGMA subroutines:

magma_dgemm (takes a queue):
magma_dpotrf_gpu:
magma_dtrsm (takes a queue):

3. Send the data back to the CPU

As the memory transfers are quite slow, I want to do the transfer of the next/previous round of data during those MAGMA calculations.
After some more reading, I found that I can use the non-blocking magma_dpotf2_gpu (takes a queue) instead of magma_dpotrf_gpu.
To wait for those events to finish,
do I then simply pass:

cudaStream_t m_stream =
magma_queue_get_cuda_stream(magma_queue);
to:
cudaStreamWaitEvent(computation_event, m_stream)?

Mark Gates

unread,

Mar 7, 2022, 3:05:04 PM3/7/22

to Georg P., MAGMA User

That should work, although I didn't test it. The potf2 version uses Level 2 BLAS, at least according to its docs, so it would be substantially slower than the usual Level 3 BLAS version (potrf). potf2 is intended for factoring a sub-matrix as part of a larger blocked potrf operation.

There's also magma_event_{create, destroy, record, query, sync}, and magma_queue_wait_event, if you want to use MAGMA structures rather than referring to CUDA structures. They are just simple wrappers around the CUDA versions, for portability.

Mark

Georg P.

unread,

Mar 9, 2022, 10:38:39 AM3/9/22

to MAGMA User, mga...@icl.utk.edu, MAGMA User, Georg P.

That was exactly what I was looking for, and it is working great.
I still have one problem, I saw that potf2 only works for matrices of order N ≤ 512 what is the reason for this?
My sub-matrices are much larger, thus I guess it would make the function again unusable for me.

Georg

Mark Gates

unread,

Mar 10, 2022, 10:55:44 AM3/10/22

to Georg P., MAGMA User

The limit on N ≤ 512 reflects CUDA limits for the number of threads in a thread block. It might be possible to change the limit to, say, 1024 on more recent CUDA GPUs. Apparently we haven't had need of larger blocks to investigate.

Mark

Reply all

Reply to author

Forward