Thank you so much, that explains a lot.
I have data that does not fit on the GPU as a whole.
I have thus this loop where each round:
3. Send the data back to the CPU
As the memory transfers are quite slow, I want to do the transfer of the next/previous round of data during those MAGMA calculations.
After some more reading, I found that I can use the non-blocking magma_dpotf2_gpu (takes a queue) instead of magma_dpotrf_gpu.
To wait for those events to finish,
do I then simply pass:
cudaStream_t m_stream =
magma_queue_get_cuda_stream(magma_queue);
to:
cudaStreamWaitEvent(computation_event, m_stream)?