Magma dgesv double memory copy

19 views
Skip to first unread message

Dhruva Kulkarni

unread,
Feb 2, 2021, 6:05:27 AMFeb 2
to MAGMA User
Hello,
I am new to magma, and look forward to learning to use it to speed up applications using GPUs. This is my first attempt at using magma (and GPUs!).

Currently, I am using the dgesv_gpu call in magma. It seems as if the "A" array (8M) is being transferred to the device twice in one call..Is this expected behavior?

Thanks for your help and suggestions!
Regards,
Dhruva

unnamed.png


Ahmad Abdelfattah

unread,
Feb 2, 2021, 10:52:13 AMFeb 2
to Dhruva Kulkarni, MAGMA User
Hi Dhruva, 

Can you please share the code snippet for this trace? If you are using a default MAGMA tester, can you please share the command you used to run it?

MAGMA uses hybrid (CPU+GPU) algorithms for dgesv_gpu (which uses a hybrid LU factorization followed by a triangular solve). The LU factorization in (magma_dgetrf_gpu) uses the CPU for the panel factorization, and uses the GPU for the rest of the algorithmic steps. It is possible that the mem-copies you refer to are the copies of the panel as the factorization progresses. You should also see mem-copies back from the CPU to the GPU. 

Thanks,
Ahmad



<unnamed.png>



--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/830af768-e762-40c0-973f-d6e6dc62cd1dn%40icl.utk.edu.
<unnamed.png>

Dhruva Kulkarni

unread,
Feb 2, 2021, 8:34:59 PMFeb 2
to Ahmad Abdelfattah, MAGMA User
Hello Ahmad,
Thank you for your reply!
I am attaching the code here - it has three different wrappers. The wrapper functions magma_dgesv_wrapper_nomemalloc and magma_dgesv_wrapper are the relevant ones. I did not see equivalent transfers (in terms of size of transfer) from device to CPU. There were plenty of small transfers (<<1M) back and forth coinciding with the calculation. However, there was a big (~8M) transfer from HtoD twice.  The A matrix in this case is also ~8M.
Please let me know if you would like to see the surrounding code as well, or need any other information.
Thanks!
Regards,
Dhruva
cpp_magma_wrapper.cpp

Ahmad Abdelfattah

unread,
Feb 2, 2021, 9:46:55 PMFeb 2
to Dhruva Kulkarni, MAGMA User
The magma_dgesv_wrapper has the set/get matrix function calls, which copy the entire matrix to/from the GPU. I believe you will see similar copies for the other wrappers.

GPUs have their own memory space. In order to take advantage of their compute power, the input matrix has to be copied from the CPU to the GPU. There is a way to allow GPUs to access CPU memory (i.e. unified memory), but it is not as efficient. The small little copies should account for the panel factorization. 

If the matrix is big enough, the overhead of the CPU-GPU communication is dominated by the amount of computation. Otherwise, it might be better to factorize the matrix using the CPU. 

Ahmad



<cpp_magma_wrapper.cpp>

Dhruva Kulkarni

unread,
Feb 2, 2021, 10:04:16 PMFeb 2
to Ahmad Abdelfattah, MAGMA User
Thanks for your reply! 

Dhruva Kulkarni

unread,
Feb 4, 2021, 1:47:03 PMFeb 4
to Ahmad Abdelfattah, MAGMA User
Hi Ahmad,
Another question about the profile: I am seeing some latency between the data transfer. 
Label 1: There are two data transfers (one big, and one small right after it) that are initiated by my code.
Label 2: There is 112 byte transfer that is taking place at this point. I am not sure what that is?
Label 3: This is where the magma computation/data transfer for its own purposes is beginning
Label 4: These are mallocs that are taking place - but not initiated by my code. (nomem_alloc wrapper in above email).

We are not sure what is the reason for the latency between label 1 and label 3. Would this latency be observed for the batched call as well? 
I am now trying to use the batched version to increase workload, and also trying to minimize data transfers to the GPU by keeping data on the GPU for longer.  Please could you help me understand these calls, so we can plan accordingly for our code?
Thanks!
Regards,
Dhruva
  


image.png

Dhruva Kulkarni

unread,
Feb 4, 2021, 1:51:02 PMFeb 4
to Ahmad Abdelfattah, MAGMA User
Sorry, I meant help understanding the observed latencies, not "calls". 

Ahmad Abdelfattah

unread,
Feb 4, 2021, 2:13:33 PMFeb 4
to Dhruva Kulkarni, MAGMA User
The latencies are probably due to some initializations on the GPU side. The GESV routine internally creates cuda streams and allocates workspaces. I’m not sure about the 112 byte transfer, but it could be part of this initialization (what is the direction of it BTW? Is it host2device or vice versa?). When we benchmark MAGMA, we usually avoid the first “warm-up run”, so you may want to do that as well.  

Another alternative to batch routines would be to use a native factorization. The routine magma_dgetrf_native performs the factorization of one matrix without using the CPU in any compute tasks.  Instead of calling magma_dgesv_gpu directly, you will have to call magma_dgetrf_native followed by magma_zgetrs_gpu (see src/dgesv_gpu.cpp and replace magma_dgetrf_gpu with magma_dgetrf_native).  

Ahmad

On Feb 4, 2021, at 1:46 PM, Dhruva Kulkarni <dkul...@lbl.gov> wrote:

Hi Ahmad,
Another question about the profile: I am seeing some latency between the data transfer. 
Label 1: There are two data transfers (one big, and one small right after it) that are initiated by my code.
Label 2: There is 112 byte transfer that is taking place at this point. I am not sure what that is?
Label 3: This is where the magma computation/data transfer for its own purposes is beginning
Label 4: These are mallocs that are taking place - but not initiated by my code. (nomem_alloc wrapper in above email).

We are not sure what is the reason for the latency between label 1 and label 3. Would this latency be observed for the batched call as well? 
I am now trying to use the batched version to increase workload, and also trying to minimize data transfers to the GPU by keeping data on the GPU for longer.  Please could you help me understand these calls, so we can plan accordingly for our code?
Thanks!
Regards,
Dhruva
  


<image.png>

Dhruva Kulkarni

unread,
Feb 5, 2021, 9:13:07 PMFeb 5
to Ahmad Abdelfattah, MAGMA User
Thanks. The transfer was from host to device. 
Yes, I see now the initializations. I will try out your suggestions regarding the alternative functions. Thanks! 
Reply all
Reply to author
Forward
0 new messages