GETRF in Hybrid mode & Native mode

19 views
Skip to first unread message

aran nokan

unread,
Nov 24, 2020, 9:04:44 AM11/24/20
to MAGMA User

Hi,

I have some questions about GETRF. It seems that inside of the GETRF we have two mode to performe the algorithm, Hybrid and Native mode which is completly runnig inside of GPU.

magma_int_t magma_dgetrf
( magma_int_t  m,
magma_int_t  n, double *  A, magma_int_t  ,lda, magma_int_t *  ipiv, magma_int_t *  info  )

From another side we can provide the memory for Original matrix (double *  A) input in two diferent type. Device memory and Pinned memory.

Is it make sense to use pinned memory for Native mode?
Which type of the memory should be used in Native mode? and which type for hybrid?
When we are using Native mode and when Hybrid?

What about the no_pivoting version? Here just we have hybrid mode?

Best regards,
Aran

Ahmad Abdelfattah

unread,
Nov 24, 2020, 9:29:14 AM11/24/20
to aran nokan, MAGMA User
Hi Aran, 

  • magma_dgetrf is a hybrid routine. It assumes that the input matrix is in the CPU memory. Pinned memory is recommended for this routine, because it allows faster data copies across the CPU-GPU interconnect. 
  • magma_dgetrf_gpu is a hybrid routine. It assumes that the input matrix is in the GPU memory. Internally, it allocates pinned memory workspaces on the CPU. As a user, you just need to allocate the matrix on the GPU memory. 
  • magma_dgetrf_native uses the GPU only for performing the factorization. It assumes that the input matrix is in the GPU memory. You cannot pass pinned memory pointers to this routine. 
  • All of the three routines assume that the pivot vector is in the CPU memory. 

So, coming to your questions. 

Is it make sense to use pinned memory for Native mode? 

No. Native routines accept GPU pointers only. 

Which type of the memory should be used in Native mode? and which type for hybrid?

Native —> GPU memory
Hybrid without the “_gpu” suffix —> CPU memory, preferably pinned
Hybrid with the “_gpu” suffix —> GPU memory

When we are using Native mode and when Hybrid?


That really depend on your system configuration, and the size of your matrix. Hybrid routines perform best when you have a high-end CPU with optimized LAPACK software (e.g. a recent Intel CPU with MKL). Native routines are independent from the CPU. They don’t use it in any computational workload. Native routines can perform better than hybrid routines on small matrices, regardless of the system configuration. 

What about the no_pivoting version? Here just we have hybrid mode? 

There is currently no native mode for non-pivoting LU. 

Ahmad

--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/3dc1abf1-bd6b-467a-b07f-da8f7d17c192n%40icl.utk.edu.

aran nokan

unread,
Nov 24, 2020, 10:34:57 AM11/24/20
to Ahmad Abdelfattah, MAGMA User
Awesome, thanks!

magma_dgetrf is a hybrid routine. It assumes that the input matrix is in the CPU memory. Pinned memory is recommended for this routine, because it allows faster data copies across the CPU-GPU interconnect.

So what is the difference between magma_dgetrf and magma_dgetrf_gpu. As I understand, memory of magma_dgetrf is initially allocated in CPU and after that the routine will allocate another one in GPU also and  magma_dgetrf_gpu has GPU memory initially and after that allocate CPU copy also. Am I wrong?

No. Native routines accept GPU pointers only. 

Actually I allocated pinned memory and the input was accepted but the time of excitement was higher. Also if I use pinned memory for hybrid mode the time is equal. What is the problem here (why is the pinned memory accepted?)?

That really depend on your system configuration, and the size of your matrix. Hybrid routines perform best when you have a high-end CPU with optimized LAPACK software (e.g. a recent Intel CPU with MKL). Native routines are independent from the CPU. They don’t use it in any computational workload. Native routines can perform better than hybrid routines on small matrices, regardless of the system configuration.

Can I ask about the small here? Is it related to the GPU? for P100 and Volta which dimension is small? around 20k, Native seems to be better.

Last question. Is working with MAGMA hard or am I a dummy?!

Ahmad Abdelfattah

unread,
Nov 24, 2020, 10:56:33 AM11/24/20
to aran nokan, MAGMA User
On Nov 24, 2020, at 10:34 AM, aran nokan <noka...@gmail.com> wrote:

Awesome, thanks!

magma_dgetrf is a hybrid routine. It assumes that the input matrix is in the CPU memory. Pinned memory is recommended for this routine, because it allows faster data copies across the CPU-GPU interconnect.

So what is the difference between magma_dgetrf and magma_dgetrf_gpu. As I understand, memory of magma_dgetrf is initially allocated in CPU and after that the routine will allocate another one in GPU also and  magma_dgetrf_gpu has GPU memory initially and after that allocate CPU copy also. Am I wrong?


You are not wrong. The main difference is where the matrix sits initially. Notice that for the “_gpu” variant, the routine allocates a small portion of pinned memory to perform the panel factorization. The matrix is never entirely copied to the CPU. 

No. Native routines accept GPU pointers only. 

Actually I allocated pinned memory and the input was accepted but the time of excitement was higher. Also if I use pinned memory for hybrid mode the time is equal. What is the problem here (why is the pinned memory accepted?)?


I think right now pinned memory on the CPU can be accepted in GPU kernels. It is not the way the routines is assumed to be used, since the data will move anyway across the interconnect, leading to slower performance. 

That really depend on your system configuration, and the size of your matrix. Hybrid routines perform best when you have a high-end CPU with optimized LAPACK software (e.g. a recent Intel CPU with MKL). Native routines are independent from the CPU. They don’t use it in any computational workload. Native routines can perform better than hybrid routines on small matrices, regardless of the system configuration.

Can I ask about the small here? Is it related to the GPU? for P100 and Volta which dimension is small? around 20k, Native seems to be better.


There are a lot of parameters that play into specifying what small is. For a hybrid routine “_gpu":
  • On the CPU, there is data copy (D2H) + panel factorization + data copy back (H2D)
  • Meanwhile on the GPU, a matrix multiply (GEMM) is invoked to perform a rank-k update
  • If the GEMM time is equivalent (or larger) than the time it takes to perform the panel and two data copies —> then the hybrid routine will be faster than the native routine. 

So I don’t have a definitive answer to how “small” translate into numbers. Here is an old graph on the V100 using CUDA 9.0. The CPU is a 20 core Haswell processor. MAGMA is configured with MKL. As you can see, hybrid is faster for matrices larger than 15k. If it was the same GPU with a faster CPU (e.g. Skylake), the intersection point would be at a smaller size. 

dgetrf_v100.pdf
dgetrf_summitdev.pdf

aran nokan

unread,
Nov 24, 2020, 11:45:36 AM11/24/20
to Ahmad Abdelfattah, MAGMA User
Cool, thanks! 


No. Native routines accept GPU pointers only. 

Actually I allocated pinned memory and the input was accepted but the time of excitement was higher. Also if I use pinned memory for hybrid mode the time is equal. What is the problem here (why is the pinned memory accepted?)?


I think right now pinned memory on the CPU can be accepted in GPU kernels. It is not the way the routines are assumed to be used, since the data will move anyway across the interconnect, leading to slower performance.

Do we have any dedicated documentation about pinned memory in MAGMA? Is it the same as what we have in CUDA? (I think it is better for me to read more about memory in MAGMA & CUDA)

How can I measure the performance here? I saw a flops header file  in the testing folder. Should I use this header for testing? Do we have any documentation about performance measurement in MAGMA? in the doxygen file I did not find.




Another graph below shows that Native seems to be always better. This is because the CPU is different (IBM POWER8), and so we cannot use MKL. Due to the lake of an optimized factorization on the CPU, the hybrid routine suffers from a big performance drop. The intersection point is beyond 40k. 



Last question. Is working with MAGMA hard or am I a dummy?!


If you are new to MAGMA, give it more time and you will like it :-)

Ahmad



On Tue, Nov 24, 2020 at 5:59 PM Ahmad Abdelfattah <ah...@icl.utk.edu> wrote:
Hi Aran, 

    • magma_dgetrf is a hybrid routine. It assumes that the input matrix is in the CPU memory. Pinned memory is recommended for this routine, because it allows faster data copies across the CPU-GPU interconnect. 
    • magma_dgetrf_gpu is a hybrid routine. It assumes that the input matrix is in the GPU memory. Internally, it allocates pinned memory workspaces on the CPU. As a user, you just need to allocate the matrix on the GPU memory. 
    • magma_dgetrf_native uses the GPU only for performing the factorization. It assumes that the input matrix is in the GPU memory. You cannot pass pinned memory pointers to this routine. 
    • All of the three routines assume that the pivot vector is in the CPU memory. 
    So, coming to your questions. 

    Is it make sense to use pinned memory for Native mode? 

    No. Native routines accept GPU pointers only. 

    Which type of the memory should be used in Native mode? and which type for hybrid?

    Native —> GPU memory
    Hybrid without the “_gpu” suffix —> CPU memory, preferably pinned
    Hybrid with the “_gpu” suffix —> GPU memory

    When we are using Native mode and when Hybrid?


    That really depend on your system configuration, and the size of your matrix. Hybrid routines perform best when you have a high-end CPU with optimized LAPACK software (e.g. a recent Intel CPU with MKL). Native routines are independent from the CPU. They don’t use it in any computational workload. Native routines can perform better than hybrid routines on small matrices, regardless of the system configuration. 

    What about the no_pivoting version? Here just we have hybrid mode? 

    There is currently no native mode for non-pivoting LU. 

    Ahmad

    Ahmad Abdelfattah

    unread,
    Nov 24, 2020, 1:56:57 PM11/24/20
    to aran nokan, MAGMA User
    The pinned memory behavior should be identical to CUDA. 

    For measuring the performance, see the example under testing/testing_dgetrf_gpu.cpp 

    Ahmad
    Reply all
    Reply to author
    Forward
    0 new messages