Memory access on GPU (device) to avoid host-device transfer

Danesh Daroui

unread,

Oct 11, 2024, 1:53:12 AMOct 11

to MAGMA User

Hi all,

I am running MAGMA on an Ada RTX 4000 and with single precision and complex equations, I get around x2.5 speed up when GPU offload is used, comparing to the CPU-only solutions.

Now I am wondering if the process can be optimised by allocating the whole memory for the coefficient matrix, one on the GPU device and fill and update the directly there instead of performing all operations on the host memory and the transfer the matrix to GPU right before calling MAGMA to solve the equations?

In testing directory, _gpu examples, allocate memory on device, but then transfer data from host to device which might degrade the performance. Is there any solution to avoid this transfer and perform all operations on the coefficient matrix directly on the device?

Regards,

Danesh

Mark Gates

unread,

Oct 11, 2024, 1:59:43 AMOct 11

to Danesh Daroui, MAGMA User

Hi Danesh,

Yes, it's possible to achieve better performance by keeping the matrix on the GPU. Some of MAGMA's routines are hybrid (CPU + GPU) while others are GPU-native. Which routines are you using?

The _gpu routine testers probably generate the matrix on the CPU, then transfer it to the GPU. But they time only the operation after transfering the matrix to the GPU, so the fact that it was generated on the CPU is largely irrelevant. (Other than some CPU memory is consumed.) If your application can generate the matrix on the GPU, that is probably better than generating it on the CPU and transfering.

Mark

Interim Director, Innovative Computing Laboratory (ICL)

Research Assistant Professor, University of Tennessee, Knoxville

https://icl.utk.edu/~mgates3/

Danesh Daroui

unread,

Oct 13, 2024, 3:07:04 PMOct 13

to Mark Gates, MAGMA User

Hi Mark,

Thanks for your response. No I didn't mean that. As I measured, data allocation and transfer on and to the device is quite fast so I don't see that much improvement if the data is on the device and not transferred from the host at each iteration when MAGMA LAPACK routines are called. What I meant was to allocate data on the device and try to access each cell and modify that. This is because we are working on a frequency domain EM solver where only parts of the coefficient matrix need to be updated as the frequency is updated in each iteration. Therefore, it is unnecessary to transfer the whole coefficient matrix from the host to the device each time the equation is solved. But like I said, the transfer time is negligible so we can live with that.

There are some more issues so I will write here if that would be OK.

1. magma_xsetmatrix always crashes in my case while I believe that I have set all parameters correctly and allocated the memory needed in advance. This is how I call:

magma_queue_t q;

magma_int_t ldda = magma_roundup(nCells, 32);

cuFloatComplex* h_A;
h_A = reinterpret_cast<cuFloatComplex*>(MKL_RL);

magma_csetmatrix(nCells, nCells, h_A, nCells, d_A, ldda, q);

2. MAGMA's xgetri for matrix inversion is only for the case when the matrix is in the device's memory and not when the matrix is in the host's memory. For this, I need to call magma_xgetmatrix which crashes all the time. Is there any specific reason for that? It is more convenient to call MAGMA routines when the matrix is in the host's memory and let MAGMA do the job.

3. Whenever MAGMA routines are running, I see that most of the time most of the CPUs are idle. I am running on a machine with 24 corea and 32 threads and when MAGMA is running, only 1-2 cores are busy. I am not sure if batched routines will split the load over host and device because as far as I remember, batched routines were optimized for several small matrices. I think MAGMA already uses block LU factorization to achieve parallelism, but is it limited to device memory only and implemented for all xgetrf routines or there are routines that use both GPU and CPU at the same time, called tiling I guess?

4. The last issue is that according to my tests, MAGMA is ~4 times slower than CUDA LAPACK. Is it already known by the MAGMA team?

Regards,

Dan

Mark Gates

unread,

Oct 14, 2024, 2:42:10 PMOct 14

to Danesh Daroui, MAGMA User

[Sending again; forgot to Reply all to the MAGMA list.]

On Sun, Oct 13, 2024 at 11:19 AM Danesh Daroui <danesh...@gmail.com> wrote:

Hi Mark,

Thanks for your response. No I didn't mean that. As I measured, data allocation and transfer on and to the device is quite fast so I don't see that much improvement if the data is on the device and not transferred from the host at each iteration when MAGMA LAPACK routines are called. What I meant was to allocate data on the device and try to access each cell and modify that. This is because we are working on a frequency domain EM solver where only parts of the coefficient matrix need to be updated as the frequency is updated in each iteration. Therefore, it is unnecessary to transfer the whole coefficient matrix from the host to the device each time the equation is solved. But like I said, the transfer time is negligible so we can live with that.

There are some more issues so I will write here if that would be OK.

1. magma_xsetmatrix always crashes in my case while I believe that I have set all parameters correctly and allocated the memory needed in advance. This is how I call:

magma_queue_t q;

Did you create (and later destroy) the queue?

magma_queue_create( device, q );

MAGMA's API is C-based; there's no C++ constructor invoked automatically.

magma_int_t ldda = magma_roundup(nCells, 32);
cuFloatComplex* h_A;
h_A = reinterpret_cast<cuFloatComplex*>(MKL_RL);
magma_csetmatrix(nCells, nCells, h_A, nCells, d_A, ldda, q);

You don't show where you allocate h_A and d_A. From the call, it looks like h_A is allocated as an nCells*nCells array, and d_A is allocated on the GPU device as an nCells*ldda array. Can you confirm that?

2. MAGMA's xgetri for matrix inversion is only for the case when the matrix is in the device's memory and not when the matrix is in the host's memory. For this, I need to call magma_xgetmatrix which crashes all the time. Is there any specific reason for that? It is more convenient to call MAGMA routines when the matrix is in the host's memory and let MAGMA do the job.

There's no particular reason that we have only the getri_gpu() GPU interface and not the getri() CPU interface. Just didn't get to implement the getri() interface.

However, while there are legitimate uses of inverses, it's generally encouraged to solve a system (Ax = b) using gesv, rather than inverting and multiplying (x = A^{-1} b) using getri and gemm. gesv is both faster and more accurate.

3. Whenever MAGMA routines are running, I see that most of the time most of the CPUs are idle. I am running on a machine with 24 corea and 32 threads and when MAGMA is running, only 1-2 cores are busy. I am not sure if batched routines will split the load over host and device because as far as I remember, batched routines were optimized for several small matrices. I think MAGMA already uses block LU factorization to achieve parallelism, but is it limited to device memory only and implemented for all xgetrf routines or there are routines that use both GPU and CPU at the same time, called tiling I guess?

The batch routines run completely on the GPU. The CPU only queues tasks.

Some MAGMA routines are hybrid, CPU and GPU. For instance, magma_xgetrf() and magma_xgetrf_gpu() are both hybrid. However, the trend has been that hybrid routines are not as efficient as GPU-only routines for high-performance GPUs. Compare the peak performance (flop/s) of your CPU and GPU; often the CPU is a very small percent of the GPU's performance, in which case it doesn't make sense to use the CPU.

Also, on the CPU side you may need to adjust the number of CPU threads, e.g., setting OMP_NUM_THREADS or MKL_NUM_THREADS.

4. The last issue is that according to my tests, MAGMA is ~4 times slower than CUDA LAPACK. Is it already known by the MAGMA team?

Can you be more specific about what routines and sizes you are testing?

Also, what is CUDA LAPACK? Do you mean cuSolver?

Mark

Mark Gates

unread,

Oct 14, 2024, 3:02:55 PMOct 14

to Danesh Daroui, MAGMA User

Hi Danesh,

On Mon, Oct 14, 2024 at 3:53 AM Danesh Daroui <danesh...@gmail.com> wrote:

Hi Mark,

Yes you are right about memory allocation. I had apparently missed creating the queue. Now the MAGMA code seems to work fine, but since later in the code I call cuSolver routines to solve the equation, there is a conflict and cuSolver fails to allocate memory on the device. As I see, MAGMA, once initialized, cannot be used with cuSolver, but I might be wrong.
I agree that matrix inversion should generally be avoided, but what I am doing is to prepare the coefficient matrix and not perform inversion and then multiply with the right hand side to solve the equation.

I don't think there should be any problem with using both MAGMA and cuSolver in the same app. Is the GPU memory being exhausted?

As I can see on my system with 24 cores and an Ada RTX 4000 Nvidia graphic card, the flops is comparable between CPU and GPU, but I think when you use several GPUs, then it doesn't make sense to use CPUs and the synchronization, communication, data transfer, and coordination might even degrade the performance. One question I have is, then why would one consider using MAGMA over cuSolver?

MAGMA has broader coverage than cuSolver, with many routines that cuSolver doesn't provide.

MAGMA is portable across CUDA and HIP/ROCm, with Intel SYCL support coming.

For some routines, MAGMA is faster than cuSolver. However, this is difficult to maintain since improvements in MAGMA will sometimes be incorporated into cuSolver and other vendor libraries, and NVIDIA has more resources than we do.

Another option for portability are the BLAS++ and LAPACK++ libraries, which wrap around CPU and GPU BLAS (cuBLAS, rocBLAS, oneMKL on GPU) and LAPACK (cuSolver, rocSolver, oneMKL on GPU).

https://github.com/icl-utk-edu/blaspp/

https://github.com/icl-utk-edu/lapackpp/

Their GPU coverage is a bit limited, but let us know if specific routines are desired.

My main motivation was (as I had seen in MAGMA's description) that MAGMA shows better performance because it uses the power of both CPUs and GPUs. If this is not the case, then the only benefit MAGMA would have over cuSolver is to be very easy to use, as lots of operations like memory allocation and data transfer is abstracted in MAGMA.

This is one of the simulations I have done when I use MAGMA in my code:

Coefficient matrix: 26074 x 26074
Memory: 19.2419 GB
MNA
Freq. domain
Freq. steps: 10 - 1200 MHz, using 120 steps

CPU: 01:19:56
GPU: 00:32:26

Speedup factor: ~2.5x

and results when I use cuSolver:

CPU: 01:19:00
GPU: 00:09:11

Speedup factor: ~8.7x

Which specific routines are being called here? There may be different variants or options to optimize MAGMA's performance. But it's not surprising that cuSolver is faster than MAGMA for some common operations.

Mark

Reply all

Reply to author

Forward