MAGMA is slow - CPU utilization is low

272 views
Skip to first unread message

Danesh Daroui

unread,
Jul 8, 2024, 2:33:52 PM7/8/24
to MAGMA User
Hi all,

I am building MAGMA using make.inc file (not CMake script) for MKL, Intel compilers and ILP64. I have updated make.inc to use correct compilers (icc is deprecated and icx and icpx shall be used) and MAGMA is built correctly. I have tested MAGMA with quite large matrices i.e., 16k x 16k coefficient on a quite old machine with 16 GB RAM, Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (8 cores) and a Maxwell nvidia graphic card. The problem is that MAGMA is considerably slower than MKL code. When I use MAGMA to solve a problem, I see that 100% of GPU is utilized while CPU utilization is very low and most of the time only 1 core is used. In my MKL code, most of the time almost all 8 cores are busy and the problem is solved ~8 times faster than MAGMA.

I have updated make.inc to compile the code using openmp directives as below:

CC        = icx
CXX       = icpx
NVCC      = nvcc
FORT      = ifx

ARCH      = ar
ARCHFLAGS = cr
RANLIB    = ranlib


# --------------------
# flags

# Use -fPIC to make shared (.so) and static (.a) library;
# can be commented out if making only static library.
FPIC      = -fPIC

CFLAGS    = -O3 $(FPIC) -qopenmp -DNDEBUG -DADD_ -Wall -Wshadow -DMAGMA_WITH_MKL
FFLAGS    = -O3 $(FPIC) -qopenmp -DNDEBUG -DADD_ -warn all -warn nounused -nogen-interfaces
F90FLAGS  = -O3 $(FPIC) -qopenmp -DNDEBUG -DADD_ -warn all -warn nounused
NVCCFLAGS = -O3                  -DNDEBUG -DADD_ -Xcompiler "$(FPIC) -Wall -Wno-unused-function -fopenmp" -std=c++11
LDFLAGS   =     $(FPIC) -qopenmp

I think the problem I use to benchmark MAGMA is large enough to stress the GPU and cover the latency of data transfer to and from GPU.

Is there any reason that MAGMA is slow? Does anybody know if I can improve the performance when MAGMA is used?

In my MKL code, I use zgetrf and zgetrs to factorize and solve the equation, while in MAGMA I use magma_zgesv. The results are exactly same so the accuracy is preserved when MAGMA is used.

Regards,

Dan

Ahmad Abdelfattah

unread,
Jul 8, 2024, 2:43:42 PM7/8/24
to Danesh Daroui, MAGMA User
Hi Dan, 

As far as I remember, Maxwell GPUs only support double precision by emulation. The hardware does not natively support FP64 arithmetic and it is  considerably slower than FP32 on these GPUs. Maybe this is why you observe such a slow performance. 

Here is a sample of a more recent machine (Slylake CPU and an Ampere GPU)

./testing_zgesv -c -l -N 16000 --niter 5

% MAGMA 2.8.0 svn 32-bit magma_int_t, 64-bit pointer.

% Compiled with CUDA support for 7.0

% CUDA runtime 12010, driver 12030. OpenMP threads 72. MKL 2023.0.2, MKL threads 36. 

% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40338.3 MiB memory, capability 8.0

% Mon Jul  8 18:37:03 2024

% Usage: ./testing_zgesv [options] [-h|--help]


% ngpu 1

%   N  NRHS   CPU Gflop/s (sec)   GPU Gflop/s (sec)   ||B - AX|| / N*||A||*||X||  ||B - AX|| / N*||A||*||X||_CPU

%================================================================================================================

16000     1   1245.34 (   8.77)   3180.01 (   3.44)   5.87e-24   ok               1.21e-23   ok

16000     1   1553.29 (   7.03)   3281.26 (   3.33)   5.11e-19   ok               4.37e-19   ok

16000     1   1415.47 (   7.72)   3470.82 (   3.15)   5.13e-19   ok               4.51e-19   ok

16000     1   1411.46 (   7.74)   3489.04 (   3.13)   4.38e-19   ok               3.90e-19   ok

16000     1   1660.47 (   6.58)   3484.29 (   3.14)   4.99e-19   ok               4.35e-19   ok


Note that magma_zgesv_gpu is generally faster than magma_zgesv because it assumes that the matrix is entirely in GPU memory.

./testing_zgesv_gpu -c -l -N 16000 --niter 5

% MAGMA 2.8.0 svn 32-bit magma_int_t, 64-bit pointer.

% Compiled with CUDA support for 7.0

% CUDA runtime 12010, driver 12030. OpenMP threads 72. MKL 2023.0.2, MKL threads 36. 

% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40338.3 MiB memory, capability 8.0

% Mon Jul  8 18:39:00 2024

% Usage: ./testing_zgesv_gpu [options] [-h|--help]


%   N  NRHS   CPU Gflop/s (sec)   GPU Gflop/s (sec)   ||B - AX|| / N*||A||*||X||

%===============================================================================

16000     1   1305.78 (   8.37)   6431.12 (   1.70)   5.87e-24   ok

16000     1   1602.54 (   6.82)   6482.08 (   1.69)   5.11e-19   ok

16000     1   1602.09 (   6.82)   6490.74 (   1.68)   5.13e-19   ok

16000     1   1599.98 (   6.83)   6495.13 (   1.68)   4.38e-19   ok

16000     1   1600.33 (   6.83)   6492.85 (   1.68)   4.99e-19   ok



Thanks,
Ahmad


--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/1d019efd-08b5-4b8f-b224-3d73aeb0029en%40icl.utk.edu.

Ahmad Abdelfattah

unread,
Jul 8, 2024, 3:32:33 PM7/8/24
to Danesh Daroui, MAGMA User
magma_zmalloc will fail if you try to allocate memory beyond the capacity of the GPU. Most Tesla GPUs, however, are now equipped with quite large memories. 

There are some routines in MAGMA that can work with matrices that fit in RAM but not in GPU memory, such as zgetrf_m, but we do not have a comprehensive support for such a mode of operation. 

For the low CPU utilization, MAGMA uses the CPU for tasks such as the panel factorization while the GPU is performing the trailing matrix updates. Ideally, the CPU activity should overlap with the GPU so that the GPU is never idle. The CPU could be idle during some part of the computation, though. However, in your case, my guess is that the GPU is extremely slow that it becomes a bottleneck and the CPU spends a lot of time waiting for the next panel. 

Aside from that, you should make sure you are linking against the multithreaded MKL libraries, and that the number of MKL threads is proportional to the number of cores (MAGMA usually takes care of that). 

Thanks,
Ahmad  

On Jul 8, 2024, at 3:02 PM, Danesh Daroui <danesh...@gmail.com> wrote:

Hi Ahmad,

Thanks for your response. I will test my code on a machine equipped with Tesla GPU as well to see if I get better results. I can use magma_zgesv_gpu but the problem is that in many cases the matrices are quite large that they won't probably fit into GPU's memory. If I use magma_zmalloc to allocate memory which is larger than GPU's memory, would the routine allocate the rest on RAM instead or the allocation will simply failed? May I also know how MAGMA's memory management work when the data in RAM is larger than GPU's internal memory? I though MAGMA will internally transfer the parts that would fit into memory and then perform factorization and Gaussian elimination both in GPU's memory and RAM in parallel to achieve best performance.

Also, do you know why MAGMA's CPU utilization is that low? If MAGMA would use MKL for its internal operations, then it should automatically use all available cores. MKL is very good in that as I see that it uses all cores when I only use MKL in my code to solve equations. Do I build MAGMA correctly to use MKL using the make.inc file that I pasted?

Regards,
Dan

Andrew Cunningham

unread,
Jul 8, 2024, 11:23:44 PM7/8/24
to Danesh Daroui, MAGMA User
Hi Danesh,
As noted, double precision is painfully slow on that GPU. Try
switching to single precision.
Saying "MAGMA is slow" (or fast) is sort of meaningless. It's a
generalization to say this, but MAGMA is a convenient and relatively
thin interface over the underlying ( in this case) CUDA routines.
I strongly suggest building MAGMA in debug mode. It is very
instructive to step through the source code.


Andrew

Danesh Daroui

unread,
Jul 8, 2024, 11:23:44 PM7/8/24
to MAGMA User, ah...@icl.utk.edu, MAGMA User, Danesh Daroui
Hi Ahmad,

Thanks for your response. I will test my code on a machine equipped with Tesla GPU as well to see if I get better results. I can use magma_zgesv_gpu but the problem is that in many cases the matrices are quite large that they won't probably fit into GPU's memory. If I use magma_zmalloc to allocate memory which is larger than GPU's memory, would the routine allocate the rest on RAM instead or the allocation will simply failed? May I also know how MAGMA's memory management work when the data in RAM is larger than GPU's internal memory? I though MAGMA will internally transfer the parts that would fit into memory and then perform factorization and Gaussian elimination both in GPU's memory and RAM in parallel to achieve best performance.

Also, do you know why MAGMA's CPU utilization is that low? If MAGMA would use MKL for its internal operations, then it should automatically use all available cores. MKL is very good in that as I see that it uses all cores when I only use MKL in my code to solve equations. Do I build MAGMA correctly to use MKL using the make.inc file that I pasted?

Regards,
Dan





On Monday, July 8, 2024 at 8:43:42 PM UTC+2 ah...@icl.utk.edu wrote:

Danesh Daroui

unread,
Jul 15, 2024, 12:28:19 PM7/15/24
to MAGMA User, Andrew Cunningham, MAGMA User, Danesh Daroui
Hi Andrew,
Yes, I agree that speed and efficiency of a library has many parameters. I tested my code on a cluster with 2 x NVIDIA Tesla V100 SXM2 GPU with 32GB RAM and 32 CPU cores with ~100 GB of RAM and the results are not promising at all. The version that uses *only* MKL is around x16 faster than when I use MAGMA. I tried to increase the size of the problem and the results were quite the same. I am wondering (as you also pointed out) from complex<double> to complex<float>. Would it be because GPUs are generally *not* optimized to carry  out double precision operations? I know then can, but this is about the efficiency. I have also another version of my solver which uses CUDA native LAPACK and BLAS, and that was also slower than CPU.

Because a task submission system is used on the cluster, I cannot continuously monitor CPU and GPU usage, but on my own system which uses Maxwell GPUs, I could see that GPU was 100% in use when the solver was running but CPU usage was very low. Almost 100% most of the time. Ahmad said that this can be due to low double precision performance on Maxwell GPUs but I get same results on Tesla as well.

Another thing is that I am not using MAGMA is the best way that I should. Since I need to update the coefficient matrix for each iteration of solution, then the whole matrix is updated in RAM and transferred to GPU. This might be the reason. Can I allocate the memory on GPU only once, and access the coefficient matrix there to avoid data transfer from RAM to GPU at each iteration?

Another approach I am thinking is to implement an OpenMP-based consumer-worker strategy to have an scheduler which schedules two workers, one for CPU using MKL and another on GPU using MAGMA and let the worker to pick up a frequency to solve independently and in parallel. But then I am not sure if this helps, if MAGMA will use majority of CPU cores. Any advice is highly appreciated.
Regards,
Dan

Andrew Cunningham

unread,
Jul 15, 2024, 12:28:36 PM7/15/24
to Danesh Daroui, MAGMA User
Hi Dan,

I suggest you go back and run the MAGMA benchmarks on your machine -
for example run the sgemm and dgemm examples ( and cgemm and zgemm).
Now you have a base-line performance metric.

The benchmarks will show performance vs MKL. You should see that
s/cgemm running on the GPU is faster than MKL for "large" N, while
z/dgemm is slower. Now you have a baseline of what to expect in the
best case scenario.

You want to avoid small transfers to the GPU. For example, multiplying
single 4x4 matrices on the GPU vs MKL is probably pointless ( unless
it is a 'batch' operation). Multiplying 4000 x 4000 matrices on the
GPU (if enough memory) makes sense as the transfer speed goes up
linearly while computation time goes up as n^3. I don't know, but it
seems like your code is dominated by CPU<->GPU communication.

I had great success using MAGMA on a Quadro GPU. I did a single
precision LDL decomposition , leaving the decomposition on the GPU,
then solving using multiple RHS.

Essentially your question is about efficiently using GPU's with CUDA
and cuBLAS. Interleaving GPU computation with on CPU computation is
certainly going to bring gains.

Andrew

Danesh Daroui

unread,
Jul 16, 2024, 7:07:37 AM7/16/24
to MAGMA User, Andrew Cunningham, MAGMA User, Danesh Daroui
Hi Andrew,
Thanks for your response. I executed testing_sgemm and here is are the results:

% MAGMA 2.8.0  64-bit magma_int_t, 64-bit pointer.
% Compiled with CUDA support for 5.0
% CUDA runtime 12040, driver 12050. OpenMP threads 8. MKL 2024.0.2, MKL threads 4.
% device 0: NVIDIA GeForce GTX 750 Ti, 1084.5 MHz clock, 1993.3 MiB memory, capability 5.0
% Tue Jul 16 12:06:51 2024
% Usage: ./testing_sgemm [options] [-h|--help]

% If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
% relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.

% transA = No transpose, transB = No transpose
%   M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
%========================================================================================================
 1088  1088  1088    574.58 (   4.48)     797.92 (   3.23)     ---   (  ---  )    1.10e-09        ---    ok
 2112  2112  2112    729.83 (  25.82)    1012.22 (  18.61)     ---   (  ---  )    8.13e-10        ---    ok
 3136  3136  3136    725.52 (  85.02)    1098.68 (  56.14)     ---   (  ---  )    4.49e-10        ---    ok
 4160  4160  4160    784.29 ( 183.58)    1161.61 ( 123.95)     ---   (  ---  )    5.88e-10        ---    ok
 5184  5184  5184    812.80 ( 342.80)    1135.90 ( 245.29)     ---   (  ---  )    4.23e-10        ---    ok
 6208  6208  6208    813.13 ( 588.47)    1160.52 ( 412.32)     ---   (  ---  )    3.23e-10        ---    ok
 7232  7232  7232    809.96 ( 933.99)    1123.54 ( 673.31)     ---   (  ---  )    5.11e-10        ---    ok
 8256  8256  8256    813.82 (1382.96)    1156.61 ( 973.09)     ---   (  ---  )    4.21e-10        ---    ok
 9280  9280  9280    793.75 (2013.67)    1228.60 (1300.96)     ---   (  ---  )    3.53e-10        ---    ok
10304 10304 10304    777.53 (2814.03)    1245.07 (1757.33)     ---   (  ---  )    3.02e-10        ---    ok

As you see the results for CPU are all blank but I can see that MAGMA is faster than cuBLAS. Comparing only with cuBLAS makes sense because ?gemm is a BLAS function, but I don't know why there are no results for CPU. I think CPU column should be for the case when only MKL is used to calculate matrix-matrix product to compare with CPU-GPU (MAGMA) and GPU only (cuBLAS). Can it because I have not built MAGMA correctly to utilize CPU as well? Can you please advice?
Regards,
Dan

Ahmad Abdelfattah

unread,
Jul 16, 2024, 7:16:29 AM7/16/24
to Danesh Daroui, MAGMA User, Andrew Cunningham
Hi Dan, 

You can run test the CPU using the (-l) or (--lapack) options. By default, this tester does not run the CPU benchmark. 

Actually cuBLAS is faster than MAGMA. Note that the tester reports both performance and execution time for each library.  MAGMA reaches about ~800 Gflop/s, while cuBLAS peaks at ~1250 Gflop/s. 

Ahmad

Andrew Cunningham

unread,
Jul 16, 2024, 11:40:55 AM7/16/24
to Danesh Daroui, MAGMA User
Hi Dan,
As Ahmad pointed out , you need to run with the --lapack switch to get
the MKL performance.
Running the dgemm benchmark will be informative as well.

Andrew
Reply all
Reply to author
Forward
0 new messages