dgetrf_mgpu scalability issue

71 views
Skip to first unread message

Simplice

unread,
Mar 11, 2021, 3:50:58 PM3/11/21
to MAGMA User

Hi MAGMA team,

I gave a MAGMA training where each learner was using their own computer equipped with GPUs. Some learners reported that the performance
of magma_dgetrf_mgpu with 1 GPU was better than the performance with 4 GPUs. I was finally able to reproduce the problem on a Tesla V100-SXM2-16GB, with MAGMA 2.5.4 and CUDA 10.2.

For a matrix of size 20 000, magma_dgetrf_mgpu with 1 GPU reaches 4068.29 GFlops/s, while with 4 GPUs it reaches 3302.63 Gflops/s. The peak performance of one GPU on that computer is 7834 Gflops/s.

Is there an explanation for that scalability issue? Could I send you the small example that reproduces the issue?

Please see the output below:

$ ./test_magma_dgetrf_mgpu 20000 1 1
% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% Thu Mar 11 21:11:03 2021
Matrix m: 20000, n: 20000, nrhs: 1, ngpus:1, max API GPUs: 8
nb: 512
MAGMA_DGETRF_MGPU time: 1.31, GFlops/s: 4068.29
Machine precision: 1.110223e-16, Residual1:2.096565e-08, Residual2: 5.781984e-03

$ ./test_magma_dgetrf_mgpu 20000 1 4
% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% Thu Mar 11 21:11:25 2021
Matrix m: 20000, n: 20000, nrhs: 1, ngpus:4, max API GPUs: 8
nb: 512
MAGMA_DGETRF_MGPU time: 1.61, GFlops/s: 3302.63
Machine precision: 1.110223e-16, Residual1:1.953109e-08, Residual2: 5.386331e-03

Stanimire Tomov

unread,
Mar 11, 2021, 4:09:15 PM3/11/21
to MAGMA User, Simplice
Hi Simplice,

Good to hear from you and that you are using MAGMA for this training!

We have not tuned recently MAGMA on multiple GPUs. The most common problem for slow performance is the CPU. For cases where the CPU is slow we have added LU version that is GPU-only, i.e., not the typical hybrid versions that use CPUs for some critical parts (like panels) and GPUs for trailing matrix updates (GEMMS).
We can revisit the tuning and make extend this GPU-only capability to multiple GPUs.

For single GPU we have this benchmark
so at 20K on V100 performance should be around 5000 GFlop/s under good tuning.
I will post some multi-GPU tuned code when we do the update in enabling GPU-only execution.

Stan

Stanimire Tomov

unread,
Mar 11, 2021, 5:41:50 PM3/11/21
to Simplice, MAGMA User
Hi Simplice,

I am looking now at the codes.

Related to hanging for m>20K, can you please specify because I just tried a few cases above 20K
and everything is fine for those tested. I am trying it with CUDA 11.2
% CUDA runtime 11020, driver 11020. OpenMP threads 80. MKL 2018.0.1, MKL threads 40. 

Thanks,
Stan

On Mar 11, 2021, at 4:56 PM, Simplice <sido...@gmail.com> wrote:


Hi Stan,

Thanks for your prompt reply and for the news,  

It is a very interesting result. I will do these tests on the A100.

I can reproduce the V100 performance for 1-GPU with the hybrid version dgetrf_gpu. The native version dgetrf_native seems to give better results but hangs for m>20000.

I'm looking forward for the tuned version of MAGMA for multi-GPUs.

Thanks,

Stanimire Tomov

unread,
Mar 11, 2021, 6:09:35 PM3/11/21
to MAGMA User, Stanimire Tomov, MAGMA User, Simplice
Until we prepare more versions for the panel, we have to point out that the matrix sizes have to be larger in order to see the effect of using multiple GPUs. Here is what I get using 4 GPUs when I grow the sizes:

./testing_dgetrf_mgpu --ngpu 4 -n 5000:60000:5000 -n 70000 -n 80000
-bash-4.2$ ./testing_dgetrf_mgpu --ngpu 4 -n 5000 -n 5000:60000:5000
% MAGMA 2.5.4 svn 64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 7.0
% CUDA runtime 11020, driver 11020. OpenMP threads 40. MKL 2018.0.1, MKL threads 40.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 4: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 5: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 6: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 7: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Thu Mar 11 17:55:39 2021
% Usage: ./testing_dgetrf_mgpu [options] [-h|--help]

% ngpu 4
%   M     N   CPU Gflop/s (sec)   GPU Gflop/s (sec)   |PA-LU|/(N*|A|)
%========================================================================
 5000  5000    ---   (  ---  )    630.55 (   0.13)     ---
10000 10000    ---   (  ---  )   1531.53 (   0.44)     ---
15000 15000    ---   (  ---  )   2901.38 (   0.78)     ---
20000 20000    ---   (  ---  )   4361.66 (   1.22)     ---
25000 25000    ---   (  ---  )   6062.69 (   1.72)     ---
30000 30000    ---   (  ---  )   7739.25 (   2.33)     ---
35000 35000    ---   (  ---  )   9697.42 (   2.95)     ---
40000 40000    ---   (  ---  )   11824.16 (   3.61)     ---
45000 45000    ---   (  ---  )   13391.99 (   4.54)     ---
50000 50000    ---   (  ---  )   15119.20 (   5.51)     ---
55000 55000    ---   (  ---  )   16807.54 (   6.60)     ---
60000 60000    ---   (  ---  )   17593.08 (   8.18)     ---
70000 70000    ---   (  ---  )   20098.44 (  11.38)     ---
80000 80000    ---   (  ---  )   21820.60 (  15.64)     ---

so it scales quite well, but the matrix sizes have to get larger for the gemms to asymptoticly
become dominant and hide the lower performance of the panel factorizations, pivoting, and trsm.

Stan

Simplice

unread,
Mar 11, 2021, 8:35:19 PM3/11/21
to MAGMA User, to...@icl.utk.edu, Simplice

Hi Stan,

Thanks for your prompt reply and for the news,  

It is a very interesting result. I will do these tests on the A100.

I can reproduce the V100 performance for 1-GPU with the hybrid version dgetrf_gpu. The native version dgetrf_native seems to give better results but hangs for m>20000.

I'm looking forward for the tuned version of MAGMA for multi-GPUs.

Thanks,

On Thursday, March 11, 2021 at 10:09:15 PM UTC+1 to...@icl.utk.edu wrote:

Simplice

unread,
Mar 11, 2021, 8:35:28 PM3/11/21
to MAGMA User, to...@icl.utk.edu, MAGMA User, Simplice
Hi Stan,

When it does not hang, it fails. I have used the same test as dgetrf_gpu, I just replace dgetrf_gpu with dgetrf_native.

I switched on Tesla V100-SXM2-32GB, still the same result.

Here is the output:

$ ./test_magma_dgetrf_native 10000

% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Thu Mar 11 22:53:29 2021
Matrix m: 10000, n: 10000, nrhs: 1
Generating A ........
Computing B ........
Copying the matrix to the GPU ........
Factorising using DGETRF ........
MAGMA_DGETRF_NATIVE time: 0.22, GFlops/s: 3044.14
Copying the matrix from the GPU ........
Solving ........
Computing the residual ........
Machine precision: 1.110223e-16, Residual1:4.396109e-09, Residual2: 6.858558e-03
Free memory ........

$ ./test_magma_dgetrf_native 20000

% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Thu Mar 11 22:52:22 2021

Matrix m: 20000, n: 20000, nrhs: 1
Generating A ........
Computing B ........
Copying the matrix to the GPU ........
Factorising using DGETRF ........
MAGMA_DGETRF_NATIVE time: 0.12, GFlops/s: 44049.49
Copying the matrix from the GPU ........
Solving ........

Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Computing the residual ........
Machine precision: 1.110223e-16, Residual1:-nan, Residual2: -nan
Free memory ........

Simplice

unread,
Mar 11, 2021, 8:35:39 PM3/11/21
to MAGMA User, to...@icl.utk.edu, MAGMA User, Simplice

Here is what I get by running MAGMA testing :

$ magma-2.5.4/testing/testing_dgetrf_mgpu --ngpu 4 -n 5000:60000:5000 -n 70000 -n 80000

% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Fri Mar 12 00:35:24 2021
% Usage: magma-2.5.4/testing//testing_dgetrf_mgpu [options] [-h|--help]


% ngpu 4
%   M     N   CPU Gflop/s (sec)   GPU Gflop/s (sec)   |PA-LU|/(N*|A|)
%========================================================================
 5000  5000    ---   (  ---  )    722.10 (   0.12)     ---
10000 10000    ---   (  ---  )   1523.23 (   0.44)     ---
15000 15000    ---   (  ---  )   2425.74 (   0.93)     ---
20000 20000    ---   (  ---  )   3305.04 (   1.61)     ---
25000 25000    ---   (  ---  )   4130.72 (   2.52)     ---
30000 30000    ---   (  ---  )   4941.47 (   3.64)     ---
35000 35000    ---   (  ---  )   5683.89 (   5.03)     ---
40000 40000    ---   (  ---  )   6410.37 (   6.66)     ---
45000 45000    ---   (  ---  )   7139.04 (   8.51)     ---
Error: magma_dmalloc_cpu( &h_A, n2 )
failed at testing/testing_dgetrf_mgpu.cpp:217: error -112: cannot allocate memory on CPU host
----

Our node configuration does not allow me to allocate matrices of sizes greater than 45 000.

I will also try to see if there is a chance to install cuda 11.2 on our system.

Thanks,

Stanimire Tomov

unread,
Mar 11, 2021, 9:10:05 PM3/11/21
to MAGMA User, Simplice, Stanimire Tomov, MAGMA User
Simplice,
The failure in the latter case is due to use of 32-bit integers in MAGMA. I see this from your test output:
% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
The limit there is indeed around 46K. To solve larger you have to use 64-bit integers, e.g., as given in
the make.inc-examples/make.inc.mkl-gcc-ilp64 make.inc example.

For the native we have to think more what is the problem because I can not reproduce it on our systems.
Related to the test, did you modify the testers in magma, e.g., what if you test with the magma tester:

-bash-4.2$ ./testing_dgetrf_gpu -n 10000 -l -c2 --version 3 --niter 2
% MAGMA 2.5.4 svn compiled for CUDA capability >= 7.0, 64-bit magma_int_t, 64-bit pointer.

% CUDA runtime 11020, driver 11020. OpenMP threads 40. MKL 2018.0.1, MKL threads 40.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 4: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 5: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 6: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 7: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Thu Mar 11 21:09:04 2021
% Usage: ./testing_dgetrf_gpu [options] [-h|--help]

% version 3
%   M     N   CPU Gflop/s (sec)   GPU Gflop/s (sec)   |Ax-b|/(N*|A|*|x|)
%========================================================================
10000 10000    587.72 (   1.13)   3277.34 (   0.20)   1.69e-21   ok
10000 10000    591.00 (   1.13)   3450.68 (   0.19)   1.69e-21   ok

Stan

Simplice

unread,
Mar 12, 2021, 10:19:27 AM3/12/21
to MAGMA User, to...@icl.utk.edu, Simplice, MAGMA User

Hi Stan,

Please see attached the simple example file I use for magma_dgetrf_native, it works for 10 000, but not for 20 000.
magma_dgetrf_gpu works using the same file.

Here is the output:

$ ./test/test_magma_dgetrf_native 10000

% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Fri Mar 12 15:04:00 2021

Matrix m: 10000, n: 10000, nrhs: 1
sizeof(magma_int_t): 4, sizeof(int):4

Generating A ........
Computing B ........
Copying the matrix to the GPU ........
Factorising using DGETRF ........
MAGMA_DGETRF_NATIVE time: 0.22, GFlops/s: 3012.35

Copying the matrix from the GPU ........
Solving ........
Computing the residual ........
Machine precision: 1.110223e-16, Residual1:4.365588e-09, Residual2: 6.810863e-03
Free memory ........


$ ./test/test_magma_dgetrf_native 20000

% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Fri Mar 12 15:03:41 2021

Matrix m: 20000, n: 20000, nrhs: 1
sizeof(magma_int_t): 4, sizeof(int):4

Generating A ........
Computing B ........
Copying the matrix to the GPU ........
Factorising using DGETRF ........
MAGMA_DGETRF_NATIVE time: 0.10, GFlops/s: 55363.98

Copying the matrix from the GPU ........
Solving ........

Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Computing the residual ........
Machine precision: 1.110223e-16, Residual1:-nan, Residual2: -nan
Free memory ........

test_magma_dgetrf_native.c

Ahmad Abdelfattah

unread,
Mar 18, 2021, 2:16:53 PM3/18/21
to Simplice, MAGMA User, to...@icl.utk.edu
Hi Simplice, 

Sorry for the delayed response. 

I have tried your code. It is working fine for me for sizes up to 40K on a V100 GPU. I’m using the same version for the driver and the runtime (11.01). 

However, when I tried downgrading to an older runtime (cuda 10.1) but with the same driver, the native LU fails for sizes larger than ~23k (no hang on my side). It looks like this mismatch between driver/runtime causes an issue. I should point out that around this size, we internally switch between different codes for performing the panel on the GPU. However, I’m not sure whether this is indeed the source of failure. 

Thanks,
Ahmad Abdelfattah
Research Scientist
Innovative Computing Laboratory
University of Tennessee, USA
ah...@icl.utk.edu



-- 
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/847b0c63-4f65-40a3-b766-f96927b0cf28n%40icl.utk.edu.
<test_magma_dgetrf_native.c>

Reply all
Reply to author
Forward
0 new messages