Strange Behavior of GEMM Peak Performance & GA102 GPU

20 views
Skip to first unread message

Nima Sahraneshin

unread,
Sep 22, 2022, 6:24:12 PM9/22/22
to MAGMA User
Hi,

I am testing the peak performance of RTX3090 but some results are strange to me.

first the DGEMM: 

nvidia is saying that the double peak for GA102 is 556.0 GFLOPS, but with MAGMA test I am seeing larger numbers:

%   M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
%========================================================================================================
 1024  1024  1024    420.58 (   5.11)     406.02 (   5.29)     ---   (  ---  )    0.00e+00        ---    ok
 2048  2048  2048    527.70 (  32.56)     473.61 (  36.27)     ---   (  ---  )    1.96e-17        ---    ok
 3072  3072  3072    599.68 (  96.69)     529.70 ( 109.46)     ---   (  ---  )    0.00e+00        ---    ok
 4096  4096  4096    623.33 ( 220.49)     539.50 ( 254.75)     ---   (  ---  )    0.00e+00        ---    ok
 5120  5120  5120    617.11 ( 434.99)     537.57 ( 499.35)     ---   (  ---  )    0.00e+00        ---    ok
 6144  6144  6144    621.39 ( 746.49)     539.89 ( 859.17)     ---   (  ---  )    0.00e+00        ---    ok
 7168  7168  7168    621.32 (1185.53)     536.08 (1374.04)     ---   (  ---  )    0.00e+00        ---    ok
 8192  8192  8192    619.22 (1775.63)     535.94 (2051.54)     ---   (  ---  )    0.00e+00        ---    ok
 9216  9216  9216    619.59 (2526.68)     536.21 (2919.61)     ---   (  ---  )    0.00e+00        ---    ok
10240 10240 10240    624.20 (3440.38)     541.41 (3966.48)     ---   (  ---  )    0.00e+00        ---    ok
11264 11264 11264    625.40 (4570.39)     539.11 (5301.91)     ---   (  ---  )    0.00e+00        ---    ok

12288 12288 12288    540.56 (6864.81)     539.22 (6881.84)     ---   (  ---  )    0.00e+00        ---    ok
13312 13312 13312    538.82 (8756.28)     538.86 (8755.55)     ---   (  ---  )    0.00e+00        ---    ok
14336 14336 14336    539.05 (10931.55)     536.36 (10986.52)     ---   (  ---  )    0.00e+00        ---    ok
15360 15360 15360    536.20 (13517.01)     536.16 (13517.81)     ---   (  ---  )    0.00e+00        ---    ok
16384 16384 16384    536.12 (16406.98)     536.05 (16409.07)     ---   (  ---  )    0.00e+00        ---    ok

How can we justify it?
Does MAGMA do GEMM without cuBLAS ?

Next is about the FP16 GEMM.

./testing_hgemm -n 1024:45000:1024 --matrix rand



%   M     N     K   GPU Gflop/s (ms)      GPU error
%========================================================================================================
 1024  1024  1024   2635.23 (   0.81)       ---
 2048  2048  2048   58631.08 (   0.29)       ---
 3072  3072  3072   92188.92 (   0.63)       ---
 4096  4096  4096   113209.10 (   1.21)       ---
 5120  5120  5120   123589.45 (   2.17)       ---
 6144  6144  6144   128241.71 (   3.62)       ---
 7168  7168  7168   115558.98 (   6.37)       ---
 8192  8192  8192   104684.95 (  10.50)       ---
 9216  9216  9216   93158.09 (  16.80)       ---
10240 10240 10240   87832.27 (  24.45)       ---
11264 11264 11264   84352.97 (  33.89)       ---
12288 12288 12288   107240.38 (  34.60)       ---
13312 13312 13312   96272.52 (  49.01)       ---
14336 14336 14336   103458.65 (  56.96)       ---
15360 15360 15360   99511.91 (  72.83)       ---
16384 16384 16384   72890.14 ( 120.68)       ---
17408 17408 17408   87728.52 ( 120.26)       ---
18432 18432 18432   69442.68 ( 180.35)       ---
19456 19456 19456   69949.03 ( 210.58)       ---
20480 20480 20480   68355.85 ( 251.33)       ---
21504 21504 21504   67744.31 ( 293.57)       ---
22528 22528 22528   67491.17 ( 338.81)       ---
23552 23552 23552   66234.61 ( 394.48)       ---
24576 24576 24576   70176.79 ( 423.03)       ---
25600 25600 25600   72157.42 ( 465.02)       ---
26624 26624 26624   73832.88 ( 511.21)       ---
27648 27648 27648   78171.55 ( 540.72)       ---
28672 28672 28672   71223.29 ( 661.88)       ---
29696 29696 29696   70045.31 ( 747.73)       ---
30720 30720 30720   69575.75 ( 833.37)       ---
31744 31744 31744   69425.44 ( 921.50)       ---
32768 32768 32768   69352.25 (1014.66)       ---
33792 33792 33792   103900.57 ( 742.77)       ---
34816 34816 34816   103183.34 ( 818.01)       ---
35840 35840 35840   82927.16 (1110.29)       ---
36864 36864 36864   91454.85 (1095.55)       ---
37888 37888 37888   80549.89 (1350.42)       ---
38912 38912 38912   92057.92 (1280.03)       ---
39936 39936 39936   98316.55 (1295.68)       ---
40960 40960 40960   97357.48 (1411.69)       ---

Why is performance dropping around 50% and starting to increase again? I am not seeing this for other GPUs. I think after CUDA_11 it is not necessary to mention about the Tensor core, so magma_hgemm should be run by Tensor Core by default.

Best regards.
Nima


Ahmad Abdelfattah

unread,
Sep 22, 2022, 7:42:17 PM9/22/22
to Nima Sahraneshin, MAGMA User
Hi Nima, 

On Sep 22, 2022, at 6:23 PM, Nima Sahraneshin <unix...@gmail.com> wrote:

I am testing the peak performance of RTX3090 but some results are strange to me.

first the DGEMM: 

nvidia is saying that the double peak for GA102 is 556.0 GFLOPS, but with MAGMA test I am seeing larger numbers:

%   M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
%========================================================================================================
 1024  1024  1024    420.58 (   5.11)     406.02 (   5.29)     ---   (  ---  )    0.00e+00        ---    ok
 2048  2048  2048    527.70 (  32.56)     473.61 (  36.27)     ---   (  ---  )    1.96e-17        ---    ok
 3072  3072  3072    599.68 (  96.69)     529.70 ( 109.46)     ---   (  ---  )    0.00e+00        ---    ok
 4096  4096  4096    623.33 ( 220.49)     539.50 ( 254.75)     ---   (  ---  )    0.00e+00        ---    ok
 5120  5120  5120    617.11 ( 434.99)     537.57 ( 499.35)     ---   (  ---  )    0.00e+00        ---    ok
 6144  6144  6144    621.39 ( 746.49)     539.89 ( 859.17)     ---   (  ---  )    0.00e+00        ---    ok
 7168  7168  7168    621.32 (1185.53)     536.08 (1374.04)     ---   (  ---  )    0.00e+00        ---    ok
 8192  8192  8192    619.22 (1775.63)     535.94 (2051.54)     ---   (  ---  )    0.00e+00        ---    ok
 9216  9216  9216    619.59 (2526.68)     536.21 (2919.61)     ---   (  ---  )    0.00e+00        ---    ok
10240 10240 10240    624.20 (3440.38)     541.41 (3966.48)     ---   (  ---  )    0.00e+00        ---    ok
11264 11264 11264    625.40 (4570.39)     539.11 (5301.91)     ---   (  ---  )    0.00e+00        ---    ok

12288 12288 12288    540.56 (6864.81)     539.22 (6881.84)     ---   (  ---  )    0.00e+00        ---    ok
13312 13312 13312    538.82 (8756.28)     538.86 (8755.55)     ---   (  ---  )    0.00e+00        ---    ok
14336 14336 14336    539.05 (10931.55)     536.36 (10986.52)     ---   (  ---  )    0.00e+00        ---    ok
15360 15360 15360    536.20 (13517.01)     536.16 (13517.81)     ---   (  ---  )    0.00e+00        ---    ok
16384 16384 16384    536.12 (16406.98)     536.05 (16409.07)     ---   (  ---  )    0.00e+00        ---    ok

How can we justify it?
Does MAGMA do GEMM without cuBLAS ?

MAGMA has its own DGEMM kernel, which is called for relatively small sizes (after that it switches to cuBLAS). You can go ahead and profile the suspicious runs to make sure which kernels are being called. 

For you question about the theoretical peak, the GA102 chip exists in different GPU models, and there could be differences in the theoretical peak (e.g. different GPU clocks could lead to different FP64 peak).
MAGMA does not have an HGEMM kernel (it has batch HGEMM only). The magma_hgemm routine is actually a simple wrapper around cuBLAS. The discrepancy in cuBLAS performance could be due to tuning issues for this particular GPU. You can run the same test multiple times using --niter <N> to make sure the slowdown is consistent. 

Thanks,
Ahmad






Mark Gates

unread,
Oct 5, 2022, 2:04:07 PM10/5/22
to Nima Sahraneshin, MAGMA User
Nima,

If you provide the complete input & output of the tester, that would help to understand the gemm issue a little better. The header includes valuable information about what specific card and frequency you are running.

leconte testing> ./testing_dgemm
% MAGMA 2.5.0  compiled for CUDA capability >= 3.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 11040, driver 11040. OpenMP threads 80. MKL 2022.0.0, MKL threads 40.
% device 0: Tesla V100-SXM2-32GB, 1530.0 MHz clock, 32510.5 MiB memory, capability 7.0
% Wed Oct  5 13:49:00 2022
% Usage: ./testing_dgemm [options] [-h|--help]

% If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
% relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.

% transA = No transpose, transB = No transpose

%   M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
%========================================================================================================
 1088  1088  1088   3782.84 (   0.68)    3030.52 (   0.85)     ---   (  ---  )    2.12e-17        ---    ok
 2112  2112  2112   5384.73 (   3.50)    5702.58 (   3.30)     ---   (  ---  )    2.04e-17        ---    ok
 3136  3136  3136   5436.97 (  11.34)    6088.36 (  10.13)     ---   (  ---  )    2.21e-17        ---    ok


Your results are surprising — normally the older MAGMA gemm kernel, on which cuBLAS gemm was later based, is slower than cuBLAS. MAGMA falls back to calling cuBLAS if it exceeds the texture memory size for A and B.

Mark

Nima Sahraneshin

unread,
Oct 5, 2022, 2:41:50 PM10/5/22
to Mark Gates, MAGMA User
Mark,

Here is the replication of tests with heade:

./testing_hgemm -n 1024:45000:1024 --matrix rand  --dev 1
% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 11030, driver 11030. OpenMP threads 20. MKL 2021.0.1, MKL threads 20.
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA GeForce RTX 3090, 1695.0 MHz clock, 24268.3 MiB memory, capability 8.6
% Wed Oct  5 20:33:47 2022
% Usage: ./testing_hgemm [options] [-h|--help]

% If running with option --lapack (-l) or with checking (-c), GPU error is computed
% relative to CPU BLAS result in single precision.


% transA = No transpose, transB = No transpose
%   M     N     K   GPU Gflop/s (ms)      GPU error
%========================================================================================================
 1024  1024  1024     18.93 ( 113.47)       ---
 2048  2048  2048   62932.40 (   0.27)       ---
 3072  3072  3072   96620.73 (   0.60)       ---
 4096  4096  4096   115384.46 (   1.19)       ---
 5120  5120  5120   125434.48 (   2.14)       ---
 6144  6144  6144   118357.16 (   3.92)       ---
 7168  7168  7168   108784.13 (   6.77)       ---
 8192  8192  8192   94777.55 (  11.60)       ---
 9216  9216  9216   85142.16 (  18.39)       ---
10240 10240 10240   81743.18 (  26.27)       ---
11264 11264 11264   78296.87 (  36.51)       ---
12288 12288 12288   110255.52 (  33.66)       ---
13312 13312 13312   98575.41 (  47.86)       ---
14336 14336 14336   96202.82 (  61.25)       ---
15360 15360 15360   87993.52 (  82.37)       ---
16384 16384 16384   67140.11 ( 131.01)       ---
17408 17408 17408   82863.40 ( 127.33)       ---
18432 18432 18432   64138.47 ( 195.27)       ---
19456 19456 19456   62124.99 ( 237.10)       ---
20480 20480 20480   61435.88 ( 279.64)       ---



./testing_dgemm -n 1024:45000:1024 --matrix rand  --dev 1
% MAGMA 2.5.4  compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 11030, driver 11030. OpenMP threads 20. MKL 2021.0.1, MKL threads 20.
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA GeForce RTX 3090, 1695.0 MHz clock, 24268.3 MiB memory, capability 8.6
% Wed Oct  5 20:36:33 2022

% Usage: ./testing_dgemm [options] [-h|--help]

% If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
% relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.

% transA = No transpose, transB = No transpose
%   M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
%========================================================================================================
 1024  1024  1024    420.17 (   5.11)      42.66 (  50.34)     ---   (  ---  )    0.00e+00        ---    ok
 2048  2048  2048    527.69 (  32.56)     468.52 (  36.67)     ---   (  ---  )    1.96e-17        ---    ok
 3072  3072  3072    544.31 ( 106.52)     530.20 ( 109.36)     ---   (  ---  )    0.00e+00        ---    ok
 4096  4096  4096    616.50 ( 222.93)     531.88 ( 258.40)     ---   (  ---  )    0.00e+00        ---    ok
 5120  5120  5120    612.28 ( 438.42)     533.36 ( 503.29)     ---   (  ---  )    0.00e+00        ---    ok
 6144  6144  6144    616.20 ( 752.77)     534.26 ( 868.23)     ---   (  ---  )    0.00e+00        ---    ok
 7168  7168  7168    616.98 (1193.87)     532.05 (1384.44)     ---   (  ---  )    0.00e+00        ---    ok
 8192  8192  8192    615.36 (1786.77)     531.81 (2067.51)     ---   (  ---  )    0.00e+00        ---    ok
 9216  9216  9216    614.82 (2546.32)     531.83 (2943.62)     ---   (  ---  )    0.00e+00        ---    ok
10240 10240 10240    613.63 (3499.63)     531.90 (4037.42)     ---   (  ---  )    0.00e+00        ---    ok
11264 11264 11264    614.80 (4649.13)     532.09 (5371.86)     ---   (  ---  )    0.00e+00        ---    ok
12288 12288 12288    531.92 (6976.28)     531.93 (6976.22)     ---   (  ---  )    0.00e+00        ---    ok
13312 13312 13312    531.98 (8868.86)     531.98 (8868.88)     ---   (  ---  )    0.00e+00        ---    ok
14336 14336 14336    532.24 (11071.57)     532.25 (11071.29)     ---   (  ---  )    0.00e+00        ---    ok

Nima

Mark Gates

unread,
Oct 5, 2022, 3:49:40 PM10/5/22
to Nima Sahraneshin, MAGMA User
I can't explain the higher MAGMA dgemm performance.
the 3090 has 82 SMs, and clock from 1395 to 1695 Mhz, as you show.
there are (2) FP64 cores per SM. It says the total is 168, which implies 84 SMs, so that must be a variant.

1395 Mhz * 82 SM * 2 cores * 2 for FMA = 457 Gflop/s
1695 Mhz * 82 SM * 2 cores * 2 for FMA = 556 Gflop/s, as you said.

For n ≥ 12288, it appears that the MAGMA dgemm is calling cuBLAS dgemm, since the performance is identical.

You could check the device properties to verify the multiProcessorCount (see attached code).
You could run nvidia nsight to trace the code, which would allow you to verify the timings.

Mark

cuda-properties.cc
Reply all
Reply to author
Forward
0 new messages