Hi MAGMA team,
I gave a MAGMA training where each learner was using their own computer equipped with GPUs. Some learners reported that the performance
of magma_dgetrf_mgpu with 1 GPU was better than the performance with 4 GPUs. I was finally able to reproduce the problem on a Tesla V100-SXM2-16GB, with MAGMA 2.5.4 and CUDA 10.2.
For a matrix of size 20 000, magma_dgetrf_mgpu with 1 GPU reaches 4068.29 GFlops/s, while with 4 GPUs it reaches 3302.63 Gflops/s. The peak performance of one GPU on that computer is 7834 Gflops/s.
Is there an explanation for that scalability issue? Could I send you the small example that reproduces the issue?
Please see the output below:
$ ./test_magma_dgetrf_mgpu 20000 1 1
% MAGMA 2.5.4 compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% Thu Mar 11 21:11:03 2021
Matrix m: 20000, n: 20000, nrhs: 1, ngpus:1, max API GPUs: 8
nb: 512
MAGMA_DGETRF_MGPU time: 1.31, GFlops/s: 4068.29
Machine precision: 1.110223e-16, Residual1:2.096565e-08, Residual2: 5.781984e-03
$ ./test_magma_dgetrf_mgpu 20000 1 4
% MAGMA 2.5.4 compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 11000. OpenMP threads 10. MKL 2019.0.4, MKL threads 10.
% device 0: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 1: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 2: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% device 3: Tesla V100-SXM2-16GB, 1530.0 MHz clock, 16160.5 MiB memory, capability 7.0
% Thu Mar 11 21:11:25 2021
Matrix m: 20000, n: 20000, nrhs: 1, ngpus:4, max API GPUs: 8
nb: 512
MAGMA_DGETRF_MGPU time: 1.61, GFlops/s: 3302.63
Machine precision: 1.110223e-16, Residual1:1.953109e-08, Residual2: 5.386331e-03