Testing slate performance on AMD MI 250 GPUs

109 views
Skip to first unread message

Ilkhom Abdurakhmanov

unread,
May 31, 2023, 10:02:24 AM5/31/23
to SLATE User
Hello,

we are trying test performance of slate functions, gesv and gemm on Setonix. The GPU compute node where we are running the tests has 8 GCDs contained in the 4 MI250X AMD GPU cards. We are thinking that we are not getting the expected performance. Would you kindly have look at it and give your assesment.

Here is our results for gesv and gemm

ilkhom@nid002240:/software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test> salloc -p gpu -A pawsey0001-gpu --exclusive -t 01:00:00 -N 1 srun -n 8 --gpu-bind=closest ./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gesv

salloc: Pending job allocation 2110883

salloc: job 2110883 queued and waiting for resources

salloc: job 2110883 has been allocated resources

salloc: Granted job allocation 2110883

salloc: Waiting for resource configuration

salloc: Nodes nid002242 are ready for job

SLATE version 2022.07.00, id 67aa47aa

input: /software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test/./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gesv

2023-05-31 21:48:41, MPI size 8, OpenMP threads 1

                                                                                                                                                                                          

type  origin  target     lu   go       n    nrhs    nb  ib    p    q  la  pt  thresh      error   time (s)       gflop/s  trs time (s)   trs gflop/s  ref time (s)   ref gflop/s  status  

   d    host    task   PPLU  col   10240      10   512  32    2    4   1   1    1.00         NA      3.959       181.314            NA            NA            NA            NA  no check  

   d    host    task   PPLU  col   10240      10   256  32    2    4   1   1    1.00         NA      3.245       221.199            NA            NA            NA            NA  no check  

   d    host    task   PPLU  col   10240      10   128  32    2    4   1   1    1.00         NA      2.936       244.542            NA            NA            NA            NA  no check  

   d    host    task   PPLU  col   10240      10    64  32    2    4   1   1    1.00         NA      3.478       206.401            NA            NA            NA            NA  no check  

   d    host    task   PPLU  col   10240      10    32  32    2    4   1   1    1.00         NA      6.251       114.835            NA            NA            NA            NA  no check  

All tests passed: gesv

salloc: Relinquishing job allocation 2110883


ilkhom@nid002240:/software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test> salloc -p gpu -A pawsey0001-gpu --exclusive -t 01:00:00 -N 1 srun -n 8 --gpu-bind=closest ./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gemm

salloc: Pending job allocation 2110895

salloc: job 2110895 queued and waiting for resources

salloc: job 2110895 has been allocated resources

salloc: Granted job allocation 2110895

salloc: Waiting for resource configuration

salloc: Nodes nid002242 are ready for job

SLATE version 2022.07.00, id 67aa47aa

input: /software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test/./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gemm

2023-05-31 21:50:42, MPI size 8, OpenMP threads 1

                                                                                                                                                                                             

type  origin  target  gemm   go   transA   transB       m       n       k      alpha       beta    nb    p    q  la      error   time (s)       gflop/s  ref time (s)   ref gflop/s  status  

   d    host    task  auto  col  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i   512    2    4   1         NA      6.095       352.322            NA            NA  no check  

   d    host    task  auto  col  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i   256    2    4   1         NA      6.450       332.948            NA            NA  no check  

   d    host    task  auto  col  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i   128    2    4   1         NA      6.408       335.121            NA            NA  no check  

   d    host    task  auto  col  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i    64    2    4   1         NA      7.541       284.765            NA            NA  no check  

   d    host    task  auto  col  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i    32    2    4   1         NA     13.589       158.030            NA            NA  no check  

All tests passed: gemm

salloc: Relinquishing job allocation 2110895

Mark Gates

unread,
May 31, 2023, 10:51:45 AM5/31/23
to Ilkhom Abdurakhmanov, SLATE User
Hi Ilkhom,

Thanks for including the job input and output. That makes diagnosing issues easy.

The way you ran, SLATE is running on the CPU, single threaded. Note the OpenMP threads 1 and target is task (a.k.a. HostTask). You definitely want to change the OpenMP number of threads to the number of CPU cores to use. For instance, on Frontier we would use 8 MPI ranks per node (1 per GCD), with 7 OpenMP threads per MPI rank, since there are 8*7 = 56 CPU cores available per node, excluding 8 CPU cores reserved for the OS.

To target the GPUs, set the --origin and --target flags to d for device. Typical block sizes for the GPU are 256 to 1024. Small sizes like 64, 32 will perform very poorly. For instance, this would test a range of matrix sizes and block sizes:

export OMP_NUM_THREADS=7

# for Device (GPU)
srun [options] ./tester --origin d --target d --dim 10240:102400:10240 --nb 256:1024:64 --check n --ref n gemm

# for Host (CPU), same as --target t for HostTask
srun [options] ./tester --origin h --target h --dim 10240:102400:10240 --nb 256:1024:64 --check n --ref n gemm

You may also want to enable GPU-aware MPI. In the latest SLATE master, there is a `gpu_aware_mpi` flag to enable that at compile time. We expect a new release soon with this feature. There may be runtime flags that need to be set as well. E.g., on Frontier:

Mark

--
Innovative Computing Laboratory
University of Tennessee, Knoxville

Ilkhom Abdurakhmanov

unread,
Jun 8, 2023, 1:18:21 AM6/8/23
to SLATE User, mga...@icl.utk.edu, SLATE User, Ilkhom Abdurakhmanov
Hi Mark,
thank you very much for your suggestion. --origin d --target d options really helped with the performance of gemm on crusher (setonix is currently undergoing maintenance).

ilkhom@crusher128:/lustre/orion/csc519/scratch/ilkhom/slate/test> srun -N 1 -n 8 -c 7 --threads-per-core=1 --gpus-per-node=8 ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gemm

% SLATE version 2022.07.00, id 8651441a

% input: /lustre/orion/csc519/scratch/ilkhom/slate/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gemm

% 2023-06-08 01:00:57, MPI size 8, OpenMP threads 7, GPU devices available 1

                                                                                                                                                                                                         

type  origin  target  gemm   go   A   B   C   transA   transB       m       n       k      alpha       beta    nb    p    q  la      error   time (s)       gflop/s  ref time (s)   ref gflop/s  status  

   d     dev     dev  auto  col   1   1   1  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i   960    2    4   1   2.96e-16     0.0410     52315.785            NA            NA  pass    

   d     dev     dev  auto  col   1   1   1  notrans  notrans   20480   20480   20480   3.1+1.4i   2.7+1.7i   960    2    4   1   2.31e-16      0.215     79837.298            NA            NA  pass    

   d     dev     dev  auto  col   1   1   1  notrans  notrans   30720   30720   30720   3.1+1.4i   2.7+1.7i   960    2    4   1   2.16e-16      0.571    101623.376            NA            NA  pass    

   d     dev     dev  auto  col   1   1   1  notrans  notrans   40960   40960   40960   3.1+1.4i   2.7+1.7i   960    2    4   1   2.05e-16      1.236    111151.957            NA            NA  pass    

   d     dev     dev  auto  col   1   1   1  notrans  notrans   51200   51200   51200   3.1+1.4i   2.7+1.7i   960    2    4   1   1.82e-16      2.284    117526.851            NA            NA  pass    

   d     dev     dev  auto  col   1   1   1  notrans  notrans   61440   61440   61440   3.1+1.4i   2.7+1.7i   960    2    4   1   2.11e-16      3.751    123645.805            NA            NA  pass    

   d     dev     dev  auto  col   1   1   1  notrans  notrans   71680   71680   71680   3.1+1.4i   2.7+1.7i   960    2    4   1   1.77e-16      5.746    128197.301            NA            NA  pass    

   d     dev     dev  auto  col   1   1   1  notrans  notrans   81920   81920   81920   3.1+1.4i   2.7+1.7i   960    2    4   1   2.21e-16      8.390    131052.569            NA            NA  pass    


% Matrix kinds:

%  1: rand, cond unknown


% All tests passed: gemm 

However, it didn't really help with gesv. First of all, if I set OMP_NUM_THREADS=7, like I did in the case of gemm I am getting oversubscription warnings and the code hangs indefinitely. 

WARNING: Requested total thread count and/or thread affinity may result in

oversubscription of available CPU resources!  Performance may be degraded.

Explicitly set OMP_WAIT_POLICY=PASSIVE or ACTIVE to suppress this message.

Set CRAY_OMP_CHECK_AFFINITY=TRUE to print detailed thread-affinity messages.

WARNING: Requested total thread count and/or thread affinity may result in

oversubscription of available CPU resources!  Performance may be degraded.

Explicitly set OMP_WAIT_POLICY=PASSIVE or ACTIVE to suppress this message.

Set CRAY_OMP_CHECK_AFFINITY=TRUE to print detailed thread-affinity messages.

With OMP_NUM_THREADS=1 the code runs but significantly slower than gemm:

ilkhom@crusher128:/lustre/orion/csc519/scratch/ilkhom/slate/test> OMP_NUM_THREADS=1 srun -N 1 -n 8 -c 1 --threads-per-core=1 --gpus-per-node=8 ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gesv

% SLATE version 2022.07.00, id 8651441a

% input: /lustre/orion/csc519/scratch/ilkhom/slate/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gesv

% 2023-06-08 01:06:47, MPI size 8, OpenMP threads 1, GPU devices available 1                                 

type  origin  target  gemm     lu  trsm   go   A   B       n    nrhs    nb  ib    p    q  la  pt  thresh      error   time (s)       gflop/s  trs time (s)   trs gflop/s  ref time (s)   ref gflop/s  status  

   d     dev     dev  auto   PPLU  auto  col   1   1   10240      10   960  32    2    4   1   1    1.00   1.25e-19     17.049        42.107            NA            NA            NA            NA  pass    

   d     dev     dev  auto   PPLU  auto  col   1   1   20480      10   960  32    2    4   1   1    1.00   9.00e-20      6.563       873.831            NA            NA            NA            NA  pass    

   d     dev     dev  auto   PPLU  auto  col   1   1   30720      10   960  32    2    4   1   1    1.00   7.19e-20     12.849      1505.648            NA            NA            NA            NA  pass    

   d     dev     dev  auto   PPLU  auto  col   1   1   40960      10   960  32    2    4   1   1    1.00   5.92e-20     22.151      2069.654            NA            NA            NA            NA  pass    

   d     dev     dev  auto   PPLU  auto  col   1   1   51200      10   960  32    2    4   1   1    1.00   5.25e-20     34.081      2626.976            NA            NA            NA            NA  pass    

   d     dev     dev  auto   PPLU  auto  col   1   1   61440      10   960  32    2    4   1   1    1.00   4.73e-20     48.388      3196.902            NA            NA            NA            NA  pass    

   d     dev     dev  auto   PPLU  auto  col   1   1   71680      10   960  32    2    4   1   1    1.00   4.34e-20     65.426      3754.301            NA            NA            NA            NA  pass    

   d     dev     dev  auto   PPLU  auto  col   1   1   81920      10   960  32    2    4   1   1    1.00   3.99e-20     85.266      4299.918            NA            NA            NA            NA  pass    

% Matrix kinds:

%  1: rand, cond unknown

% All tests passed: gesv

Should one expect the same level of performance for gesv as in gemm?

Kind regards,

Ilkhom

Mark Gates

unread,
Jun 8, 2023, 10:56:51 AM6/8/23
to Ilkhom Abdurakhmanov, SLATE User
On Thu, Jun 8, 2023 at 1:18 AM Ilkhom Abdurakhmanov <i.abdur...@gmail.com> wrote:
However, it didn't really help with gesv. First of all, if I set OMP_NUM_THREADS=7, like I did in the case of gemm I am getting oversubscription warnings and the code hangs indefinitely. 

With OMP_NUM_THREADS=1 the code runs but significantly slower than gemm:

Thanks for the feedback. We continue to investigate the best way to run on Cray / AMD systems, and will report back.

 

Should one expect the same level of performance for gesv as in gemm?


No, gemm is always the fastest routine because it's relatively simple. Among factorizations, Cholesky (posv / potrf) should get a good fraction of the gemm performance, especially for large matrices. LU factorization (gesv / getrf) has a lot more to do, including pivot search and swapping rows, which involves extra communication.

I would also check getrf performance instead of gesv performance. The tester output for getrf (factor) also includes getrs (solve) when doing check; their sum is gesv (factor & solve). That gives us a better breakdown of where time is spent. We have recent updates for trsm that improve the solve (getrs) time.

You may try using other gesv variants. These probably require the latest master, not the rather out-dated 2022 release. (We expect a new release this month.) CALU pushes the panel to the GPU, so it should avoid issues with oversubscribing threads. The NoPiv variant isn't useful for computation (unless your matrix is special, like diagonally dominant), but gives a ceiling on performance.

    # 1234 is a throw-away warm-up run to initialize GPUs and libraries;
    # it needs to be large enough that all MPI ranks have some computation to do.
    ./tester --dim 1234,10000:20000:1000 --origin d --target d --method-lu PPLU,CALU,NoPiv getrf

Another option you can try is threshold pivoting. We showed that for many cases, using a threshold of 0.5 yields a good performance improvement with little to no effect on accuracy. See:
Threshold Pivoting for Dense LU Factorization, Neil Lindquist et al.

    ./tester --dim 1234,10000 --origin d --target d --method-lu PPLU --thresh 1,0.5,0.1,0.01 getrf

Mark

Mark Gates

unread,
Jun 8, 2023, 2:44:59 PM6/8/23
to Ilkhom Abdurakhmanov, SLATE User
A further comment from Tom Papatheodore at Oak Ridge:

By the way, on Crusher/Frontier, the user should be using --gpus-per-task=1 and --gpu-bind=closest instead of just --gpus-per-node=8. The former will bind processes a single, ideal GPU, whereas the latter will give all processes access to all GPUs – which I suspect could be giving the poor performance.

However, from what I can see in your results, each MPI rank sees just 1 GPU (GCD) ("GPU devices available 1"), so at least the gpus-per-task doesn't seem to be an issue for your runs:

% 2023-06-08 01:00:57, MPI size 8, OpenMP threads 7, GPU devices available 1


Mark


Ilkhom Abdurakhmanov

unread,
Jun 13, 2023, 12:54:31 AM6/13/23
to SLATE User, mga...@icl.utk.edu, SLATE User, Ilkhom Abdurakhmanov
Thank you very much for explaining this and suggesting other methods for solving. I can confirm that CALU method is the fastest. Now I have 2 questions. 
  1. Can CALU method be used for any general linear equations without limitations?
  2. How can CALU method be achieved through Scalapack API? (I see there is an environment variable, SLATE_SCALAPACK_TARGET, to make the computations run on the device. However, could not find anything which allows choosing the method.)
Kind regards,
Ilkhom

Mark Gates

unread,
Jun 13, 2023, 9:46:51 AM6/13/23
to Ilkhom Abdurakhmanov, SLATE User
On Tue, Jun 13, 2023 at 12:54 AM Ilkhom Abdurakhmanov <i.abdur...@gmail.com> wrote:
Thank you very much for explaining this and suggesting other methods for solving. I can confirm that CALU method is the fastest. Now I have 2 questions. 
  1. Can CALU method be used for any general linear equations without limitations?
No, but regular Partial Pivoting LU (PPLU) can't either! There are examples where PPLU has exponential growth, and examples where CALU has exponential growth. The error bound for PPLU is 2^n, which is quite horrible, but in practice the error is much smaller, O( n^{2/3} ). The CALU error bound that has been proved so far is worse, but again in practice it works well.

It's probably a good idea to check your backward error (relative residual),
    || b – Ax || / ( || A || * || x || ),
to see if the method is stable with your matrices. In SLATE's tester we use the 1-norm.

  1. How can CALU method be achieved through Scalapack API? (I see there is an environment variable, SLATE_SCALAPACK_TARGET, to make the computations run on the device. However, could not find anything which allows choosing the method.)
We will have to update the ScaLAPACK API to accommodate CALU. Thanks for pointing this out.

Mark
Reply all
Reply to author
Forward
0 new messages