Testing slate performance on AMD MI 250 GPUs

Ilkhom Abdurakhmanov

unread,

May 31, 2023, 10:02:24 AM5/31/23

to SLATE User

Hello,

we are trying test performance of slate functions, gesv and gemm on Setonix. The GPU compute node where we are running the tests has 8 GCDs contained in the 4 MI250X AMD GPU cards. We are thinking that we are not getting the expected performance. Would you kindly have look at it and give your assesment.

Here is our results for gesv and gemm

ilkhom@nid002240:/software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test> salloc -p gpu -A pawsey0001-gpu --exclusive -t 01:00:00 -N 1 srun -n 8 --gpu-bind=closest ./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gesv

salloc: Pending job allocation 2110883

salloc: job 2110883 queued and waiting for resources

salloc: job 2110883 has been allocated resources

salloc: Granted job allocation 2110883

salloc: Waiting for resource configuration

salloc: Nodes nid002242 are ready for job

SLATE version 2022.07.00, id 67aa47aa

input: /software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test/./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gesv

2023-05-31 21:48:41, MPI size 8, OpenMP threads 1

type origin target lu go n nrhs nb ib p q la pt thresh error time (s) gflop/s trs time (s) trs gflop/s ref time (s) ref gflop/s status

d host task PPLU col 10240 10 512 32 2 4 1 1 1.00 NA 3.959 181.314 NA NA NA NA no check

d host task PPLU col 10240 10 256 32 2 4 1 1 1.00 NA 3.245 221.199 NA NA NA NA no check

d host task PPLU col 10240 10 128 32 2 4 1 1 1.00 NA 2.936 244.542 NA NA NA NA no check

d host task PPLU col 10240 10 64 32 2 4 1 1 1.00 NA 3.478 206.401 NA NA NA NA no check

d host task PPLU col 10240 10 32 32 2 4 1 1 1.00 NA 6.251 114.835 NA NA NA NA no check

All tests passed: gesv

salloc: Relinquishing job allocation 2110883

ilkhom@nid002240:/software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test> salloc -p gpu -A pawsey0001-gpu --exclusive -t 01:00:00 -N 1 srun -n 8 --gpu-bind=closest ./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gemm

salloc: Pending job allocation 2110895

salloc: job 2110895 queued and waiting for resources

salloc: job 2110895 has been allocated resources

salloc: Granted job allocation 2110895

salloc: Waiting for resource configuration

salloc: Nodes nid002242 are ready for job

SLATE version 2022.07.00, id 67aa47aa

input: /software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test/./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gemm

2023-05-31 21:50:42, MPI size 8, OpenMP threads 1

type origin target gemm go transA transB m n k alpha beta nb p q la error time (s) gflop/s ref time (s) ref gflop/s status

d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 512 2 4 1 NA 6.095 352.322 NA NA no check

d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 256 2 4 1 NA 6.450 332.948 NA NA no check

d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 128 2 4 1 NA 6.408 335.121 NA NA no check

d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 64 2 4 1 NA 7.541 284.765 NA NA no check

d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 32 2 4 1 NA 13.589 158.030 NA NA no check

All tests passed: gemm

salloc: Relinquishing job allocation 2110895

Mark Gates

unread,

May 31, 2023, 10:51:45 AM5/31/23

to Ilkhom Abdurakhmanov, SLATE User

Hi Ilkhom,

Thanks for including the job input and output. That makes diagnosing issues easy.

The way you ran, SLATE is running on the CPU, single threaded. Note the OpenMP threads 1 and target is task (a.k.a. HostTask). You definitely want to change the OpenMP number of threads to the number of CPU cores to use. For instance, on Frontier we would use 8 MPI ranks per node (1 per GCD), with 7 OpenMP threads per MPI rank, since there are 8*7 = 56 CPU cores available per node, excluding 8 CPU cores reserved for the OS.

To target the GPUs, set the --origin and --target flags to d for device. Typical block sizes for the GPU are 256 to 1024. Small sizes like 64, 32 will perform very poorly. For instance, this would test a range of matrix sizes and block sizes:

export OMP_NUM_THREADS=7

# for Device (GPU)

srun [options] ./tester --origin d --target d --dim 10240:102400:10240 --nb 256:1024:64 --check n --ref n gemm

# for Host (CPU), same as --target t for HostTask

srun [options] ./tester --origin h --target h --dim 10240:102400:10240 --nb 256:1024:64 --check n --ref n gemm

You may also want to enable GPU-aware MPI. In the latest SLATE master, there is a `gpu_aware_mpi` flag to enable that at compile time. We expect a new release soon with this feature. There may be runtime flags that need to be set as well. E.g., on Frontier:

https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#gpu-aware-mpi

Mark

--

Innovative Computing Laboratory

University of Tennessee, Knoxville

Ilkhom Abdurakhmanov

unread,

Jun 8, 2023, 1:18:21 AM6/8/23

to SLATE User, mga...@icl.utk.edu, SLATE User, Ilkhom Abdurakhmanov

Hi Mark,

thank you very much for your suggestion. --origin d --target d options really helped with the performance of gemm on crusher (setonix is currently undergoing maintenance).

ilkhom@crusher128:/lustre/orion/csc519/scratch/ilkhom/slate/test> srun -N 1 -n 8 -c 7 --threads-per-core=1 --gpus-per-node=8 ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gemm

% SLATE version 2022.07.00, id 8651441a

% input: /lustre/orion/csc519/scratch/ilkhom/slate/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gemm

% 2023-06-08 01:00:57, MPI size 8, OpenMP threads 7, GPU devices available 1

type origin target gemm go A B C transA transB m n k alpha beta nb p q la error time (s) gflop/s ref time (s) ref gflop/s status

d dev dev auto col 1 1 1 notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 960 2 4 1 2.96e-16 0.0410 52315.785 NA NA pass

d dev dev auto col 1 1 1 notrans notrans 20480 20480 20480 3.1+1.4i 2.7+1.7i 960 2 4 1 2.31e-16 0.215 79837.298 NA NA pass

d dev dev auto col 1 1 1 notrans notrans 30720 30720 30720 3.1+1.4i 2.7+1.7i 960 2 4 1 2.16e-16 0.571 101623.376 NA NA pass

d dev dev auto col 1 1 1 notrans notrans 40960 40960 40960 3.1+1.4i 2.7+1.7i 960 2 4 1 2.05e-16 1.236 111151.957 NA NA pass

d dev dev auto col 1 1 1 notrans notrans 51200 51200 51200 3.1+1.4i 2.7+1.7i 960 2 4 1 1.82e-16 2.284 117526.851 NA NA pass

d dev dev auto col 1 1 1 notrans notrans 61440 61440 61440 3.1+1.4i 2.7+1.7i 960 2 4 1 2.11e-16 3.751 123645.805 NA NA pass

d dev dev auto col 1 1 1 notrans notrans 71680 71680 71680 3.1+1.4i 2.7+1.7i 960 2 4 1 1.77e-16 5.746 128197.301 NA NA pass

d dev dev auto col 1 1 1 notrans notrans 81920 81920 81920 3.1+1.4i 2.7+1.7i 960 2 4 1 2.21e-16 8.390 131052.569 NA NA pass

% Matrix kinds:

% 1: rand, cond unknown

% All tests passed: gemm

However, it didn't really help with gesv. First of all, if I set OMP_NUM_THREADS=7, like I did in the case of gemm I am getting oversubscription warnings and the code hangs indefinitely.

WARNING: Requested total thread count and/or thread affinity may result in

oversubscription of available CPU resources! Performance may be degraded.

Explicitly set OMP_WAIT_POLICY=PASSIVE or ACTIVE to suppress this message.

Set CRAY_OMP_CHECK_AFFINITY=TRUE to print detailed thread-affinity messages.

WARNING: Requested total thread count and/or thread affinity may result in

oversubscription of available CPU resources! Performance may be degraded.

Explicitly set OMP_WAIT_POLICY=PASSIVE or ACTIVE to suppress this message.

Set CRAY_OMP_CHECK_AFFINITY=TRUE to print detailed thread-affinity messages.

With OMP_NUM_THREADS=1 the code runs but significantly slower than gemm:

ilkhom@crusher128:/lustre/orion/csc519/scratch/ilkhom/slate/test> OMP_NUM_THREADS=1 srun -N 1 -n 8 -c 1 --threads-per-core=1 --gpus-per-node=8 ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gesv

% SLATE version 2022.07.00, id 8651441a

% input: /lustre/orion/csc519/scratch/ilkhom/slate/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gesv

% 2023-06-08 01:06:47, MPI size 8, OpenMP threads 1, GPU devices available 1

type origin target gemm lu trsm go A B n nrhs nb ib p q la pt thresh error time (s) gflop/s trs time (s) trs gflop/s ref time (s) ref gflop/s status

d dev dev auto PPLU auto col 1 1 10240 10 960 32 2 4 1 1 1.00 1.25e-19 17.049 42.107 NA NA NA NA pass

d dev dev auto PPLU auto col 1 1 20480 10 960 32 2 4 1 1 1.00 9.00e-20 6.563 873.831 NA NA NA NA pass

d dev dev auto PPLU auto col 1 1 30720 10 960 32 2 4 1 1 1.00 7.19e-20 12.849 1505.648 NA NA NA NA pass

d dev dev auto PPLU auto col 1 1 40960 10 960 32 2 4 1 1 1.00 5.92e-20 22.151 2069.654 NA NA NA NA pass

d dev dev auto PPLU auto col 1 1 51200 10 960 32 2 4 1 1 1.00 5.25e-20 34.081 2626.976 NA NA NA NA pass

d dev dev auto PPLU auto col 1 1 61440 10 960 32 2 4 1 1 1.00 4.73e-20 48.388 3196.902 NA NA NA NA pass

d dev dev auto PPLU auto col 1 1 71680 10 960 32 2 4 1 1 1.00 4.34e-20 65.426 3754.301 NA NA NA NA pass

d dev dev auto PPLU auto col 1 1 81920 10 960 32 2 4 1 1 1.00 3.99e-20 85.266 4299.918 NA NA NA NA pass

% Matrix kinds:

% 1: rand, cond unknown

% All tests passed: gesv

Should one expect the same level of performance for gesv as in gemm?

Kind regards,

Ilkhom

Mark Gates

unread,

Jun 8, 2023, 10:56:51 AM6/8/23

to Ilkhom Abdurakhmanov, SLATE User

On Thu, Jun 8, 2023 at 1:18 AM Ilkhom Abdurakhmanov <i.abdur...@gmail.com> wrote:

However, it didn't really help with gesv. First of all, if I set OMP_NUM_THREADS=7, like I did in the case of gemm I am getting oversubscription warnings and the code hangs indefinitely.

With OMP_NUM_THREADS=1 the code runs but significantly slower than gemm:

Thanks for the feedback. We continue to investigate the best way to run on Cray / AMD systems, and will report back.

Should one expect the same level of performance for gesv as in gemm?

No, gemm is always the fastest routine because it's relatively simple. Among factorizations, Cholesky (posv / potrf) should get a good fraction of the gemm performance, especially for large matrices. LU factorization (gesv / getrf) has a lot more to do, including pivot search and swapping rows, which involves extra communication.

I would also check getrf performance instead of gesv performance. The tester output for getrf (factor) also includes getrs (solve) when doing check; their sum is gesv (factor & solve). That gives us a better breakdown of where time is spent. We have recent updates for trsm that improve the solve (getrs) time.

You may try using other gesv variants. These probably require the latest master, not the rather out-dated 2022 release. (We expect a new release this month.) CALU pushes the panel to the GPU, so it should avoid issues with oversubscribing threads. The NoPiv variant isn't useful for computation (unless your matrix is special, like diagonally dominant), but gives a ceiling on performance.

# 1234 is a throw-away warm-up run to initialize GPUs and libraries;

# it needs to be large enough that all MPI ranks have some computation to do.

./tester --dim 1234,10000:20000:1000 --origin d --target d --method-lu PPLU,CALU,NoPiv getrf

Another option you can try is threshold pivoting. We showed that for many cases, using a threshold of 0.5 yields a good performance improvement with little to no effect on accuracy. See:

Threshold Pivoting for Dense LU Factorization, Neil Lindquist et al.

https://ieeexplore.ieee.org/abstract/document/10024579

./tester --dim 1234,10000 --origin d --target d --method-lu PPLU --thresh 1,0.5,0.1,0.01 getrf

Mark

Mark Gates

unread,

Jun 8, 2023, 2:44:59 PM6/8/23

to Ilkhom Abdurakhmanov, SLATE User

A further comment from Tom Papatheodore at Oak Ridge:

By the way, on Crusher/Frontier, the user should be using --gpus-per-task=1 and --gpu-bind=closest instead of just --gpus-per-node=8. The former will bind processes a single, ideal GPU, whereas the latter will give all processes access to all GPUs – which I suspect could be giving the poor performance.

However, from what I can see in your results, each MPI rank sees just 1 GPU (GCD) ("GPU devices available 1"), so at least the gpus-per-task doesn't seem to be an issue for your runs:

% 2023-06-08 01:00:57, MPI size 8, OpenMP threads 7, GPU devices available 1

Mark

Ilkhom Abdurakhmanov

unread,

Jun 13, 2023, 12:54:31 AM6/13/23

to SLATE User, mga...@icl.utk.edu, SLATE User, Ilkhom Abdurakhmanov

Thank you very much for explaining this and suggesting other methods for solving. I can confirm that CALU method is the fastest. Now I have 2 questions.

Can CALU method be used for any general linear equations without limitations?
How can CALU method be achieved through Scalapack API? (I see there is an environment variable, SLATE_SCALAPACK_TARGET, to make the computations run on the device. However, could not find anything which allows choosing the method.)

Kind regards,

Ilkhom

Mark Gates

unread,

Jun 13, 2023, 9:46:51 AM6/13/23

to Ilkhom Abdurakhmanov, SLATE User

On Tue, Jun 13, 2023 at 12:54 AM Ilkhom Abdurakhmanov <i.abdur...@gmail.com> wrote:

Thank you very much for explaining this and suggesting other methods for solving. I can confirm that CALU method is the fastest. Now I have 2 questions.
Can CALU method be used for any general linear equations without limitations?

No, but regular Partial Pivoting LU (PPLU) can't either! There are examples where PPLU has exponential growth, and examples where CALU has exponential growth. The error bound for PPLU is 2^n, which is quite horrible, but in practice the error is much smaller, O( n^{2/3} ). The CALU error bound that has been proved so far is worse, but again in practice it works well.

It's probably a good idea to check your backward error (relative residual),

|| b – Ax || / ( || A || * || x || ),

to see if the method is stable with your matrices. In SLATE's tester we use the 1-norm.

How can CALU method be achieved through Scalapack API? (I see there is an environment variable, SLATE_SCALAPACK_TARGET, to make the computations run on the device. However, could not find anything which allows choosing the method.)

We will have to update the ScaLAPACK API to accommodate CALU. Thanks for pointing this out.

Mark

Reply all

Reply to author

Forward