ilkhom@nid002240:/software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test> salloc -p gpu -A pawsey0001-gpu --exclusive -t 01:00:00 -N 1 srun -n 8 --gpu-bind=closest ./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gesv
salloc: Pending job allocation 2110883
salloc: job 2110883 queued and waiting for resources
salloc: job 2110883 has been allocated resources
salloc: Granted job allocation 2110883
salloc: Waiting for resource configuration
salloc: Nodes nid002242 are ready for job
SLATE version 2022.07.00, id 67aa47aa
input: /software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test/./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gesv
2023-05-31 21:48:41, MPI size 8, OpenMP threads 1
type origin target lu go n nrhs nb ib p q la pt thresh error time (s) gflop/s trs time (s) trs gflop/s ref time (s) ref gflop/s status
d host task PPLU col 10240 10 512 32 2 4 1 1 1.00 NA 3.959 181.314 NA NA NA NA no check
d host task PPLU col 10240 10 256 32 2 4 1 1 1.00 NA 3.245 221.199 NA NA NA NA no check
d host task PPLU col 10240 10 128 32 2 4 1 1 1.00 NA 2.936 244.542 NA NA NA NA no check
d host task PPLU col 10240 10 64 32 2 4 1 1 1.00 NA 3.478 206.401 NA NA NA NA no check
d host task PPLU col 10240 10 32 32 2 4 1 1 1.00 NA 6.251 114.835 NA NA NA NA no check
All tests passed: gesv
salloc: Relinquishing job allocation 2110883
ilkhom@nid002240:/software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test> salloc -p gpu -A pawsey0001-gpu --exclusive -t 01:00:00 -N 1 srun -n 8 --gpu-bind=closest ./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gemm
salloc: Pending job allocation 2110895
salloc: job 2110895 queued and waiting for resources
salloc: job 2110895 has been allocated resources
salloc: Granted job allocation 2110895
salloc: Waiting for resource configuration
salloc: Nodes nid002242 are ready for job
SLATE version 2022.07.00, id 67aa47aa
input: /software/projects/pawsey0012/ilkhom/attempt10/slate-dev/test/./tester --dim 10240 --nb 512,256,128,64,32 --ref n --check n gemm
2023-05-31 21:50:42, MPI size 8, OpenMP threads 1
type origin target gemm go transA transB m n k alpha beta nb p q la error time (s) gflop/s ref time (s) ref gflop/s status
d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 512 2 4 1 NA 6.095 352.322 NA NA no check
d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 256 2 4 1 NA 6.450 332.948 NA NA no check
d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 128 2 4 1 NA 6.408 335.121 NA NA no check
d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 64 2 4 1 NA 7.541 284.765 NA NA no check
d host task auto col notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 32 2 4 1 NA 13.589 158.030 NA NA no check
All tests passed: gemm
salloc: Relinquishing job allocation 2110895
ilkhom@crusher128:/lustre/orion/csc519/scratch/ilkhom/slate/test> srun -N 1 -n 8 -c 7 --threads-per-core=1 --gpus-per-node=8 ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gemm
% SLATE version 2022.07.00, id 8651441a
% input: /lustre/orion/csc519/scratch/ilkhom/slate/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gemm
% 2023-06-08 01:00:57, MPI size 8, OpenMP threads 7, GPU devices available 1
type origin target gemm go A B C transA transB m n k alpha beta nb p q la error time (s) gflop/s ref time (s) ref gflop/s status
d dev dev auto col 1 1 1 notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 960 2 4 1 2.96e-16 0.0410 52315.785 NA NA pass
d dev dev auto col 1 1 1 notrans notrans 20480 20480 20480 3.1+1.4i 2.7+1.7i 960 2 4 1 2.31e-16 0.215 79837.298 NA NA pass
d dev dev auto col 1 1 1 notrans notrans 30720 30720 30720 3.1+1.4i 2.7+1.7i 960 2 4 1 2.16e-16 0.571 101623.376 NA NA pass
d dev dev auto col 1 1 1 notrans notrans 40960 40960 40960 3.1+1.4i 2.7+1.7i 960 2 4 1 2.05e-16 1.236 111151.957 NA NA pass
d dev dev auto col 1 1 1 notrans notrans 51200 51200 51200 3.1+1.4i 2.7+1.7i 960 2 4 1 1.82e-16 2.284 117526.851 NA NA pass
d dev dev auto col 1 1 1 notrans notrans 61440 61440 61440 3.1+1.4i 2.7+1.7i 960 2 4 1 2.11e-16 3.751 123645.805 NA NA pass
d dev dev auto col 1 1 1 notrans notrans 71680 71680 71680 3.1+1.4i 2.7+1.7i 960 2 4 1 1.77e-16 5.746 128197.301 NA NA pass
d dev dev auto col 1 1 1 notrans notrans 81920 81920 81920 3.1+1.4i 2.7+1.7i 960 2 4 1 2.21e-16 8.390 131052.569 NA NA pass
% Matrix kinds:
% 1: rand, cond unknown
WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources! Performance may be degraded.
Explicitly set OMP_WAIT_POLICY=PASSIVE or ACTIVE to suppress this message.
Set CRAY_OMP_CHECK_AFFINITY=TRUE to print detailed thread-affinity messages.
WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources! Performance may be degraded.
Explicitly set OMP_WAIT_POLICY=PASSIVE or ACTIVE to suppress this message.
Set CRAY_OMP_CHECK_AFFINITY=TRUE to print detailed thread-affinity messages.
ilkhom@crusher128:/lustre/orion/csc519/scratch/ilkhom/slate/test> OMP_NUM_THREADS=1 srun -N 1 -n 8 -c 1 --threads-per-core=1 --gpus-per-node=8 ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gesv
% SLATE version 2022.07.00, id 8651441a
% input: /lustre/orion/csc519/scratch/ilkhom/slate/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 gesv
% 2023-06-08 01:06:47, MPI size 8, OpenMP threads 1, GPU devices available 1
type origin target gemm lu trsm go A B n nrhs nb ib p q la pt thresh error time (s) gflop/s trs time (s) trs gflop/s ref time (s) ref gflop/s status
d dev dev auto PPLU auto col 1 1 10240 10 960 32 2 4 1 1 1.00 1.25e-19 17.049 42.107 NA NA NA NA pass
d dev dev auto PPLU auto col 1 1 20480 10 960 32 2 4 1 1 1.00 9.00e-20 6.563 873.831 NA NA NA NA pass
d dev dev auto PPLU auto col 1 1 30720 10 960 32 2 4 1 1 1.00 7.19e-20 12.849 1505.648 NA NA NA NA pass
d dev dev auto PPLU auto col 1 1 40960 10 960 32 2 4 1 1 1.00 5.92e-20 22.151 2069.654 NA NA NA NA pass
d dev dev auto PPLU auto col 1 1 51200 10 960 32 2 4 1 1 1.00 5.25e-20 34.081 2626.976 NA NA NA NA pass
d dev dev auto PPLU auto col 1 1 61440 10 960 32 2 4 1 1 1.00 4.73e-20 48.388 3196.902 NA NA NA NA pass
d dev dev auto PPLU auto col 1 1 71680 10 960 32 2 4 1 1 1.00 4.34e-20 65.426 3754.301 NA NA NA NA pass
d dev dev auto PPLU auto col 1 1 81920 10 960 32 2 4 1 1 1.00 3.99e-20 85.266 4299.918 NA NA NA NA pass
% Matrix kinds:
% 1: rand, cond unknown
% All tests passed: gesv
Should one expect the same level of performance for gesv as in gemm?
Kind regards,
Ilkhom
However, it didn't really help with gesv. First of all, if I set OMP_NUM_THREADS=7, like I did in the case of gemm I am getting oversubscription warnings and the code hangs indefinitely.
With OMP_NUM_THREADS=1 the code runs but significantly slower than gemm:
Should one expect the same level of performance for gesv as in gemm?
By the way, on Crusher/Frontier, the user should be using --gpus-per-task=1 and --gpu-bind=closest instead of just --gpus-per-node=8. The former will bind processes a single, ideal GPU, whereas the latter will give all processes access to all GPUs – which I suspect could be giving the poor performance.
% 2023-06-08 01:00:57, MPI size 8, OpenMP threads 7, GPU devices available 1
Mark
Thank you very much for explaining this and suggesting other methods for solving. I can confirm that CALU method is the fastest. Now I have 2 questions.
- Can CALU method be used for any general linear equations without limitations?
- How can CALU method be achieved through Scalapack API? (I see there is an environment variable, SLATE_SCALAPACK_TARGET, to make the computations run on the device. However, could not find anything which allows choosing the method.)