Out of memory error in call to tester

bcsj

unread,

Jun 8, 2023, 9:18:50 AM6/8/23

to SLATE User

Hi again,

I have continued to try getting slate running on LUMI. I've successfully compiled it with CCE and have been trying to test the installation with the tester in a similar fashion to what was done in the latest conversation here:

https://groups.google.com/a/icl.utk.edu/g/slate-user/c/k9xUGIHUIH4/m/PYgMoD8qAQAJ

However, I experience an out of memory error from slurm when I do this.

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3653611.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid007567: task 3: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=3653611.0
slurmstepd: error: *** STEP 3653611.0 ON nid007567 CANCELLED AT 2023-06-08T16:05:25 ***

It manages to produce one test, but the performance on that one is also pretty atrocious.

% SLATE version 2022.07.00, id unknown
% input: /pfs/lustrep1/projappl/project_462000224/software/slate/src/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
% 2023-06-08 16:05:01, MPI size 8, OpenMP threads 7, GPU devices available 8

type origin target gemm go A B C transA transB m n k alpha beta nb p q la error time (s) gflop/s ref time (s) ref gflop/s status
d dev dev auto col 1 1 1 notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 14.886 144.266 NA NA no check

I compiled SLATE with a make.inc file containing

CXX = CC
FC = ftn
blas = libsci
gpu_backend = hip
gpu_aware_mpi = 1
hip_arch = gfx90a
mpi = cray

and did the tester call using

#!/bin/bash -l
#SBATCH --job-name=slate-test # Job name
#SBATCH --output=slate-test.o%j # Name of stdout output file
#SBATCH --error=slate-test.e%j # Name of stderr error file
#SBATCH --partition=dev-g # Partition (queue) name
#SBATCH --account=project_462000224 # Project for billing
#SBATCH --exclusive
#SBATCH --nodes=1 # Total number of nodes
#SBATCH --ntasks-per-node=8 # 8 MPI ranks per node
#SBATCH --cpus-per-task=7
#SBATCH --threads-per-core=1
#SBATCH --gpus-per-node=8 # Allocate one gpu per MPI rank
#SBATCH --time=01:00:00 # Run time (d-hh:mm:ss)

module load slate/2022.07.00-cce

export LD_LIBRARY_PATH=../testsweeper/:$LD_LIBRARY_PATH

export OMP_NUM_THREADS=7
export MPICH_GPU_SUPPORT_ENABLED=1

srun ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm

Any suggestions for what might be going wrong with this?

Mark Gates

unread,

Jun 8, 2023, 11:23:26 AM6/8/23

to bcsj, SLATE User

I assume you are running 8 MPI ranks on one node that has 8 GPUs (here GPU == GCD), intending each MPI rank to use 1 GPU. That's how we run on Frontier. In your run, it says "MPI size 8, OpenMP threads 7, GPU devices available 8", which means each MPI rank is seeing ALL 8 GPUs. That is, every GPU is being shared by all 8 MPI ranks, which will lead to MPI ranks thrashing the GPUs. You need to change the SBATCH or srun options so each MPI rank sees only 1 GPU. I think that will greatly improve your results. Something like:

srun --ntasks-per-gpu=1 ...

Consult with the LUMI help desk to figure out how to do that on LUMI.

We got good performance on Frontier earlier this year (March), but have recently noticed a significant drop in performance for the same SLATE code. It's unclear what may have changed to affect the performance. We will report back if we resolve the issue.

Which version of SLATE are you using, or how are you getting it? It says "id unknown", which is unusual. Usually the id is either from the release or the current Git ID.

Mark

bcsj

unread,

Jun 8, 2023, 12:45:23 PM6/8/23

to SLATE User, mga...@icl.utk.edu, SLATE User, bcsj

TL;DR: I managed to get it running, but for some reason the "status" for the tests isn't "pass" but instead "no check".

How do I make it perform the check?

Yes, there are 8 GPUs, well 4, but they each behave like 2 GPUs so effectively 8.

As I understand it LUMI and Frontier have exactly the same hardware (CPUs and GPUs), Frontier just has more nodes.

My version should be the latest I assume? I followed the instruction on github: https://github.com/icl-utk-edu/slate/blob/master/INSTALL.md

git clone --recursive https://github.com/icl-utk-edu/slate.git

I tried setting the --ntasks-per-gpu=1 , apparently that conflicts with --ntasks-per-node=8. If I exchanged them I got another error instead: https://pastebin.com/Xyyc99PB

I tried to change the job-script based on some things in the LUMI documentation, and I managed to get it running with the following job-script

#!/bin/bash -l
#SBATCH --job-name=slate-test # Job name
#SBATCH --output=slate-test.o%j # Name of stdout output file
#SBATCH --error=slate-test.e%j # Name of stderr error file

#SBATCH --partition=standard-g # Partition (queue) name

#SBATCH --account=project_462000224 # Project for billing

#SBATCH --nodes=1 # Total number of nodes
#SBATCH --ntasks-per-node=8 # 8 MPI ranks per node

#SBATCH --gpus-per-node=8 # Allocate one gpu per MPI rank
#SBATCH --time=01:00:00 # Run time (d-hh:mm:ss)

module load slate/2022.07.00-cce
export LD_LIBRARY_PATH=../testsweeper/:$LD_LIBRARY_PATH

cat << EOF > select_gpu
#!/bin/bash

export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID
exec \$*
EOF

chmod +x ./select_gpu

CPU_BIND="mask_cpu:ff000000000000,ff00000000000000"
CPU_BIND="${CPU_BIND},ff0000,ff000000"
CPU_BIND="${CPU_BIND},fe,ff00"
CPU_BIND="${CPU_BIND},ff00000000,ff0000000000"

export OMP_NUM_THREADS=7
export MPICH_GPU_SUPPORT_ENABLED=1

srun --cpu-bind=${CPU_BIND} ./select_gpu ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
rm -rf ./select_gpu

The output I get is

% SLATE version 2022.07.00, id unknown

% input: ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
% 2023-06-08 19:30:21, MPI size 8, OpenMP threads 7, GPU devices available 1

type origin target gemm go A B C transA transB m n k alpha beta nb p q la error time (s) gflop/s ref time (s) ref gflop/s status

d dev dev auto col 1 1 1 notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 6.154 348.936 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 20480 20480 20480 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 0.228 75391.268 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 30720 30720 30720 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 0.597 97196.729 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 40960 40960 40960 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 1.286 106887.975 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 51200 51200 51200 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 2.358 113831.075 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 61440 61440 61440 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 3.815 121579.543 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 71680 71680 71680 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 5.847 125984.152 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 81920 81920 81920 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 8.570 128296.549 NA NA no check
% Matrix kinds:
% 1: rand, cond unknown

% All tests passed: gemm

Compared to the other conversation (https://groups.google.com/a/icl.utk.edu/g/slate-user/c/k9xUGIHUIH4) the gflops are similar at larger scale, but in the small end this performs pretty bad compared to that one.

I also don't understand why I get NA for the error here and the status is "no check" for some reason?

Mark Gates

unread,

Jun 8, 2023, 3:06:54 PM6/8/23

to bcsj, SLATE User

On Thu, Jun 8, 2023 at 12:45 PM bcsj <bjornje...@gmail.com> wrote:

TL;DR: I managed to get it running, but for some reason the "status" for the tests isn't "pass" but instead "no check".
How do I make it perform the check?

You have `--check n`, which disables the check. Just omit that, or set `--check y`.

My version should be the latest I assume? I followed the instruction on github: https://github.com/icl-utk-edu/slate/blob/master/INSTALL.md
git clone --recursive https://github.com/icl-utk-edu/slate.git

Odd that it isn't picking up the Git ID. Are you using CMake or just make (i.e., GNUmakefile)? Not a big deal, but it would be nice to fix if it isn't working for users. Normally with CMake, during the configure step it should say something like:

-- slate_id = 2db778e6

which then after compiling the tester shows up in the output:

slate/build> ./test/tester gemm
SLATE version 2022.07.00, id 2db778e6

It should match the Git commit ID:

slate/build> git log --oneline -n 1
2db778e6 (HEAD -> lq) LQ and QR unit test

Likewise, with the GNUmakefile, it sets the SLATE_ID during compilation, unless the .git directory isn't there.

Compared to the other conversation (https://groups.google.com/a/icl.utk.edu/g/slate-user/c/k9xUGIHUIH4) the gflops are similar at larger scale, but in the small end this performs pretty bad compared to that one.

As suggested in the other thread, use a dummy dimension to throw away the first test. There's a lot of one-time overhead initializing the GPUs, loading rocBLAS, etc. Achieving 16 Tflop/s per GPU (GCD) is about what we observe, too. Small dimensions won't have quite as good performance because their compute intensity (flop count / memory used) isn't as high.

Mark

bcsj

unread,

Jun 8, 2023, 3:24:25 PM6/8/23

to SLATE User, mga...@icl.utk.edu, SLATE User, bcsj

On Thursday, June 8, 2023 at 10:06:54 PM UTC+3 mga...@icl.utk.edu wrote:

On Thu, Jun 8, 2023 at 12:45 PM bcsj <bjornje...@gmail.com> wrote:
TL;DR: I managed to get it running, but for some reason the "status" for the tests isn't "pass" but instead "no check".
How do I make it perform the check?

You have `--check n`, which disables the check. Just omit that, or set `--check y`.

Ah, of course! I feel the fool now, haha!

My version should be the latest I assume? I followed the instruction on github: https://github.com/icl-utk-edu/slate/blob/master/INSTALL.md
git clone --recursive https://github.com/icl-utk-edu/slate.git

Odd that it isn't picking up the Git ID. Are you using CMake or just make (i.e., GNUmakefile)? Not a big deal, but it would be nice to fix if it isn't working for users. Normally with CMake, during the configure step it should say something like:

I used:
> make
> make install prefix=path/to/dir

when I built, I didn't do any CMake calls.

    -- slate_id = 2db778e6

which then after compiling the tester shows up in the output:

    slate/build> ./test/tester gemm
    SLATE version 2022.07.00, id 2db778e6

It should match the Git commit ID:

    slate/build> git log --oneline -n 1
    2db778e6 (HEAD -> lq) LQ and QR unit test

Likewise, with the GNUmakefile, it sets the SLATE_ID during compilation, unless the .git directory isn't there.

Compared to the other conversation (https://groups.google.com/a/icl.utk.edu/g/slate-user/c/k9xUGIHUIH4) the gflops are similar at larger scale, but in the small end this performs pretty bad compared to that one.

As suggested in the other thread, use a dummy dimension to throw away the first test. There's a lot of one-time overhead initializing the GPUs, loading rocBLAS, etc. Achieving 16 Tflop/s per GPU (GCD) is about what we observe, too. Small dimensions won't have quite as good performance because their compute intensity (flop count / memory used) isn't as high.