Out of memory error in call to tester

43 views
Skip to first unread message

bcsj

unread,
Jun 8, 2023, 9:18:50 AM6/8/23
to SLATE User
Hi again,

I have continued to try getting slate running on LUMI. I've successfully compiled it with CCE and have been trying to test the installation with the tester in a similar fashion to what was done in the latest conversation here:

However, I experience an out of memory error from slurm when I do this.

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3653611.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid007567: task 3: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=3653611.0
slurmstepd: error: *** STEP 3653611.0 ON nid007567 CANCELLED AT 2023-06-08T16:05:25 ***


It manages to produce one test, but the performance on that one is also pretty atrocious.

% SLATE version 2022.07.00, id unknown
% input: /pfs/lustrep1/projappl/project_462000224/software/slate/src/test/./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
% 2023-06-08 16:05:01, MPI size 8, OpenMP threads 7, GPU devices available 8
                                                                                                                                                                                                         
type  origin  target  gemm   go   A   B   C   transA   transB       m       n       k      alpha       beta    nb    p    q  la      error   time (s)       gflop/s  ref time (s)   ref gflop/s  status  
   d     dev     dev  auto  col   1   1   1  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i   960    2    4   1         NA     14.886       144.266            NA            NA  no check
  


I compiled SLATE with a make.inc file containing

CXX      = CC
FC       = ftn
blas          = libsci
gpu_backend   = hip
gpu_aware_mpi = 1
hip_arch      = gfx90a
mpi           = cray


and did the tester call using

#!/bin/bash -l
#SBATCH --job-name=slate-test   # Job name
#SBATCH --output=slate-test.o%j # Name of stdout output file
#SBATCH --error=slate-test.e%j  # Name of stderr error file
#SBATCH --partition=dev-g       # Partition (queue) name
#SBATCH --account=project_462000224 # Project for billing
#SBATCH --exclusive
#SBATCH --nodes=1               # Total number of nodes
#SBATCH --ntasks-per-node=8     # 8 MPI ranks per node
#SBATCH --cpus-per-task=7
#SBATCH --threads-per-core=1
#SBATCH --gpus-per-node=8       # Allocate one gpu per MPI rank
#SBATCH --time=01:00:00         # Run time (d-hh:mm:ss)

module load slate/2022.07.00-cce

export LD_LIBRARY_PATH=../testsweeper/:$LD_LIBRARY_PATH

export OMP_NUM_THREADS=7
export MPICH_GPU_SUPPORT_ENABLED=1

srun ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm

Any suggestions for what might be going wrong with this?


Mark Gates

unread,
Jun 8, 2023, 11:23:26 AM6/8/23
to bcsj, SLATE User
I assume you are running 8 MPI ranks on one node that has 8 GPUs (here GPU == GCD), intending each MPI rank to use 1 GPU. That's how we run on Frontier. In your run, it says "MPI size 8, OpenMP threads 7, GPU devices available 8", which means each MPI rank is seeing ALL 8 GPUs. That is, every GPU is being shared by all 8 MPI ranks, which will lead to MPI ranks thrashing the GPUs. You need to change the SBATCH or srun options so each MPI rank sees only 1 GPU. I think that will greatly improve your results. Something like:

    srun --ntasks-per-gpu=1 ...

Consult with the LUMI help desk to figure out how to do that on LUMI.

We got good performance on Frontier earlier this year (March), but have recently noticed a significant drop in performance for the same SLATE code. It's unclear what may have changed to affect the performance. We will report back if we resolve the issue.

Which version of SLATE are you using, or how are you getting it? It says "id unknown", which is unusual. Usually the id is either from the release or the current Git ID.

Mark

bcsj

unread,
Jun 8, 2023, 12:45:23 PM6/8/23
to SLATE User, mga...@icl.utk.edu, SLATE User, bcsj
TL;DR: I managed to get it running, but for some reason the "status" for the tests isn't "pass" but instead "no check". 
How do I make it perform the check?

Yes, there are 8 GPUs, well 4, but they each behave like 2 GPUs so effectively 8. 
As I understand it LUMI and Frontier have exactly the same hardware (CPUs and GPUs), Frontier just has more nodes.

My version should be the latest I assume? I followed the instruction on github: https://github.com/icl-utk-edu/slate/blob/master/INSTALL.md
git clone --recursive https://github.com/icl-utk-edu/slate.git

I tried setting the --ntasks-per-gpu=1 , apparently that conflicts with --ntasks-per-node=8. If I exchanged them I got another error instead: https://pastebin.com/Xyyc99PB

I tried to change the job-script based on some things in the LUMI documentation, and I managed to get it running with the following job-script

#!/bin/bash -l
#SBATCH --job-name=slate-test   # Job name
#SBATCH --output=slate-test.o%j # Name of stdout output file
#SBATCH --error=slate-test.e%j  # Name of stderr error file
#SBATCH --partition=standard-g       # Partition (queue) name

#SBATCH --account=project_462000224 # Project for billing
#SBATCH --nodes=1               # Total number of nodes
#SBATCH --ntasks-per-node=8     # 8 MPI ranks per node
#SBATCH --gpus-per-node=8       # Allocate one gpu per MPI rank
#SBATCH --time=01:00:00         # Run time (d-hh:mm:ss)

module load slate/2022.07.00-cce
export LD_LIBRARY_PATH=../testsweeper/:$LD_LIBRARY_PATH

cat << EOF > select_gpu
#!/bin/bash

export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID
exec \$*
EOF

chmod +x ./select_gpu

CPU_BIND="mask_cpu:ff000000000000,ff00000000000000"
CPU_BIND="${CPU_BIND},ff0000,ff000000"
CPU_BIND="${CPU_BIND},fe,ff00"
CPU_BIND="${CPU_BIND},ff00000000,ff0000000000"

export OMP_NUM_THREADS=7
export MPICH_GPU_SUPPORT_ENABLED=1

srun --cpu-bind=${CPU_BIND} ./select_gpu ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
rm -rf ./select_gpu


The output I get is

% SLATE version 2022.07.00, id unknown
% input: ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
% 2023-06-08 19:30:21, MPI size 8, OpenMP threads 7, GPU devices available 1

type  origin target  gemm   go  A B  C   transA   transB       m       n       k      alpha       beta    nb   p  q  la   error  time (s)      gflop/s  ref time (s)  ref gflop/s  status
   d     dev    dev  auto  col  1 1  1  notrans  notrans   10240   10240   10240   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     6.154      348.936            NA           NA  no check
   d     dev    dev  auto  col  1 1  1  notrans  notrans   20480   20480   20480   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     0.228    75391.268            NA           NA  no check
   d     dev    dev  auto  col  1 1  1  notrans  notrans   30720   30720   30720   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     0.597    97196.729            NA           NA  no check
   d     dev    dev  auto  col  1 1  1  notrans  notrans   40960   40960   40960   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     1.286   106887.975            NA           NA  no check
   d     dev    dev  auto  col  1 1  1  notrans  notrans   51200   51200   51200   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     2.358   113831.075            NA           NA  no check
   d     dev    dev  auto  col  1 1  1  notrans  notrans   61440   61440   61440   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     3.815   121579.543            NA           NA  no check
   d     dev    dev  auto  col  1 1  1  notrans  notrans   71680   71680   71680   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     5.847   125984.152            NA           NA  no check
   d     dev    dev  auto  col  1 1  1  notrans  notrans   81920   81920   81920   3.1+1.4i   2.7+1.7i   960   2  4   1      NA     8.570   128296.549            NA           NA  no check
% Matrix kinds:
%  1: rand, cond unknown

% All tests passed: gemm


Compared to the other conversation (https://groups.google.com/a/icl.utk.edu/g/slate-user/c/k9xUGIHUIH4) the gflops are similar at larger scale, but in the small end this performs pretty bad compared to that one.

I also don't understand why I get NA for the error here and the status is "no check" for some reason?

Mark Gates

unread,
Jun 8, 2023, 3:06:54 PM6/8/23
to bcsj, SLATE User
On Thu, Jun 8, 2023 at 12:45 PM bcsj <bjornje...@gmail.com> wrote:
TL;DR: I managed to get it running, but for some reason the "status" for the tests isn't "pass" but instead "no check". 
How do I make it perform the check?

You have `--check n`, which disables the check. Just omit that, or set `--check y`.


My version should be the latest I assume? I followed the instruction on github: https://github.com/icl-utk-edu/slate/blob/master/INSTALL.md

Odd that it isn't picking up the Git ID. Are you using CMake or just make (i.e., GNUmakefile)? Not a big deal, but it would be nice to fix if it isn't working for users. Normally with CMake, during the configure step it should say something like:

    -- slate_id = 2db778e6

which then after compiling the tester shows up in the output:

    slate/build> ./test/tester gemm
    SLATE version 2022.07.00, id 2db778e6

It should match the Git commit ID:

    slate/build> git log --oneline -n 1
    2db778e6 (HEAD -> lq) LQ and QR unit test

Likewise, with the GNUmakefile, it sets the SLATE_ID during compilation, unless the .git directory isn't there.


Compared to the other conversation (https://groups.google.com/a/icl.utk.edu/g/slate-user/c/k9xUGIHUIH4) the gflops are similar at larger scale, but in the small end this performs pretty bad compared to that one.

As suggested in the other thread, use a dummy dimension to throw away the first test. There's a lot of one-time overhead initializing the GPUs, loading rocBLAS, etc. Achieving 16 Tflop/s per GPU (GCD) is about what we observe, too. Small dimensions won't have quite as good performance because their compute intensity (flop count / memory used) isn't as high.

Mark

bcsj

unread,
Jun 8, 2023, 3:24:25 PM6/8/23
to SLATE User, mga...@icl.utk.edu, SLATE User, bcsj
On Thursday, June 8, 2023 at 10:06:54 PM UTC+3 mga...@icl.utk.edu wrote:
On Thu, Jun 8, 2023 at 12:45 PM bcsj <bjornje...@gmail.com> wrote:
TL;DR: I managed to get it running, but for some reason the "status" for the tests isn't "pass" but instead "no check". 
How do I make it perform the check?

You have `--check n`, which disables the check. Just omit that, or set `--check y`.


Ah, of course! I feel the fool now, haha!
 

My version should be the latest I assume? I followed the instruction on github: https://github.com/icl-utk-edu/slate/blob/master/INSTALL.md

Odd that it isn't picking up the Git ID. Are you using CMake or just make (i.e., GNUmakefile)? Not a big deal, but it would be nice to fix if it isn't working for users. Normally with CMake, during the configure step it should say something like:


I used:
> make 
> make install prefix=path/to/dir
when I built, I didn't do any CMake calls.

    -- slate_id = 2db778e6

which then after compiling the tester shows up in the output:

    slate/build> ./test/tester gemm
    SLATE version 2022.07.00, id 2db778e6

It should match the Git commit ID:

    slate/build> git log --oneline -n 1
    2db778e6 (HEAD -> lq) LQ and QR unit test

Likewise, with the GNUmakefile, it sets the SLATE_ID during compilation, unless the .git directory isn't there.


Compared to the other conversation (https://groups.google.com/a/icl.utk.edu/g/slate-user/c/k9xUGIHUIH4) the gflops are similar at larger scale, but in the small end this performs pretty bad compared to that one.

As suggested in the other thread, use a dummy dimension to throw away the first test. There's a lot of one-time overhead initializing the GPUs, loading rocBLAS, etc. Achieving 16 Tflop/s per GPU (GCD) is about what we observe, too. Small dimensions won't have quite as good performance because their compute intensity (flop count / memory used) isn't as high.


I had an inkling it might be overhead related. Thanks for the advice!
 
Mark

Reply all
Reply to author
Forward
0 new messages