TL;DR: I managed to get it running, but for some reason the "status" for the tests isn't "pass" but instead "no check".
How do I make it perform the check?
Yes, there are 8 GPUs, well 4, but they each behave like 2 GPUs so effectively 8.
I tried to change the job-script based on some things in the LUMI documentation, and I managed to get it running with the following job-script
#!/bin/bash -l
#SBATCH --job-name=slate-test # Job name
#SBATCH --output=slate-test.o%j # Name of stdout output file
#SBATCH --error=slate-test.e%j # Name of stderr error file
#SBATCH --partition=standard-g # Partition (queue) name
#SBATCH --account=project_462000224 # Project for billing
#SBATCH --nodes=1 # Total number of nodes
#SBATCH --ntasks-per-node=8 # 8 MPI ranks per node
#SBATCH --gpus-per-node=8 # Allocate one gpu per MPI rank
#SBATCH --time=01:00:00 # Run time (d-hh:mm:ss)
module load slate/2022.07.00-cce
export LD_LIBRARY_PATH=../testsweeper/:$LD_LIBRARY_PATH
cat << EOF > select_gpu
#!/bin/bash
export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID
exec \$*
EOF
chmod +x ./select_gpu
CPU_BIND="mask_cpu:ff000000000000,ff00000000000000"
CPU_BIND="${CPU_BIND},ff0000,ff000000"
CPU_BIND="${CPU_BIND},fe,ff00"
CPU_BIND="${CPU_BIND},ff00000000,ff0000000000"
export OMP_NUM_THREADS=7
export MPICH_GPU_SUPPORT_ENABLED=1
srun --cpu-bind=${CPU_BIND} ./select_gpu ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
rm -rf ./select_gpu
The output I get is
% SLATE version 2022.07.00, id unknown
% input: ./tester --origin d --target d --dim 10240:81920:10240 --nb 960 --check n --ref n gemm
% 2023-06-08 19:30:21, MPI size 8, OpenMP threads 7, GPU devices available 1
type origin target gemm go A B C transA transB m n k alpha beta nb p q la error time (s) gflop/s ref time (s) ref gflop/s status
d dev dev auto col 1 1 1 notrans notrans 10240 10240 10240 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 6.154 348.936 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 20480 20480 20480 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 0.228 75391.268 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 30720 30720 30720 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 0.597 97196.729 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 40960 40960 40960 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 1.286 106887.975 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 51200 51200 51200 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 2.358 113831.075 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 61440 61440 61440 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 3.815 121579.543 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 71680 71680 71680 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 5.847 125984.152 NA NA no check
d dev dev auto col 1 1 1 notrans notrans 81920 81920 81920 3.1+1.4i 2.7+1.7i 960 2 4 1 NA 8.570 128296.549 NA NA no check
% Matrix kinds:
% 1: rand, cond unknown
% All tests passed: gemm
I also don't understand why I get NA for the error here and the status is "no check" for some reason?