I was hoping I could get some advice on how to get the scalapack_api to work properly. I've run into the same issue on both Frontier at OLCF and Perlmutter at NERSC where SLATE compiles and the tester passes with GPU usage, but when using libslate_scalapack_api.so as in the documentation for a program that calls PZHEEVX, I see that SLATE intercepts ScaLAPACK but does not seem to offload to GPU. Below I have my jobscript and section of my output file from Frontier, as well as the compilation flags in case that is helpful. I'd appreciate any advice!
Jobscript:
#!/bin/bash
#SBATCH --account=cph169
#SBATCH -q debug
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#SBATCH --time=00:05:00
export OMP_NUM_THREADS=7
export SLURM_CPU_BIND='cores'
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export HDF5_USE_FILE_LOCKING=FALSE
export BGW_HDF5_WRITE_REDIST=1
ulimit -s unlimited
export SLATE_GPU_AWARE_MPI=1
export SLATE_SCALAPACK_TARGET=Devices
export SLATE_DIR=/ccs/home/aaronalt/CODES/slate
export SLATE_SCALAPACK_VERBOSE=1
export LD_LIBRARY_PATH=$SLATE_DIR/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$SLATE_DIR/blaspp/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$SLATE_DIR/lapackpp/lib/:$LD_LIBRARY_PATH
./monitor_gpu.sh &
MONITOR_PID=$!
module load craype-accel-amd-gfx90a
export MPICH_GPU_SUPPORT_ENABLED=1
env LD_PRELOAD=$SLATE_DIR/lib/libslate_scalapack_api.so srun --cpu-bind=cores --gpu-bind=closest /ccs/home/aaronalt/CODES/BerkeleyGW_test/bin/parabands.cplx.x < parabands.inp &> parabands.out
kill $MONITOR_PID
In the output I see the interception (for context, the matrix is complex hermitian, rank 15229, and the block size is 512x512):
Beginning ScaLAPACK diagonalization. Size: 15229
scalapack_api/scalapack_lanhe.cc:83 slate_planhe(): lanhe
Done ScaLAPACK diagonalization
However, the monitor_gpu.sh script above, which simply runs rocm-smi once per second, reports no GPU utilization at any point during the run (diagonalization took about 80 sec in this case):
======================= ROCm System Management Interface =======================
============================== % time GPU is busy ==============================
GPU[0] : GPU use (%): 0
GPU[0] : GFX Activity: 788775627
GPU[1] : GPU use (%): 0
GPU[1] : GFX Activity: 716303949
GPU[2] : GPU use (%): 0
GPU[2] : GFX Activity: 808625778
GPU[3] : GPU use (%): 0
GPU[3] : GFX Activity: 685484219
GPU[4] : GPU use (%): 0
GPU[4] : GFX Activity: 721907599
GPU[5] : GPU use (%): 0
GPU[5] : GFX Activity: 812915178
GPU[6] : GPU use (%): 0
GPU[6] : GFX Activity: 686269342
GPU[7] : GPU use (%): 0
GPU[7] : GFX Activity: 533777042
================================================================================
============================= End of ROCm SMI Log ==============================
With the tester I am able to force GPU offloading with the `--origin d --target d` flags, but can't seem to do that in general. I compiled with these modules on top of the Frontier defaults:
module load cray-fftw cray-hdf5-parallel craype-accel-amd-gfx90a rocm cray-python ; module swap cce cce/15.0.0 ; module swap rocm rocm/5.3.0
and this make.inc:
CXX = CC
FC = ftn
mpi = 1
blas = libsci
gpu_backend = hip
gpu_aware_mpi=1
hip_arch = gfx90a
CXXFLAGS = -O3 -std=c++17
FCFLAGS = -O3
prefix = $PWD/install
I'd appreciate any advice on this, and am happy to supply more information if needed! Thank you!
Aaron