scalapack_api not using GPUs

97 views
Skip to first unread message

Aaron Altman

unread,
Dec 4, 2024, 7:39:41 PM12/4/24
to SLATE User
Hello,

I was hoping I could get some advice on how to get the scalapack_api to work properly. I've run into the same issue on both Frontier at OLCF and Perlmutter at NERSC where SLATE compiles and the tester passes with GPU usage, but when using libslate_scalapack_api.so as in the documentation for a program that calls PZHEEVX, I see that SLATE intercepts ScaLAPACK but does not seem to offload to GPU. Below I have my jobscript and section of my output file from Frontier, as well as the compilation flags in case that is helpful. I'd appreciate any advice!

Jobscript:
#!/bin/bash
#SBATCH --account=cph169
#SBATCH -q debug
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=7
export SLURM_CPU_BIND='cores'
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export HDF5_USE_FILE_LOCKING=FALSE
export BGW_HDF5_WRITE_REDIST=1
ulimit -s unlimited
export SLATE_GPU_AWARE_MPI=1
export SLATE_SCALAPACK_TARGET=Devices
export SLATE_DIR=/ccs/home/aaronalt/CODES/slate
export SLATE_SCALAPACK_VERBOSE=1
export LD_LIBRARY_PATH=$SLATE_DIR/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$SLATE_DIR/blaspp/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$SLATE_DIR/lapackpp/lib/:$LD_LIBRARY_PATH

./monitor_gpu.sh &
MONITOR_PID=$!

module load craype-accel-amd-gfx90a
export MPICH_GPU_SUPPORT_ENABLED=1

env LD_PRELOAD=$SLATE_DIR/lib/libslate_scalapack_api.so srun --cpu-bind=cores --gpu-bind=closest /ccs/home/aaronalt/CODES/BerkeleyGW_test/bin/parabands.cplx.x < parabands.inp &> parabands.out

kill $MONITOR_PID

In the output I see the interception (for context, the matrix is complex hermitian, rank 15229, and the block size is 512x512):

 Beginning ScaLAPACK diagonalization. Size: 15229
scalapack_api/scalapack_lanhe.cc:83 slate_planhe(): lanhe
 Done ScaLAPACK diagonalization

However, the monitor_gpu.sh script above, which simply runs rocm-smi once per second, reports no GPU utilization at any point during the run (diagonalization took about 80 sec in this case):

======================= ROCm System Management Interface =======================
============================== % time GPU is busy ==============================
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 788775627
GPU[1]          : GPU use (%): 0
GPU[1]          : GFX Activity: 716303949
GPU[2]          : GPU use (%): 0
GPU[2]          : GFX Activity: 808625778
GPU[3]          : GPU use (%): 0
GPU[3]          : GFX Activity: 685484219
GPU[4]          : GPU use (%): 0
GPU[4]          : GFX Activity: 721907599
GPU[5]          : GPU use (%): 0
GPU[5]          : GFX Activity: 812915178
GPU[6]          : GPU use (%): 0
GPU[6]          : GFX Activity: 686269342
GPU[7]          : GPU use (%): 0
GPU[7]          : GFX Activity: 533777042
================================================================================
============================= End of ROCm SMI Log ==============================

With the tester I am able to force GPU offloading with the `--origin d --target d` flags, but can't seem to do that in general. I compiled with these modules on top of the Frontier defaults:

module load cray-fftw cray-hdf5-parallel craype-accel-amd-gfx90a rocm cray-python ; module swap cce cce/15.0.0 ; module swap rocm rocm/5.3.0

and this make.inc:

CXX  = CC
FC   = ftn
mpi = 1
blas = libsci
gpu_backend = hip
gpu_aware_mpi=1
hip_arch = gfx90a
CXXFLAGS = -O3 -std=c++17
FCFLAGS  = -O3
prefix   = $PWD/install

I'd appreciate any advice on this, and am happy to supply more information if needed! Thank you!

Aaron

Mark Gates

unread,
Dec 5, 2024, 11:02:18 AM12/5/24
to Aaron Altman, SLATE User
Hi Aaron,

What ScaLAPACK functions are you calling?

In looks like it intercepted Hermitian norm (lanhe), which would be a fast, O( n^2 ) operation. Since you're saying diagonalization, I expect you're calling heev or heevd, but I don't see that in the output.

Mark

Aaron Altman

unread,
Dec 5, 2024, 1:34:10 PM12/5/24
to SLATE User, mga...@icl.utk.edu, SLATE User, Aaron Altman
Hi Mark,

Thanks for pointing that out. The function being called was PZHEEVX, which i now realize isnt part of the slate api, and the lanhe call was being intercepted from inside the PZHEEVX code from scalapack. I switched to PZHEEVD and am now seeing the correct interception:

 Beginning ScaLAPACK diagonalization.  Size: 15229
scalapack_api/scalapack_heevd.cc:118 slate_pheevd(): heevd

I'm now getting segmentation faults though, with not completely reproducible results. Sometime I will get a segfault with no backtrace, and other times it will show an MPICH error. Is this related to the MPICH_GPU_SUPPORT_ENABLED=1 envvar that frontier recommends for gpu-aware mpi? For context, here are the tests I've done so far with their outputs (all else -- jobscript, inputs, etc are the same as in my original message):

Block size = 128x128:
scalapack_api/scalapack_heevd.cc:118 slate_pheevd(): heevd

 Beginning ScaLAPACK diagonalization. Size: 15229
scalapack_api/scalapack_heevd.cc:118 slate_pheevd(): heevd
Assertion failed in file ../src/include/mpir_request.h at line 459: ((req))->ref_count >= 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x155552e1e13b]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1b44314) [0x15555287d314]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xcd5f28) [0x155551a0ef28]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xd210a3) [0x155551a5a0a3]
/opt/cray/pe/lib64/libmpi_cray.so.12(PMPI_Wait+0x54d) [0x155551a5aa9d]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(+0xf002d5) [0x15554ac7e2d5]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x41a2a) [0x15554c38aa2a]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x9e921) [0x15554c3e7921]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x226fe) [0x15554c36b6fe]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x81e05) [0x15554c3cae05]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x84ac5) [0x15554c3cdac5]
/lib64/libpthread.so.0(+0xa6ea) [0x15555364a6ea]
/lib64/libc.so.6(clone+0x41) [0x15554b7e058f]
MPICH ERROR [Rank 5] [job id 2829636.0] [Thu Dec  5 13:21:27 2024] [frontier10198] - Abort(1): Internal error

srun: error: frontier10198: task 0: Segmentation fault
srun: Terminating StepId=2829636.0
slurmstepd: error: *** STEP 2829636.0 ON frontier10198 CANCELLED AT 2024-12-05T13:21:27 ***
srun: error: frontier10198: tasks 1,6: Segmentation fault
srun: error: frontier10198: tasks 2,5: Terminated
srun: error: frontier10198: tasks 3-4,7: Terminated
srun: Force Terminated StepId=2829636.0



Block size = 256x256:
scalapack_api/scalapack_heevd.cc:118 slate_pheevd(): heevd

 Beginning ScaLAPACK diagonalization. Size: 15229
scalapack_api/scalapack_heevd.cc:118 slate_pheevd(): heevd
Assertion failed in file ../src/include/mpir_request.h at line 459: ((req))->ref_count >= 0
Assertion failed in file ../src/mpid/ch4/src/ch4_request.h at line 88: *(&incomplete) >= 0
Assertion failed in file ../src/mpid/ch4/src/ch4_request.h at line 88: *(&incomplete) >= 0
Assertion failed in file ../src/mpid/ch4/src/ch4_request.h at line 88: *(&incomplete) >= 0
Assertion failed in file ../src/mpid/ch4/src/ch4_request.h at line 88: *(&incomplete) >= 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x155552e1e13b]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1b44314) [0x15555287d314]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xcd5f28) [0x155551a0ef28]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xd210a3) [0x155551a5a0a3]
/opt/cray/pe/lib64/libmpi_cray.so.12(PMPI_Wait+0x54d) [0x155551a5aa9d]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(+0xf002f2) [0x15554ac7e2f2]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x41a2a) [0x15554c38aa2a]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x9e921) [0x15554c3e7921]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x226fe) [0x15554c36b6fe]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x81e05) [0x15554c3cae05]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x84ac5) [0x15554c3cdac5]
/lib64/libpthread.so.0(+0xa6ea) [0x15555364a6ea]
/lib64/libc.so.6(clone+0x41) [0x15554b7e058f]
MPICH ERROR [Rank 2] [job id 2829648.0] [Thu Dec  5 13:25:10 2024] [frontier10249] - Abort(1): Internal error



Thanks!!

Mark Gates

unread,
Dec 5, 2024, 3:02:34 PM12/5/24
to Aaron Altman, SLATE User
Hi Aaron,

Try without GPU-aware MPI. We've had difficulties with that. I'm not convinced that the MPI + ROCm stack is thread safe.

Though, TBH, the eigenvalue routines don't scale well. Something that we want to work on.

What is your target matrix size and # nodes? Are you computing the whole spectrum or just a portion (like the largest 5% of eigvals & eigvecs)?

Mark

Aaron Altman

unread,
Dec 5, 2024, 3:36:03 PM12/5/24
to SLATE User, mga...@icl.utk.edu, SLATE User, Aaron Altman
Hi Mark, 

It still doesn't work without GPU-aware MPI. Here is the new error message:

scalapack_api/scalapack_heevd.cc:118 slate_pheevd(): heevd
srun: error: frontier00497: task 0: Bus error
srun: Terminating StepId=2830042.0
slurmstepd: error: *** STEP 2830042.0 ON frontier00497 CANCELLED AT 2024-12-05T15:29:05 ***
srun: error: frontier00497: tasks 1-7: Terminated
srun: Force Terminated StepId=2830042.0

I'm ok with it not scaling that well as long as it works, since SLATE is one of the only options for AMD gpus. Ideally I would be able to routinely perform diagonalizations around 200k x 200k, get the entire spectrum, and occasionally go up to 400k x 400k. Im happy using as many nodes as needed, I would imagine on the order of 50-100 nodes are needed for a 200k x 200k matrix.

Thanks,
Aaron

Aaron Altman

unread,
Dec 20, 2024, 12:28:16 PM12/20/24
to SLATE User, Aaron Altman, mga...@icl.utk.edu, SLATE User
Hi Mark,

A brief update: we managed to get the library to work with 1 OMP thread and a square processor grid. It doesnt seem to be significantly faster than the CPU-only scalapack diagonalization on frontier for a 15k matrix (maybe 10% faster), but hopefully this will improve with problem size.

Thanks for your help!

Aaron

Mark Gates

unread,
Dec 22, 2024, 4:18:25 PM12/22/24
to Aaron Altman, SLATE User
Hi Aaron,

With only 1 OpenMP thread, I would not expect good performance for SLATE on Hermitian eigenvalues. The 2nd stage is CPU-only and depends on a multi-threaded algorithm.

The implementation currently requires a square process grid. Sorry if that wasn't clear; I'll clarify that in the documentation. That's a restriction that we would like to remove, but isn't easily removed in the current setup. The eigenvalue algorithm is definitely something that I am looking to improve in the next year.

Mark

Aaron Altman

unread,
Dec 26, 2024, 12:33:53 PM12/26/24
to SLATE User, mga...@icl.utk.edu, SLATE User, Aaron Altman
Hi Mark, 

I see, thanks for that information. Its possible I missed it in the docs too, but glad we got the square grid worked out now, and it's easy to account for. Is there any special compilation flag required to get threading to work? I am using `CXXFLAGS += -DSLATE_HAVE_MT_BCAST` in my make.inc but keep getting mpi errors when I use more than 1 thread (have tested 1, 2 and 7 threads on Frontier, only 1 works). For context, I get errors that look like that below, in case it's helpful (this case is 2 threads):

scalapack_api/scalapack_heevd.cc:118 slate_pheevd(): heevd
Assertion failed in file ../src/mpid/ch4/src/ch4_impl.h at line 128: *(&incomplete) >= 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x155552e1e13b]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1b44314) [0x15555287d314]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xcfbd31) [0x155551a34d31]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xd2107b) [0x155551a5a07b]
/opt/cray/pe/lib64/libmpi_cray.so.12(PMPI_Wait+0x54d) [0x155551a5aa9d]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(_ZN5slate10BaseMatrixISt7complexIdEE15tileIbcastToSetEllRKSt3setIiSt4lessIiESaIiEEiiN4blas6LayoutERSt6vectorIiS7_ENS_6TargetE+0x66f) [0x15554a5cb86f]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(+0xf4b4d4) [0x15554acc84d4]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x41a2a) [0x15554c38aa2a]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x9e921) [0x15554c3e7921]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0xa0108) [0x15554c3e9108]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0xa0ded) [0x15554c3e9ded]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(__kmpc_end_taskgroup+0x2d) [0x15554c3906cd]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(+0xf43225) [0x15554acc0225]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x41a2a) [0x15554c38aa2a]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x9e921) [0x15554c3e7921]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x9fcd5) [0x15554c3e8cd5]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(__kmpc_omp_taskwait+0x2d) [0x15554c3903ed]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(+0xf5b32e) [0x15554acd832e]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x4ae67) [0x15554c393e67]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(+0x86f77) [0x15554c3cff77]
/opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymp.so.1(_cray$mt_kmpc_fork_call_with_flags+0xb4) [0x15554c38d884]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(_ZN5slate4impl5he2hbILNS_6TargetE68ESt7complexIdEEEvRNS_15HermitianMatrixIT0_EERSt6vectorINS_6MatrixIS6_EESaISB_EERKSt3mapINS_6OptionENS_11OptionValueESt4l
essISG_ESaISt4pairIKSG_SH_EEE+0x1c6d) [0x15554acec3ed]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(_ZN5slate5he2hbISt7complexIdEEEvRNS_15HermitianMatrixIT_EERSt6vectorINS_6MatrixIS4_EESaIS9_EERKSt3mapINS_6OptionENS_11OptionValueESt4lessISE_ESaISt4pairIKS
E_SF_EEE+0x170) [0x15554ace84e0]
/ccs/home/aaronalt/CODES/slate/lib/libslate.so.1(_ZN5slate4heevISt7complexIdEEEvRNS_15HermitianMatrixIT_EERSt6vectorIN4blas16real_type_traitsIJS4_EE6real_tESaISB_EERNS_6MatrixIS4_EERKSt3mapINS_6OptionENS_
11OptionValueESt4lessISJ_ESaISt4pairIKSJ_SK_EEE+0x700) [0x15554acfc720]
/ccs/home/aaronalt/CODES/slate/lib/libslate_scalapack_api.so(_ZN5slate13scalapack_api12slate_pheevdISt7complexIdEEEvPKcS5_iPT_iiPiPN4blas16real_type_traitsIJS6_EE6real_tES7_iiS8_S7_iSD_iS8_iS8_+0xd78) [0x
155555519928]
/ccs/home/aaronalt/CODES/slate/lib/libslate_scalapack_api.so(pzheevd_+0xb9) [0x155555511209]
/ccs/home/aaronalt/CODES/BerkeleyGW_slate/bin/parabands.cplx.x() [0x4a8e06]
/ccs/home/aaronalt/CODES/BerkeleyGW_slate/bin/parabands.cplx.x() [0x4a47c4]
/ccs/home/aaronalt/CODES/BerkeleyGW_slate/bin/parabands.cplx.x() [0x4ac597]
/ccs/home/aaronalt/CODES/BerkeleyGW_slate/bin/parabands.cplx.x() [0x4c0858]
/lib64/libc.so.6(__libc_start_main+0xef) [0x15554b79024d]
/ccs/home/aaronalt/CODES/BerkeleyGW_slate/bin/parabands.cplx.x() [0x40705a]
MPICH ERROR [Rank 8] [job id 2883069.0] [Thu Dec 26 12:24:50 2024] [frontier10175] - Abort(1): Internal error

Thanks again and happy holidays!

Best,
Aaron

Mark Gates

unread,
Dec 26, 2024, 2:38:36 PM12/26/24
to Aaron Altman, SLATE User
Don't use SLATE_HAVE_MT_BCAST on Frontier. From INSTALL.md:

        * SLATE_HAVE_MT_BCAST uses multiple OMP threads for MPI broadcast communication.
        Using this flag to enable multi-threaded broadcast communication achieves
        better performance but causes hangs on certain systems, particularly Frontier.

We haven't resolved all the issues with multi-threading on Frontier, which is frustrating for you and me alike. I'll take a look at it in January.

Mark

Mark Gates

unread,
Jan 16, 2025, 2:58:43 PMJan 16
to Aaron Altman, SLATE User
On Sun, Dec 22, 2024 at 4:18 PM Mark Gates <mga...@icl.utk.edu> wrote:

The implementation currently requires a square process grid. Sorry if that wasn't clear; I'll clarify that in the documentation. That's a restriction that we would like to remove, but isn't easily removed in the current setup. The eigenvalue algorithm is definitely something that I am looking to improve in the next year.

FYI, the documentation and error checks were updated in this PR.

Mark

Reply all
Reply to author
Forward
0 new messages