GEMM Communication and computation not overlapped

10 views
Skip to first unread message

天天天

unread,
Feb 2, 2026, 4:55:09 AMFeb 2
to SLATE User
When I run Gemm tester on a 4-GPU Node with 

export SLATE_GPU_AWARE_MPI=1
export OMP_NUM_THREADS=14
mpirun --bind-to none -n 4 bash -c 'export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK; ./tester --grid-order 'r' --nb 2048  --repeat 2 --check n --origin d --target d --ref n --type s --dim 32768x32768x32768 --grid 2x2 gemm'

the program exit correctly and outputs:

% SLATE version 2025.05.28, id f8348a7c
% input: ./tester --grid-order r --nb 2048 --repeat 2 --check n --origin d --target d --ref n --type s --dim 32768x32768x32768 --grid 2x2 gemm
% 2026-02-02 15:56:01, 4 MPI ranks, GPU-aware MPI, 14 OpenMP threads, 1 GPU devices per MPI rank

type  origin  target  gemm   go   do   A   B   C   transA   transB       m       n       k    nb      alpha       beta    p    q  la     error   time (s)       gflop/s  ref time (s)   ref gflop/s  status
   s     dev     dev  auto  row  row   1   1   1  notrans  notrans   32768   32768   32768  2048   3.1+1.4i   2.7+1.7i    2    2   1        NA      1.320     53312.823            NA            NA  no check  skipping reference: ScaLAPACK not available
   s     dev     dev  auto  row  row   1   1   1  notrans  notrans   32768   32768   32768  2048   3.1+1.4i   2.7+1.7i    2    2   1        NA      1.241     56716.360            NA            NA  no check
time (s)         min     1.241, max     1.320, avg     1.280, stddev   0.05601
gflop/s          min 5.331e+04, max 5.672e+04, avg 5.501e+04, stddev     2407.


% Matrix kinds:
%  1: rand, cond unknown

% All tests passed: gemm

Then I profile the tester using Nvidia nsight system :

mpirun --bind-to none -n 4 bash -c 'export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK; nsys profile --force-overwrite true --trace=mpi,cuda,nvtx,cublas -o /tmp/nsys_out_slate/slate_tester_gemm.%q{OMPI_COMM_WORLD_RANK} ./tester --grid-order 'r' --nb 2048 --repeat 2 --check y --origin d --target d --ref n --type s --dim 32768x32768x32768 --grid 2x2 gemm'

but in the showed timeline of GPU0 in nsight system, the communication and computation process is not overlapped at all. Each computation phase(blue) begins after the pervious communication phase's ends(orange).

屏幕截图 2026-02-02 174928.png

I wonder if the sequenced procedure is expected, for the gemm routine of slate? 

Reply all
Reply to author
Forward
0 new messages