When I run Gemm tester on a 4-GPU Node with
export SLATE_GPU_AWARE_MPI=1
export OMP_NUM_THREADS=14
mpirun --bind-to none -n 4 bash -c 'export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK; ./tester --grid-order 'r' --nb 2048 --repeat 2 --check n --origin d --target d --ref n --type s --dim 32768x32768x32768 --grid 2x2 gemm'
the program exit correctly and outputs:
% SLATE version 2025.05.28, id f8348a7c
% input: ./tester --grid-order r --nb 2048 --repeat 2 --check n --origin d --target d --ref n --type s --dim 32768x32768x32768 --grid 2x2 gemm
% 2026-02-02 15:56:01, 4 MPI ranks, GPU-aware MPI, 14 OpenMP threads, 1 GPU devices per MPI rank
type origin target gemm go do A B C transA transB m n k nb alpha beta p q la error time (s) gflop/s ref time (s) ref gflop/s status
s dev dev auto row row 1 1 1 notrans notrans 32768 32768 32768 2048 3.1+1.4i 2.7+1.7i 2 2 1 NA 1.320 53312.823 NA NA no check skipping reference: ScaLAPACK not available
s dev dev auto row row 1 1 1 notrans notrans 32768 32768 32768 2048 3.1+1.4i 2.7+1.7i 2 2 1 NA 1.241 56716.360 NA NA no check
time (s) min 1.241, max 1.320, avg 1.280, stddev 0.05601
gflop/s min 5.331e+04, max 5.672e+04, avg 5.501e+04, stddev 2407.
% Matrix kinds:
% 1: rand, cond unknown
% All tests passed: gemm
Then I profile the tester using Nvidia nsight system :
mpirun --bind-to none -n 4 bash -c 'export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK; nsys profile --force-overwrite true --trace=mpi,cuda,nvtx,cublas -o /tmp/nsys_out_slate/slate_tester_gemm.%q{OMPI_COMM_WORLD_RANK} ./tester --grid-order 'r' --nb 2048 --repeat 2 --check y --origin d --target d --ref n --type s --dim 32768x32768x32768 --grid 2x2 gemm'
but in the showed timeline of GPU0 in nsight system, the communication and computation process is not overlapped at all. Each computation phase(blue) begins after the pervious communication phase's ends(orange).
I wonder if the sequenced procedure is expected, for the gemm routine of slate?