Projects builds and runs but not with cuda aware MPI

Nakul Iyer

unread,

Feb 3, 2025, 9:49:54 PMFeb 3

to SLATE User

Hello,

I am working on a project on the NERSC (Perlmutter) architecture with Cray MPICH. I've set

export mpi := cray
export blas := libsci
export CXX := CC
export DVS_MAXNODES := 1__
export SLATE_GPU_AWARE_MPI := 1
export MPICH_GPU_SUPPORT_ENABLED := 1
...
srun --nodes=1 --ntasks-per-node=4 --cpus-per-task=32 --cpu-bind=cores --gpus=4 --gpu-bind=single:1 build/main

In my code, I need SLATE to do a distributed GEMM routine on a large matrix across Perlmutter nodes. This has all been set up and works fine when I add slate::gpu_aware_mpi(false); before the routine in question.

But now instead, if I try to enable CUDA-aware MPI by setting that above variable to "true", I get the following error:

Number of GPUs per node: 1
Waking the GPUs... DONE!
(GTL DEBUG: 1) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 360
MPICH ERROR [Rank 1] [job id 35390417.29] [Mon Feb  3 18:24:51 2025] [nid200449] - Abort(1011496450) (rank 1 in comm 0): Fatal error in PMPI_Wait: Invalid count, error stack:
PMPI_Wait(221)............................: MPI_Wait(request=0x7fffd0c2cb1c, status=0x1) failed
MPIR_Wait(93).............................: 
MPIR_Wait_impl(41)........................: 
MPID_Progress_wait(201)...................: 
MPIDI_Progress_test(105)..................: 
MPIDI_SHMI_progress(118)..................: 
MPIDI_POSIX_progress(412).................: 
MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64): 
MPIDI_CRAY_Common_lmt_handle_recv(44).....: 
MPIDI_CRAY_Common_lmt_import_mem(218).....: 
(unknown)(): Invalid count

aborting job:
Fatal error in PMPI_Wait: Invalid count, error stack:
PMPI_Wait(221)............................: MPI_Wait(request=0x7fffd0c2cb1c, status=0x1) failed
MPIR_Wait(93).............................: 
MPIR_Wait_impl(41)........................: 
MPID_Progress_wait(201)...................: 
MPIDI_Progress_test(105)..................: 
MPIDI_SHMI_progress(118)..................: 
MPIDI_POSIX_progress(412).................: 
MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64): 
MPIDI_CRAY_Common_lmt_handle_recv(44).....: 
MPIDI_CRAY_Common_lmt_import_mem(218).....: 
(unknown)(): Invalid count
[CRAYBLAS_WARNING] Application linked against multiple cray-libsci libraries
srun: error: nid200449: task 1: Exited with exit code 255
srun: Terminating StepId=35390417.29
slurmstepd: error: *** STEP 35390417.29 ON nid200449 CANCELLED AT 2025-02-04T02:24:52 ***
srun: error: nid200449: tasks 0,3: Terminated
srun: error: nid200449: task 2: Terminated
srun: Force Terminated StepId=35390417.29

To give more context, I can confirm that the error is being called by the GEMM routine given below:

slate::Matrix<float> PT = load_matrix(
         path, m, n, comm); // inserts tiles, initializes values
auto P = slate::transpose(PT);
auto K = slate::Matrix<float>(P.m(), P.m(), P.tileMb(0), P.tileMb(0), 1, size, PT.mpiComm());
K.insertLocalTiles(slate::Target::Devices);
// error caused by the following line:
slate::gemm<float>(1.0f, P, PT, 0.0f, K,
          {{slate::Option::Target, slate::Target::Devices}});

Does anyone who has worked with SLATE, NERSC Perlmutter, Cray MPI, etc know what could be doing on here? I'd appreciate any advice on this, and am happy to supply more information as needed. Thanks!

Nakul

Paul Lin

unread,

Feb 4, 2025, 12:41:26 AMFeb 4

to Nakul Iyer, SLATE User

Hi Nakul,

Please provide the list of #SBATCH directives at the top of your batch script.

--
You received this message because you are subscribed to the Google Groups "SLATE User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slate-user+...@icl.utk.edu.
To view this discussion visit https://groups.google.com/a/icl.utk.edu/d/msgid/slate-user/7335760a-4fa5-4005-b2f6-595b8e69d9den%40icl.utk.edu.

Nakul Iyer

unread,

Feb 4, 2025, 12:45:05 AMFeb 4

to SLATE User, pau...@lbl.gov, SLATE User, Nakul Iyer

I've been testing on the Perlmutter interactive session, but the directives correspond to:

#SBATCH --nodes=4
#SBATCH --gpus=16
#SBATCH --constraint=gpu
#SBATCH --qos=debug

Paul Lin

unread,

Feb 4, 2025, 1:09:46 AMFeb 4

to Nakul Iyer, SLATE User

Please delete --gpu-bind=single:1 from your srun line and try again.

Nakul Iyer

unread,

Feb 4, 2025, 1:21:42 AMFeb 4

to SLATE User, pau...@lbl.gov, SLATE User, Nakul Iyer

I get a similar error when I remove that (pasted below). I did notice, though, that if I set --ntasks=1 (still with --gpus=4) then it works fine (with or without --gpu-bind=single:1 ). I can increase the number of nodes from 1 to 2 and the GEMM still works.

(GTL DEBUG: 0) cuMemGetAddressRange: named symbol not found, CUDA_ERROR_NOT_FOUND, line no 142
MPICH ERROR [Rank 0] [job id 35402138.10] [Mon Feb  3 22:18:12 2025] [nid200352] - Abort(942257666) (rank 0 in comm 0): Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x7fd8f5440800, count=2408, MPI_FLOAT, dest=1, tag=0, MPI_COMM_WORLD, request=0x7ffc96ca2560) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 

(unknown)(): Invalid count

aborting job:

Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x7fd8f5440800, count=2408, MPI_FLOAT, dest=1, tag=0, MPI_COMM_WORLD, request=0x7ffc96ca2560) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 
(unknown)(): Invalid count

Paul Lin

unread,

Feb 13, 2025, 12:39:23 AMFeb 13

to Nakul Iyer, SLATE User

Actually the error has changed. Previously it was a "cuIpcOpenMemHandle" error, now it is a "cuMemGetAddressRange" error.

Please create a script, e.g. "device_wrapper" with the following three lines (with no blank lines after the third line):

#!/bin/bash
export CUDA_VISIBLE_DEVICES=$((3-$SLURM_LOCALID))
exec $*

Make this script executable (i.e. chmod u+x device_wrapper)

Then try the following srun line:

srun --nodes=1 -n 4 --cpus-per-task=32 --cpu-bind=cores --ntasks-per-node=4 --gpus-per-node=4 ./device_wrapper build/main

Nakul Iyer

unread,

Feb 28, 2025, 4:20:26 PMFeb 28

to SLATE User, pau...@lbl.gov, SLATE User, Nakul Iyer

This has been working for us. Thank you very much for your help!

Reply all

Reply to author

Forward