Number of GPUs per node: 1
Waking the GPUs... DONE!
(GTL DEBUG: 1) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 360
MPICH ERROR [Rank 1] [job id 35390417.29] [Mon Feb 3 18:24:51 2025] [nid200449] - Abort(1011496450) (rank 1 in comm 0): Fatal error in PMPI_Wait: Invalid count, error stack:
PMPI_Wait(221)............................: MPI_Wait(request=0x7fffd0c2cb1c, status=0x1) failed
MPIR_Wait(93).............................:
MPIR_Wait_impl(41)........................:
MPID_Progress_wait(201)...................:
MPIDI_Progress_test(105)..................:
MPIDI_SHMI_progress(118)..................:
MPIDI_POSIX_progress(412).................:
MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64):
MPIDI_CRAY_Common_lmt_handle_recv(44).....:
MPIDI_CRAY_Common_lmt_import_mem(218).....:
(unknown)(): Invalid count
aborting job:
Fatal error in PMPI_Wait: Invalid count, error stack:
PMPI_Wait(221)............................: MPI_Wait(request=0x7fffd0c2cb1c, status=0x1) failed
MPIR_Wait(93).............................:
MPIR_Wait_impl(41)........................:
MPID_Progress_wait(201)...................:
MPIDI_Progress_test(105)..................:
MPIDI_SHMI_progress(118)..................:
MPIDI_POSIX_progress(412).................:
MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64):
MPIDI_CRAY_Common_lmt_handle_recv(44).....:
MPIDI_CRAY_Common_lmt_import_mem(218).....:
(unknown)(): Invalid count
[CRAYBLAS_WARNING] Application linked against multiple cray-libsci libraries
srun: error: nid200449: task 1: Exited with exit code 255
srun: Terminating StepId=35390417.29
slurmstepd: error: *** STEP 35390417.29 ON nid200449 CANCELLED AT 2025-02-04T02:24:52 ***
srun: error: nid200449: tasks 0,3: Terminated
srun: error: nid200449: task 2: Terminated
srun: Force Terminated StepId=35390417.29