[slurm-users] slurmctld: slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer

62 views
Skip to first unread message

Rike-Benjamin Schuppner

unread,
Jan 25, 2024, 6:17:10 AM1/25/24
to slurm...@lists.schedmd.com
Hi,

I am getting the following error in the logs whenever I run a few srun jobs in a batch.

Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: _send_timeout: Socket POLLERR: Connection reset by peer
Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: error: slurm_send_node_msg: [socket:[921897]] slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer
Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: laying out the 1 tasks on 1 hosts compute2 dist 1

The slurm version is 23.11.3 and an example sbatch file is:

#!/bin/bash
#SBATCH --job-name=slurm_test
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --output=slurm_test_%j.log
pwd; hostname; date

srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &
srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &

wait

The more sruns I have in the script (&-backgrounded or not), the more the error shows up. Is there anything I could do to fix this error?

Best
/rike


Jerome Verleyen via slurm-users

unread,
Feb 15, 2024, 8:56:11 PM2/15/24
to slurm...@lists.schedmd.com
Dear Rike

I'm facing the same error in my own cluster, slurm version 23.11.3. And
i notice that my task are running in sequence, not in parralele. I'm
using the example inside the srun manual:

#!/bin/bash

srun -n1 sleep 30 &
srun -n1 sleep 45 &
srun -n1 sleep 20 &
srun -n1 sleep 25 &
wait


$ sbatch -n4 test.sh

I hope that should running in 45 seconds, as my server have 64 cores...
But no, each task is running sequencialy.

Hope that someone could help us?

Regards

--
-- Jérôme
Inventer, c'est penser à côté.
(Albert Einstein)

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Reply all
Reply to author
Forward
0 new messages