[slurm-users] Pending job will be cancelled since transport endpoint is not connected

269 views
Skip to first unread message

Chenyang Yan

unread,
Mar 31, 2023, 3:11:59 AM3/31/23
to slurm...@schedmd.com
Hi, all

We have one cluster with Slurm version 20.11.8 in CentOS 8.2. Suddenly it produces a wired problem proid for only Pending job will be cancelled since transport endpoint is not connected error(See image https://user-images.githubusercontent.com/19144683/229037078-ca704ba8-23a4-4948-9d1a-bacab82acd1f.png). The all jobs are submitted with srun command.
... ...
srun:job 6367724 queued and waiting for resources
srun:error:Unable to allocate resources: Transport endpoint is not connected
srun:job 6367725 queued and waiting for resources
srun:error: Unable to allocate resources: Transport endpoint is not connected
srun:job 6367726 queued and waiting for resources
srun:job 6367727 queued and waiting for resources
srun:job 6367728 queued and waiting for resources
srun:error: Unable to allocate resources: Transport endpoint is not connected
srun:Force Terminated job 6366908

[root@slurm-master01 bin]# journalctl --since today -p err __COMM=slurmctld
Mar 31 02:50:46 slurm-master01 slurmctld[220654]: error: slurm_receive_msgs: Transport endpoint is not connected
Mar 31 02:50:47 slurm-master01 slurmctld[220654]: error: slurm receive_msgs: Transport endpoint is not connected

According to https://github.com/SchedMD/slurm/blob/slurm-20-11-8-1/src/srun/libsrun/allocate.c#L182-L227 , it seems OS issue? I've google for "transport endpoint is not connected", lots of references report that filesystem IO issue.So:
* How to avoid pending job will be cancelled for slurm
* What caused the slurmctld reported error

Thanks!
Reply all
Reply to author
Forward
0 new messages