Hi, all
We have one cluster with Slurm version 20.11.8 in CentOS 8.2. Suddenly it produces a wired problem proid for
only Pending job will be cancelled since transport endpoint is not connected error(See image
https://user-images.githubusercontent.com/19144683/229037078-ca704ba8-23a4-4948-9d1a-bacab82acd1f.png). The all jobs are submitted with srun command.
... ...
srun:job 6367724 queued and waiting for resources
srun:error:Unable to allocate resources: Transport endpoint is not connected
srun:job 6367725 queued and waiting for resources
srun:error: Unable to allocate resources: Transport endpoint is not connected
srun:job 6367726 queued and waiting for resources
srun:job 6367727 queued and waiting for resources
srun:job 6367728 queued and waiting for resources
srun:error: Unable to allocate resources: Transport endpoint is not connected
srun:Force Terminated job 6366908
[root@slurm-master01 bin]# journalctl --since today -p err __COMM=slurmctld
Mar 31 02:50:46 slurm-master01 slurmctld[220654]: error: slurm_receive_msgs: Transport endpoint is not connected
Mar 31 02:50:47 slurm-master01 slurmctld[220654]: error: slurm receive_msgs: Transport endpoint is not connected
* How to avoid pending job will be cancelled for slurm
* What caused the slurmctld reported error
Thanks!