[slurm-users] Issues with orphaned jobs after update

41 views
Skip to first unread message

Jeffrey McDonald

unread,
Dec 6, 2023, 9:27:39 AM12/6/23
to Slurm User Community List
Hi, 
Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I ended up losing a number of jobs on the compute nodes.   Ultimately, the installation seems to be successful but I now have some issues with job remnants it appears.    About once per minute (per job), the slurmctld daemon is logging: 

[2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39104]: Zero Bytes were transmitted or received
[2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39106]: Zero Bytes were transmitted or received
[2023-12-06T08:16:32.792] error: slurm_receive_msg [146.57.133.38:54722]: Zero Bytes were transmitted or received
[2023-12-06T08:16:34.189] error: slurm_receive_msg [146.57.133.49:59058]: Zero Bytes were transmitted or received
[2023-12-06T08:16:34.197] error: slurm_receive_msg [146.57.133.49:58232]: Zero Bytes were transmitted or received
[2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48856]: Zero Bytes were transmitted or received
[2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48860]: Zero Bytes were transmitted or received
[2023-12-06T08:16:36.329] error: slurm_receive_msg [146.57.133.46:50848]: Zero Bytes were transmitted or received
[2023-12-06T08:16:59.827] error: slurm_receive_msg [146.57.133.14:60328]: Zero Bytes were transmitted or received
[2023-12-06T08:16:59.828] error: slurm_receive_msg [146.57.133.37:37734]: Zero Bytes were transmitted or received
[2023-12-06T08:17:03.285] error: slurm_receive_msg [146.57.133.35:41426]: Zero Bytes were transmitted or received
[2023-12-06T08:17:13.244] error: slurm_receive_msg [146.57.133.105:34416]: Zero Bytes were transmitted or received
[2023-12-06T08:17:13.726] error: slurm_receive_msg [146.57.133.15:60164]: Zero Bytes were transmitted or received

The controller also shows orphaned jobs: 

[2023-12-06T07:47:42.010] error: Orphan StepId=9050.extern reported on node amd03
[2023-12-06T07:47:42.010] error: Orphan StepId=9055.extern reported on node amd03
[2023-12-06T07:47:42.011] error: Orphan StepId=8862.extern reported on node amd12
[2023-12-06T07:47:42.011] error: Orphan StepId=9065.extern reported on node amd07
[2023-12-06T07:47:42.011] error: Orphan StepId=9066.extern reported on node amd07
[2023-12-06T07:47:42.011] error: Orphan StepId=8987.extern reported on node amd09
[2023-12-06T07:47:42.012] error: Orphan StepId=9068.extern reported on node amd08
[2023-12-06T07:47:42.012] error: Orphan StepId=8862.extern reported on node amd13
[2023-12-06T07:47:42.012] error: Orphan StepId=8774.extern reported on node amd10
[2023-12-06T07:47:42.012] error: Orphan StepId=9051.extern reported on node amd10
[2023-12-06T07:49:22.009] error: Orphan StepId=9071.extern reported on node aslab01
[2023-12-06T07:49:22.010] error: Orphan StepId=8699.extern reported on node gpu05


On the compute nodes, I see  a corresponding error message like this one: 

[2023-12-06T08:18:03.292] [9052.extern] error: hash_g_compute: hash plugin with id:0 not exist or is not loaded
[2023-12-06T08:18:03.292] [9052.extern] error: slurm_send_node_msg: hash_g_compute: REQUEST_STEP_COMPLETE has error



The error seems to be a reference always to a job that was canceled, e.g. 9052: 

# sacct -j 9052
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode  
------------ ---------- ---------- ---------- ---------- ---------- --------  
9052         sys/dashb+     a40gpu                    24  CANCELLED      0:0  
9052.batch        batch                               24  CANCELLED      0:0  
9052.extern      extern                               24  CANCELLED      0:0

These jobs were running at the start of the update but we subsequently canceled because of the slurmd or slurmctld timeouts during the update.    How can I clean this up?    I've tried canceling the jobs but nothing seems to work to remove them.   

Thanks in advance,
Jeff

Jeffrey McDonald

unread,
Dec 7, 2023, 10:38:08 AM12/7/23
to Slurm User Community List
Hi, 

As an update, I able to clear out the orphan/cancelled jobs by rebooting the compute nodes which had cancelled jobs.   The error messages have ceased. 

Regards,
Jeff
Reply all
Reply to author
Forward
0 new messages