We regularly have jobs that fail for some reason
(hardware, software, phase of the moon...)
but can't be removed from the system because SLURM won't let go of them.
For example, I just had a user complain that
Rebooting [nodes] hangs when there's a
failed job in the CG state from previous boot. scancel as the user
or root
cannot clear the job. (the scancel hangs for a while and then fails
with:
~ $ scancel 5141
scancel: error: Kill job error on job id 5141: Job can not be
altered now,
try again later
The user had root access, so they tried restarting slurmctld:
Executing /etc/init.d/slurmctld restart seems to clear the state
enough to
allow the [node boot to] complete. squeue still shows:
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5141 scx-comp hpcc.hpl joe PD 0:00 957 (BeginTime)
The job in fact then appeared to start up (though would have failed
because
the run file system was not mounted).
I haven't seen this *particular* scancel error before,
but we see problems *like* this all the time.
The design principle in SLURM seems to be to hold onto the job state at all
costs, presumably on the grounds that the user has expended lots of
valuable
cluster time creating that state.
The problem with this is that there is the equally valid situation where
the
user knows that the job has failed and that the job state is worthless, and
what the user wants now is to regain access to their valuable cluster as
quickly as possible and move on, but they can't because SLURM is jammed up
with this worthless failed job.
We need a SLURM command that will
- remove a job
- right now
- no questions asked
Unix has
kill -9
SLURM needs
sterminate-with-extreme-prejudice
In cases where you know there's something wrong with the node (node rebooting, or node hung), the best way to clear a CG job is to 'DOWN' the offending node with 'scontrol'. This tells SLURM "there's a problem with this node, so don't bother waiting for information from this node".
Note that the 'squeue' output for CG-state jobs will list only the offending nodes (the nodes that SLURM is waiting on). Other nodes from the same job that have cleaned up properly will be set to 'idle'.
HTH,
--Chris