[slurm-dev] Killing SLURM jobs

4 views
Skip to first unread message

Steven McDougall

unread,
Jul 1, 2008, 12:13:55 PM7/1/08
to slur...@lists.llnl.gov
We need a simple, quick, reliable way to kill SLURM jobs.

We regularly have jobs that fail for some reason
(hardware, software, phase of the moon...)
but can't be removed from the system because SLURM won't let go of them.

For example, I just had a user complain that

Rebooting [nodes] hangs when there's a
failed job in the CG state from previous boot. scancel as the user
or root
cannot clear the job. (the scancel hangs for a while and then fails
with:

~ $ scancel 5141
scancel: error: Kill job error on job id 5141: Job can not be
altered now,
try again later

The user had root access, so they tried restarting slurmctld:

Executing /etc/init.d/slurmctld restart seems to clear the state
enough to
allow the [node boot to] complete. squeue still shows:

# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5141 scx-comp hpcc.hpl joe PD 0:00 957 (BeginTime)

The job in fact then appeared to start up (though would have failed
because
the run file system was not mounted).


I haven't seen this *particular* scancel error before,
but we see problems *like* this all the time.

The design principle in SLURM seems to be to hold onto the job state at all
costs, presumably on the grounds that the user has expended lots of
valuable
cluster time creating that state.

The problem with this is that there is the equally valid situation where
the
user knows that the job has failed and that the job state is worthless, and
what the user wants now is to regain access to their valuable cluster as
quickly as possible and move on, but they can't because SLURM is jammed up
with this worthless failed job.


We need a SLURM command that will
- remove a job
- right now
- no questions asked


Unix has

kill -9

SLURM needs

sterminate-with-extreme-prejudice

Holmes, Christopher (ZKO)

unread,
Jul 1, 2008, 12:34:20 PM7/1/08
to slur...@lists.llnl.gov
SLURM is doing it's best to make sure that rogue processes are not left all over the cluster that could end up compromising future jobs. This is why you may see a CG state hang around while a node reboots. SLURM last knew there was something running on that node, and it's trying to contact the slurmd and get an updated status.

In cases where you know there's something wrong with the node (node rebooting, or node hung), the best way to clear a CG job is to 'DOWN' the offending node with 'scontrol'. This tells SLURM "there's a problem with this node, so don't bother waiting for information from this node".

Note that the 'squeue' output for CG-state jobs will list only the offending nodes (the nodes that SLURM is waiting on). Other nodes from the same job that have cleaned up properly will be set to 'idle'.

HTH,
--Chris

jet...@llnl.gov

unread,
Jul 1, 2008, 12:52:49 PM7/1/08
to slur...@lists.llnl.gov
The reason Slurm retains job records tenaciously is that
1. job records contain step records
2. step records contain network switch information
3. on some networks, switch resource can't be re-used until all
   processes using them have been purged. In the case of a
   Quadrics switch, those switch resources are global.
With most networks (i.e. not Quadrics Elan or IBM Federation
switches), job records could be purged without causing problems.

The other problem is that a job hung in CG (completing) state
has non-killable processes. If you remove the job record,
Slurm will no longer be able to keep the node in CG state
and it will get reassigned to some other job. If there are
non-killable processes on a node, starting another job
there is probably not such a good idea.

The better solution would probably be to use Slurm's configuration
parameters UnkillableStepProgram and UnkillableStepTimeout (see
"man slurm.conf") to set a node with unkillable processes DOWN.
This will purge the job record and make the node unusable until
someone has a chance to investigate the problem (the program can
send email as an alert). Let me know if this isn't a viable
solution for you.

I'll update our FAQ with information about these relatively new
slurm.conf options
Reply all
Reply to author
Forward
0 new messages