[slurm-dev] Killing SLURM jobs

Steven McDougall

unread,

Jul 1, 2008, 12:13:55 PM7/1/08

to slur...@lists.llnl.gov

We need a simple, quick, reliable way to kill SLURM jobs.

We regularly have jobs that fail for some reason
(hardware, software, phase of the moon...)
but can't be removed from the system because SLURM won't let go of them.

For example, I just had a user complain that

Rebooting [nodes] hangs when there's a
failed job in the CG state from previous boot. scancel as the user
or root
cannot clear the job. (the scancel hangs for a while and then fails
with:

~ $ scancel 5141
scancel: error: Kill job error on job id 5141: Job can not be
altered now,
try again later

The user had root access, so they tried restarting slurmctld:

Executing /etc/init.d/slurmctld restart seems to clear the state
enough to
allow the [node boot to] complete. squeue still shows:

# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5141 scx-comp hpcc.hpl joe PD 0:00 957 (BeginTime)

The job in fact then appeared to start up (though would have failed
because
the run file system was not mounted).

I haven't seen this *particular* scancel error before,
but we see problems *like* this all the time.

The design principle in SLURM seems to be to hold onto the job state at all
costs, presumably on the grounds that the user has expended lots of
valuable
cluster time creating that state.

The problem with this is that there is the equally valid situation where
the
user knows that the job has failed and that the job state is worthless, and
what the user wants now is to regain access to their valuable cluster as
quickly as possible and move on, but they can't because SLURM is jammed up
with this worthless failed job.

We need a SLURM command that will
- remove a job
- right now
- no questions asked

Unix has

kill -9

SLURM needs

sterminate-with-extreme-prejudice

Holmes, Christopher (ZKO)

unread,

Jul 1, 2008, 12:34:20 PM7/1/08

to slur...@lists.llnl.gov

SLURM is doing it's best to make sure that rogue processes are not left all over the cluster that could end up compromising future jobs. This is why you may see a CG state hang around while a node reboots. SLURM last knew there was something running on that node, and it's trying to contact the slurmd and get an updated status.

In cases where you know there's something wrong with the node (node rebooting, or node hung), the best way to clear a CG job is to 'DOWN' the offending node with 'scontrol'. This tells SLURM "there's a problem with this node, so don't bother waiting for information from this node".

Note that the 'squeue' output for CG-state jobs will list only the offending nodes (the nodes that SLURM is waiting on). Other nodes from the same job that have cleaned up properly will be set to 'idle'.

HTH,
--Chris

jet...@llnl.gov

unread,

Jul 1, 2008, 12:52:49 PM7/1/08

to slur...@lists.llnl.gov

The reason Slurm retains job records tenaciously is that

1. job records contain step records

2. step records contain network switch information

3. on some networks, switch resource can't be re-used until all

processes using them have been purged. In the case of a

Quadrics switch, those switch resources are global.

With most networks (i.e. not Quadrics Elan or IBM Federation

switches), job records could be purged without causing problems.

The other problem is that a job hung in CG (completing) state

has non-killable processes. If you remove the job record,

Slurm will no longer be able to keep the node in CG state

and it will get reassigned to some other job. If there are

non-killable processes on a node, starting another job

there is probably not such a good idea.

The better solution would probably be to use Slurm's configuration

parameters UnkillableStepProgram and UnkillableStepTimeout (see

"man slurm.conf") to set a node with unkillable processes DOWN.

This will purge the job record and make the node unusable until

someone has a chance to investigate the problem (the program can

send email as an alert). Let me know if this isn't a viable

solution for you.

I'll update our FAQ with information about these relatively new

slurm.conf options

https://computing.llnl.gov/linux/slurm/faq.html#comp

Reply all

Reply to author

Forward