[slurm-users] help with canceling or deleteing a job

1,320 views
Skip to first unread message

Felix

unread,
Sep 19, 2023, 8:00:21 AM9/19/23
to Slurm User Community List
Hello

I have a job on my system which is running more than its time, more than
4 days.

1808851     debug  gridjob  atlas01 CG 4-00:00:19      1 awn-047

I'm trying to cancel it

[@arc7-node ~]# scancel 1808851

I get no message as if the job was canceled but when getting information
about the job, the job is still there

[@arc7-node ~]# squeue | grep awn-047
           1808851     debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047

Can I do any other thinks to kill end the job?

Thank you

Felix


--
Dr. Eng. Farcas Felix
National Institute of Research and Development of Isotopic and Molecular Technology,
IT - Department - Cluj-Napoca, Romania
Mobile: +40742195323

Ole Holm Nielsen

unread,
Sep 19, 2023, 8:28:17 AM9/19/23
to slurm...@lists.schedmd.com


On 9/19/23 13:59, Felix wrote:
> Hello
>
> I have a job on my system which is running more than its time, more than 4
> days.
>
> 1808851     debug  gridjob  atlas01 CG 4-00:00:19      1 awn-047

The job has state "CG" which means "Completing". The Completing status is
explained in "man sinfo".

This means that Slurm is trying to cancel the job, but it hangs for some
reason.

> I'm trying to cancel it
>
> [@arc7-node ~]# scancel 1808851
>
> I get no message as if the job was canceled but when getting information
> about the job, the job is still there
>
> [@arc7-node ~]# squeue | grep awn-047
>            1808851     debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047

What is your UnkillableStepTimeout parameter? The default of 60 seconds
can be changed in slurm.conf. My cluster:

$ scontrol show config | grep UnkillableStepTimeout
UnkillableStepTimeout = 126 sec

> Can I do any other thinks to kill end the job?

It may be impossible to kill the job's processes, for example, if a
filesystem is hanging.

You may log in to the node and give the job's processes a "kill -9". Or
just reboot the node.

/Ole

Feng Zhang

unread,
Sep 19, 2023, 7:40:06 PM9/19/23
to Slurm User Community List
Restarting the slurmd dameon of the compute node should work, if the
node is still online and normal.

Best,

Feng

Ole Holm Nielsen

unread,
Sep 20, 2023, 3:12:22 AM9/20/23
to slurm...@lists.schedmd.com
On 9/20/23 01:39, Feng Zhang wrote:
> Restarting the slurmd dameon of the compute node should work, if the
> node is still online and normal.

Probably not. If the filesystem used by the job is hung, the node must
probably be rebooted, and the filesystem must be checked.

/Ole

Wagner, Marcus

unread,
Sep 20, 2023, 7:27:17 AM9/20/23
to slurm...@lists.schedmd.com
Even after rebooting, sometimes nodes are stuck because of "completing
jobs".

What helps then is to set the node down and resume it afterwards:

scontrol update nodename=<nodename> state=drain reason=stuck; scontrol
update nodename=<nodename> state=resume


Best
Marcus

Feng Zhang

unread,
Sep 20, 2023, 11:32:27 AM9/20/23
to Slurm User Community List
👍

Best,

Feng

Reply all
Reply to author
Forward
0 new messages