You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Slurm User Community List
Hello!
We have a huge number of jobs stuck in CG state from a user who probably wrote code with bad I/O. "scancel" does not make them go away. Is there a way for admins to get rid of these jobs without draining and rebooting the nodes. I read somewhere that killing the respective slurmstepd process will do the job. Is this possible? Any other solutions? Also are there any parameters in slurm.conf one can set to manage such situations better?
Best,
Durai
MPI Tübingen
Florian Zillner
unread,
Aug 20, 2021, 10:55:48 AM8/20/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Slurm User Community List
Hi,
scancel the job, then set the nodes to a "down" state like so "scontrol update nodename=<nodename> state=down reason=cg" and
resume them afterwards.
However, if there are tasks stuck, then in most cases a reboot is needed to bring the node back with in a clean state.