[slurm-users] jobs getting stuck in CG

808 views
Skip to first unread message

Ricardo Román-Brenes via slurm-users

unread,
Feb 10, 2025, 3:32:09 AM2/10/25
to slurm...@lists.schedmd.com
Hello everyone.

I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them.
The filesystem is gluster, authentication via slapd/munge.

My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)

Thank you.

-Ricardo

John Hearns via slurm-users

unread,
Feb 10, 2025, 4:27:07 AM2/10/25
to Ricardo Román-Brenes, Slurm User Community List
I have had something similar.
The fix was to run a 
scontrol reconfig
Which causes a reread of the Slurmd config
Give that a try

It might be scontrol reread. Use the manual 

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

John Hearns via slurm-users

unread,
Feb 10, 2025, 4:32:15 AM2/10/25
to Ricardo Román-Brenes, Slurm User Community List
Belay that reply. Different issue.
In that case salloc works OK but stun says user has no job on the node

Michał Kadlof via slurm-users

unread,
Feb 10, 2025, 7:08:21 AM2/10/25
to slurm...@lists.schedmd.com

I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state.

--
best regards | pozdrawiam serdecznie
Michał Kadlof
Head of the high performance computing center Kierownik ośrodka obliczeniowego HPC
EdenN cluster administrator Administrator klastra obliczeniowego EdenN
Structural and Functional Genomics Laboratory Laboratorium Genomiki Strukturalnej i Funkcjonalnej
Faculty of Mathematics and Computer Science Wydział Matematyki i Nauk Informacyjnych
Warsaw University of Technology Politechnika Warszawska

John Hearns via slurm-users

unread,
Feb 10, 2025, 7:16:06 AM2/10/25
to Michał Kadlof, Slurm User Community List
ps -eaf --forest is your friend with Slurm

Christopher Samuel via slurm-users

unread,
Feb 10, 2025, 9:46:59 AM2/10/25
to slurm...@lists.schedmd.com
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:

> I observed similar symptoms when we had issues with the shared Lustre
> file system. When the file system couldn't complete an I/O operation,
> the process in Slurm remained in the CG state until the file system
> became responsive again. An additional symptom was that the blocking
> process was stuck in the D state.

We've seen the same behaviour, though for us we use an
"UnkillableStepProgram" to deal with compute nodes where user processes
(as opposed to Slurm daemons, which sounds like the issue for the
original poster here) get stuck and are unkillable.

Our script does things like "echo w > /proc/sysrq-trigger" to get the
kernel to dump its view of all stuck processes and then it goes through
the stuck jobs cgroup to find all the processes and dump
/proc/$PID/stack for each process and then thread it finds there.

In the end it either marks the node down (if it's the only job on the
node which will mark the job as complete in Slurm, though will not free
up those stuck processes) or drains the node if it's running multiple
jobs. In both cases we'll come back and check the issue out (and our
SREs will wake us up if they think there's an unusual number of these).

That final step is important because a node stuck completing can really
confuse backfill scheduling for us as slurmctld assumes it will become
free any second now and try and use the node for planning jobs, despite
it being stuck. So marking it down/drain gets it out of slurmctld's view
as a potential future node.

For nodes where a Slurm daemon on the node is stuck that script will not
fire and so our SRE's have alarms that trip after a node has been
completing for longer than a certain amount of time. They go and look at
what's going on and get the node out of the system before utilisation
collapses (and wake us up if that number seems to be increasing).

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Reply all
Reply to author
Forward
0 new messages