[slurm-users] jobs getting stuck in CG

Ricardo Román-Brenes via slurm-users

unread,

Feb 10, 2025, 3:32:09 AM2/10/25

to slurm...@lists.schedmd.com

Hello everyone.

I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them.

The filesystem is gluster, authentication via slapd/munge.

My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)

Thank you.

-Ricardo

John Hearns via slurm-users

unread,

Feb 10, 2025, 4:27:07 AM2/10/25

to Ricardo Román-Brenes, Slurm User Community List

I have had something similar.

The fix was to run a

scontrol reconfig

Which causes a reread of the Slurmd config

Give that a try

It might be scontrol reread. Use the manual

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

John Hearns via slurm-users

unread,

Feb 10, 2025, 4:32:15 AM2/10/25

to Ricardo Román-Brenes, Slurm User Community List

Belay that reply. Different issue.

In that case salloc works OK but stun says user has no job on the node

Michał Kadlof via slurm-users

unread,

Feb 10, 2025, 7:08:21 AM2/10/25

to slurm...@lists.schedmd.com

I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state.

--
best regards | pozdrawiam serdecznie
Michał Kadlof

Head of the high performance computing center	Kierownik ośrodka obliczeniowego HPC
Eden^N cluster administrator	Administrator klastra obliczeniowego Eden^N
Structural and Functional Genomics Laboratory	Laboratorium Genomiki Strukturalnej i Funkcjonalnej
Faculty of Mathematics and Computer Science	Wydział Matematyki i Nauk Informacyjnych
Warsaw University of Technology	Politechnika Warszawska

John Hearns via slurm-users

unread,

Feb 10, 2025, 7:16:06 AM2/10/25

to Michał Kadlof, Slurm User Community List

ps -eaf --forest is your friend with Slurm

Christopher Samuel via slurm-users

unread,

Feb 10, 2025, 9:46:59 AM2/10/25

to slurm...@lists.schedmd.com

On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:

> I observed similar symptoms when we had issues with the shared Lustre
> file system. When the file system couldn't complete an I/O operation,
> the process in Slurm remained in the CG state until the file system
> became responsive again. An additional symptom was that the blocking
> process was stuck in the D state.

We've seen the same behaviour, though for us we use an
"UnkillableStepProgram" to deal with compute nodes where user processes
(as opposed to Slurm daemons, which sounds like the issue for the
original poster here) get stuck and are unkillable.

Our script does things like "echo w > /proc/sysrq-trigger" to get the
kernel to dump its view of all stuck processes and then it goes through
the stuck jobs cgroup to find all the processes and dump
/proc/$PID/stack for each process and then thread it finds there.

In the end it either marks the node down (if it's the only job on the
node which will mark the job as complete in Slurm, though will not free
up those stuck processes) or drains the node if it's running multiple
jobs. In both cases we'll come back and check the issue out (and our
SREs will wake us up if they think there's an unusual number of these).

That final step is important because a node stuck completing can really
confuse backfill scheduling for us as slurmctld assumes it will become
free any second now and try and use the node for planning jobs, despite
it being stuck. So marking it down/drain gets it out of slurmctld's view
as a potential future node.

For nodes where a Slurm daemon on the node is stuck that script will not
fire and so our SRE's have alarms that trip after a node has been
completing for longer than a certain amount of time. They go and look at
what's going on and get the node out of the system before utilisation
collapses (and wake us up if that number seems to be increasing).

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Reply all

Reply to author

Forward