On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:
> I observed similar symptoms when we had issues with the shared Lustre
> file system. When the file system couldn't complete an I/O operation,
> the process in Slurm remained in the CG state until the file system
> became responsive again. An additional symptom was that the blocking
> process was stuck in the D state.
We've seen the same behaviour, though for us we use an
"UnkillableStepProgram" to deal with compute nodes where user processes
(as opposed to Slurm daemons, which sounds like the issue for the
original poster here) get stuck and are unkillable.
Our script does things like "echo w > /proc/sysrq-trigger" to get the
kernel to dump its view of all stuck processes and then it goes through
the stuck jobs cgroup to find all the processes and dump
/proc/$PID/stack for each process and then thread it finds there.
In the end it either marks the node down (if it's the only job on the
node which will mark the job as complete in Slurm, though will not free
up those stuck processes) or drains the node if it's running multiple
jobs. In both cases we'll come back and check the issue out (and our
SREs will wake us up if they think there's an unusual number of these).
That final step is important because a node stuck completing can really
confuse backfill scheduling for us as slurmctld assumes it will become
free any second now and try and use the node for planning jobs, despite
it being stuck. So marking it down/drain gets it out of slurmctld's view
as a potential future node.
For nodes where a Slurm daemon on the node is stuck that script will not
fire and so our SRE's have alarms that trip after a node has been
completing for longer than a certain amount of time. They go and look at
what's going on and get the node out of the system before utilisation
collapses (and wake us up if that number seems to be increasing).
All the best,
Chris
--
Chris Samuel :
http://www.csamuel.org/ : Berkeley, CA, USA