[slurm-users] Drain node from TaskProlog / TaskEpilog

Mark Dixon

unread,

May 24, 2021, 6:02:50 AM5/24/21

to slurm...@lists.schedmd.com

Hi all,

Sometimes our compute nodes get into a failed state which we can only
detect from inside the job environment.

I can see that TaskProlog / TaskEpilog allows us to run our detection
test; however, unlike Epilog and Prolog, they do not drain a node if they
exit with a non-zero exit code.

Does anyone have advice on automatically draining a node in this
situation, please?

Best wishes,

Mark

Brian Andrus

unread,

May 24, 2021, 10:05:51 AM5/24/21

to slurm...@lists.schedmd.com

Not sure I can understand how it can only be detected from inside the
job environment for a failed node.

That description is more of "our application is behaving badly, but not
so bad, the node quits responding." For that situation, your app or job
should have something that it is doing to catch that and report it to
slurm in some fashion (up to and including, kill the process).

Slurm polls the nodes and if slurmd does not respond, it will mark the
node as failed. So slurmd must be responding.

If you can provide a better description of what symptoms you see that
cause you to feel the node has failed, we can help a little more.

Mark Dixon

unread,

May 24, 2021, 11:57:09 AM5/24/21

to Slurm User Community List

Hi Brian,

Thanks for replying. On our hardware, GPUs allocated to a job by cgroup
sometimes get themselves into a state requiring a reboot.

Outside the job, a simple CUDA program calling the API function
cudaGetDeviceCount works happily. Inside the job, it returns an error code
of 3 (cudaErrorInitializationError).

At present, I have a TaskProlog that prods this API function and emails me
when there is a failure. It'd be nice if the nodes could drain themselves
without administrator intervention, rather than continuing to run waiting
jobs and so causing them to fail.

I can see a couple of ways to do it (e.g. sudo script in TaskProlog, or
playing with the cgroup hierarchy outside of slurm), but was wondering if
I had misunderstood the slurm docs and there was a simpler way.

Best,

Mark

Brian Andrus

unread,

May 24, 2021, 1:00:08 PM5/24/21

to slurm...@lists.schedmd.com

Ah. I'll proceed under the scenario that there is a piece of hardware
that is being tested and may lock up (The GPU in this case).

If you are able to identify the issue is occurring from within the job,
you should exit the job with an error or some signal to alert slurm (eg:
a semaphore file). You can then use something like EpilogSlurmctld to
recognize that and reboot the node accordingly.

This is presuming the node needs a full reboot, which I am guessing
affects the entire job. If you are able to do something like
unload/reload the cuda drivers between tasks, that may be a way to
continue the job while still 'fixing' the issue. That could be done in
the TaskEpilog script (assuming your daemon user has permissions to do so).

Christopher Samuel

unread,

May 24, 2021, 4:32:05 PM5/24/21

to slurm...@lists.schedmd.com

On 5/24/21 3:02 am, Mark Dixon wrote:

> Does anyone have advice on automatically draining a node in this
> situation, please?

We do some health checks via a node epilog set with the "Epilog"
setting, including queueing node reboots with "scontrol reboot".

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Mark Dixon

unread,

May 25, 2021, 8:09:52 AM5/25/21

to Slurm User Community List

Thanks to everyone for their help, much appreciated.

Seems to confirm that things would be much easier if I could just figure
out a way to detect the issue from the prolog/epilog, rather than the
taskprolog/taskepilog!

All the best,