[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

731 views
Skip to first unread message

taleint...@sjtu.edu.cn

unread,
May 3, 2022, 3:47:30 AM5/3/22
to slurm...@lists.schedmd.com

Hi, all:

 

We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed.

I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as “Epilog error”. Then our auto-repair program will have trouble to determine how to repair the node.

Another way is call scontrol directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote:

Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). … Slurm commands in these scripts can potentially lead to performance issues and should not be used.

So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides “Epilog error” reason?

Paul Edmon

unread,
May 3, 2022, 9:25:06 AM5/3/22
to slurm...@lists.schedmd.com

We've invoked scontrol in our epilog script for years to close off nodes with out any issue.  What the docs are really referring to is gratuitous use of those commands.  If you have those commands well circumscribed (i.e. only invoked when you have to actually close a node) and only use them when you absolutely have no other work around then you should be fine.

-Paul Edmon-

John Hanks

unread,
May 3, 2022, 9:35:58 AM5/3/22
to Slurm User Community List
I've done similar by having the epilog touch a file, then have the node health check (LBNL NHC) act on that file's presence/contents later to do the heavy lifting. There's a window of time/delay where the reason is "Epilog error" before the health check corrects it, but if that's tolerable this makes for a fast epilog script. 

griznog

Michael Jennings

unread,
May 3, 2022, 5:22:57 PM5/3/22
to slurm...@lists.schedmd.com
On Tuesday, 03 May 2022, at 15:46:38 (+0800),
taleint...@sjtu.edu.cn wrote:

> We need to detect some problem at job end timepoint, so we write some
> detection script in slurm epilog, which should drain the node if check is
> not passed.
>
> I know exit epilog with non-zero code will make slurm automatically drain
> the node. But in such way, drain reason will all be marked as "Epilog
> error". Then our auto-repair program will have trouble to determine how to
> repair the node.
>
> Another way is call scontrol directly from epilog to drain the node, but
> from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote:
>
> Prolog and Epilog scripts should be designed to be as short as possible and
> should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). .
> Slurm commands in these scripts can potentially lead to performance issues
> and should not be used.
>
> So what is the best way to drain node from epilog with a self-defined
> reason, or tell slurm to add more verbose message besides "Epilog error"
> reason?

Invoking `scontrol` from a prolog/epilog script to simply alter nodes'
state and/or reason fields is totally fine. Many sites (including
ours) use LBNL NHC for all or part of their epilogs' post-job "sanity
checking" of nodes, and -- knock on renewable bamboo -- there have
been no concurrency issues (loops, deadlocks, etc.) reported to either
project to date. :-)

If it helps, I had similar concerns about invoking the `squeue`
command from an NHC run in order to gather job data. The Man Himself
(Moe Jette, original creator of Slurm and co-founder of SchedMD) was
kind enough to weigh in on the issue (literally, the Issue:
https://github.com/mej/nhc/issues/15), saying in part,

"I do not believe that you could create a deadlock situation from
NHC (if you did, I would consider that a Slurm bug)."
-- https://github.com/mej/nhc/issues/15#issuecomment-217174363

That's not to say you should go hog-wild and fill your epilog script
with all the `s`-commands you can think of.... ;-) But you can at
least be reasonably confident that draining/offlining a node from an
epilog script will not cause your cluster to implode!

Michael

--
Michael E. Jennings <m...@lanl.gov> - [PGPH: he/him/his/Mr] -- hpc.lanl.gov
HPC Systems Engineer -- Platforms Team -- HPC Systems Group (HPC-SYS)
Strategic Computing Complex, Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605
Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

Reply all
Reply to author
Forward
0 new messages