SLURM Integration

88 views
Skip to first unread message

Jason Simms

unread,
Sep 5, 2019, 10:08:05 AM9/5/19
to n...@lbl.gov
Dear all,

I have what is a rather simple question I'm certain, so I apologize in advance for its basic nature.

I'm attempting to integrate Node Health Check (NHC) with SLURM, such that it will run it periodically, and be able to offline a node with an issue, etc. Pretty typical stuff.

But, while I think I have everything configured correctly - there's not much to it, really - I'm having a challenging time determining whether it is running as it should. On a given node, if I manually run nhc and then check its return code with echo $?, I see it returns 0. But then in nhc.log, every time it runs, I see these two lines (and this is the same for each node):

20190905 09:12:43 [pbs] /usr/libexec/nhc/node-mark-online node03.cluster
/usr/libexec/nhc/node-mark-online:  Skipping  node node03.cluster ( )

Is this expected? Is this just telling me that everything is fine, or is something failing? I've searched in vain for a couple of hours online with no success. Any insight would be appreciated, and again, apologies for my ignorance.

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

Dockendorf, Trey

unread,
Sep 5, 2019, 10:19:14 AM9/5/19
to n...@lbl.gov

Yes, those log lines are expected. When a node is healthy it will attempt to mark itself online and if it’s not marked offline by NHC then you’ll see the “Skipping” message.

 

- Trey

--
You received this message because you are subscribed to the Google Groups "LBNL Node Health Check" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nhc+uns...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/nhc/CAP7JYwcbcv4UN-OfLJddXhnSoxEHvsuKne0S9SOF7PzwjDD1xg%40mail.gmail.com.

Michael Jennings

unread,
Sep 9, 2019, 8:23:22 PM9/9/19
to Jason Simms, n...@lbl.gov
Hi Jason!

Trey's reply was absolutely correct, but I did want to point out one
additional item I noticed.

> I'm attempting to integrate Node Health Check (NHC) with SLURM, such that
> it will run it periodically, and be able to offline a node with an issue,
> etc. Pretty typical stuff.
>
> ...
>
> 20190905 09:12:43 [pbs] /usr/libexec/nhc/node-mark-online node03.cluster
> /usr/libexec/nhc/node-mark-online: Skipping node node03.cluster ( )

Two things worth pointing out here: One, you can turn off the default
behavior of NHC (i.e., to mark nodes online/offline via the resource
manager based on the outcome of the checks) by setting the
MARK_OFFLINE variable to 0 as documented here:
https://github.com/mej/nhc#supported-variables

Two, and more importantly, notice the "[pbs]" at the beginning of the
line there. That means that NHC's built-in autodetection for which RM
you're using found PBS (e.g., OpenPBS, PBS Pro, or TORQUE) rather than
Slurm. Since you said you were intending to integrate with Slurm,
this represents a potential issue for you.

Possible causes include: having the slurm-torque RPM package (or
equivalent) installed on your system, having TORQUE itself installed,
or having a /var/spool/torque directory or pbsnodes command present on
the system when NHC runs.

To resolve the issue, either uninstall/remove the offending item(s) or
set NHC_RM=slurm either on the nhc command line when you run it (e.g.,
"/usr/sbin/nhc -a -l - NHC_RM=slurm") or in /etc/sysconfig/nhc. See
Ole's report on this issue on GitHub: https://github.com/mej/nhc/issues/20

HTH!
Michael

--
Michael E. Jennings <m...@lanl.gov>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605

Erik Ellestad

unread,
Jul 21, 2021, 3:30:52 PM7/21/21
to LBNL Node Health Check, Michael Jennings, n...@lbl.gov, sim...@lafayette.edu
For what it is worth, I had the best luck adding "NHC_RM=slurm" to /etc/sysconfig/nhc and then integrating that as part of my nhc install and node update scripts.

But, yes, NHC will not function correctly to down or up nodes if it is detecting [pbs] for the Resource Manager on a Slurm node.

Erik

Reply all
Reply to author
Forward
0 new messages