Hi Jason!
Trey's reply was absolutely correct, but I did want to point out one
additional item I noticed.
> I'm attempting to integrate Node Health Check (NHC) with SLURM, such that
> it will run it periodically, and be able to offline a node with an issue,
> etc. Pretty typical stuff.
>
> ...
>
> 20190905 09:12:43 [pbs] /usr/libexec/nhc/node-mark-online node03.cluster
> /usr/libexec/nhc/node-mark-online: Skipping node node03.cluster ( )
Two things worth pointing out here: One, you can turn off the default
behavior of NHC (i.e., to mark nodes online/offline via the resource
manager based on the outcome of the checks) by setting the
MARK_OFFLINE variable to 0 as documented here:
https://github.com/mej/nhc#supported-variables
Two, and more importantly, notice the "[pbs]" at the beginning of the
line there. That means that NHC's built-in autodetection for which RM
you're using found PBS (e.g., OpenPBS, PBS Pro, or TORQUE) rather than
Slurm. Since you said you were intending to integrate with Slurm,
this represents a potential issue for you.
Possible causes include: having the slurm-torque RPM package (or
equivalent) installed on your system, having TORQUE itself installed,
or having a /var/spool/torque directory or pbsnodes command present on
the system when NHC runs.
To resolve the issue, either uninstall/remove the offending item(s) or
set NHC_RM=slurm either on the nhc command line when you run it (e.g.,
"/usr/sbin/nhc -a -l - NHC_RM=slurm") or in /etc/sysconfig/nhc. See
Ole's report on this issue on GitHub:
https://github.com/mej/nhc/issues/20
HTH!
Michael
--
Michael E. Jennings <
m...@lanl.gov>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341 W:
+1 (505) 606-0605