Hi Lachlan,
I noticed "joining list?" appeared in the subject of your e-mail, but
the body didn't mention anything about it. Just FYI, to subscribe to
the Group, you can either send an e-mail to
nhc+su...@lbl.gov, or
visit
https://groups.google.com/a/lbl.gov/group/nhc/ instead.
Hope that helps!
On Friday, 23 September 2016, at 03:39:03 (+0000),
Simpson Lachlan wrote:
> We have Centos 7.2, SLURM 16.05 and lbnl-nhc 1.4.2
>
> I was wondering about how to deploy the nhc.conf file.
>
> Currently we have nhc installed on all nodes (large, hardware) and the login and head nodes (small, virtual).
>
> The login and head node are a lot smaller wrt RAM and CPUs etc.
>
> Should I have a different conf on each flavour?
No, definitely not! As a general rule, you only want to use 1 NHC
config file for your entire system/cluster. In some cases, it's
possible to use a single config across many clusters; it just depends
on your setup. But definitely don't use 1 per node flavor! :-)
As mentioned in
https://github.com/mej/nhc#configuration-file-syntax,
the beginning of each line of the config file (i.e., the part before
the double-pipe "||" delimiter) is matched against the hostname of
each node as it runs NHC, and lines that don't match are ignored. So
if you had a cluster of nodes named "n01" through "n50" attached to a
master node named "master," you could use "n*" for compute node checks
and "master*" for login/head node checks, like this:
n* || check_hw_cpuinfo 2 24 24
n* || check_hw_physmem 64G 64G 5%
master* || check_hw_cpuinfo 2 8 16
master* || check_hw_physmem 24G 24G 5%
Since "n*" doesn't match "master," the master node won't run the first
2 checks; likewise, compute nodes won't run the last two.
Match strings can be globs, as shown above, or they can be regular
expressions, node ranges, or even external match expressions. Details
are here:
https://github.com/mej/nhc#match-strings
If you're not sure what checks to run, I would recommend running the
"nhc-genconf" command on 1 of each type of node you have, then combine
the auto-generated configurations into a single file. Additional info
here:
https://github.com/mej/nhc#config-file-auto-generation
> Does the login/head node not need nhc?
NHC is great for login nodes, master nodes, data transfer nodes, and
all sorts of systems, not just compute nodes! However, since only
compute nodes (typically) have a node health check script being run by
a scheduler or resource manager (like SLURM or TORQUE), for those
other types of nodes, you'll need to run it via cron or something
similar.
We use the nhc-wrapper script which comes with NHC to run it on
non-compute-node systems. Here's what our crontab entry looks like:
*/5 * * * * root /usr/sbin/nhc-wrapper -M
m...@lbl.gov -X 6h -a -v
Details on nhc-wrapper are here:
https://github.com/mej/nhc#periodic-execution
Hope that helps!
Michael
--
Michael Jennings <
m...@lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W:
510-495-2687
MS 050B-3209 F:
510-486-8615