joining list? Also: hardware distinction

27 views
Skip to first unread message

Simpson Lachlan

unread,
Sep 23, 2016, 12:31:46 PM9/23/16
to n...@lbl.gov
Hi,

We have Centos 7.2, SLURM 16.05 and lbnl-nhc 1.4.2

I was wondering about how to deploy the nhc.conf file.

Currently we have nhc installed on all nodes (large, hardware) and the login and head nodes (small, virtual).

The login and head node are a lot smaller wrt RAM and CPUs etc.

Should I have a different conf on each flavour? Does the login/head node not need nhc?

Cheers
L.
This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee. If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email. Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.

Michael Jennings

unread,
Sep 23, 2016, 5:56:51 PM9/23/16
to Simpson Lachlan, n...@lbl.gov
Hi Lachlan,

I noticed "joining list?" appeared in the subject of your e-mail, but
the body didn't mention anything about it. Just FYI, to subscribe to
the Group, you can either send an e-mail to nhc+su...@lbl.gov, or
visit https://groups.google.com/a/lbl.gov/group/nhc/ instead.

Hope that helps!


On Friday, 23 September 2016, at 03:39:03 (+0000),
Simpson Lachlan wrote:

> We have Centos 7.2, SLURM 16.05 and lbnl-nhc 1.4.2
>
> I was wondering about how to deploy the nhc.conf file.
>
> Currently we have nhc installed on all nodes (large, hardware) and the login and head nodes (small, virtual).
>
> The login and head node are a lot smaller wrt RAM and CPUs etc.
>
> Should I have a different conf on each flavour?

No, definitely not! As a general rule, you only want to use 1 NHC
config file for your entire system/cluster. In some cases, it's
possible to use a single config across many clusters; it just depends
on your setup. But definitely don't use 1 per node flavor! :-)

As mentioned in https://github.com/mej/nhc#configuration-file-syntax,
the beginning of each line of the config file (i.e., the part before
the double-pipe "||" delimiter) is matched against the hostname of
each node as it runs NHC, and lines that don't match are ignored. So
if you had a cluster of nodes named "n01" through "n50" attached to a
master node named "master," you could use "n*" for compute node checks
and "master*" for login/head node checks, like this:

n* || check_hw_cpuinfo 2 24 24
n* || check_hw_physmem 64G 64G 5%
master* || check_hw_cpuinfo 2 8 16
master* || check_hw_physmem 24G 24G 5%

Since "n*" doesn't match "master," the master node won't run the first
2 checks; likewise, compute nodes won't run the last two.

Match strings can be globs, as shown above, or they can be regular
expressions, node ranges, or even external match expressions. Details
are here: https://github.com/mej/nhc#match-strings

If you're not sure what checks to run, I would recommend running the
"nhc-genconf" command on 1 of each type of node you have, then combine
the auto-generated configurations into a single file. Additional info
here: https://github.com/mej/nhc#config-file-auto-generation

> Does the login/head node not need nhc?

NHC is great for login nodes, master nodes, data transfer nodes, and
all sorts of systems, not just compute nodes! However, since only
compute nodes (typically) have a node health check script being run by
a scheduler or resource manager (like SLURM or TORQUE), for those
other types of nodes, you'll need to run it via cron or something
similar.

We use the nhc-wrapper script which comes with NHC to run it on
non-compute-node systems. Here's what our crontab entry looks like:

*/5 * * * * root /usr/sbin/nhc-wrapper -M m...@lbl.gov -X 6h -a -v

Details on nhc-wrapper are here: https://github.com/mej/nhc#periodic-execution

Hope that helps!
Michael

--
Michael Jennings <m...@lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615

Simpson Lachlan

unread,
Sep 25, 2016, 9:09:17 PM9/25/16
to Michael Jennings, n...@lbl.gov
Thanks Michael, I couldn't see the join group button for some reason, but I have joined now.

Appreciate your reply, all makes sense now, although I saw this error:

/usr/sbin/nhc: line 295: /etc/nhc/nhc.conf.auto: Permission denied

On one of my compute nodes when I tried to run it locally with

nhc -d /etc/nhc/nhc.conf.auto

? I was running as root...

Cheers
L.

Michael Jennings

unread,
Sep 26, 2016, 1:38:08 PM9/26/16
to n...@lbl.gov
On Sun, Sep 25, 2016 at 6:08 PM, Simpson Lachlan
<Lachlan...@petermac.org> wrote:

> Appreciate your reply, all makes sense now, although I saw this error:
>
> /usr/sbin/nhc: line 295: /etc/nhc/nhc.conf.auto: Permission denied
>
> On one of my compute nodes when I tried to run it locally with
>
> nhc -d /etc/nhc/nhc.conf.auto

The -d option turns on debugging. I think you meant to run "nhc -c
/etc/nhc/nhc.conf.auto" instead. :-)
Reply all
Reply to author
Forward
0 new messages