ANNOUNCE: Warewulf Node Health Check 1.2 Released

16 views
Skip to first unread message

Michael Jennings

unread,
Oct 3, 2012, 7:00:00 PM10/3/12
to ware...@lbl.gov
As no problems with the beta were reported, we have now released
version 1.2 as stable. The documentation on the web site has been
updated with all the new features and checks.

In case you missed the beta announcement, here are the release notes:

If you're not familiar, it's an effort to create a framework and
implementation for the node health check scripts often used by
resource managers and schedulers as well as for periodic independent
node sanity checks. More information and complete documentation may
be found at:

http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check

A number of significant changes have been made for the 1.2 release:

1. UNIT TESTS -- NHC now has a complete unit test framework and test
suite. All checks are tested for correct functionality based on
sample data as part of the package build. Even the driver script
itself was split into parts and subjected to unit testing. To run
the unit tests by hand, simply run "make test" in the top-level
source directory.

2. VARIABLES IN CONFIG FILES -- You now have the ability to set shell
variables in the NHC configuration file to alter settings for
individual checks or for NHC as a whole. Things like timeouts,
marking nodes offline, debugging, etc. may now be altered within
the configuration file. In addition, the syntax is the same as
that used for checks, so each variable assignment can be turned
on/off based on the match expression just like any other check.
This allows you to, for example, activate debugging only on a
specific node or nodegroup, or specify the path to "pbsnodes"
differently for different clusters.

NOTE: A side effect of this is that shell metacharacters in
config files (particularly those in regexps) now need to be
quoted.

3. MORE SETTINGS -- Several items which were previously hard-coded,
such as the pbsnodes path (and the options to it), the node daemon
name, etc. are now alterable via variables (in the config file as
mentioned in item #2, or in /etc/sysconfig/nhc). This helps to
enhance the flexibility and portability of NHC.

4. NVIDIA HEALTHMON SUPPORT -- The folks at nVidia have just released
version 3.304.3 of their Tesla Deployment Kit (TDK) as a
development/release candidate. It can be downloaded from:
http://developer.nvidia.com/cuda/tesla-deployment-kit

It contains their new nvidia-healthmon script which runs a slew of
checks on supported GPU devices. We've already got built-in
support for this tool (via a new check called check_nv_healthmon)
thanks to a collaboration with their developers, and we'll be
continuing to work together with them on enhancements going
forward.

5. DETACHED MODE SUPPORT -- Due to the single-threadedness of the
health check in current versions of TORQUE (and possible other RMs
as well), many sites prefer to run checks independently and store
the results in a file which a TORQUE-spawned health check script
can act upon later. NHC now has native support for this mode of
operation. It will detach and run all checks in a background
process while the foreground process handles the status recorded
from the previously-run checks.

6. MORE NEW CHECKS -- A new set of checks which use the output of
"df" has been added. This allows NHC to look at filesystem size
(check_fs_size <minsize> <maxsize>), amount of space used
(check_fs_used <size>), and/or amount of free space (check_fs_free
<size>). Size values can be expressed numerically with an
optional suffix (e.g., 100G, 10TB, etc.) or as a percentage (e.g.,
97%). Note that the "df" command is known to hang in certain
circumstances, so use with caution!

Support for check_ps_blacklist was also added to look for
processes that *shouldn't* be running on the nodes (opposite of
check_ps_daemon).

Please send feedback to the list or to me directly (m...@lbl.gov).

Michael

--
Michael Jennings <m...@lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
Reply all
Reply to author
Forward
0 new messages