ANNOUNCE: Warewulf Node Health Check 1.2 Released

Showing 1-1 of 1 messages
ANNOUNCE: Warewulf Node Health Check 1.2 Released Michael Jennings 10/3/12 4:00 PM
As no problems with the beta were reported, we have now released
version 1.2 as stable.  The documentation on the web site has been
updated with all the new features and checks.

In case you missed the beta announcement, here are the release notes:

If you're not familiar, it's an effort to create a framework and
implementation for the node health check scripts often used by
resource managers and schedulers as well as for periodic independent
node sanity checks.  More information and complete documentation may
be found at:

http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check

A number of significant changes have been made for the 1.2 release:

1.  UNIT TESTS -- NHC now has a complete unit test framework and test
    suite.  All checks are tested for correct functionality based on
    sample data as part of the package build.  Even the driver script
    itself was split into parts and subjected to unit testing.  To run
    the unit tests by hand, simply run "make test" in the top-level
    source directory.

2.  VARIABLES IN CONFIG FILES -- You now have the ability to set shell
    variables in the NHC configuration file to alter settings for
    individual checks or for NHC as a whole.  Things like timeouts,
    marking nodes offline, debugging, etc. may now be altered within
    the configuration file.  In addition, the syntax is the same as
    that used for checks, so each variable assignment can be turned
    on/off based on the match expression just like any other check.
    This allows you to, for example, activate debugging only on a
    specific node or nodegroup, or specify the path to "pbsnodes"
    differently for different clusters.

    NOTE:  A side effect of this is that shell metacharacters in
    config files (particularly those in regexps) now need to be
    quoted.

3.  MORE SETTINGS -- Several items which were previously hard-coded,
    such as the pbsnodes path (and the options to it), the node daemon
    name, etc. are now alterable via variables (in the config file as
    mentioned in item #2, or in /etc/sysconfig/nhc).  This helps to
    enhance the flexibility and portability of NHC.

4.  NVIDIA HEALTHMON SUPPORT -- The folks at nVidia have just released
    version 3.304.3 of their Tesla Deployment Kit (TDK) as a
    development/release candidate.  It can be downloaded from:
    http://developer.nvidia.com/cuda/tesla-deployment-kit

    It contains their new nvidia-healthmon script which runs a slew of
    checks on supported GPU devices.  We've already got built-in
    support for this tool (via a new check called check_nv_healthmon)
    thanks to a collaboration with their developers, and we'll be
    continuing to work together with them on enhancements going
    forward.

5.  DETACHED MODE SUPPORT -- Due to the single-threadedness of the
    health check in current versions of TORQUE (and possible other RMs
    as well), many sites prefer to run checks independently and store
    the results in a file which a TORQUE-spawned health check script
    can act upon later.  NHC now has native support for this mode of
    operation. It will detach and run all checks in a background
    process while the foreground process handles the status recorded
    from the previously-run checks.

6.  MORE NEW CHECKS -- A new set of checks which use the output of
    "df" has been added.  This allows NHC to look at filesystem size
    (check_fs_size <minsize> <maxsize>), amount of space used
    (check_fs_used <size>), and/or amount of free space (check_fs_free
    <size>).  Size values can be expressed numerically with an
    optional suffix (e.g., 100G, 10TB, etc.) or as a percentage (e.g.,
    97%).  Note that the "df" command is known to hang in certain
    circumstances, so use with caution!

    Support for check_ps_blacklist was also added to look for
    processes that *shouldn't* be running on the nodes (opposite of
    check_ps_daemon).

Please send feedback to the list or to me directly (m...@lbl.gov).

Michael

--
Michael Jennings <m...@lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615