ANNOUNCE: Warewulf Node Health Check 1.2 beta 1 | Michael Jennings | 9/10/12 3:52 PM | I'm happy to announce the first beta release of version 1.2 of the
Warewulf Node Health Check subproject. If you're not familiar, it's an effort to create a framework and implementation for the node health check scripts often used by resource managers and schedulers as well as for periodic independent node sanity checks. More information and complete documentation may be found at: http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check A number of significant changes have been made for the 1.2 release: 1. UNIT TESTS -- NHC now has a complete unit test framework and test suite. All checks are tested for correct functionality based on sample data as part of the package build. Even the driver script itself was split into parts and subjected to unit testing. To run the unit tests by hand, simply run "make test" in the top-level source directory. 2. VARIABLES IN CONFIG FILES -- You now have the ability to set shell variables in the NHC configuration file to alter settings for individual checks or for NHC as a whole. Things like timeouts, marking nodes offline, debugging, etc. may now be altered within the configuration file. In addition, the syntax is the same as that used for checks, so each variable assignment can be turned on/off based on the match expression just like any other check. This allows you to, for example, activate debugging only on a specific node or nodegroup, or specify the path to "pbsnodes" differently for different clusters. NOTE: A side effect of this is that shell metacharacters in config files (particularly those in regexps) now need to be quoted. 3. MORE SETTINGS -- Several items which were previously hard-coded, such as the pbsnodes path (and the options to it), the node daemon name, etc. are now alterable via variables (in the config file as mentioned in item #2, or in /etc/sysconfig/nhc). This helps to enhance the flexibility and portability of NHC. 4. NVIDIA HEALTHMON SUPPORT -- The folks at nVidia have just released version 3.304.3 of their Tesla Deployment Kit (TDK) as a development/release candidate. It can be downloaded from: http://developer.nvidia.com/cuda/tesla-deployment-kit It contains their new nvidia-healthmon script which runs a slew of checks on supported GPU devices. We've already got built-in support for this tool (via a new check called check_nv_healthmon) thanks to a collaboration with their developers, and we'll be continuing to work together with them on enhancements going forward. 5. DETACHED MODE SUPPORT -- Due to the single-threadedness of the health check in current versions of TORQUE (and possible other RMs as well), many sites prefer to run checks independently and store the results in a file which a TORQUE-spawned health check script can act upon later. NHC now has native support for this mode of operation. It will detach and run all checks in a background process while the foreground process handles the status recorded from the previously-run checks. 6. MORE NEW CHECKS -- A new set of checks which use the output of "df" has been added. This allows NHC to look at filesystem size (check_fs_size <minsize> <maxsize>), amount of space used (check_fs_used <size>), and/or amount of free space (check_fs_free <size>). Size values can be expressed numerically with an optional suffix (e.g., 100G, 10TB, etc.) or as a percentage (e.g., 97%). Note that the "df" command is known to hang in certain circumstances, so use with caution! Support for check_ps_blacklist was also added to look for processes that *shouldn't* be running on the nodes (opposite of check_ps_daemon). None of the above is documented on the wiki page yet since it's still in beta. We're in the process of rolling this beta out to our systems here, but since we may not be using all the new features, I could really use some help with beta testing! :-) You can download the beta from: http://warewulf.lbl.gov/downloads/beta/warewulf-nhc-1.2beta1.tar.gz Please send feedback to the list or to me directly (m...@lbl.gov). Enjoy! :-) Michael -- Michael Jennings <m...@lbl.gov> Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 |