Branch: refs/heads/mej/dev/hw-procfs-refactor
Home:
https://github.com/mej/nhc
Commit: 5e18b7f5f803765a24cb281f8c4b452281482c4c
https://github.com/mej/nhc/commit/5e18b7f5f803765a24cb281f8c4b452281482c4c
Author: Michael Jennings <
m...@eterm.org>
Date: 2023-01-30 (Mon, 30 Jan 2023)
Changed paths:
M scripts/lbnl_hw.nhc
M test/test_lbnl_hw.nhc
Log Message:
-----------
WIP: lbnl_hw: Fixes/speedups for procfs file reads
This branch/changeset extensively refactors `lbnl_hw.nhc` as a
possible solution/improvement for the `/proc` file I/O problem (e.g.,
- Split `nhc_hw_gather_data()` into distinct functions per section,
allowing more fine-grained control over which `procfs` and/or
`sysfs` files actually need to be read and parsed in the first
place;
- Convert the path+filename for each file being parsed into a
configuration variable that users can customize as needed;
- Alter the way NHC is pulling data from each file, by using a single
`read` invocation, to avoid the
[`lseek()`/rebuild problems described](
https://github.com/mej/nhc/issues/43#issuecomment-324690311)
by @mattmix in #43.
- Add support for using a cached copy of the original file (instead
of reading directly from `/proc` or `/sys`) to avoid the resulting
poor performance; and
- Allow for the aforementioned configuration variables to specify a
process substitution expression (see `bash(1)` for details) rather
than a path+filename. This, in turn, permits the user to, for
example, `grep` out unused lines to minimize parsing time. (For a
specific example and rationale, see
[this comment](
https://github.com/mej/nhc/issues/30#issuecomment-331285668)
from @NateCrawford with relevant performance comparisons.)
- Improve the unit tests for this module; now that the hardware
checks are capable of reading from a user-defined location, the
unit tests feed auto-generated data into the checks via the
`/dev/stdin` "file" (really a shell variable). Not only does this
verify the new "user-specified custom data source" code, it also
expands the "code coverage" of the unit tests. The old tests just
directly assigned the test data to the right NHC-internal
variables; the new tests cover the parsing code as well, not just
the checks themselves.
These changes are intended to fix #30, #39, #43, #47, and #118 as well
as some older LANL-internal issues with
[Trinity](
https://www.lanl.gov/projects/trinity/specifications.php)
(our Haswell/KNL-based, nineteen-thousand-node HPE/Cray XC40).
And with respect to Trinity, I would be remiss were I to fail to
express my sincere thanks to @grahamvh, my colleague at @lanl and one
of the main sysadmins for that system, who helped me immensely in
brainstorming, devising potential solutions, testing, and providing
critical feedback en route toward finally getting this problem licked!
Feedback on this approach is much appreciated!