[mej/nhc] 128aec: nhc: Global config & command line precedence fix

2 views
Skip to first unread message

Michael Jennings

unread,
Nov 7, 2024, 4:20:07 PM11/7/24
to nhc-...@lbl.gov
Branch: refs/heads/dev
Home: https://github.com/mej/nhc
Commit: 128aec7d4621c89a0b642a26ac2637197f17a021
https://github.com/mej/nhc/commit/128aec7d4621c89a0b642a26ac2637197f17a021
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-06-07 (Fri, 07 Jun 2024)

Changed paths:
M nhc

Log Message:
-----------
nhc: Global config & command line precedence fix

As far as I can tell, this code has been in NHC -- wrong -- for
literally years. I've given up trying to figure out how my memory and
Git history can be so extremely different; being almost 50, the answer
seems obvious.

Anyway... Two simple but very important fixes here:

- `$NAME` should never have been used for finding and loading the
global settings file in `/etc/sysconfig/`; it was (and is) "global"
for a reason! But if anyone was relying on this functionality, let
me know; in theory, we can allow for both at once.

- Command-line arguments are intended to be authoritative; if a
particular setting appears in both `/etc/sysconfig/nhc` and as an
argument to `nhc`, the latter should win. Unfortunately, the
"order of operations," so to speak, prior to this commit did the
exact opposite. This change corrects the ordering.


Commit: 80e19f24c2915ac9bc4fbc73721ff757df0e6463
https://github.com/mej/nhc/commit/80e19f24c2915ac9bc4fbc73721ff757df0e6463
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-06-07 (Fri, 07 Jun 2024)

Changed paths:
M nhc

Log Message:
-----------
WIP: nhc: Add JSON output support (!125)

Work-in-progress support for JSON output.


Commit: 44bbfc1146e0527bcaef0a845b9c920cefc4d5e9
https://github.com/mej/nhc/commit/44bbfc1146e0527bcaef0a845b9c920cefc4d5e9
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-06-17 (Mon, 17 Jun 2024)

Changed paths:
M .gitignore
M nhc

Log Message:
-----------
nhc: Minor refactoring of info output functions

This change is a "re-think" of what I previously labeled
"quick-and-dirty" functions for outputting informational messages to
either the `$LOGFILE` or the original, saved `stdout`/`stderr` file
descriptors. Even the function names *themselves* were inconsistent
and confusing; I found myself having to refresh my memory by reading
through all the code again -- just to determine which function to use
in a given situation -- one too many times.

As a result, I rewrote them all as short, simple "one-liner"
functions, which are grouped together for readability and consistency,
that provide a clear and consistent flow of information. In fact,
with the exception of `syslog()`, all other informational messaging
functions now use a single interface to actually generate the output.
Each and every output interface now has two separate functions for use
with or without a format string, similar to `echo` and `printf`; i.e.,
`dbg()` now also has `dbgf()`, `log()` now has `logf()`, and so forth.
And each pair of functions simply wraps `_out()` and `_outf()`,
meaning that all output generated by NHC comes from one of 3 short
functions: `_out()`, `_outf()`, or `syslog()`.

I don't normally do this, but I lined up and spaced out all the
function declarations to make it easier to tell how they each differ
from one another as well as how they are alike. Hopefully these
changes will, overall, make outputting messages from within NHC
easier...and make more sense! I'm also hoping this refactor will
enable me to write unit tests for the output functions, something I
haven't been able to do up 'til now.


Commit: ab1ba9ef3ae9d2cd32ecec7f678857bb6e195a2a
https://github.com/mej/nhc/commit/ab1ba9ef3ae9d2cd32ecec7f678857bb6e195a2a
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-06-19 (Wed, 19 Jun 2024)

Changed paths:
M .gitignore
M nhc
M test/nhc-test

Log Message:
-----------
nhc: Make `flush` an option instead of a function

This has been on my "TODO" list for ages now, so since I'm already
making I/O and logging changes, I decided to include this too.

Prior to this change, `syslog()` would only append message lines to a
string variable; nothing would actually be written to the `syslog`
socket (using the `logger` command by default) until the
`syslog_flush()` function was invoked.

As with many things in NHC, this was done to minimize context switches
and subcommand execution as much as possible, especially during the
"running checks" phase. There are, however, numerous instances in
which it's prudent to write information to the `syslog` right away
rather than saving it for later; this is made quite evident by the
number of times a call to `syslog_flush()` immediately follows
`syslog()` invocation in NHC's code. The repetition makes that code
jump out at you; in my humble opinion, it's just too distracting.

This commit adds support to `syslog()` for requesting immediate
flushing of the `syslog` data by specifying the `-f` option. All
references to `syslog_flush()` have been replaced with `syslog -f`
and, if they followed a call to `syslog()`, merged the two calls into
one. The `syslog_flush()` function is now deprecated and may be
removed in a future release; however, for compatibility reasons, the
`syslog_flush()` function does still exist and will still do the same
thing it did prior to this commit. It is now a simple wrapper around
a call to `syslog -f`.

(Another possible eventual change: If `syslog()` ever acquires any
additional options apart from just `-f`, I will convert it to use
`getopts` like other functions. For now, though, since it's the only
available flag and takes no argument(s), it's faster and simpler to
just look at the first argument...either it'll be `-f`, or it won't.)


Commit: 6b40f9eaced067e0c6d1c68fdf8a5dff4be73136
https://github.com/mej/nhc/commit/6b40f9eaced067e0c6d1c68fdf8a5dff4be73136
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-11 (Thu, 11 Jul 2024)

Changed paths:
M .gitignore
M nhc
M test/nhc-test

Log Message:
-----------
Merge branch 'mej/dev/minor-io-refactor' into mej/dev/major-feature-merge

* mej/dev/minor-io-refactor:
nhc: Make `flush` an option instead of a function
nhc: Minor refactoring of info output functions


Commit: 16b8924530865b42962c0d9c30f3b6eccff7fd6d
https://github.com/mej/nhc/commit/16b8924530865b42962c0d9c30f3b6eccff7fd6d
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-11 (Thu, 11 Jul 2024)

Changed paths:
M nhc

Log Message:
-----------
Merge branch 'mej/fix/cli-vs-sysconfig-vs-name' into mej/dev/major-feature-merge

* mej/fix/cli-vs-sysconfig-vs-name:
nhc: Global config & command line precedence fix


Commit: 8e4eceef0b499765380f4aca2f2dd46e20af6eff
https://github.com/mej/nhc/commit/8e4eceef0b499765380f4aca2f2dd46e20af6eff
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-11 (Thu, 11 Jul 2024)

Changed paths:
M nhc

Log Message:
-----------
Merge branch 'mej/dev/125-json-output' into mej/dev/major-feature-merge

* mej/dev/125-json-output:
WIP: nhc: Add JSON output support (!125)


Commit: 3285d5a9938bb38be64b7b74e7e2519017cd70b7
https://github.com/mej/nhc/commit/3285d5a9938bb38be64b7b74e7e2519017cd70b7
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-11 (Thu, 11 Jul 2024)

Changed paths:
M nhc
M test/test_common.nhc

Log Message:
-----------
nhc: TS -> NHC_TS to fix variable name collision

Not sure how long that's been there, but yikes!

When I started testing the newly merged code, I started getting weird
errors from the new timestamp-handling microfunction `tsup()`
complaining about an expression error. Come to find out, one of my
unit tests for the array-handling code used the same variable name of
`${TS}` that the timestamping feature was using. Doh!

I covered my bases here. I've renamed the variable in the test code.
I also changed the name of the config variable in NHC to the
much-more-gooderly-namespaced `${NHC_TS}`. Since these new feature
changes will be going straight into the `1.5` release, and never any
`1.4.x`-series releases, I think I'm okay to rename that config
variable. Besides, I doubt anyone would use it in production
anyway...in fact, I may be the only one who uses it at all!


Commit: d2d4ee1f82c53916a74ae21799c6f83b773cee4a
https://github.com/mej/nhc/commit/d2d4ee1f82c53916a74ae21799c6f83b773cee4a
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-15 (Mon, 15 Jul 2024)

Changed paths:
M scripts/lbnl_file.nhc
M test/nhc-test
M test/test_lbnl_file.nhc

Log Message:
-----------
lbnl_file.nhc: Rethink `stat`/`test` checks

This commit does a number of things:

- Support was added in `check_file_stat()` for handling (almost) every
portable field that can be selected from `stat -c`; specifically:

- There's now a separate pair of options for validating just the
*permissions* bits of the file mode (i.e., `mode | 00007777`) or
the entire `mode` value (including the higher-order "file type
bits," `mode | 00170000`).

- Another new pair of options, `-b`/`-B`, was added to test
whether or not the "birth timestamp" is older or newer,
respectively, than the given number of seconds (only on
filesystems which support it, of course).

- On Linux hosts, SELinux file contexts are also supported.

- The `check_file_test()` shell function's support for command-line
arguments was rewritten to use `getopts` like many other checks
already do. Among other things, this allows options to be bundled
together. This might seem trivial, but being able to turn
`check_file_test -f -G -O -e -r -w -s ~/.*profile ~/.*rc` into
`check_file_test -GOefrsw ~/.*profile ~/.*rc` is definitely not
nothing.

- Variables have replaced a hard-coded format string when executing
the `stat` command.

- The quality of the unit tests for both of these checks has been
substantially improved.

And of course there were other tweaks here and there as well.

This is not to say they're complete, though! As stated in this
commit, the `stat` stuff needs to be refactored to use hashmaps
instead (or "associative arrays" or "dictionaries," if you prefer) now
that the older versions of Bash are no longer supported. (In fact,
there are several items that need to be "modernized" to use newer Bash
features...!) But I'd like to get 1.5 released sooner rather than
later, so some features and improvements will be postponed until the
1.5.1 development cycle.


Commit: c848f3cb4563a2296f8c0c3bebe033dbcd5d5aac
https://github.com/mej/nhc/commit/c848f3cb4563a2296f8c0c3bebe033dbcd5d5aac
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-15 (Mon, 15 Jul 2024)

Changed paths:
M scripts/lbnl_file.nhc
M test/nhc-test
M test/test_lbnl_file.nhc

Log Message:
-----------
Merge branch 'mej/dev/file-stat-enhancements' into mej/dev/major-feature-merge

* mej/dev/file-stat-enhancements:
lbnl_file.nhc: Rethink `stat`/`test` checks


Commit: 6b3785f57c70e2209848ff83e85968beea141d90
https://github.com/mej/nhc/commit/6b3785f57c70e2209848ff83e85968beea141d90
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-29 (Mon, 29 Jul 2024)

Changed paths:
M scripts/lbnl_file.nhc

Log Message:
-----------
lbnl_file.nhc: Handle wonky SELinux supportitude

On Linux hosts with fully functional SELinux support in the `disabled`
state, the `stat` command returns some error messages complaining
about a lack of context data:

```
/usr/bin/stat: failed to get security context of '/my/file': No data available
```

On the other hand, a system with SELinux enabled and active will
return the correct value (e.g., `system_u:object_r:null_device_t:s0`
for `/dev/null`). To address this problem, at least for the short
term, the `%C` for the SELinux Context field will only be included if
SELinux appears to be supported *and* active. On RHEL-compatible
systems (and possibly others), this means `SELINUX` being set to
something neither empty nor `"disabled"`. The return value of
`selinuxenabled` is also checked. For now, that's it, but please do
suggest other methods you know of!

Interestingly, `stat(1)` seems to output each of the values requested
one-by-one in order -- meaning that, even though the error is written
to `stderr`, it's written at the correct point to correlate when that
field would have been output in a non-failure scenario. In other
words, if I use `'%n %m %C %u %g'` as my `stat -c` format string
(that's "name mountpoint context uid gid") and redirect `stderr` to
`stdout`, the output will contain the file's name in column 1, its
mountpoint in column 2, and column 3 will contain the error message
shown above followed by a newline. After the newline will come a
space, the numeric UID, another space, and finally the numeric GID.

The point being, all the other information `stat(1)` provides, even if
it gets no SELinux data back, is still valid and still shows up in the
correct columns/fields per the requested format. So the possibility
exists to better handle this case in the future -- more of a "best
effort" approach rather than "non-zero exit means failure." Another
approach would be to make it an admin-controlled setting. I can't
really take the time to do that just now, though.


Commit: 0a0f142b9c1b9ccee0ec87a5e589abaf320b460b
https://github.com/mej/nhc/commit/0a0f142b9c1b9ccee0ec87a5e589abaf320b460b
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-07-29 (Mon, 29 Jul 2024)

Changed paths:
M scripts/lbnl_file.nhc

Log Message:
-----------
Merge branch 'mej/dev/file-stat-enhancements' into mej/dev/major-feature-merge

* mej/dev/file-stat-enhancements:
lbnl_file.nhc: Handle wonky SELinux supportitude


Commit: dc2762775878cdf257645bd6fcc9fa19c1661f18
https://github.com/mej/nhc/commit/dc2762775878cdf257645bd6fcc9fa19c1661f18
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-09-10 (Tue, 10 Sep 2024)

Changed paths:
M configure.ac
M nhc
M scripts/common.nhc

Log Message:
-----------
nhc: Add support for flag to show version/info

A previous commit (63c5678d) added the `NHC_VERSION` variable,
transformed at install-time, to provide the ability to identify the
version of `nhc` being executed. This commit adds the `-V` option to
allow that information to be retrieved from the command line or a
shell script.

I'm still working out the specifics of what gets displayed by `-V`, so
the exact output is subject to change, but the basic idea is this:
`-V` displays just the name and version string; for each additional
`-V` added to the `nhc` invocation, additional info/details may be
provided. For right now, though, I'm keeping it pretty simple.


Commit: c571813d746bb2e4b19783fb46e61ca843656c66
https://github.com/mej/nhc/commit/c571813d746bb2e4b19783fb46e61ca843656c66
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-09-10 (Tue, 10 Sep 2024)

Changed paths:
M configure.ac
M nhc
M scripts/common.nhc

Log Message:
-----------
Merge branch 'mej/dev/new-version-cli-flag' into dev

* mej/dev/new-version-cli-flag:
nhc: Add support for flag to show version/info


Commit: 0b60291a217edc1c4b5bfaf67fdf32862fdfcc68
https://github.com/mej/nhc/commit/0b60291a217edc1c4b5bfaf67fdf32862fdfcc68
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-10-16 (Wed, 16 Oct 2024)

Changed paths:
M scripts/lbnl_file.nhc

Log Message:
-----------
lbnl_file.nhc: Remove extraneous `\n`


Commit: 1a430349e1d01b1472089058da56071bdb391e68
https://github.com/mej/nhc/commit/1a430349e1d01b1472089058da56071bdb391e68
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-10-17 (Thu, 17 Oct 2024)

Changed paths:
M nhc

Log Message:
-----------
nhc: Add new `-VVV` response and tidy up previous

This updates the code for handling `nhcmain_version_info()` invocation
with a count/argument of 0 or 1 (`-V`) to match the style of the
2-count (`-VV`) handler; it also adds support for a value of 3
(`-VVV`) to obtain a reasonably comprehensive view of configuration
variables/settings.

The existing behavior for `-V` and `-VV` was not altered; specifying
`-V` twice will still display the information that a single `-V`
displays followed by some additional details. Likewise, adding a
third `-V` still displays everything that was included in the output
for the first and second, followed by even more detailed information.
The output is additive, and I feel that the new layout/formatting of
the code makes this more clear and easier to follow. Conceptually, it
maps well to similar functionality in other software (e.g., the `-v`
flag in OpenSSH).


Commit: 511cd080e426052f1d95e0f8c37ea1a01ee82d42
https://github.com/mej/nhc/commit/511cd080e426052f1d95e0f8c37ea1a01ee82d42
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-10-17 (Thu, 17 Oct 2024)

Changed paths:
M nhc

Log Message:
-----------
nhc: Fix var collisions, etc., with JSON logging

All calls to `syslog()` now have their message buffers properly
converted to JSON (if requested) just prior to invocation, and the
multi-function variable name collisions for `$MSG` that were keeping
JSON from working at all have been remediated.

This changeset also adds the `-j` command-line toggle to activate JSON
mode (instead of having to use `NHC_FMT_JSON=1`).

Future work: Would a nameref fix this, or do we still have the same
problem with matching names up/down the call stack? Also, should we
be passing an "object" (hashmap) to the `syslog()` function with all
the metadata rather than having to pre-convert the message buffer
right beforehand? and making the `jsonify()` function figure out all
the metadata on its own from the message text?


Commit: 5cdb36b511040cccc813bb6d050c5df8fd829971
https://github.com/mej/nhc/commit/5cdb36b511040cccc813bb6d050c5df8fd829971
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-10-28 (Mon, 28 Oct 2024)

Changed paths:
M scripts/lbnl_fs.nhc

Log Message:
-----------
lbnl_fs.nhc: Minor adjustment to mount check

A small issue revealed itself during pre-release testing: When a
mountpoint does not exist, any attempt to mount that mountpoint's
filesystem will, obviously, fail. For that reason, NHC will
automatically create (or *attempt to create*) any non-existent
mountpoints it encounters for which the admin has specified a check.

(No option/flag was added to control this behavior; logically, if a
directory has a mount configuration for it specified in `fstab` *and*
the admin has configured NHC to verify that a filesystem is mounted
there, it stands to reason that the admin would want a filesystem to
be able to be successfully mounted there...and for that to happen, the
directory would have to exist.)

Previously, NHC would report that the missing directory issue had been
"`auto-fixed`" as part of the result message -- but it wouldn't
actually check to make sure the `mkdir` command had really worked!
And if it *had* failed, the `mkdir` command would usually report the
failure immediately prior to displaying the aforementioned claim of
success, resulting in something like the following:

```
Running check 06 of 89: "check_fs_mount -t "nfs" -s "netapp:/moo" -f "/mnt/remote/moo""
mkdir: cannot create directory '/mnt/remote/moo': No such file or directory
Health check failed: check_fs_mount: /mnt/remote/moo not mounted; directory /mnt/remote/moo missing (auto-fixed)
```

So the admin would see the `mkdir` failure and then NHC claiming
success...every time it ran. Not exactly confidence-inspiring, right?

This simply tests for the existence of the mountpoint directory, and
if it doesn't exist, reports the auto-fix failure rather than assuming
success. It's a little thing, perhaps, but the potential for user
confusion was, nonetheless, unquestionably present.


Commit: c8ea40a67997c58d8758a25ec0ab1c3489c01c65
https://github.com/mej/nhc/commit/c8ea40a67997c58d8758a25ec0ab1c3489c01c65
Author: Michael Jennings <m...@lanl.gov>
Date: 2024-10-29 (Tue, 29 Oct 2024)

Changed paths:
M nhc

Log Message:
-----------
nhc: Allow more exotic config file specifications

For various reasons I won't get into, I've recently discovered that
there's value in feeding NHC configuration data from non-standard
locations using non-standard techniques; unfortunately, one simple
test in `nhcmain_check_conffile()` ruined it because it would refuse
any config file specification that wasn't a plain ordinary file (or a
symlink to one).

Here's an example of (more or less) one of the things I tried that
didn't work (but does now!):

```
host# nhc -avl- -c <( egrep -hv '^[[:space:]]*(#|$)' /etc/nhc/nhc*.conf )
```

Other possibilities include appending additional check lines to your
NHC config - without having to change the config file! - by piping
check-lines into NHC's config "file" onto `stdout`, using named pipes
to collapse and feed multiple data streams into NHC's config
dynamically, etc.

The other problem was this: Since NHC was designed and implemented
with the primary focus of answering the question, "Can this node
reliably and correctly execute user jobs?" an empty config file (i.e.,
one with no checks) is conceptually identical to a config file in
which none of the check-lines' leading match strings matches. The
admin/user providing no checks for a given host does not declare it
unhealthy, nor does it warrant taking the node out of service.

A side-effect of that, though, meant that NHC not having any checks to
run was still considered healthy (a return code of "true"), a fact
which was then never reported back to the log, syslog, or anywhere.

NHC will now check both type *and* access for its config file;
further, warning messages will be provided if the supplied config is
invalid or unreadable. It will still return "true" in those cases,
for the reasons set forth above, but it won't just silently exit.


Compare: https://github.com/mej/nhc/compare/7d9cec9d2cde...c8ea40a67997

To unsubscribe from these emails, change your notification settings at https://github.com/mej/nhc/settings/notifications
Reply all
Reply to author
Forward
0 new messages