Branch: refs/heads/dev
Home:
https://github.com/mej/nhc
Commit: 8fe57c05f284e48e15eb5f5a88f98a5455617b6c
https://github.com/mej/nhc/commit/8fe57c05f284e48e15eb5f5a88f98a5455617b6c
Author: Michael Jennings <
m...@lanl.gov>
Date: 2023-04-25 (Tue, 25 Apr 2023)
Changed paths:
M scripts/common.nhc
M scripts/lbnl_cmd.nhc
Log Message:
-----------
scripts/lbnl_cmd.nhc: Use consistent spawn code
For some reason (probably simplicity), the `check_cmd_status()` check
function was using a different method of spawning a subprocess with a
timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()`
function was written specifically to facilitate consistency in coding
that exact functionality, but only `check_cmd_output()` was using it.
The `check_cmd_status()` check function was launching the subcommand
directly and trying to use `nhcmain_watchdog_timer()` to create its
watchdog timer process.
As observed in #104, `check_cmd_status()` (but *not*
`check_cmd_output()`) was leaving behind watchdog timer and `sleep`
processes that should have been terminated when the subcommand
exited. Unfortunately, `nhcmain_watchdog_timer()` was not written
with this use case in mind, nor was `kill_watchdog()` expecting to
have to clean up multiple child processes.
This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've
switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since
it uses a different mechanism and has not displayed this behavior.
The longer term fix will be to refactor the watchdog code in `nhc`
itself and use that code whenever launching subcommands is needed.
This should address #104 for the 1.4.4 branch but will be applied to
both for now, once tested and verified.
Commit: f4ddfff46ff628a60ae5fb799ce54287fc8dbec6
https://github.com/mej/nhc/commit/f4ddfff46ff628a60ae5fb799ce54287fc8dbec6
Author: Michael Jennings <
m...@lanl.gov>
Date: 2023-04-25 (Tue, 25 Apr 2023)
Changed paths:
M nhc
M scripts/common.nhc
M test/nhc-test
Log Message:
-----------
nhc: Refactor watchdog timer code for reuse
See commits b08769bb and 8fe57c05 for further details.
As referenced in the above commits, the longer-term fix for the 1.5+
branch is a refactoring of all the watchdog timer code in `nhc` so
that multiple distinct timers can be managed simultaneously, including
their termination in case of successful subprocess/program exit. Lack
of proper cleanup was ultimately the key cause of #104's leaked
shell+`sleep` processes.
**NOTE**: The `nhc` script itself does *not* keep track of all the
PIDs for all the timers it has spawned off, only the main one (for the
top-level `nhc` process). Any other timer PIDs must be tracked by
whatever spawned them. In particular, `nhc_cmd_with_timeout()` tracks
both the task and the timer PIDs and ensures that both processes have
exited before it returns.
Commit: 551fb34175c167b5c286ff109adc5b47438c3876
https://github.com/mej/nhc/commit/551fb34175c167b5c286ff109adc5b47438c3876
Author: Michael Jennings <
m...@lanl.gov>
Date: 2023-04-25 (Tue, 25 Apr 2023)
Changed paths:
M nhc
M scripts/common.nhc
M test/nhc-test
Log Message:
-----------
nhc: Remove flags requiring newer Bash
Oopsie! Unfortunately, the versions of Bash installed on RHEL 7.x and
8.x do not support the flags I was passing to the `wait` builtin.
Going to have to use the older syntax for now....
Commit: 33c884f3e9584a1478c9a47eff464e6aef39a95d
https://github.com/mej/nhc/commit/33c884f3e9584a1478c9a47eff464e6aef39a95d
Author: Michael Jennings <
m...@lanl.gov>
Date: 2023-06-05 (Mon, 05 Jun 2023)
Changed paths:
M nhc
M scripts/common.nhc
M scripts/lbnl_cmd.nhc
M test/nhc-test
Log Message:
-----------
Merge pull request #132 from mej/fix/104/watchdog-tracking
nhc: Refactor watchdog timer code for reuse
Compare:
https://github.com/mej/nhc/compare/dc10825ea9a5...33c884f3e958