[mej/nhc] 1dd668: scripts/lbnl_cmd.nhc: Use consistent spawn code

0 views

Skip to first unread message

Michael Jennings

unread,

Apr 19, 2023, 4:56:36 PM4/19/23

to nhc-...@lbl.gov

Branch: refs/heads/fix/104/watchdog-tracking
Home: https://github.com/mej/nhc
Commit: 1dd668e57eaabd621153b08771c224ada6d806c0
https://github.com/mej/nhc/commit/1dd668e57eaabd621153b08771c224ada6d806c0
Author: Michael Jennings <m...@lanl.gov>
Date: 2023-04-13 (Thu, 13 Apr 2023)

Changed paths:
M scripts/common.nhc
M scripts/lbnl_cmd.nhc

Log Message:
-----------
scripts/lbnl_cmd.nhc: Use consistent spawn code

For some reason (probably simplicity), the `check_cmd_status()` check
function was using a different method of spawning a subprocess with a
timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()`
function was written specifically to facilitate consistency in coding
that exact functionality, but only `check_cmd_output()` was using it.
The `check_cmd_status()` check function was launching the subcommand
directly and trying to use `nhcmain_watchdog_timer()` to create its
watchdog timer process.

As observed in #104, `check_cmd_status()` (but *not*
`check_cmd_output()`) was leaving behind watchdog timer and `sleep`
processes that should have been terminated when the subcommand
exited. Unfortunately, `nhcmain_watchdog_timer()` was not written
with this use case in mind, nor was `kill_watchdog()` expecting to
have to clean up multiple child processes.

This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've
switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since
it uses a different mechanism and has not displayed this behavior.
The longer term fix will be to refactor the watchdog code in `nhc`
itself and use that code whenever launching subcommands is needed.

This should address #104 for the 1.4.4 branch but will be applied to
both for now, once tested and verified.

Commit: b171d831f492a08b06fa3fc52ce1dab578d5b4d2
https://github.com/mej/nhc/commit/b171d831f492a08b06fa3fc52ce1dab578d5b4d2
Author: Michael Jennings <m...@lanl.gov>
Date: 2023-04-19 (Wed, 19 Apr 2023)

Changed paths:
M nhc
M scripts/common.nhc
M test/nhc-test

Log Message:
-----------
nhc: Refactor watchdog timer code for reuse

See commits 1dd668e5 and b08769bb for further details.

As referenced in the above commits, the longer-term fix for the 1.5+
branch is a refactoring of all the watchdog timer code in `nhc` so
that multiple distinct timers can be managed simultaneously, including
their termination in case of successful subprocess/program exit. Lack
of proper cleanup was ultimately the key cause of #104's leaked
shell+`sleep` processes.

**NOTE**: The `nhc` script itself does *not* keep track of all the
PIDs for all the timers it has spawned off, only the main one (for the
top-level `nhc` process). Any other timer PIDs must be tracked by
whatever spawned them. In particular, `nhc_cmd_with_timeout()` tracks
both the task and the timer PIDs and ensures that both processes have
exited before it returns.

Compare: https://github.com/mej/nhc/compare/1dd668e57eaa%5E...b171d831f492

Reply all

Reply to author

Forward

0 new messages