[mej/nhc] 8fe57c: scripts/lbnl_cmd.nhc: Use consistent spawn code

0 views

Skip to first unread message

Michael Jennings

unread,

Apr 25, 2023, 1:44:07 PM4/25/23

to nhc-...@lbl.gov

Branch: refs/heads/fix/104/watchdog-tracking
Home: https://github.com/mej/nhc
Commit: 8fe57c05f284e48e15eb5f5a88f98a5455617b6c
https://github.com/mej/nhc/commit/8fe57c05f284e48e15eb5f5a88f98a5455617b6c
Author: Michael Jennings <m...@lanl.gov>
Date: 2023-04-25 (Tue, 25 Apr 2023)

Changed paths:
M scripts/common.nhc
M scripts/lbnl_cmd.nhc

Log Message:
-----------
scripts/lbnl_cmd.nhc: Use consistent spawn code

For some reason (probably simplicity), the `check_cmd_status()` check
function was using a different method of spawning a subprocess with a
timout than `check_cmd_status()` was. The `nhc_cmd_with_timeout()`
function was written specifically to facilitate consistency in coding
that exact functionality, but only `check_cmd_output()` was using it.
The `check_cmd_status()` check function was launching the subcommand
directly and trying to use `nhcmain_watchdog_timer()` to create its
watchdog timer process.

As observed in #104, `check_cmd_status()` (but *not*
`check_cmd_output()`) was leaving behind watchdog timer and `sleep`
processes that should have been terminated when the subcommand
exited. Unfortunately, `nhcmain_watchdog_timer()` was not written
with this use case in mind, nor was `kill_watchdog()` expecting to
have to clean up multiple child processes.

This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've
switched `check_cmd_status()` to using `nhc_cmd_with_timeout()` since
it uses a different mechanism and has not displayed this behavior.
The longer term fix will be to refactor the watchdog code in `nhc`
itself and use that code whenever launching subcommands is needed.

This should address #104 for the 1.4.4 branch but will be applied to
both for now, once tested and verified.

Commit: f4ddfff46ff628a60ae5fb799ce54287fc8dbec6
https://github.com/mej/nhc/commit/f4ddfff46ff628a60ae5fb799ce54287fc8dbec6
Author: Michael Jennings <m...@lanl.gov>
Date: 2023-04-25 (Tue, 25 Apr 2023)

Changed paths:
M nhc
M scripts/common.nhc
M test/nhc-test

Log Message:
-----------
nhc: Refactor watchdog timer code for reuse

See commits b08769bb and 8fe57c05 for further details.

As referenced in the above commits, the longer-term fix for the 1.5+
branch is a refactoring of all the watchdog timer code in `nhc` so
that multiple distinct timers can be managed simultaneously, including
their termination in case of successful subprocess/program exit. Lack
of proper cleanup was ultimately the key cause of #104's leaked
shell+`sleep` processes.

**NOTE**: The `nhc` script itself does *not* keep track of all the
PIDs for all the timers it has spawned off, only the main one (for the
top-level `nhc` process). Any other timer PIDs must be tracked by
whatever spawned them. In particular, `nhc_cmd_with_timeout()` tracks
both the task and the timer PIDs and ensures that both processes have
exited before it returns.

Compare: https://github.com/mej/nhc/compare/b171d831f492...f4ddfff46ff6

Reply all

Reply to author

Forward

0 new messages