cfengine watchdog is not biting

60 views
Skip to first unread message

Xander Cage

unread,
Feb 20, 2024, 5:11:20 AMFeb 20
to help-cfengine
Hi,

At times the cfengine watchdog script is unable to resolve hanging cf-agents. the reason is that if the system is under high load the watchdog "hangs" because it reads his own stale pid file and is doing simply nothing.

example:

i had 332 hanging cf-agent processe but the watchdog log just shows this.

Tue Feb 20 10:57:00 CET 2024 Initiating watchdog 6816488
Tue Feb 20 10:57:00 CET 2024 Aborting execution of watchdog 6816488, existing watchdog process 16450014 running
Tue Feb 20 10:58:00 CET 2024 Initiating watchdog 44499432
Tue Feb 20 10:58:00 CET 2024 Aborting execution of watchdog 44499432, existing watchdog process 16450014 running

after deleting the pid file...

Tue Feb 20 10:59:00 CET 2024 Initiating watchdog 25756150
Tue Feb 20 10:59:01 CET 2024 Found cf-execd not running
Tue Feb 20 10:59:03 CET 2024 Found 332 occurrences of cf-execd terminating unresponsive cf-agent
Tue Feb 20 10:59:04 CET 2024 Found 2 symptoms, threshold (0) breached.
Tue Feb 20 10:59:06 CET 2024 Initiating apoptosis
Tue Feb 20 10:59:10 CET 2024 Initiating anastasis

i guess there is room for impovement, as this can bring a system down quite easyily.

wbr

chris

Xander Cage

unread,
Feb 20, 2024, 7:34:15 AMFeb 20
to help-cfengine

Nick Anderson

unread,
Feb 21, 2024, 9:59:41 AMFeb 21
to help-cfengine
Thanks for filing the ticket. :D

Mike Weilgart

unread,
Feb 21, 2024, 8:47:24 PMFeb 21
to help-cfengine

Would like some help with testing.  Would be especially great to test on some of these systems where the existing pidfile logic isn't working, besides testing the usual cases.

Best,
--Mike Weilgart

Xander Cage

unread,
Feb 22, 2024, 8:25:03 AMFeb 22
to help-cfengine
i dont think the problematic situation can be simulated, but running the changed script shows a missing "then" ;-)


after adding it, the scripts runs without errors....

Xander Cage

unread,
Feb 22, 2024, 8:26:24 AMFeb 22
to help-cfengine
forgot to post the error:

root@aixtest01: /root # /var/cfengine/bin/watchdog_changed
/var/cfengine/bin/watchdog_changed[52]: syntax error at line 64 : `else' unexpected

Nick Anderson

unread,
Feb 22, 2024, 12:13:54 PMFeb 22
to help-cfengine
Thanks Mike and Xander

this got merged.

Mike Weilgart

unread,
Feb 22, 2024, 12:32:06 PMFeb 22
to Xander Cage, help-cfengine
Thanks Xander,

You could test it by running something like:

bash -c 'echo $$ > /var/cfengine/watchdog.pid && exec sleep infinity' &

Running bash separately ensures the pid will be different, and the exec makes the sleep run with the same PID as the shell was running with (i.e. that PID is taken oven by the sleep) and the & makes it all happen in the background.

Then you've got a pidfile showing a valid, running process that is definitely using that pid.

To simulate stale pidfile with no process using that pid is easy, just put a number in there and then exit:

bash -c 'echo $$ > /var/cfengine/watchdog.pid'

To simulate stale pidfile where there is a process running on that pid but it's newer than the pidfile is also possible; it requires use of touch to set an old mtime.  Checking AIX docs (https://www.ibm.com/docs/en/aix/7.3?topic=t-touch-command) it looks like it should be something like:

bash -c 'echo $$ > /var/cfengine/watchdog.pid && touch -t 02210927 /var/cfengine/watchdog.pid && exec sleep infinity' &

(That would set the mtime to 24 hours ago at this writing - Feb 21 at 9:27 in the morning.)

Then in each case run the watchdog and see what happens, maybe convert my inline comments to echo commands so you can see which branch is taken.  (E.g. echo "Pidfile is definitely correct")

I much prefer submitting fully tested code but unfortunately I don't have access to any AIX systems, so if you could try the above it would be very much appreciated.  :)  And this way any issues can be found before the updated code is included in any release.

Best,
--Mike Weilgart
-- 
You received this message because you are subscribed to a topic in the Google Groups "help-cfengine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/help-cfengine/ddLSAt7qsos/unsubscribe.
To unsubscribe from this group and all its topics, send an email to help-cfengin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/help-cfengine/24484aa8-cf0f-4882-a8b5-47894c81af5cn%40googlegroups.com.

Xander Cage

unread,
Feb 23, 2024, 3:20:49 AMFeb 23
to help-cfengine
no problem...will do the tests next week...

Xander Cage

unread,
Feb 26, 2024, 4:44:01 AMFeb 26
to help-cfengine
did the requested tests...


root@aixtest01: /root # bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid && exec /opt/freeware/bin/sleep infinity' &
[2] 22020394
root@aixtest01: /root # /var/cfengine/bin/watchdog_changed

log output:

Mon Feb 26 10:36:45 CET 2024 Initiating watchdog 4456948
Mon Feb 26 10:36:47 CET 2024 Found 0 symptoms, threshold (0) not breached, no remediation or collection performed
Mon Feb 26 10:36:47 CET 2024 DONE watchdog 4456948


root@aixtest01: /root # bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid'
root@aixtest01: /root # /var/cfengine/bin/watchdog_changed

log output:

Mon Feb 26 10:37:52 CET 2024 Initiating watchdog 17432960
Mon Feb 26 10:37:53 CET 2024 Found 0 symptoms, threshold (0) not breached, no remediation or collection performed
Mon Feb 26 10:37:53 CET 2024 DONE watchdog 17432960


root@aixtest01: /root # bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid && touch -t 02210927 /var/cfengine/watchdog_changed.pid && exec /opt/freeware/bin/sleep infinity' &
[3] 32964974
root@aixtest01: /root # /var/cfengine/bin/watchdog_changed

[3]+  Stopped                 bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid && touch -t 02210927 /var/cfengine/watchdog_changed.pid && exec /opt/freeware/bin/sleep infinity'

log output:

Mon Feb 26 10:39:28 CET 2024 Initiating watchdog 12321050
Mon Feb 26 10:39:31 CET 2024 Found 0 symptoms, threshold (0) not breached, no remediation or collection performed
Mon Feb 26 10:39:31 CET 2024 DONE watchdog 12321050

expected something more exiting...but it is what is i guess....

Mike Weilgart

unread,
Feb 26, 2024, 1:54:42 PMFeb 26
to Xander Cage, help-cfengine
 Hi Xander,

From that output it doesn't appear that the edited watchdog is looking at the same modified location for the pidfile as you're using, since the first test should have produced an "abort" message.

Could you please run those same tests again but convert the inline comments to echo commands as shown below, so we can see which branch is being taken?

Expected results: Your first test should give "Pidfile is definitely correct" and an abort message; second test should say "Pidfile is stale, ignore it"; third test should say "No current process matching pid in file".

Best,
--Mike Weilgart

######
if [ -s $PIDFILE ]; then
    echo We have a pidfile
    if ps -p $(cat $PIDFILE) > /dev/null 2>&1 ; then
        echo 'There is a process with the PID in the file, but is it stale?'
        if [ -d /proc ]; then
            echo "We can know for sure if it's stale"
            actual_process="/proc/$(cat "$PIDFILE")"
            newer="$(ls -1dt "$PIDFILE" "$actual_process" | head -n 1)"
            if [ "$actual_process" = "$newer" ]; then
                echo Pidfile is stale, ignore it
                echo $$ > $PIDFILE
            else

                echo Pidfile is definitely correct
                echo "$(date) Aborting execution of watchdog $$, existing watchdog process $(cat $PIDFILE) running" >> ${LOGFILE}
                exit 1
            fi
        else
            echo "No /proc, pidfile shows a running process, we'll assume it's valid"
            echo "$(date) Aborting execution of watchdog $$, existing watchdog process $(cat $PIDFILE) running" >> ${LOGFILE}
            exit 1
        fi
    else
        echo No current process matching pid in file
        echo $$ > $PIDFILE
    fi
else
    echo No pidfile at all
    echo $$ > $PIDFILE
fi
######

Xander Cage

unread,
Feb 27, 2024, 10:15:41 AMFeb 27
to help-cfengine
new run with echoes...

root@aixtest01: /root # bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid && exec /opt/freeware/bin/sleep infinity' &
[1] 4456792
root@aixtest01: /root # /var/cfengine/bin/watchdog_changed
We have a pidfile

There is a process with the PID in the file, but is it stale?
We can know for sure if its stale

Pidfile is stale, ignore it
root@aixtest01: /root # bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid'
root@aixtest01: /root # /var/cfengine/bin/watchdog_changed
We have a pidfile

No current process matching pid in file
root@aixtest01: /root # bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid && touch -t 02210927 /var/cfengine/watchdog_changed.pid && exec /opt/freeware/bin/sleep infinity' &
[2] 12321240
root@aixtest01: /root # /var/cfengine/bin/watchdog_changed
We have a pidfile

There is a process with the PID in the file, but is it stale?
We can know for sure if its stale

Pidfile is stale, ignore it

[2]+  Stopped                 bash -c 'echo $$ > /var/cfengine/watchdog_changed.pid && touch -t 02210927 /var/cfengine/watchdog_changed.pid && exec /opt/freeware/bin/sleep infinity'

Mike Weilgart

unread,
Mar 6, 2024, 10:55:05 PMMar 6
to Xander Cage, help-cfengine
Thanks, Xander.

On closer inspection and some testing on a Linux system, there is a possible race condition where a process can actually have a /proc entry that is a fraction of a second newer than a file created by that very process.  That's why the commands I gave you for testing didn't function as expected.

Example output:

# bash -c 'echo $$ > pidfile; ls -ldt --full-time pidfile /proc/$$'
dr-xr-xr-x 9 root root 0 2024-03-07 03:31:15.961650417 +0000 /proc/742781
-rw-r--r-- 1 root root 7 2024-03-07 03:31:15.953650428 +0000 pidfile

On the surface of it this would look like a bug, wherein my modified watchdog could potentially see a factually valid pidfile that matched the actual running instance of the watchdog that generated the pidfile—but the new watchdog would disregard the pidfile because it's older than the running process, according to /proc.  This would be a problem.

However, when I added a run of 'ps -p $$' in my test command ahead of the pidfile creation, I was unable to trigger the race condition and the timestamps are much further apart than the discrepancy shown above.  (I tried many times; the sequence of events seems completely reliable and the time interval much more consistent.)

# bash -c 'ps -p $$; echo $$ > pidfile; ls -ldt --full-time pidfile /proc/$$'
    PID TTY          TIME CMD
 750818 pts/1    00:00:00 bash
-rw-r--r-- 1 root root 7 2024-03-07 03:36:57.569061554 +0000 pidfile
dr-xr-xr-x 9 root root 0 2024-03-07 03:36:57.537061615 +0000 /proc/750818

I speculate that the /proc entry has to be created before the ps command can run successfully, so the kernel will do that before continuing, whereas in the earlier command the pidfile can go ahead and get created when the kernel hasn't yet bothered with the /proc update.  In the watchdog code, there is a run of ps -p on the pidfile contents before pidfile creation, so I think it is likely impossible for this race condition to be hit with the script as written.  (I can't prove it definitively.)

In any case—could you please try the following modified test command?  Just the one test case, since this is the only "interesting" code path left untested.

bash -c 'ps -p $$; echo $$ > /var/cfengine/watchdog_changed.pid; /opt/freeware/bin/sleep infinity' &
/var/cfengine/bin/watchdog_changed

Expected output from the second part:

We have a pidfile
There is a process with the PID in the file, but is it stale?
We can know for sure if it's stale
Pidfile is definitely correct

Best,
--Mike Weilgart

Xander Cage

unread,
Mar 11, 2024, 8:39:58 AMMar 11
to help-cfengine
here you go...not changed much tho...

root@aixtest01: /root # bash -c 'ps -p $$; echo $$ > /var/cfengine/watchdog_changed.pid; /opt/freeware/bin/sleep infinity' &
[1] 17432958
root@aixtest01: /root #       PID    TTY  TIME CMD
 17432958 pts/11  0:00 bash_64

root@aixtest01: /root # /var/cfengine/bin/watchdog_changed
We have a pidfile
No current process matching pid in file


Xander Cage

unread,
Mar 11, 2024, 8:47:02 AMMar 11
to help-cfengine
btw...i think we found out what was to reason for all this...it seems to have to do with lpar reboots...if the watchdog is running at reboot time, it gets killed mid run and the pidfile is not removed and the registered
pid "MIGHT" be picked up by another process (even in aix, which has some countermeasures against this) which leads to the discribed problems.
Reply all
Reply to author
Forward
0 new messages