Strange problem during system reboot

15 views
Skip to first unread message

Berthold Cogel

unread,
Dec 19, 2022, 5:03:23 AM12/19/22
to help-cfengine
Hello,

We are runnung into a problem during reboots of some of our systems.

OS: RHEL 7.9
cfengine: cfengine-community-3.18.2-1.el7.x86_64

We have /var in a seperate logical volume, which contains the config for
cfengine. cfengine writes backups of changed files to /service/cfbackup
(which on larger systems /service is a seperate LVM-volume).

Now for the problem:
After rebooting a system quite a lot of config files maintained by
cfengine ended up being empty (zero bytes), which obviously wreaks all
kinds of havoc. This was eventually fixed by a second reboot and
cfengine repairing the missing/empty config files (e.g. /etc/ntpd.conf).
Among others, also the rsyslog.conf was damaged, making debugging a lot
harder.

Findings so far:
When rebooting this system, no messages pertaining to stopping cfengine
are found in either /var/log/messages in journalctl. Testing another
system, we -do- see that cfengine is shut down after multiuser.target is
stopped.
It looks like that cfengine is still running after filesystems are
unmounted, since we found files written by cfengine not inside the
mounted LVM-volume /service, but rather in the plain mount-directory
inside the root filesystem. The question is: Why is cfengine still
active while filesystems are being unmounted, and how did it manage to
write zero-length files, when it should be maintaining the correct contents.
It somehow looks like a race condition, since we do not see this
behavior on all our systems all the time, but in rare cases on large
systems with heavy services that take a bit of time to stop.


Regards
Berthold Cogel

craig.c...@northern.tech

unread,
Dec 19, 2022, 10:52:18 AM12/19/22
to help-cfengine
Thanks for the report. This is indeed a serious situation.

We have an internal ticket about a similar issue which is also very hard to reproduce and seldom seen.

The ticket is different enough that I would be thankful if you could log a new ticket about your issue and possibly add a few more details, logs, that sort of thing.


-Craig

Nick Anderson

unread,
Dec 19, 2022, 1:19:47 PM12/19/22
to craig.c...@northern.tech, help-c...@googlegroups.com

"'craig.c...@northern.tech' via help-cfengine" <help-c...@googlegroups.com> writes:

> Thanks for the report. This is indeed a serious situation.
>
> We have an internal ticket about a similar issue which is also very hard to reproduce and seldom seen.
>
> The ticket is different enough that I would be thankful if you could log a new ticket about your issue and possibly add a few more details, logs, that sort of thing.
>
> Log a ticket here: https://tracker.mender.io/projects/CFE

With the mount points it makes me wonder if systemd units should be
adjusted to require and be after local-fs.target.

--
Nick Anderson | Doer of Things | (+1) 785-550-1767 | https://northern.tech

t.d...@servicemusic.org.uk

unread,
Dec 19, 2022, 6:40:29 PM12/19/22
to help-cfengine
(We, too are RHEL7 although older version of CFE.)

This may be completely and totally unrelated.  But I offer it on the tenuous off-chance that it might, perhaps, be relevant.

We had a one-off significant incident many months ago, whose root cause was a nasty network disruption.  The observed net-effect symptom on many machines was that several vital CFE-maintained files (LDAP configs, etc.) became empty.  (On repair of the underlying network issue, CFE on those machines then gradually repaired those files.)

During post-mortem review, one thing that we identified was that the affected (emptied) files were templated, traditional-style, and included the clause "edit_defaults => empty".  The existence of that clause seems to have been ancient cargo-cult from one file, to another, to another, to another, etc.  But if a file is templated, then that clause seems superfluous.  And none of us could really explain or defend its ancient, cargo-cult presence.

We don't have an explanation for why CFE would have ended up emptying such files.  But it did lead us to consider whether that "edit_defaults => empty" clause might be superfluous in templated files, and to start removing its instances.  (We're also not sure whether that really would fully prevent the same effect under similar circumstances... nevertheless we're sure it would be no worse.. which was bad enough anyway!)

So if you are templating and if those templates say "edit_defaults => empty", I wonder whether something vaguely similar to this (which was for us a one-off major network disruption) might be involved.

As I say, probably totally unrelated, so don't let it distract. 

-- David Lee
Reply all
Reply to author
Forward
0 new messages