I just want to follow up on this that I believe I have found the reason for this bug and I have a solution which I am testing out.
In the output of `systemctl status systemd-journald` it says:
Active: failed (Result: start-limit)
and this is related to the systemd mechanism to stop cycles where services keep restarting. It's related to the configs
StartLimitIntervalSec= [defaults to DefaultStartLimitIntervalSec=10 sec by default]
StartLimitBurst= [defaults to DefaultStartLimitBurst=5 by default]
so that if a service is restarted 5 times within 10 seconds, it's not restarted any more to avoid wasting resources.
What seems to happen is that if systemd-journal-flush.service sends the signal SIGUSR1 to systemd-journald and systemd-journald wasn't ready to handle it for some reason (e.g. had just been started, or there was some hangup), then systemd-journald is restarted, and systemd-journal-flush.service immediately restarts also. Again systemd-journald won't be ready to handle the signal as it hasn't had time to run the signal handler, and the cycle repeats 5 times until the systemd-journald service hits the StartLimitBurst and dies. I verified using `lastcomm` (we have process accounting enabled) that this did in fact happen.
The way I am trying to fix it is to introduce a restart delay of 10 seconds into the systemd-journal-flush service. See the config file below; the only new line that differs from the defaults in /lib/systemd/system/systemd-journal-flush.service is the line RestartSec=10.
Dan
======
cat /etc/systemd/system/systemd-journal-flush.service
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
[Unit]
Description=Trigger Flushing of Journal to Persistent Storage
Documentation=man:systemd-journald.service(8) man:journald.conf(5)
DefaultDependencies=no
Requires=systemd-journald.service
After=systemd-journald.service local-fs.target remote-fs.target
Before=systemd-user-sessions.service
# Don't re-start this service too often, or we can get into a cycle where
# if this service was run just after systemd-journald was started, it
# fails to handle the signal, causing that service to be restarted
# and this servie to be immediately restarted also, until we hit
# the limit of StartLimitBurst=5 restarts and systemd-journald
# dies.
RestartSec=10
[Service]
ExecStart=/bin/systemctl kill --kill-who=main --signal=SIGUSR1 systemd-journald.service
Type=oneshot