Bug#915379: anacron.service: should probably use KillMode=process

Ansgar Burchardt

unread,

Dec 3, 2018, 5:30:03 AM12/3/18

to

Package: anacron
Version: 2.3-26
Severity: normal

anacron.service currently uses KillMode=mixed. It probably should
not.

KillMode=mixed sends SIGTERM to anacron and then SIGKILL to any
processes started by anacron. The default (KillMode=control-group)
would send SIGTERM to all processes which is probably what one wantes
if processes started by anacron should be stopped when anacron is (for
example during upgrades).

More likely, anacron should probably use KillMode=process. Then
stopping anacron would only send SIGTERM to anacron itself and leave
the jobs anacron might have started untouched. (This is what
cron.service uses.)

Ansgar

Tomas Janousek

unread,

Nov 11, 2020, 4:50:04 AM11/11/20

to

Hi Boyuan.

On Mon, Dec 03, 2018 at 01:42:58PM -0500, Boyuan Yang wrote:
> I chose KillMode=mixed intentionally. Here's the reason: we want to stop
> anacron service elegantly and not to abruptly kill any process inside the
> control group. Anacron is accepting SIGUSR1 and interprets it as the request
> to exit gracefully; it will wait till all its jobs to finish before it exits
> by itself.

Unfortunately this isn't entirely true.

Anacron does wait for jobs to finish, but if any job invokes exim4's
/usr/sbin/sendmail to send its result to the user, that sendmail forks the
setuid /usr/sbin/exim4¹ to process the queue in the background and deliver the
mail. Anacron doesn't wait for this forked process, and systemd kills it
immediately, resulting in the mail not being delivered until the queue gets
processed by something else.

This, I believe, is a rather serious issue.

¹) That might be different if another MTA is being used, obviously, but exim4
is still the default MTA that gets installed if any Debian package needs to
deliver (local) mail.

Now setting KillMode=process is one way to fix this. Another is adding an
additional

ExecStart=/bin/sleep 5
Type=oneshot

which adds a few seconds for the forked processes to finish before they're
killed.

It would be a bit nicer if systemd's killing had a timeout for these processes
that need to do a bit more work after the main process finished, but it seems
there is not such thing. :-(

> 90 seconds (a default value provided by systemd) after sending SIGUSR1,
> systemd will send SIGKILL to all processes in the control group; this
> SIGKILL ensures that jobs are eventually killed if they are not responding.
> I believe this is better than abruptly sending SIGTERM to all processes and
> definitely better than KillMode=process since it will leak processes out of
> systemd's control if the daemon is not handling things properly.

--
Tomáš Janoušek, a.k.a. Pivník, a.k.a. Liskni_si, https://work.lisk.in/

Marc Haber

unread,

Feb 26, 2021, 3:20:03 AM2/26/21

to

On Wed, Nov 11, 2020 at 09:34:16AM +0000, Tomas Janousek wrote:
> On Mon, Dec 03, 2018 at 01:42:58PM -0500, Boyuan Yang wrote:
> > I chose KillMode=mixed intentionally. Here's the reason: we want to stop
> > anacron service elegantly and not to abruptly kill any process inside the
> > control group. Anacron is accepting SIGUSR1 and interprets it as the request
> > to exit gracefully; it will wait till all its jobs to finish before it exits
> > by itself.
>
> Unfortunately this isn't entirely true.
>
> Anacron does wait for jobs to finish, but if any job invokes exim4's
> /usr/sbin/sendmail to send its result to the user, that sendmail forks the
> setuid /usr/sbin/exim4¹ to process the queue in the background and deliver the
> mail. Anacron doesn't wait for this forked process, and systemd kills it
> immediately, resulting in the mail not being delivered until the queue gets
> processed by something else.

Worse. If the receiving side does post-DATA checking of the message, and
systemd sends SIGKILL to the exim process on the sending side, the
receiving side might continue delivery without the sending side noticing
the confirmation (it's already dead by then). During the next exim queue
run, the message will be delivered a second time.

In this case, 5 seconds of extra wait would probably not be enough.

Greetings
Marc

Tomas Janousek

unread,

Dec 25, 2021, 3:40:03 PM12/25/21

to

Hi,

On Fri, Feb 26, 2021 at 09:08:18AM +0100, Marc Haber wrote:

Worse. If the receiving side does post-DATA checking of the message, and
systemd sends SIGKILL to the exim process on the sending side, the
receiving side might continue delivery without the sending side noticing
the confirmation (it's already dead by then). During the next exim queue
run, the message will be delivered a second time.

In this case, 5 seconds of extra wait would probably not be enough.

This entry in the systemd 250 NEWS gives me hope this might be fixed in a nice way eventually:

 * A new service unit file setting ExitType= has been added that  
   specifies when to assume a service has exited. By default systemd  
   only watches the main process of a service. By setting  
   ExitType=cgroup it can be told to wait for the last process in a  
   cgroup instead.

I'll probably experiment with it once systemd 250 lands in testing, unless someone beats me to it.

-- Tomáš "liskin" ("Pivník") Janoušek, https://lisk.in/

Tomas Janousek

unread,

Jan 23, 2022, 7:50:03 AM1/23/22

to

Hi again,

On Sat, Dec 25, 2021 at 08:27:58PM +0000, Tomas Janousek wrote:

This entry in the systemd 250 NEWS gives me hope this might be fixed in a nice way eventually:

A new service unit file setting ExitType= has been added that

specifies when to assume a service has exited. By default systemd
only watches the main process of a service. By setting
ExitType=cgroup it can be told to wait for the last process in a
cgroup instead.

I'll probably experiment with it once systemd 250 lands in testing, unless someone beats me to it.

I can confirm that using ExitType=cgroup fixes the issue as well—systemd now waits for all processes spawned by anacron to exit.

Melvin Vermeeren

unread,

Aug 12, 2022, 2:40:04 PM8/12/22

to

Hi,

I ran into this problem today. Automated upgrades with unattended-upgrades
upgraded debianutils. Then needrestart decided that anacron.service needed to
be restart.

However, backup cron jobs were running at the time, which take longer than a
minute or two. After the brief timeout every single process in the chain got
killed abruptly, without any of the actual cron job tasks even receiving a
SIGTERM to clean up nicely. Worst of all this means the process does not even
get a chance to report error/failure, so not mail ends up in the mailbox.

The result (with borgbackup at least) means a lot of manual work to cleanup
stale repository locks acquired, checking some caches and manually unmount and
removing the snapshots made for backup purposes.

I strongly feel like stopping/restarting anacron.service should never, ever
timeout at all. A very long-running (possibly stuck) cron job should result in
a blocking (or failing) stop action which can then be investigated properly by
the administrator. Such as event would be a bug in another package and not a
problem with anacron daemon.

Forcefully killing long-running cron jobs can have severe consequences. In
today's case it was recoverable but similar cron jobs could also perform
automated cleanup/pruning tasks in databases, registries, etc, where killing
is very, very much undesired and effectively as bad as system crash for data
integrity purposes.

I can think of two ways to improve this.

1. Always let jobs finish cleanly: TimeoutStopSec=infinity
I strongly prefer this option in all cases (desktop/server/...).

2. SIGUSR1 anacron as is the case now, then on timeout SIGTERM to all
processes in the group, then on timeout again SIGKILL all processes in the
group. I must admit I don't know how to implement this with systemd services.

Could you share thoughts regarding this issue?

Thanks,

--
Melvin Vermeeren
Systems engineer

signature.asc

Lance Lin

unread,

Aug 23, 2022, 9:00:04 AM8/23/22

to

Hello Melvin,

Thank you for your report.

> I can think of two ways to improve this.
>

> 1. Always let jobs finish cleanly: TimeoutStopSec=infinity
> I strongly prefer this option in all cases (desktop/server/...).

Sure, I think this is easy enough to do and it does make sense.

I can push a patch for this.

> 2. SIGUSR1 anacron as is the case now, then on timeout SIGTERM to all
> processes in the group, then on timeout again SIGKILL all processes in the
> group. I must admit I don't know how to implement this with systemd services.

This change is more involved. At present, anacron is in "legacy" status. I recently
picked up the package as part of the cronie transition. cronie is expected to replace
cron/anacron in the future and is actively developed by Fedora. I would suggest we place
major changes/improvements in that project.

How do you think?

Lance

signature.asc

Melvin Vermeeren

unread,

Aug 23, 2022, 3:40:04 PM8/23/22

to

Hi Lance,

Thanks for your reply!

On Tuesday, 23 August 2022 14:33:11 CEST Lance Lin wrote:
> > 1. Always let jobs finish cleanly: TimeoutStopSec=infinity
> > I strongly prefer this option in all cases (desktop/server/...).
>
> Sure, I think this is easy enough to do and it does make sense.
>
> I can push a patch for this.

Sounds great, I also think this is the best way to solve it. I did some local
testing already with long-running jobs and can confirm TimeoutStopSec=infinity
in the [Service] section works perfectly. Anacron will finish its current job
cleanly (cron.daily etc) and only then stop/restart.

> > 2. SIGUSR1 anacron as is the case now, then on timeout SIGTERM to all
> > processes in the group, then on timeout again SIGKILL all processes in the
> > group. I must admit I don't know how to implement this with systemd
> > services.
> This change is more involved. At present, anacron is in "legacy" status. I
> recently picked up the package as part of the cronie transition. cronie is
> expected to replace cron/anacron in the future and is actively developed by
> Fedora. I would suggest we place major changes/improvements in that
> project.

Makes sense, I fully agree with you. Hotfixing anacron by disabling timeout
should be all that's needed until the cronie transition is complete.

Cheers,

signature.asc