incoming mail delays caused by deferred mail

Mike Smith

unread,

Dec 6, 2005, 9:02:34 AM12/6/05

to

Version: 2.2.4
Symptoms: constant deferred processing, almost to the exclusion of
the incoming queue.

I am seeing delivery delays of up to 1800 seconds on incoming mail that
is delivered locally. This seems to be caused by a large amount of mail
in the deferred queue. This never seemed to be a problem on the
current production system running version 2.1.1, where we actually
have 3 times as much deferred mail.

It appears that the real cause is that postfix is constantly looping
through and trying to send this deferred queue, instead of following
the backoff timers. I have the following backoffs set...
minimal_backoff_time = 2h
maximal_backoff_time = 6h

The queue spool is local hard drives and I see the following if I look
at one of the queue files...
[root@mailserver postfix]# perl -le '
map{
stat( $_ );
printf "%s\n\t%s\n\t%s\n\t%s\n", $_, scalar(localtime((stat(_))[8]));
} @ARGV;' deferred/7/7E581C15AB
deferred/7/7E581C15AB
Tue Dec 6 12:57:36 2005

But I see it constantly trying in the log file...
Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
Dec 6 06:57:36 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=386185, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
Dec 6 07:13:42 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=387151, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
Dec 6 07:37:11 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=388560, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
Dec 6 07:49:05 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=389274, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)

Now, with the above script, after the latest log line, I get a timestamp
of Tue Dec 6 13:49:05 2005. Yet, I can guarantee that within 15-20
minutes after the last log entry, it will try again. It's almost as if
as soon as it runs the deferred queue, it goes "I'm bored, I'm going to
run it again".

This constant spinning through the deferred queue is degrading the
incoming queue performance.

What can I do to correct this behavior? Let me know if you need additional
info.

--
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
| Michael Smith <msm...@datafoundry.com> | Voice - 512-684-9300 |
| Senior Systems Engineer | Fax - 512-684-9301 |
| Data Foundry, Inc. | |
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+

Wietse Venema

unread,

Dec 6, 2005, 9:25:46 AM12/6/05

to

Mike Smith:

> [root@mailserver postfix]# perl -le '
> map{
> stat( $_ );
> printf "%s\n\t%s\n\t%s\n\t%s\n", $_, scalar(localtime((stat(_))[8]));
> } @ARGV;' deferred/7/7E581C15AB
> deferred/7/7E581C15AB
> Tue Dec 6 12:57:36 2005
>
> But I see it constantly trying in the log file...
> Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)

[and other records with the same queue manager PID and queue ID]

The Postfix queue manager won't access files with "future" time
stamps, except when someone tells it to ignore the time stamp.

The time stamp is ignored when

1) someone executes the destination-independent "postfix flush",
"postqueue -f", "sendmail -q",

2) someone executes the destination-dependent commands: "sendmail
-qRdomain", "postqueue -s domain", or with the SMTP ETRN command.

3) The flush(8) daemon refreshes its logfiles. This is controlled
with fast_flush_refresh_time and fast_flush_purge_time.

Wietse

Mike Smith

unread,

Dec 6, 2005, 9:36:59 AM12/6/05

to

Hmm, something isn't working right then, since I'm the only one
on the box with root admin at the moment. So I know that nobody
is doing either #1 or first two parts of #2. As for the 2nd part
of #2, the box has no external smtp connections coming in to it
at the moment, except my 'telnet localhost 25' to inject a message
for delivery to test what kind of delays I'm getting.

As for #3, the values are default and postconf reports
fast_flush_domains = $relay_domains fast_flush_purge_time = 7d fast_flush_refresh_time = 12h

The timestamp of the file is now 6 hours in the future (Tue Dec 6 14:21:06 2005), and has tried to send out another 4 or 5 times since I sent the
first message.

--

Wietse Venema

unread,

Dec 6, 2005, 9:56:01 AM12/6/05

to

Mike Smith:

> > Mike Smith:
> > > [root@mailserver postfix]# perl -le '
> > > map{
> > > stat( $_ );
> > > printf "%s\n\t%s\n\t%s\n\t%s\n", $_, scalar(localtime((stat(_))[8]));
> > > } @ARGV;' deferred/7/7E581C15AB
> > > deferred/7/7E581C15AB
> > > Tue Dec 6 12:57:36 2005
> > >
> > > But I see it constantly trying in the log file...
> > > Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
> >
> > [and other records with the same queue manager PID and queue ID]
> >
> > The Postfix queue manager won't access files with "future" time
> > stamps, except when someone tells it to ignore the time stamp.
> >
> > The time stamp is ignored when
> >
> > 1) someone executes the destination-independent "postfix flush",
> > "postqueue -f", "sendmail -q",

These show up in queue manager logging (qmgr -v -v in master.cf).

> > 2) someone executes the destination-dependent commands: "sendmail
> > -qRdomain", "postqueue -s domain", or with the SMTP ETRN command.

These show up in flush daemon logging (flush -v -v in master.cf).

> > 3) The flush(8) daemon refreshes its logfiles. This is controlled
> > with fast_flush_refresh_time and fast_flush_purge_time.

This also shows up in flush daemon logging.

Wietse

Victor Duchovni

unread,

Dec 6, 2005, 10:00:11 AM12/6/05

to

On Tue, Dec 06, 2005 at 08:36:59AM -0600, Mike Smith wrote:

> Hmm, something isn't working right then, since I'm the only one
> on the box with root admin at the moment. So I know that nobody
> is doing either #1 or first two parts of #2.

By default neither require root privs, any user can flush the queue.

http://www.postfix.org/postconf.5.html#authorized_flush_users

--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:majo...@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.

Mike Smith

unread,

Dec 6, 2005, 12:21:59 PM12/6/05

to

Now I have no idea what is going on with it. I added the desired debugging
to qmgr and flush, then reloaded postfix. After it ran through
the queue once (took about 15-20 minutes and generated about 2.7gig
worth of logs), it stopped processing any messages. There are
incredibly few log lines now from postfix.

qmgr fifo n - n 300 1 qmgr -v -v
flush unix n - n 1000? 0 flush -v -v

It now seems to hit the watchdog for flush, do a few lines of logs...
scan_dir_next: found <some domain>
flush_refresh_service: skip logfile <some domain> - empty log
[repeat for various domains]
scan_dir_pop: close flush
connection closed
watchdog_stop: 0x639728
watchdog_start: 0x639728

I connected manually to port 25 and injected a message. I see it go in
to the incoming queue, where it has now sat for an hour, with nothing
in the logs indicating it even trying to process it. There seem
to be about 11,000 files in the incoming folder. Qshape only shows about
600 messages active. I only see about 69 files in the active
directory, and all of their timestamps are now 2 hours old.

qshape -c /etc/postfix-residential active
T 5 10 20 40 80 160 320 640 1280 1280+
TOTAL 594 0 0 0 0 0 0 0 0 0 594
obht.com 594 0 0 0 0 0 0 0 0 0 594

--

Victor Duchovni

unread,

Dec 6, 2005, 12:31:09 PM12/6/05

to

On Tue, Dec 06, 2005 at 11:21:59AM -0600, Mike Smith wrote:

> Now I have no idea what is going on with it. I added the desired debugging
> to qmgr and flush, then reloaded postfix. After it ran through
> the queue once (took about 15-20 minutes and generated about 2.7gig
> worth of logs), it stopped processing any messages. There are
> incredibly few log lines now from postfix.
>
> qmgr fifo n - n 300 1 qmgr -v -v
> flush unix n - n 1000? 0 flush -v -v
>
> It now seems to hit the watchdog for flush, do a few lines of logs...
> scan_dir_next: found <some domain>
> flush_refresh_service: skip logfile <some domain> - empty log
> [repeat for various domains]
> scan_dir_pop: close flush
> connection closed
> watchdog_stop: 0x639728
> watchdog_start: 0x639728
>
> I connected manually to port 25 and injected a message. I see it go in
> to the incoming queue, where it has now sat for an hour, with nothing
> in the logs indicating it even trying to process it.

Try "postkick -c /etc/postfix-residential public qmgr I", does that
start an incoming queue scan? Does cleanup succeed in sending triggers
to the queue manager? You are in too deep here. You could troubleshoot
for a long time and perhaps solve it, or build a new working system with
a known solid OS and forget this one.

Mike Smith

unread,

Dec 14, 2005, 9:57:38 AM12/14/05

to

Ok, we've managed to track down the problem, and uncovered another minor
problem (which shouldn't really affect many people at all).

After several hours of pouring through the source and following exactly
what was going on, turns out it was the fast flush service that was
constantly running the queues. It seems that another admin, in their
ultimate wisdom, decided that no filesystem should have atime updated
due to performance reasons, so had noatime added to all mount points.
Since fast flush uses atime to judge the age of a fast flush queue file,
they were always old and "needed to be run". That has been fixed.

The minor problem we found was touched on here. It seems that with that
level of debug on qmgr and flush, there is a leak somewhere. Doesn't
seem like a memory leak, but a leak in some resource. After having the
debug on for a while the entire system stops processing new or deferred
messages. The messages still go in to the queue, but are never processed.
The amount of time for this lockup seems to be dependant on the volume
of mail coming through. The higher the volume, the shorter the amount
of time until it stops processing. Since this is related to enabling the
debug, it is unlikely to really be a problem for anyone. Once the
debugging is removed and/or postfix restarted, all the stuck mail is
processed so nothing appears to be lost, only delayed.

On Tue, Dec 06, 2005 at 11:21:59AM -0600, Mike Smith wrote:

--

Victor Duchovni

unread,

Dec 14, 2005, 10:41:54 AM12/14/05

to

On Wed, Dec 14, 2005 at 08:57:38AM -0600, Mike Smith wrote:

> After several hours of pouring through the source and following exactly
> what was going on, turns out it was the fast flush service that was
> constantly running the queues. It seems that another admin, in their
> ultimate wisdom, decided that no filesystem should have atime updated
> due to performance reasons, so had noatime added to all mount points.
>
> Since fast flush uses atime to judge the age of a fast flush queue file,
> they were always old and "needed to be run". That has been fixed.

Good catch! Perhaps flush(8) should explicitly set the atime after
refreshing the file? Wietse?

Wietse Venema

unread,

Dec 14, 2005, 11:37:23 AM12/14/05

to

Victor Duchovni:

> On Wed, Dec 14, 2005 at 08:57:38AM -0600, Mike Smith wrote:
>
> > After several hours of pouring through the source and following exactly
> > what was going on, turns out it was the fast flush service that was
> > constantly running the queues. It seems that another admin, in their
> > ultimate wisdom, decided that no filesystem should have atime updated
> > due to performance reasons, so had noatime added to all mount points.
> >
> > Since fast flush uses atime to judge the age of a fast flush queue file,
> > they were always old and "needed to be run". That has been fixed.
>
> Good catch! Perhaps flush(8) should explicitly set the atime after
> refreshing the file? Wietse?

That would be wasteful, at least for myself and for many other
sites.

Postfix expects standard file system semantics; I don't wish to
work around or even attempt to detect non-standard mount options
that turn off "chmod +x" (Digital UNIX), atimes, mtimes or that
make the queue read-only, or that botch system calls for the sake
of security.

Wietse

Victor Duchovni

unread,

Dec 14, 2005, 11:49:21 AM12/14/05

to

Yet I seem to recall that the advice to set "noatime" has been routinely
given (by others) in the past as a possible way of improving Postfix
performance on slow IDE drives... We could concede the battle as lost
and allow people to disable atime updates and set an option that tells
flush to do it explicitly... Or the message to not disable "noatime"
(if fast_flush_domains is not empty) could be made more strongly.

Wietse Venema

unread,

Dec 14, 2005, 12:25:58 PM12/14/05

to

I don't recommend turning off atimes.

How would Postfix avoid the waste? How would Postfix reliably detect
that <some file system feature> is botched?

Adding text to documentation will be useful only to blame victims.

Wietse