Version: 2.2.4 Symptoms: constant deferred processing, almost to the exclusion of the incoming queue.
I am seeing delivery delays of up to 1800 seconds on incoming mail that is delivered locally. This seems to be caused by a large amount of mail in the deferred queue. This never seemed to be a problem on the current production system running version 2.1.1, where we actually have 3 times as much deferred mail.
It appears that the real cause is that postfix is constantly looping through and trying to send this deferred queue, instead of following the backoff timers. I have the following backoffs set... minimal_backoff_time = 2h maximal_backoff_time = 6h
The queue spool is local hard drives and I see the following if I look at one of the queue files... [root@mailserver postfix]# perl -le ' map{ stat( $_ ); printf "%s\n\t%s\n\t%s\n\t%s\n", $_, scalar(localtime((stat(_))[8]));
} @ARGV;' deferred/7/7E581C15AB
deferred/7/7E581C15AB Tue Dec 6 12:57:36 2005
But I see it constantly trying in the log file... Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out) Dec 6 06:57:36 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=386185, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out) Dec 6 07:13:42 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=387151, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out) Dec 6 07:37:11 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=388560, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out) Dec 6 07:49:05 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=389274, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
Now, with the above script, after the latest log line, I get a timestamp of Tue Dec 6 13:49:05 2005. Yet, I can guarantee that within 15-20 minutes after the last log entry, it will try again. It's almost as if as soon as it runs the deferred queue, it goes "I'm bored, I'm going to run it again".
This constant spinning through the deferred queue is degrading the incoming queue performance.
What can I do to correct this behavior? Let me know if you need additional info.
-- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+ | Michael Smith <msm...@datafoundry.com> | Voice - 512-684-9300 | | Senior Systems Engineer | Fax - 512-684-9301 | | Data Foundry, Inc. | | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
> But I see it constantly trying in the log file... > Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
[and other records with the same queue manager PID and queue ID]
The Postfix queue manager won't access files with "future" time stamps, except when someone tells it to ignore the time stamp.
Hmm, something isn't working right then, since I'm the only one on the box with root admin at the moment. So I know that nobody is doing either #1 or first two parts of #2. As for the 2nd part of #2, the box has no external smtp connections coming in to it at the moment, except my 'telnet localhost 25' to inject a message for delivery to test what kind of delays I'm getting.
As for #3, the values are default and postconf reports fast_flush_domains = $relay_domains fast_flush_purge_time = 7d fast_flush_refresh_time = 12h
The timestamp of the file is now 6 hours in the future (Tue Dec 6 14:21:06 2005), and has tried to send out another 4 or 5 times since I sent the first message.
On Tue, Dec 06, 2005 at 09:25:46AM -0500, Wietse Venema wrote: > Mike Smith: > > [root@mailserver postfix]# perl -le ' > > map{ > > stat( $_ ); > > printf "%s\n\t%s\n\t%s\n\t%s\n", $_, scalar(localtime((stat(_))[8])); > > } @ARGV;' deferred/7/7E581C15AB > > deferred/7/7E581C15AB > > Tue Dec 6 12:57:36 2005
> > But I see it constantly trying in the log file... > > Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
> [and other records with the same queue manager PID and queue ID]
> The Postfix queue manager won't access files with "future" time > stamps, except when someone tells it to ignore the time stamp.
> > > But I see it constantly trying in the log file... > > > Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
> > [and other records with the same queue manager PID and queue ID]
> > The Postfix queue manager won't access files with "future" time > > stamps, except when someone tells it to ignore the time stamp.
On Tue, Dec 06, 2005 at 08:36:59AM -0600, Mike Smith wrote: > Hmm, something isn't working right then, since I'm the only one > on the box with root admin at the moment. So I know that nobody > is doing either #1 or first two parts of #2.
By default neither require root privs, any user can flush the queue.
Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the "Reply-To" header.
To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: <mailto:majord...@postfix.org?body=unsubscribe%20postfix-users>
If my response solves your problem, the best way to thank me is to not send an "it worked, thanks" follow-up. If you must respond, please put "It worked, thanks" in the "Subject" so I can delete these quickly.
Now I have no idea what is going on with it. I added the desired debugging to qmgr and flush, then reloaded postfix. After it ran through the queue once (took about 15-20 minutes and generated about 2.7gig worth of logs), it stopped processing any messages. There are incredibly few log lines now from postfix.
qmgr fifo n - n 300 1 qmgr -v -v flush unix n - n 1000? 0 flush -v -v
It now seems to hit the watchdog for flush, do a few lines of logs... scan_dir_next: found <some domain> flush_refresh_service: skip logfile <some domain> - empty log [repeat for various domains] scan_dir_pop: close flush connection closed watchdog_stop: 0x639728 watchdog_start: 0x639728
I connected manually to port 25 and injected a message. I see it go in to the incoming queue, where it has now sat for an hour, with nothing in the logs indicating it even trying to process it. There seem to be about 11,000 files in the incoming folder. Qshape only shows about 600 messages active. I only see about 69 files in the active directory, and all of their timestamps are now 2 hours old.
> > > > But I see it constantly trying in the log file... > > > > Dec 6 06:41:31 mailserver postfix/qmgr[25079]: 7E581C15AB: to=<su...@dreco.com>, relay=none, delay=385220, status=deferred (delivery temporarily suspended: connect to srvhoushpexc01.natoil.com[209.16.57.21]: Connection timed out)
> > > [and other records with the same queue manager PID and queue ID]
> > > The Postfix queue manager won't access files with "future" time > > > stamps, except when someone tells it to ignore the time stamp.
On Tue, Dec 06, 2005 at 11:21:59AM -0600, Mike Smith wrote: > Now I have no idea what is going on with it. I added the desired debugging > to qmgr and flush, then reloaded postfix. After it ran through > the queue once (took about 15-20 minutes and generated about 2.7gig > worth of logs), it stopped processing any messages. There are > incredibly few log lines now from postfix.
> qmgr fifo n - n 300 1 qmgr -v -v > flush unix n - n 1000? 0 flush -v -v
> It now seems to hit the watchdog for flush, do a few lines of logs... > scan_dir_next: found <some domain> > flush_refresh_service: skip logfile <some domain> - empty log > [repeat for various domains] > scan_dir_pop: close flush > connection closed > watchdog_stop: 0x639728 > watchdog_start: 0x639728
> I connected manually to port 25 and injected a message. I see it go in > to the incoming queue, where it has now sat for an hour, with nothing > in the logs indicating it even trying to process it.
Try "postkick -c /etc/postfix-residential public qmgr I", does that start an incoming queue scan? Does cleanup succeed in sending triggers to the queue manager? You are in too deep here. You could troubleshoot for a long time and perhaps solve it, or build a new working system with a known solid OS and forget this one.
-- Viktor.
Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the "Reply-To" header.
To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: <mailto:majord...@postfix.org?body=unsubscribe%20postfix-users>
If my response solves your problem, the best way to thank me is to not send an "it worked, thanks" follow-up. If you must respond, please put "It worked, thanks" in the "Subject" so I can delete these quickly.
Ok, we've managed to track down the problem, and uncovered another minor problem (which shouldn't really affect many people at all).
After several hours of pouring through the source and following exactly what was going on, turns out it was the fast flush service that was constantly running the queues. It seems that another admin, in their ultimate wisdom, decided that no filesystem should have atime updated due to performance reasons, so had noatime added to all mount points. Since fast flush uses atime to judge the age of a fast flush queue file, they were always old and "needed to be run". That has been fixed.
The minor problem we found was touched on here. It seems that with that level of debug on qmgr and flush, there is a leak somewhere. Doesn't seem like a memory leak, but a leak in some resource. After having the debug on for a while the entire system stops processing new or deferred messages. The messages still go in to the queue, but are never processed. The amount of time for this lockup seems to be dependant on the volume of mail coming through. The higher the volume, the shorter the amount of time until it stops processing. Since this is related to enabling the debug, it is unlikely to really be a problem for anyone. Once the debugging is removed and/or postfix restarted, all the stuck mail is processed so nothing appears to be lost, only delayed.
On Tue, Dec 06, 2005 at 11:21:59AM -0600, Mike Smith wrote: > Now I have no idea what is going on with it. I added the desired debugging > to qmgr and flush, then reloaded postfix. After it ran through > the queue once (took about 15-20 minutes and generated about 2.7gig > worth of logs), it stopped processing any messages. There are > incredibly few log lines now from postfix.
> qmgr fifo n - n 300 1 qmgr -v -v > flush unix n - n 1000? 0 flush -v -v
> It now seems to hit the watchdog for flush, do a few lines of logs... > scan_dir_next: found <some domain> > flush_refresh_service: skip logfile <some domain> - empty log > [repeat for various domains] > scan_dir_pop: close flush > connection closed > watchdog_stop: 0x639728 > watchdog_start: 0x639728
> I connected manually to port 25 and injected a message. I see it go in > to the incoming queue, where it has now sat for an hour, with nothing > in the logs indicating it even trying to process it. There seem > to be about 11,000 files in the incoming folder. Qshape only shows about > 600 messages active. I only see about 69 files in the active > directory, and all of their timestamps are now 2 hours old.
On Wed, Dec 14, 2005 at 08:57:38AM -0600, Mike Smith wrote: > After several hours of pouring through the source and following exactly > what was going on, turns out it was the fast flush service that was > constantly running the queues. It seems that another admin, in their > ultimate wisdom, decided that no filesystem should have atime updated > due to performance reasons, so had noatime added to all mount points.
> Since fast flush uses atime to judge the age of a fast flush queue file, > they were always old and "needed to be run". That has been fixed.
Good catch! Perhaps flush(8) should explicitly set the atime after refreshing the file? Wietse?
-- Viktor.
Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the "Reply-To" header.
To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: <mailto:majord...@postfix.org?body=unsubscribe%20postfix-users>
If my response solves your problem, the best way to thank me is to not send an "it worked, thanks" follow-up. If you must respond, please put "It worked, thanks" in the "Subject" so I can delete these quickly.
> On Wed, Dec 14, 2005 at 08:57:38AM -0600, Mike Smith wrote:
> > After several hours of pouring through the source and following exactly > > what was going on, turns out it was the fast flush service that was > > constantly running the queues. It seems that another admin, in their > > ultimate wisdom, decided that no filesystem should have atime updated > > due to performance reasons, so had noatime added to all mount points.
> > Since fast flush uses atime to judge the age of a fast flush queue file, > > they were always old and "needed to be run". That has been fixed.
> Good catch! Perhaps flush(8) should explicitly set the atime after > refreshing the file? Wietse?
That would be wasteful, at least for myself and for many other sites.
Postfix expects standard file system semantics; I don't wish to work around or even attempt to detect non-standard mount options that turn off "chmod +x" (Digital UNIX), atimes, mtimes or that make the queue read-only, or that botch system calls for the sake of security.
On Wed, Dec 14, 2005 at 11:37:23AM -0500, Wietse Venema wrote: > Victor Duchovni: > > On Wed, Dec 14, 2005 at 08:57:38AM -0600, Mike Smith wrote:
> > > After several hours of pouring through the source and following exactly > > > what was going on, turns out it was the fast flush service that was > > > constantly running the queues. It seems that another admin, in their > > > ultimate wisdom, decided that no filesystem should have atime updated > > > due to performance reasons, so had noatime added to all mount points.
> > > Since fast flush uses atime to judge the age of a fast flush queue file, > > > they were always old and "needed to be run". That has been fixed.
> > Good catch! Perhaps flush(8) should explicitly set the atime after > > refreshing the file? Wietse?
> That would be wasteful, at least for myself and for many other > sites.
> Postfix expects standard file system semantics; I don't wish to > work around or even attempt to detect non-standard mount options > that turn off "chmod +x" (Digital UNIX), atimes, mtimes or that > make the queue read-only, or that botch system calls for the sake > of security.
Yet I seem to recall that the advice to set "noatime" has been routinely given (by others) in the past as a possible way of improving Postfix performance on slow IDE drives... We could concede the battle as lost and allow people to disable atime updates and set an option that tells flush to do it explicitly... Or the message to not disable "noatime" (if fast_flush_domains is not empty) could be made more strongly.
-- Viktor.
Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the "Reply-To" header.
To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: <mailto:majord...@postfix.org?body=unsubscribe%20postfix-users>
If my response solves your problem, the best way to thank me is to not send an "it worked, thanks" follow-up. If you must respond, please put "It worked, thanks" in the "Subject" so I can delete these quickly.
> On Wed, Dec 14, 2005 at 11:37:23AM -0500, Wietse Venema wrote:
> > Victor Duchovni: > > > On Wed, Dec 14, 2005 at 08:57:38AM -0600, Mike Smith wrote:
> > > > After several hours of pouring through the source and following exactly > > > > what was going on, turns out it was the fast flush service that was > > > > constantly running the queues. It seems that another admin, in their > > > > ultimate wisdom, decided that no filesystem should have atime updated > > > > due to performance reasons, so had noatime added to all mount points.
> > > > Since fast flush uses atime to judge the age of a fast flush queue file, > > > > they were always old and "needed to be run". That has been fixed.
> > > Good catch! Perhaps flush(8) should explicitly set the atime after > > > refreshing the file? Wietse?
> > That would be wasteful, at least for myself and for many other > > sites.
> > Postfix expects standard file system semantics; I don't wish to > > work around or even attempt to detect non-standard mount options > > that turn off "chmod +x" (Digital UNIX), atimes, mtimes or that > > make the queue read-only, or that botch system calls for the sake > > of security.
> Yet I seem to recall that the advice to set "noatime" has been routinely > given (by others) in the past as a possible way of improving Postfix > performance on slow IDE drives... We could concede the battle as lost > and allow people to disable atime updates and set an option that tells > flush to do it explicitly... Or the message to not disable "noatime" > (if fast_flush_domains is not empty) could be made more strongly.
I don't recommend turning off atimes.
How would Postfix avoid the waste? How would Postfix reliably detect that <some file system feature> is botched?
Adding text to documentation will be useful only to blame victims.