Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

skipped, still being delivered !!!!????

145 views
Skip to first unread message

Manuel Hendel

unread,
Oct 19, 2001, 5:14:34 PM10/19/01
to Postfix Users
Where does this message come from, and why does it come so often.
More than 5000 times a day. I didn't flushed postfix and just reloaded
once. For some reason the queue is flushed every minute, I don't know
why.
I already use "smtpd_etrn_restrictions = reject". I don't know what to
do anymore. I already tried to fix this problem with the help of Ralf
Hildebrand (Thank you very much!), but we couldn't get it.

Thanks for any hlep,

Manuel

-
To unsubscribe, send mail to majo...@postfix.org with content
(not subject): unsubscribe postfix-users

Loic Le Loarer

unread,
Oct 19, 2001, 5:34:41 PM10/19/01
to Postfix Users
Le vendredi 19 octobre 2001 à 23:14:34 +0200, Manuel Hendel a écrit:
> Where does this message come from, and why does it come so often.
> More than 5000 times a day. I didn't flushed postfix and just reloaded
> once. For some reason the queue is flushed every minute, I don't know
> why.
> I already use "smtpd_etrn_restrictions = reject". I don't know what to
> do anymore. I already tried to fix this problem with the help of Ralf
> Hildebrand (Thank you very much!), but we couldn't get it.
>
> Thanks for any hlep,
>
> Manuel

Postfix rerun the queue once a while (see queue_run_delay parameter) and
if a message takes a long time to be delivered (the smtp connection is
slow or the host is behind a firewall and does not answer), the message
"skipped, still being delivered" appear, one for each message.

Check why the messages takes so long time to be delivered.

--
Loic Le Loarer
Fench

Manuel Hendel

unread,
Oct 19, 2001, 5:40:19 PM10/19/01
to Postfix Users
We already checked this, the queue_run_delay = 1000s (default), all
smtp_* have dafault values as well.

Michael Tokarev

unread,
Oct 19, 2001, 7:34:48 PM10/19/01
to Postfix Users
Manuel Hendel wrote:
>
> Where does this message come from, and why does it come so often.
> More than 5000 times a day. I didn't flushed postfix and just reloaded
> once. For some reason the queue is flushed every minute, I don't know
> why.
> I already use "smtpd_etrn_restrictions = reject".

Hmmm... A cron job? For example, to update any maps
and than to reload (postfix usually doesn't need this)?

There are only 4 ways to explicitly flush the queue:

a) ETRN, disabled in your case
b) `postfix flush' command
c) `sendmail -q' command
d) direct usage of postfix's internal sockets to issue
flush command (cases b) and c) uses that too)

You can just try to rename `postfix' and `sendmail'
commands temporarily to see what will happen (note this
is somewhat dangerous to rename sendmail, *if* your
system sends mails using it).

Regards,
Michael.

Manuel Hendel

unread,
Oct 20, 2001, 1:35:06 AM10/20/01
to postfix-users
On Sat, Oct 20, 2001 at 03:34:48AM +0400, Michael Tokarev wrote:
> Manuel Hendel wrote:
> >
> > Where does this message come from, and why does it come so often.
> > More than 5000 times a day. I didn't flushed postfix and just reloaded
> > once. For some reason the queue is flushed every minute, I don't know
> > why.
> > I already use "smtpd_etrn_restrictions = reject".
>
> Hmmm... A cron job? For example, to update any maps
> and than to reload (postfix usually doesn't need this)?

Don't I have to reload postfix to update any maps?


>
> There are only 4 ways to explicitly flush the queue:
>
> a) ETRN, disabled in your case
> b) `postfix flush' command
> c) `sendmail -q' command
> d) direct usage of postfix's internal sockets to issue
> flush command (cases b) and c) uses that too)
>
> You can just try to rename `postfix' and `sendmail'
> commands temporarily to see what will happen (note this
> is somewhat dangerous to rename sendmail, *if* your
> system sends mails using it).
>

Ralf thought about the same, we already disabled postfix and sendmail
command (chmod 000 postfix|sendmail). That's why I'm writing to the
list now. There are no local users, nobady can execute the commands
postfix or sendmail and ETRN is disabled.

Manuel

--
The activist is not the man who says the river is dirty. The activist is the
man who cleans up the river.
-H. Ross Perot

Ralf Hildebrandt

unread,
Oct 20, 2001, 3:03:43 AM10/20/01
to Postfix Users
On Fri, Oct 19, 2001 at 02:34:41PM -0700, Loic Le Loarer wrote:

> Postfix rerun the queue once a while (see queue_run_delay parameter) and
> if a message takes a long time to be delivered (the smtp connection is
> slow or the host is behind a firewall and does not answer), the message
> "skipped, still being delivered" appear, one for each message.
>
> Check why the messages takes so long time to be delivered.

I'll add some details here:

master.cf is unchanged. No chrooting. No bad entries for qmgr (only 1
instance), no limits on cleanup.

main.cf: very basic. Only minuscule changes.

Fact: from the logs one can see that nqmgr works through the queue every
minute.

I suggested "chmod 000 sendmail postfix" to make sure NOTHING can issue a
"postfix flush" or "sendmail -q". To no avail.

Question: Can nqmgr be made to reprocess the queue by sending it a signal
(like SIGHUP or something)??

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Echte Informatiker benutzen kein M$-Windows.

Ralf Hildebrandt

unread,
Oct 20, 2001, 3:05:26 AM10/20/01
to Postfix Users
On Sat, Oct 20, 2001 at 03:34:48AM +0400, Michael Tokarev wrote:

> Hmmm... A cron job? For example, to update any maps
> and than to reload (postfix usually doesn't need this)?

No maps were updated. At least there's nothing in the log.

Question to Manuel: How does your syslog.conf look like??
Do you log

mail.debug -/var/log/maillog

or only higher severitieS?

> a) ETRN, disabled in your case

correct

> b) `postfix flush' command
> c) `sendmail -q' command

Thus I recommended chmod 000

> d) direct usage of postfix's internal sockets to issue
> flush command (cases b) and c) uses that too)

e) a signal to (n)qmgr ?

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

Never ever relay emails based solely on a user-provided header.
Down that path leads madness, despair, and the RBL.

Manuel Hendel

unread,
Oct 20, 2001, 7:00:56 AM10/20/01
to Postfix Users
On Sat, Oct 20, 2001 at 09:05:26AM +0200, Ralf Hildebrandt wrote:
> On Sat, Oct 20, 2001 at 03:34:48AM +0400, Michael Tokarev wrote:
>
> > Hmmm... A cron job? For example, to update any maps
> > and than to reload (postfix usually doesn't need this)?
>
> No maps were updated. At least there's nothing in the log.
>
> Question to Manuel: How does your syslog.conf look like??
> Do you log
>
> mail.debug -/var/log/maillog

We don't use syslog for logging, sorry! This there anything else I can
provide?


>
> or only higher severitieS?
>
> > a) ETRN, disabled in your case
>
> correct
>
> > b) `postfix flush' command
> > c) `sendmail -q' command
>
> Thus I recommended chmod 000
>
> > d) direct usage of postfix's internal sockets to issue
> > flush command (cases b) and c) uses that too)
>
> e) a signal to (n)qmgr ?
>
> --
> Ralf Hildebrandt Tel. +49 (0)30-450 570-155
> Fax. +49 (0)30-450 570-916
> Never ever relay emails based solely on a user-provided header.
> Down that path leads madness, despair, and the RBL.
>
>
> -
> To unsubscribe, send mail to majo...@postfix.org with content
> (not subject): unsubscribe postfix-users

--
Conversation is the slowest form of human communication.
(contributed by Chris Johnston)

Ralf Hildebrandt

unread,
Oct 20, 2001, 7:05:04 AM10/20/01
to Postfix Users
On Sat, Oct 20, 2001 at 01:00:56PM +0200, Manuel Hendel wrote:

> > mail.debug -/var/log/maillog
>
> We don't use syslog for logging, sorry! This there anything else I can
> provide?

I meant to say: Do you log EVERYTHING postfix sends to syslog() or just
everything over a certain severity??

What do you use for logging then?

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

PowerPoint & Windows. Office and NT - I might like'em more after my lobotomy

Manuel Hendel

unread,
Oct 20, 2001, 8:41:00 AM10/20/01
to Postfix Users
On Sat, Oct 20, 2001 at 01:05:04PM +0200, Ralf Hildebrandt wrote:
> On Sat, Oct 20, 2001 at 01:00:56PM +0200, Manuel Hendel wrote:
>
> > > mail.debug -/var/log/maillog
> >
> > We don't use syslog for logging, sorry! This there anything else I can
> > provide?
>
> I meant to say: Do you log EVERYTHING postfix sends to syslog() or just
> everything over a certain severity??
>
> What do you use for logging then?
>
Sorry, I don't know actually. It might be syslog, but inside the
openpkg (http://www.openpkg.org) hierarchy, I guess I already
mentioned that we use this for installing and managing the software we
need.

Manuel

--
The whole aim of practical politics is to keep the populace in a continual
state of alarm (and hence clamorous to be led to safety) by menacing them with
an endless series of hobgoblins, all of them imaginary.
-H. L. Mencken

Ralf Hildebrandt

unread,
Oct 20, 2001, 11:36:36 AM10/20/01
to Postfix Users
On Sat, Oct 20, 2001 at 02:41:00PM +0200, Manuel Hendel wrote:

> Sorry, I don't know actually.

:) -- it's important to know if EVERYTHING is logged.

> It might be syslog, but inside the openpkg (http://www.openpkg.org)
> hierarchy, I guess I already mentioned that we use this for installing and
> managing the software we need.

Ralf Engelschall is a crazy guy :)

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

From empirical experience, your Exchange admin needs to put down the
crack pipe and open a window to disperse the fumes.

Manuel Hendel

unread,
Oct 21, 2001, 7:52:00 AM10/21/01
to Postfix Users
On Sat, Oct 20, 2001 at 05:36:36PM +0200, Ralf Hildebrandt wrote:
> On Sat, Oct 20, 2001 at 02:41:00PM +0200, Manuel Hendel wrote:
>
> > Sorry, I don't know actually.
>
> :) -- it's important to know if EVERYTHING is logged.
>
> > It might be syslog, but inside the openpkg (http://www.openpkg.org)
> > hierarchy, I guess I already mentioned that we use this for installing and
> > managing the software we need.
>
> Ralf Engelschall is a crazy guy :)
>
Now I got the right information. I asked Ralf E.

We are using "fakesyslog", a libary which redirects every syslog
calls, to write them in a file (hard coded). We log everything the
application would send to syslog. There's no possibility to change any
loglevel at the moment.

Manuel

--
If you knew what to say next, would you say it?

Ralf Hildebrandt

unread,
Oct 21, 2001, 7:52:12 AM10/21/01
to Postfix Users
On Sun, Oct 21, 2001 at 01:52:00PM +0200, Manuel Hendel wrote:

> We are using "fakesyslog", a libary which redirects every syslog
> calls, to write them in a file (hard coded). We log everything the
> application would send to syslog. There's no possibility to change any
> loglevel at the moment.

Excellent. So I've seen the whole log and nothing but the whole log.

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

Al Gore invented the Internet, Bill Gates deployed it. That's their
respective stories, anyways

Manuel Hendel

unread,
Oct 22, 2001, 6:04:10 AM10/22/01
to Postfix Users
On Sun, Oct 21, 2001 at 01:52:12PM +0200, Ralf Hildebrandt wrote:
> On Sun, Oct 21, 2001 at 01:52:00PM +0200, Manuel Hendel wrote:
>
> > We are using "fakesyslog", a libary which redirects every syslog
> > calls, to write them in a file (hard coded). We log everything the
> > application would send to syslog. There's no possibility to change any
> > loglevel at the moment.
>
> Excellent. So I've seen the whole log and nothing but the whole log.
>
That's it, but this doesn't solve the Problem at all, I even don't
know if this is a problem, I mean the server is running and doing
fine.

Manuel

--
Computers are unreliable, but humans are even more unreliable. Any system
which depends on human reliability is unreliable.
-Gilb

Ralf Hildebrandt

unread,
Oct 22, 2001, 7:03:22 AM10/22/01
to Postfix Users
On Mon, Oct 22, 2001 at 12:04:10PM +0200, Manuel Hendel wrote:

> That's it, but this doesn't solve the Problem at all, I even don't
> know if this is a problem, I mean the server is running and doing
> fine.

RIght now you told me that Postfix is running under the UserID of cw, like
almost all other software. The INSTALL file clearly says that POstfix must
have it's own group in order to insulate it from the rest of the system.

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

When you say 'I wrote a program that crashed Windows', people just
stare at you blankly and say 'Hey, I got those with the system, for
free'

Lutz Jaenicke

unread,
Oct 22, 2001, 7:02:24 AM10/22/01
to Postfix Users
On Fri, Oct 19, 2001 at 11:14:34PM +0200, Manuel Hendel wrote:
> Where does this message come from, and why does it come so often.
> More than 5000 times a day. I didn't flushed postfix and just reloaded
> once. For some reason the queue is flushed every minute, I don't know
> why.
> I already use "smtpd_etrn_restrictions = reject". I don't know what to
> do anymore. I already tried to fix this problem with the help of Ralf
> Hildebrand (Thank you very much!), but we couldn't get it.

Hmm. I step in on this message (I read the entire thread, though).
If there was an earlier thread on that topic, I have missed it, so please
excuse if some of my statements were already discussed.

The message is generated by qmgr (or nqmgr), when it tries to open a queue
file for processing and finds, that the file is locked using the
operating system dependant file system locking mechanism (flock(), fcntl(),
hidden inside Postfix's myflock() call). As the file is locked, [n]qmgr
concludes that the file is actually being processed (delivered).
* This can and will happen, if a queue file is still being delivered,
when [n]qmgr starts another queue run for whatever reason. The options
have already been discussed in this thread.
Please check out the PID of the [n]qmgr process. If [n]qmgr is restarted
due to a table change, actually a new process is spawned and a new PID
should be seen. If this does not happen, you can already rule out this
option.
* If a queue file is under processing (delivery), another postfix process
should be handling it. Thus, if the queue file 3FC9A2219 is being processed
and the message occurs, you see another process (e.g. smtp) handling it.
If you grep in the syslog file for 3FC9A2219, you should find
Oct 18 17:41:13 ws01 postfix/nqmgr[24139]: 3FC9A2219: from=<jaen...@ws01.aet.tu-cottbus.de>, size=2394, nrcpt=1 (queue active)
...
Oct 18 17:41:16 ws01 postfix/smtp[6324]: 3FC9A2219: to=<Ralf.Hil...@charite.de>, relay=serv01.aet.tu-cottbus.de[141.43.132.161], delay=4, status=sent (250Ok: queued as B85C0299E)

The "skipped, ..." message should only occur between these two messages
and would indicate, that another queue run was started in between.
From the log entries you should also see, whether delivery was abnormally
slow (here you see approx. 3 seconds between start and delivery, needed
because the TLS handshake took place, I have omitted the log entries).
With the log entries you can also see the PID of the (smtp) process
called by [n]qmgr (in the example it is 6324), so that you can check out
the other corresponding entries related to this process.
* Long delivery exceeding the queue manager time constant can happen with
slow connections and large mails. You should however be able to identify
these problems from the logfiles and they should not occur that often.
You may also identify them with the "mailq" command, as "active" messages
are indicated by an asterisk "*". Only those message may lead to the
"skipped, ..." log file entry.
* Some information missing in your post is, whether these messages occur
for the same queue file (same queue ID) or for "random" queue entries.
This might lead to some systematic effect (or not).
* It is also possible that your operating system has a buggy lockf()/flock()
implementation that returns false values and thus causes [n]qmgr to
issue false warnings (and unnecessarily delay delivery). Please post your
operating system (and file system) information.

Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153

Manuel Hendel

unread,
Oct 22, 2001, 7:12:09 AM10/22/01
to Postfix Users
On Mon, Oct 22, 2001 at 01:03:22PM +0200, Ralf Hildebrandt wrote:
> On Mon, Oct 22, 2001 at 12:04:10PM +0200, Manuel Hendel wrote:
>
> > That's it, but this doesn't solve the Problem at all, I even don't
> > know if this is a problem, I mean the server is running and doing
> > fine.
>
> RIght now you told me that Postfix is running under the UserID of cw, like
> almost all other software. The INSTALL file clearly says that POstfix must
> have it's own group in order to insulate it from the rest of the system.
>
But that got nothing to do with the "skipped, still being delivered"
problem. That's just a security reason, to give postfix its own
user, etc. That doesn't impact the functionality of postfix!?

manuel
--
Men always want to be a woman's first love - women like to be a man's last
romance.
-Oscar Wilde
(contributed by Chris Johnston)

Manuel Hendel

unread,
Oct 22, 2001, 7:20:03 AM10/22/01
to Postfix Users
This server is running debian 2.2 and ext2. We got about 5000 entries
of these messages a day.

manuel
--
Do not believe in anything simply because you have heard it. Do not believe in
anything simply because it is spoken and rumored by many. Do not believe in
anything simply because it is found written in your religious books. Do not
believe in anything merely on the authority of your teachers and elders. Do
not believe in traditions because they have been handed down for many
generations. But after observation and analysis, when you find that anything
agrees with reason and is conducive to the good and benefit of one and all,
then accept it and live up to it.
-Buddha

Ralf Hildebrandt

unread,
Oct 22, 2001, 7:20:27 AM10/22/01
to Postfix Users
On Mon, Oct 22, 2001 at 01:12:09PM +0200, Manuel Hendel wrote:

> But that got nothing to do with the "skipped, still being delivered"
> problem. That's just a security reason, to give postfix its own
> user, etc. That doesn't impact the functionality of postfix!?

Not necessarily -- the files in /var/spool/postfix are ownerd by Postfix --
and Postfix expects exclusive access. So, like Lutz writes, some other
program, running as cw, can access the queue directory and hold locks,
you'll see the problem you see.

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

If you bought $1000 worth of Nortel stock one year ago, it would now
be worth $72.
If you bought $1000 worth of Budweiser (the beer, not the stock) one
year ago, drank all the beer, and traded in the cans for the nickel
deposit, you would have $79.
My advice is to start drinking heavily.

Ralf Hildebrandt

unread,
Oct 22, 2001, 9:47:29 AM10/22/01
to Postfix Users
On Mon, Oct 22, 2001 at 03:31:56PM +0200, Lutz Jaenicke wrote:

> Ralf Hildebrandt just sent me a copy of a 60MB logfile.

Lutz can be trusted, BTW. Normally I don't send (other people's) logfiles around.

> For my opinion, your system is completely messed up. I don't know, whether
> this has to do with some additional packages installed, but the system
> _as_is_ is beyond all reasonable behaviour and the error message you
> are reporting here "skipped, ..." is the least significant problem you
> have and only a consequence of other failures.
>
> Let's start some short analysis. The following error message causes nqmgr
> to die. It indicates, that nqmgr started (or at least tried to start) an
> smtp to send out message A944A20044.

> Oct 17 14:14:06 <crit> postfix/nqmgr[26763]: fatal: A944A20044: timeout receiving delivery status from transport: smtp

> So I had a look back into the past, searching for A944A20044:

> Oct 17 09:14:06 <crit> postfix/smtpd[8224]: A944A20044: client=aaaaaaaa.aaaaaaa.aaa[111.11.111.111]
> Oct 17 09:14:06 <crit> postfix/smtpd[8224]: disconnect from aaaaaaaa.aaaaaaa.aaa[111.11.111.111]
> Oct 17 09:14:06 <crit> postfix/nqmgr[26763]: A944A20044: from=<bbbbb...@aaaaaaa.aaa>, size=2362, nrcpt=1 (queue active)

> As you can see, there was a delivery attempt started 6 hours earlier that
> did not finish and nqmgr did not receive the smtp's answer. The size of the
> message (2632) does not indicate any reason why the delivery should have
> taken that long.
> I rather see other messages in between, that I however cannot related to the
> examined queue ID, because the queue ID is only logged on successfull delivery
> (or probably in verbose mode):
> Oct 17 14:14:05 <crit> postfix/smtp[7422]: fatal: watchdog timeout
>
> There are things going on that are even more strange.
> The "skipped," message is listed by
> qmgr_message_alloc()
> called from
> qmgr_active_feed()
> called from
> qmgr_loop()
> As qmgr_active_feed() moves the message from the incoming or deferred
> queue to the active queue, qmgr_loop() would not find it on the next
> loop run with qmgr_scan_next(qmgr_deferred). As the message however
> pops up, this means that either the file was not moved to the correct
> queue, or that it was moved back because it could not be delivered
> (in which case we should see an error message indicating the reason,
> however).

What FS is being used for the Queue?

> The first thing to resolve is
> Oct 18 01:29:30 <crit> postfix/smtp[25432]: fatal: watchdog timeout
> Oct 18 02:22:20 <crit> postfix/smtp[29195]: fatal: watchdog timeout
> Oct 18 04:35:35 <crit> postfix/smtp[4431]: fatal: watchdog timeout
> Oct 18 04:51:32 <crit> postfix/smtp[6767]: fatal: watchdog timeout
> Once this is done, the "skipped, ..." problem can be tackled again.
> But very probably it will magically be gone at that time anyway.

But the number of "fatal:.* timeouts" is << against the number of "skipped"
entries.

> The "watchdog timeout" message should never appear in the first place.
> It indicates, that the process is hanging at a location, which is not
> protected by a normal timeout. During the SMTP protocol, the exchange
> of messages is protected by normal timeout values. Therefore the
> processes should hang at another location, e.g. when looking up
> values from a database or file. Look out for hanging smtp processes
> on your system and find out what they are currently trying to do.

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

Do daemons dream of electric sleep()?

Lutz Jaenicke

unread,
Oct 22, 2001, 9:31:56 AM10/22/01
to Postfix Users
On Mon, Oct 22, 2001 at 01:20:03PM +0200, Manuel Hendel wrote:
> > * It is also possible that your operating system has a buggy lockf()/flock()
> > implementation that returns false values and thus causes [n]qmgr to
> > issue false warnings (and unnecessarily delay delivery). Please post your
> > operating system (and file system) information.
> >
> This server is running debian 2.2 and ext2. We got about 5000 entries
> of these messages a day.

Ralf Hildebrandt just sent me a copy of a 60MB logfile.

The first thing to resolve is
Oct 18 01:29:30 <crit> postfix/smtp[25432]: fatal: watchdog timeout
Oct 18 02:22:20 <crit> postfix/smtp[29195]: fatal: watchdog timeout
Oct 18 04:35:35 <crit> postfix/smtp[4431]: fatal: watchdog timeout
Oct 18 04:51:32 <crit> postfix/smtp[6767]: fatal: watchdog timeout
Once this is done, the "skipped, ..." problem can be tackled again.
But very probably it will magically be gone at that time anyway.

The "watchdog timeout" message should never appear in the first place.


It indicates, that the process is hanging at a location, which is not
protected by a normal timeout. During the SMTP protocol, the exchange
of messages is protected by normal timeout values. Therefore the
processes should hang at another location, e.g. when looking up
values from a database or file. Look out for hanging smtp processes
on your system and find out what they are currently trying to do.

If you are lucky, a lsof call will show you the open files and may
give you an indication. A system call tracer may not help, as it would
not show within which call the process is actually hanging.
You may therefore consider attaching a debugger to the process.

Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153

Lutz Jaenicke

unread,
Oct 22, 2001, 10:06:59 AM10/22/01
to Postfix Users
On Mon, Oct 22, 2001 at 03:47:29PM +0200, Ralf Hildebrandt wrote:
> On Mon, Oct 22, 2001 at 03:31:56PM +0200, Lutz Jaenicke wrote:
> What FS is being used for the Queue?
The orginal author pointed out "ext2".

> > The first thing to resolve is
> > Oct 18 01:29:30 <crit> postfix/smtp[25432]: fatal: watchdog timeout
> > Oct 18 02:22:20 <crit> postfix/smtp[29195]: fatal: watchdog timeout
> > Oct 18 04:35:35 <crit> postfix/smtp[4431]: fatal: watchdog timeout
> > Oct 18 04:51:32 <crit> postfix/smtp[6767]: fatal: watchdog timeout
> > Once this is done, the "skipped, ..." problem can be tackled again.
> > But very probably it will magically be gone at that time anyway.
>
> But the number of "fatal:.* timeouts" is << against the number of "skipped"
> entries.

Yes, but with
mail_params.h:#define VAR_DAEMON_TIMEOUT "daemon_timeout"
mail_params.h:#define DEF_DAEMON_TIMEOUT "18000s"
one smtp may hang up to 18000s / 3600s/h = 5h.
[n]qmgr will try to run the queue quite often in this time. And the
"skipped, ..." message tends to appear quite often for the same messages.

Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153

Ralf Hildebrandt

unread,
Oct 22, 2001, 10:10:15 AM10/22/01
to Postfix Users
On Mon, Oct 22, 2001 at 04:06:59PM +0200, Lutz Jaenicke wrote:

> > What FS is being used for the Queue?
> The orginal author pointed out "ext2".

My brain FIFO is just so small :)

> Yes, but with
> mail_params.h:#define VAR_DAEMON_TIMEOUT "daemon_timeout"
> mail_params.h:#define DEF_DAEMON_TIMEOUT "18000s"
> one smtp may hang up to 18000s / 3600s/h = 5h.

Whoa!

> [n]qmgr will try to run the queue quite often in this time. And the
> "skipped, ..." message tends to appear quite often for the same messages.

In fact, every minute :(
I did some awk/sort/uniq magic to find out how often which message was retried.

So I'd say: Reinstall Postfix as user Postfix, group Postfix, setgid()
Mailqueue just to be sure.

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

echo Mhbqnrnes Stbjr | tr "[a-y]" "[b-z]"

Ralf Hildebrandt

unread,
Oct 23, 2001, 3:36:00 AM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 09:34:17AM +0200, Manuel Hendel wrote:

> I guess, that's not that easy with OpenPKG? I don't know actually, I
> try to find it out, and let you know.

Idea: Punish the packager of Postfix on OpenPKG. Hard. Severely.
He built a package ignoring the EXPLICIT installation instructions.
I'd say this falls within the "Gewaehrleitungspflicht".

> By the way, our amount of mail has increased to 122.000 mails, this is
> about 6.7 GB of mails.

Not bad.

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

Like medieval peasants, computer manufacturers and millions of users
are locked in a seemingly eternal lease with their evil landlord, who
comes around every two years to collect billions of dollars of taxes
in return for mediocre services. --Mark Harris, Electronics Times

Manuel Hendel

unread,
Oct 23, 2001, 3:34:17 AM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 09:29:33AM +0200, Ralf Hildebrandt wrote:
> On Tue, Oct 23, 2001 at 09:27:49AM +0200, Manuel Hendel wrote:
>
> > As you know, this is a productiv system, I can't just reinstall. This
> > would lead to a downtime of atleast 2 hours. You see, I've to find
> > another way.
>
> This is not true. You can simply "make && make install", and then "postfix reload"

>
I guess, that's not that easy with OpenPKG? I don't know actually, I
try to find it out, and let you know.
By the way, our amount of mail has increased to 122.000 mails, this is
about 6.7 GB of mails.

Manuel

--
The problem with the rat race is that even if you win, you're still a rat.
-Jane Wagner

Ralf Hildebrandt

unread,
Oct 23, 2001, 3:29:33 AM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 09:27:49AM +0200, Manuel Hendel wrote:

> As you know, this is a productiv system, I can't just reinstall. This
> would lead to a downtime of atleast 2 hours. You see, I've to find
> another way.

This is not true. You can simply "make && make install", and then "postfix reload"

--

Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

Do not meddle in the affairs of sysadmins,
they are quick to anger and have no need for subtlety.

Manuel Hendel

unread,
Oct 23, 2001, 3:27:49 AM10/23/01
to Postfix Users
On Mon, Oct 22, 2001 at 04:10:15PM +0200, Ralf Hildebrandt wrote:
> On Mon, Oct 22, 2001 at 04:06:59PM +0200, Lutz Jaenicke wrote:
>
> > > What FS is being used for the Queue?
> > The orginal author pointed out "ext2".
>
> My brain FIFO is just so small :)
>
> > Yes, but with
> > mail_params.h:#define VAR_DAEMON_TIMEOUT "daemon_timeout"
> > mail_params.h:#define DEF_DAEMON_TIMEOUT "18000s"
> > one smtp may hang up to 18000s / 3600s/h = 5h.
>
> Whoa!
>
> > [n]qmgr will try to run the queue quite often in this time. And the
> > "skipped, ..." message tends to appear quite often for the same messages.
>
> In fact, every minute :(
> I did some awk/sort/uniq magic to find out how often which message was retried.
>
> So I'd say: Reinstall Postfix as user Postfix, group Postfix, setgid()
> Mailqueue just to be sure.
>
As you know, this is a productiv system, I can't just reinstall. This
would lead to a downtime of atleast 2 hours. You see, I've to find
another way.

Manuel

--
If you want a thing done well, do it yourself.
-Napoleon Bonaparte

Manuel Hendel

unread,
Oct 23, 2001, 11:14:45 AM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 09:36:00AM +0200, Ralf Hildebrandt wrote:
> On Tue, Oct 23, 2001 at 09:34:17AM +0200, Manuel Hendel wrote:
>
> > I guess, that's not that easy with OpenPKG? I don't know actually, I
> > try to find it out, and let you know.
>
> Idea: Punish the packager of Postfix on OpenPKG. Hard. Severely.
> He built a package ignoring the EXPLICIT installation instructions.
> I'd say this falls within the "Gewaehrleitungspflicht".

I talked to the packager, and he promised me that this couldn't have
anything to do with the built of the package.
Another thing is, that we haven't got these problems on other systems
with less mail traffic. It seems to be a problem of the amount of
mails, or of the nqmgr itself. We also didn't have these problems with
the "old" qmgr. What can I do now. There's no way in just installing a
new system.
Yesterday we had 19 Watchdog Timeouts. Where does a this error come
from?


>
> > By the way, our amount of mail has increased to 122.000 mails, this is
> > about 6.7 GB of mails.
>
> Not bad.
>

Manuel

Lutz Jaenicke

unread,
Oct 23, 2001, 11:52:43 AM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 05:14:45PM +0200, Manuel Hendel wrote:
> On Tue, Oct 23, 2001 at 09:36:00AM +0200, Ralf Hildebrandt wrote:
> > On Tue, Oct 23, 2001 at 09:34:17AM +0200, Manuel Hendel wrote:
> >
> > > I guess, that's not that easy with OpenPKG? I don't know actually, I
> > > try to find it out, and let you know.
> >
> > Idea: Punish the packager of Postfix on OpenPKG. Hard. Severely.
> > He built a package ignoring the EXPLICIT installation instructions.
> > I'd say this falls within the "Gewaehrleitungspflicht".
>
> I talked to the packager, and he promised me that this couldn't have
> anything to do with the built of the package.
> Another thing is, that we haven't got these problems on other systems
> with less mail traffic. It seems to be a problem of the amount of
> mails, or of the nqmgr itself. We also didn't have these problems with
> the "old" qmgr. What can I do now. There's no way in just installing a
> new system.
> Yesterday we had 19 Watchdog Timeouts. Where does a this error come
> from?

The watchdog timeout is the last measure of postfix to shut down hanging
processes.
In your case I have seen "smtp" experiencing watchdog timeouts. This means,
that smtp was sitting around for 5 hours. This indicates that there is
a problem around: inside the SMTP protocol handling, all transactions
have their own (much shorter) timeout. So it is not that smtp is sitting
and waiting for the peer to react (network problems etc). Rather smtp
is hanging at a point at which no delay is expected and after 5 hours
finally the watchdog timeout occurs. Actually, you should never see
one. I have only seen them on once: that was on a system with a badly
synchronized clock on which I had to change the time in one step...

It is hard to say, how much effort it would take to find out why the processes
are hanging. If you have compiled postfix with debugging support, you
could easily attach a debugger to the hanging process and from the
traceback one should get a good idea on what is going on.
One of the nice properties of postfix is, that it is modular. You can thus
recompile postfix with "-g" and then replace the smtp executable in the
LIBEXEC subdirectory with the new one while postfix is still running.
All newly started smtp processes will then have debugging support and once
you find a hanging one (with 5 hours timeout you should just look out
for a process sitting around for some hours), you attach the debugger.

Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153

Meng Weng Wong

unread,
Oct 23, 2001, 12:48:35 PM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 09:27:49AM +0200, Manuel Hendel wrote:
| As you know, this is a productiv system, I can't just reinstall. This
| would lead to a downtime of atleast 2 hours. You see, I've to find
| another way.

btw, have you done a "postfix stop; postfix start" since the
problem began, or have you been doing only "postfix reload"?

Manuel Hendel

unread,
Oct 23, 2001, 1:11:52 PM10/23/01
to Meng Weng Wong, Postfix Users
On Tue, Oct 23, 2001 at 12:48:35PM -0400, Meng Weng Wong wrote:
> On Tue, Oct 23, 2001 at 09:27:49AM +0200, Manuel Hendel wrote:
> | As you know, this is a productiv system, I can't just reinstall. This
> | would lead to a downtime of atleast 2 hours. You see, I've to find
> | another way.
>
> btw, have you done a "postfix stop; postfix start" since the
> problem began, or have you been doing only "postfix reload"?
>
Not really! I will do this and let you know if this helps.

Manuel

--
Space: the final frontier. These are the voyages of the starship Enterprise.
Its five-year mission: to explore strange new worlds; to seek out new life and
new civilizations; to boldly go where no man has gone before.
-Captain James T. Kirk

Ralf Hildebrandt

unread,
Oct 23, 2001, 1:30:56 PM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 05:14:45PM +0200, Manuel Hendel wrote:

> Another thing is, that we haven't got these problems on other systems
> with less mail traffic.

That's interesting.

> It seems to be a problem of the amount of
> mails, or of the nqmgr itself. We also didn't have these problems with
> the "old" qmgr. What can I do now. There's no way in just installing a
> new system.

In that case, switch back to qmgr by editing master.cf
On the other hand I bet Wietse *might* like to have a look around that machine
to see what the problem with nqmgr might be.

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

Our continuing mission: To seek out knowledge of C, to explore strange
UNIX commands, and to boldly code where no one has man page 4.

Ralf Hildebrandt

unread,
Oct 23, 2001, 1:32:51 PM10/23/01
to Postfix Users
On Tue, Oct 23, 2001 at 05:52:43PM +0200, Lutz Jaenicke wrote:

> The watchdog timeout is the last measure of postfix to shut down hanging
> processes.

Idea: Build Postfix WITHOUT optimziation. This often helps. What compiler
was used? Was the package built on the box it runs on?

--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916

Your mantra for today is: Don't let data from the network near a
shell. Bad things happen. -- Randall Schwartz

Wietse Venema

unread,
Oct 23, 2001, 7:05:58 PM10/23/01
to Ralf Hildebrandt, Postfix Users
Ralf Hildebrandt:

> On Tue, Oct 23, 2001 at 05:14:45PM +0200, Manuel Hendel wrote:
>
> > Another thing is, that we haven't got these problems on other systems
> > with less mail traffic.
>
> That's interesting.
>
> > It seems to be a problem of the amount of
> > mails, or of the nqmgr itself. We also didn't have these problems with
> > the "old" qmgr. What can I do now. There's no way in just installing a
> > new system.
>
> In that case, switch back to qmgr by editing master.cf
> On the other hand I bet Wietse *might* like to have a look around that machine
> to see what the problem with nqmgr might be.

I don't think so.

If there is a problem with Postfix on a specific platform then I
can try to build a machine (in VMware :-) and test Postfix there.
I will not visit someone elses machine. IBM's pockets are too deep.

I deliver Postfix as working on a half-dozen systems including
Solaris, FreeBSD, BSD/OS, a slew of Linuxes, and AIX :-). Postfix
works on all those systems. And that is where my effort ends.

Watchdog timeouts in SMTP servers etc. are the result of a broken
build, the result of a broken library, or the result of a broken
kernel. I have seen problems of each type.

Wietse

0 new messages