Thanks for any hlep,
Manuel
-
To unsubscribe, send mail to majo...@postfix.org with content
(not subject): unsubscribe postfix-users
Postfix rerun the queue once a while (see queue_run_delay parameter) and
if a message takes a long time to be delivered (the smtp connection is
slow or the host is behind a firewall and does not answer), the message
"skipped, still being delivered" appear, one for each message.
Check why the messages takes so long time to be delivered.
--
Loic Le Loarer
Fench
Hmmm... A cron job? For example, to update any maps
and than to reload (postfix usually doesn't need this)?
There are only 4 ways to explicitly flush the queue:
a) ETRN, disabled in your case
b) `postfix flush' command
c) `sendmail -q' command
d) direct usage of postfix's internal sockets to issue
flush command (cases b) and c) uses that too)
You can just try to rename `postfix' and `sendmail'
commands temporarily to see what will happen (note this
is somewhat dangerous to rename sendmail, *if* your
system sends mails using it).
Regards,
Michael.
Don't I have to reload postfix to update any maps?
>
> There are only 4 ways to explicitly flush the queue:
>
> a) ETRN, disabled in your case
> b) `postfix flush' command
> c) `sendmail -q' command
> d) direct usage of postfix's internal sockets to issue
> flush command (cases b) and c) uses that too)
>
> You can just try to rename `postfix' and `sendmail'
> commands temporarily to see what will happen (note this
> is somewhat dangerous to rename sendmail, *if* your
> system sends mails using it).
>
Ralf thought about the same, we already disabled postfix and sendmail
command (chmod 000 postfix|sendmail). That's why I'm writing to the
list now. There are no local users, nobady can execute the commands
postfix or sendmail and ETRN is disabled.
Manuel
--
The activist is not the man who says the river is dirty. The activist is the
man who cleans up the river.
-H. Ross Perot
> Postfix rerun the queue once a while (see queue_run_delay parameter) and
> if a message takes a long time to be delivered (the smtp connection is
> slow or the host is behind a firewall and does not answer), the message
> "skipped, still being delivered" appear, one for each message.
>
> Check why the messages takes so long time to be delivered.
I'll add some details here:
master.cf is unchanged. No chrooting. No bad entries for qmgr (only 1
instance), no limits on cleanup.
main.cf: very basic. Only minuscule changes.
Fact: from the logs one can see that nqmgr works through the queue every
minute.
I suggested "chmod 000 sendmail postfix" to make sure NOTHING can issue a
"postfix flush" or "sendmail -q". To no avail.
Question: Can nqmgr be made to reprocess the queue by sending it a signal
(like SIGHUP or something)??
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Echte Informatiker benutzen kein M$-Windows.
> Hmmm... A cron job? For example, to update any maps
> and than to reload (postfix usually doesn't need this)?
No maps were updated. At least there's nothing in the log.
Question to Manuel: How does your syslog.conf look like??
Do you log
mail.debug -/var/log/maillog
or only higher severitieS?
> a) ETRN, disabled in your case
correct
> b) `postfix flush' command
> c) `sendmail -q' command
Thus I recommended chmod 000
> d) direct usage of postfix's internal sockets to issue
> flush command (cases b) and c) uses that too)
e) a signal to (n)qmgr ?
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Never ever relay emails based solely on a user-provided header.
Down that path leads madness, despair, and the RBL.
We don't use syslog for logging, sorry! This there anything else I can
provide?
>
> or only higher severitieS?
>
> > a) ETRN, disabled in your case
>
> correct
>
> > b) `postfix flush' command
> > c) `sendmail -q' command
>
> Thus I recommended chmod 000
>
> > d) direct usage of postfix's internal sockets to issue
> > flush command (cases b) and c) uses that too)
>
> e) a signal to (n)qmgr ?
>
> --
> Ralf Hildebrandt Tel. +49 (0)30-450 570-155
> Fax. +49 (0)30-450 570-916
> Never ever relay emails based solely on a user-provided header.
> Down that path leads madness, despair, and the RBL.
>
>
> -
> To unsubscribe, send mail to majo...@postfix.org with content
> (not subject): unsubscribe postfix-users
--
Conversation is the slowest form of human communication.
(contributed by Chris Johnston)
> > mail.debug -/var/log/maillog
>
> We don't use syslog for logging, sorry! This there anything else I can
> provide?
I meant to say: Do you log EVERYTHING postfix sends to syslog() or just
everything over a certain severity??
What do you use for logging then?
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
PowerPoint & Windows. Office and NT - I might like'em more after my lobotomy
Manuel
--
The whole aim of practical politics is to keep the populace in a continual
state of alarm (and hence clamorous to be led to safety) by menacing them with
an endless series of hobgoblins, all of them imaginary.
-H. L. Mencken
> Sorry, I don't know actually.
:) -- it's important to know if EVERYTHING is logged.
> It might be syslog, but inside the openpkg (http://www.openpkg.org)
> hierarchy, I guess I already mentioned that we use this for installing and
> managing the software we need.
Ralf Engelschall is a crazy guy :)
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
From empirical experience, your Exchange admin needs to put down the
crack pipe and open a window to disperse the fumes.
We are using "fakesyslog", a libary which redirects every syslog
calls, to write them in a file (hard coded). We log everything the
application would send to syslog. There's no possibility to change any
loglevel at the moment.
Manuel
--
If you knew what to say next, would you say it?
> We are using "fakesyslog", a libary which redirects every syslog
> calls, to write them in a file (hard coded). We log everything the
> application would send to syslog. There's no possibility to change any
> loglevel at the moment.
Excellent. So I've seen the whole log and nothing but the whole log.
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Al Gore invented the Internet, Bill Gates deployed it. That's their
respective stories, anyways
Manuel
--
Computers are unreliable, but humans are even more unreliable. Any system
which depends on human reliability is unreliable.
-Gilb
> That's it, but this doesn't solve the Problem at all, I even don't
> know if this is a problem, I mean the server is running and doing
> fine.
RIght now you told me that Postfix is running under the UserID of cw, like
almost all other software. The INSTALL file clearly says that POstfix must
have it's own group in order to insulate it from the rest of the system.
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
When you say 'I wrote a program that crashed Windows', people just
stare at you blankly and say 'Hey, I got those with the system, for
free'
Hmm. I step in on this message (I read the entire thread, though).
If there was an earlier thread on that topic, I have missed it, so please
excuse if some of my statements were already discussed.
The message is generated by qmgr (or nqmgr), when it tries to open a queue
file for processing and finds, that the file is locked using the
operating system dependant file system locking mechanism (flock(), fcntl(),
hidden inside Postfix's myflock() call). As the file is locked, [n]qmgr
concludes that the file is actually being processed (delivered).
* This can and will happen, if a queue file is still being delivered,
when [n]qmgr starts another queue run for whatever reason. The options
have already been discussed in this thread.
Please check out the PID of the [n]qmgr process. If [n]qmgr is restarted
due to a table change, actually a new process is spawned and a new PID
should be seen. If this does not happen, you can already rule out this
option.
* If a queue file is under processing (delivery), another postfix process
should be handling it. Thus, if the queue file 3FC9A2219 is being processed
and the message occurs, you see another process (e.g. smtp) handling it.
If you grep in the syslog file for 3FC9A2219, you should find
Oct 18 17:41:13 ws01 postfix/nqmgr[24139]: 3FC9A2219: from=<jaen...@ws01.aet.tu-cottbus.de>, size=2394, nrcpt=1 (queue active)
...
Oct 18 17:41:16 ws01 postfix/smtp[6324]: 3FC9A2219: to=<Ralf.Hil...@charite.de>, relay=serv01.aet.tu-cottbus.de[141.43.132.161], delay=4, status=sent (250Ok: queued as B85C0299E)
The "skipped, ..." message should only occur between these two messages
and would indicate, that another queue run was started in between.
From the log entries you should also see, whether delivery was abnormally
slow (here you see approx. 3 seconds between start and delivery, needed
because the TLS handshake took place, I have omitted the log entries).
With the log entries you can also see the PID of the (smtp) process
called by [n]qmgr (in the example it is 6324), so that you can check out
the other corresponding entries related to this process.
* Long delivery exceeding the queue manager time constant can happen with
slow connections and large mails. You should however be able to identify
these problems from the logfiles and they should not occur that often.
You may also identify them with the "mailq" command, as "active" messages
are indicated by an asterisk "*". Only those message may lead to the
"skipped, ..." log file entry.
* Some information missing in your post is, whether these messages occur
for the same queue file (same queue ID) or for "random" queue entries.
This might lead to some systematic effect (or not).
* It is also possible that your operating system has a buggy lockf()/flock()
implementation that returns false values and thus causes [n]qmgr to
issue false warnings (and unnecessarily delay delivery). Please post your
operating system (and file system) information.
Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153
manuel
--
Men always want to be a woman's first love - women like to be a man's last
romance.
-Oscar Wilde
(contributed by Chris Johnston)
manuel
--
Do not believe in anything simply because you have heard it. Do not believe in
anything simply because it is spoken and rumored by many. Do not believe in
anything simply because it is found written in your religious books. Do not
believe in anything merely on the authority of your teachers and elders. Do
not believe in traditions because they have been handed down for many
generations. But after observation and analysis, when you find that anything
agrees with reason and is conducive to the good and benefit of one and all,
then accept it and live up to it.
-Buddha
> But that got nothing to do with the "skipped, still being delivered"
> problem. That's just a security reason, to give postfix its own
> user, etc. That doesn't impact the functionality of postfix!?
Not necessarily -- the files in /var/spool/postfix are ownerd by Postfix --
and Postfix expects exclusive access. So, like Lutz writes, some other
program, running as cw, can access the queue directory and hold locks,
you'll see the problem you see.
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
If you bought $1000 worth of Nortel stock one year ago, it would now
be worth $72.
If you bought $1000 worth of Budweiser (the beer, not the stock) one
year ago, drank all the beer, and traded in the cans for the nickel
deposit, you would have $79.
My advice is to start drinking heavily.
> Ralf Hildebrandt just sent me a copy of a 60MB logfile.
Lutz can be trusted, BTW. Normally I don't send (other people's) logfiles around.
> For my opinion, your system is completely messed up. I don't know, whether
> this has to do with some additional packages installed, but the system
> _as_is_ is beyond all reasonable behaviour and the error message you
> are reporting here "skipped, ..." is the least significant problem you
> have and only a consequence of other failures.
>
> Let's start some short analysis. The following error message causes nqmgr
> to die. It indicates, that nqmgr started (or at least tried to start) an
> smtp to send out message A944A20044.
> Oct 17 14:14:06 <crit> postfix/nqmgr[26763]: fatal: A944A20044: timeout receiving delivery status from transport: smtp
> So I had a look back into the past, searching for A944A20044:
> Oct 17 09:14:06 <crit> postfix/smtpd[8224]: A944A20044: client=aaaaaaaa.aaaaaaa.aaa[111.11.111.111]
> Oct 17 09:14:06 <crit> postfix/smtpd[8224]: disconnect from aaaaaaaa.aaaaaaa.aaa[111.11.111.111]
> Oct 17 09:14:06 <crit> postfix/nqmgr[26763]: A944A20044: from=<bbbbb...@aaaaaaa.aaa>, size=2362, nrcpt=1 (queue active)
> As you can see, there was a delivery attempt started 6 hours earlier that
> did not finish and nqmgr did not receive the smtp's answer. The size of the
> message (2632) does not indicate any reason why the delivery should have
> taken that long.
> I rather see other messages in between, that I however cannot related to the
> examined queue ID, because the queue ID is only logged on successfull delivery
> (or probably in verbose mode):
> Oct 17 14:14:05 <crit> postfix/smtp[7422]: fatal: watchdog timeout
>
> There are things going on that are even more strange.
> The "skipped," message is listed by
> qmgr_message_alloc()
> called from
> qmgr_active_feed()
> called from
> qmgr_loop()
> As qmgr_active_feed() moves the message from the incoming or deferred
> queue to the active queue, qmgr_loop() would not find it on the next
> loop run with qmgr_scan_next(qmgr_deferred). As the message however
> pops up, this means that either the file was not moved to the correct
> queue, or that it was moved back because it could not be delivered
> (in which case we should see an error message indicating the reason,
> however).
What FS is being used for the Queue?
> The first thing to resolve is
> Oct 18 01:29:30 <crit> postfix/smtp[25432]: fatal: watchdog timeout
> Oct 18 02:22:20 <crit> postfix/smtp[29195]: fatal: watchdog timeout
> Oct 18 04:35:35 <crit> postfix/smtp[4431]: fatal: watchdog timeout
> Oct 18 04:51:32 <crit> postfix/smtp[6767]: fatal: watchdog timeout
> Once this is done, the "skipped, ..." problem can be tackled again.
> But very probably it will magically be gone at that time anyway.
But the number of "fatal:.* timeouts" is << against the number of "skipped"
entries.
> The "watchdog timeout" message should never appear in the first place.
> It indicates, that the process is hanging at a location, which is not
> protected by a normal timeout. During the SMTP protocol, the exchange
> of messages is protected by normal timeout values. Therefore the
> processes should hang at another location, e.g. when looking up
> values from a database or file. Look out for hanging smtp processes
> on your system and find out what they are currently trying to do.
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Do daemons dream of electric sleep()?
Ralf Hildebrandt just sent me a copy of a 60MB logfile.
The first thing to resolve is
Oct 18 01:29:30 <crit> postfix/smtp[25432]: fatal: watchdog timeout
Oct 18 02:22:20 <crit> postfix/smtp[29195]: fatal: watchdog timeout
Oct 18 04:35:35 <crit> postfix/smtp[4431]: fatal: watchdog timeout
Oct 18 04:51:32 <crit> postfix/smtp[6767]: fatal: watchdog timeout
Once this is done, the "skipped, ..." problem can be tackled again.
But very probably it will magically be gone at that time anyway.
The "watchdog timeout" message should never appear in the first place.
It indicates, that the process is hanging at a location, which is not
protected by a normal timeout. During the SMTP protocol, the exchange
of messages is protected by normal timeout values. Therefore the
processes should hang at another location, e.g. when looking up
values from a database or file. Look out for hanging smtp processes
on your system and find out what they are currently trying to do.
If you are lucky, a lsof call will show you the open files and may
give you an indication. A system call tracer may not help, as it would
not show within which call the process is actually hanging.
You may therefore consider attaching a debugger to the process.
Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153
> > The first thing to resolve is
> > Oct 18 01:29:30 <crit> postfix/smtp[25432]: fatal: watchdog timeout
> > Oct 18 02:22:20 <crit> postfix/smtp[29195]: fatal: watchdog timeout
> > Oct 18 04:35:35 <crit> postfix/smtp[4431]: fatal: watchdog timeout
> > Oct 18 04:51:32 <crit> postfix/smtp[6767]: fatal: watchdog timeout
> > Once this is done, the "skipped, ..." problem can be tackled again.
> > But very probably it will magically be gone at that time anyway.
>
> But the number of "fatal:.* timeouts" is << against the number of "skipped"
> entries.
Yes, but with
mail_params.h:#define VAR_DAEMON_TIMEOUT "daemon_timeout"
mail_params.h:#define DEF_DAEMON_TIMEOUT "18000s"
one smtp may hang up to 18000s / 3600s/h = 5h.
[n]qmgr will try to run the queue quite often in this time. And the
"skipped, ..." message tends to appear quite often for the same messages.
Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153
> > What FS is being used for the Queue?
> The orginal author pointed out "ext2".
My brain FIFO is just so small :)
> Yes, but with
> mail_params.h:#define VAR_DAEMON_TIMEOUT "daemon_timeout"
> mail_params.h:#define DEF_DAEMON_TIMEOUT "18000s"
> one smtp may hang up to 18000s / 3600s/h = 5h.
Whoa!
> [n]qmgr will try to run the queue quite often in this time. And the
> "skipped, ..." message tends to appear quite often for the same messages.
In fact, every minute :(
I did some awk/sort/uniq magic to find out how often which message was retried.
So I'd say: Reinstall Postfix as user Postfix, group Postfix, setgid()
Mailqueue just to be sure.
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
echo Mhbqnrnes Stbjr | tr "[a-y]" "[b-z]"
> I guess, that's not that easy with OpenPKG? I don't know actually, I
> try to find it out, and let you know.
Idea: Punish the packager of Postfix on OpenPKG. Hard. Severely.
He built a package ignoring the EXPLICIT installation instructions.
I'd say this falls within the "Gewaehrleitungspflicht".
> By the way, our amount of mail has increased to 122.000 mails, this is
> about 6.7 GB of mails.
Not bad.
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Like medieval peasants, computer manufacturers and millions of users
are locked in a seemingly eternal lease with their evil landlord, who
comes around every two years to collect billions of dollars of taxes
in return for mediocre services. --Mark Harris, Electronics Times
Manuel
--
The problem with the rat race is that even if you win, you're still a rat.
-Jane Wagner
> As you know, this is a productiv system, I can't just reinstall. This
> would lead to a downtime of atleast 2 hours. You see, I've to find
> another way.
This is not true. You can simply "make && make install", and then "postfix reload"
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Do not meddle in the affairs of sysadmins,
they are quick to anger and have no need for subtlety.
Manuel
--
If you want a thing done well, do it yourself.
-Napoleon Bonaparte
I talked to the packager, and he promised me that this couldn't have
anything to do with the built of the package.
Another thing is, that we haven't got these problems on other systems
with less mail traffic. It seems to be a problem of the amount of
mails, or of the nqmgr itself. We also didn't have these problems with
the "old" qmgr. What can I do now. There's no way in just installing a
new system.
Yesterday we had 19 Watchdog Timeouts. Where does a this error come
from?
>
> > By the way, our amount of mail has increased to 122.000 mails, this is
> > about 6.7 GB of mails.
>
> Not bad.
>
Manuel
The watchdog timeout is the last measure of postfix to shut down hanging
processes.
In your case I have seen "smtp" experiencing watchdog timeouts. This means,
that smtp was sitting around for 5 hours. This indicates that there is
a problem around: inside the SMTP protocol handling, all transactions
have their own (much shorter) timeout. So it is not that smtp is sitting
and waiting for the peer to react (network problems etc). Rather smtp
is hanging at a point at which no delay is expected and after 5 hours
finally the watchdog timeout occurs. Actually, you should never see
one. I have only seen them on once: that was on a system with a badly
synchronized clock on which I had to change the time in one step...
It is hard to say, how much effort it would take to find out why the processes
are hanging. If you have compiled postfix with debugging support, you
could easily attach a debugger to the hanging process and from the
traceback one should get a good idea on what is going on.
One of the nice properties of postfix is, that it is modular. You can thus
recompile postfix with "-g" and then replace the smtp executable in the
LIBEXEC subdirectory with the new one while postfix is still running.
All newly started smtp processes will then have debugging support and once
you find a hanging one (with 5 hours timeout you should just look out
for a process sitting around for some hours), you attach the debugger.
Best regards,
Lutz
--
Lutz Jaenicke Lutz.J...@aet.TU-Cottbus.DE
BTU Cottbus http://www.aet.TU-Cottbus.DE/personen/jaenicke/
Lehrstuhl Allgemeine Elektrotechnik Tel. +49 355 69-4129
Universitaetsplatz 3-4, D-03044 Cottbus Fax. +49 355 69-4153
btw, have you done a "postfix stop; postfix start" since the
problem began, or have you been doing only "postfix reload"?
Manuel
--
Space: the final frontier. These are the voyages of the starship Enterprise.
Its five-year mission: to explore strange new worlds; to seek out new life and
new civilizations; to boldly go where no man has gone before.
-Captain James T. Kirk
> Another thing is, that we haven't got these problems on other systems
> with less mail traffic.
That's interesting.
> It seems to be a problem of the amount of
> mails, or of the nqmgr itself. We also didn't have these problems with
> the "old" qmgr. What can I do now. There's no way in just installing a
> new system.
In that case, switch back to qmgr by editing master.cf
On the other hand I bet Wietse *might* like to have a look around that machine
to see what the problem with nqmgr might be.
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Our continuing mission: To seek out knowledge of C, to explore strange
UNIX commands, and to boldly code where no one has man page 4.
> The watchdog timeout is the last measure of postfix to shut down hanging
> processes.
Idea: Build Postfix WITHOUT optimziation. This often helps. What compiler
was used? Was the package built on the box it runs on?
--
Ralf Hildebrandt Tel. +49 (0)30-450 570-155
Fax. +49 (0)30-450 570-916
Your mantra for today is: Don't let data from the network near a
shell. Bad things happen. -- Randall Schwartz
I don't think so.
If there is a problem with Postfix on a specific platform then I
can try to build a machine (in VMware :-) and test Postfix there.
I will not visit someone elses machine. IBM's pockets are too deep.
I deliver Postfix as working on a half-dozen systems including
Solaris, FreeBSD, BSD/OS, a slew of Linuxes, and AIX :-). Postfix
works on all those systems. And that is where my effort ends.
Watchdog timeouts in SMTP servers etc. are the result of a broken
build, the result of a broken library, or the result of a broken
kernel. I have seen problems of each type.
Wietse