Postfix locking up, not accepting connections / smtp not sending emails out

Christian Rohmann

unread,

Oct 29, 2010, 11:35:58 AM10/29/10

to

Hey postfix-users,

we are currently analyzing very strange postfix behavior which I can
only describe as lockup or freeze.
Honestly we reached our abilities to gather more info and to find the
root cause of this issue.

You are my last hope obi wan ... eh Wietse ....

--- Setup / Configuration ---
We are running three VMware ESX 4.1 guests (multiple VMware hosts) with
Debian Lenny amd64.
Each machine has 8 virtual CPUs and 4GB of RAM. We are using the Debian
lenny postfix package (2.5.5) and now moved up to 2.7.1 which we
backported (problems remained unchanged).
Each instance of postfix recieves about 200-300 mails/minute. The mail
addresses of the incoming emails are checked using a relay_recipient_map
(proxymap + mysql) are rewritten (domain is changed) and then relayed
using a transport map towards a multi-A record containing two IPs.

--- Lockup Issues ---
The setup and configuration works like a charm for hours at a time and
all of a sudden it stops working leading to two issues (not at the same
time):

1) First issue was that suddenly smtp stopped delivering email to that
mutli-A record. We noticed a few thousand emails in the active queue (I
guess all emails where in the active queue by that time). We rule out
problems with the destination servers since the remaining postfix
instances still delivered mail during that time. Even the last
submission of email from the now locked up postfix finished without
issue. There are just no more tries to reach the destination. Postfix
also stops to deliver to both destination IPs at the same time. There
was no logging anymore, but from anvil giving us some statistics about
connection rates.

2) The second issue, occurring more often is that smtpd stopped
responding or doing anything acutally. Sometimes the number of processes
went to max (500) at other occurences it just stayed at like 30-50 but
symptoms where still the same.

Looking at netstat all tcp connections to the smtpd processes went away
after some time.We believe that clients trying to deliver email to us
did disconnect due to reaching their timeout.

Here is a sniplet of log from before and after the issue:
--- connect before and after lockup ---
Oct 29 12:57:35 mailserver postfix/smtpd[12457]: connect from
newsletter.xxx.de[xxx.162.xxx.28]
Oct 29 13:29:02 mailserver postfix/smtpd[12457]: 826C611FA8:
client=newsletter.xxx.de[xxx.162.xxx.28]
Oct 29 13:29:03 mailserver postfix/smtpd[12457]: lost connection after
RCPT from newsletter.xxx.de[xxx.162.xxx.28]
Oct 29 13:29:03 mailserver postfix/smtpd[12457]: disconnect from
newsletter.xxx.de[xxx.162.xxx.28]
--- / connect before and after lockup ---
There is not a single line of log from smtpd processes while being in
the locked up state. At 12:57:35 there is a last connect from a client,
no more log happens until we run "postfix reload" (! a reload is
sufficient) about half an hour later. Then the last few lines show that
the smtpd process is "alive" again talking to the now already
disconnected client.

Also the master process is still there but not accepting anymore
connections until we issue "postfix reload". Here is an strace of the
master process:
--- strace on master-pid while locked up ---
10648 alarm(333) = 307
10648 epoll_wait(13,
--- / strace on master-pid while locked up ---

Another strange thing is, that some of the last log entries contain
"UNKNOWN" instead of the IP:
--- smtpd connects towards UNKNOWN ---
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: connect from
unknown[unknown]
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: NOQUEUE: reject:
CONNECT from unknown[unknown]: 550 5.7.1 Client host rejected: cannot
find your reverse hostname, [unknown]; proto=SMTP
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: too many errors after
CONNECT from unknown[unknown]
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: disconnect from
unknown[unknown]
--- / smtpd connects towards UNKNOWN ---

Anvil also gives statistics about this obscure client-IP "UNKNOWN":
--- anvil statistics for UNKNOWN ---
Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max
connection rate 92/60s for (smtp:unknown) at Oct 29 13:05:02
Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max
connection count 1 for (smtp:unknown) at Oct 29 13:05:02
Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max cache
size 413 at Oct 29 13:05:02
--- / anvil statistics for UNKNOWN ---

During the lock up we still see the *:25 listening socket of the master
process. When we tried to connect ourselves we just waited forever for a
connection or even a SMTP prompt.

New incoming connections are in state "SYN_RECV":
--- netstat while locked up---
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 x.x.x.143:25 xx.0.xx.217:58556 SYN_RECV
tcp 0 0 x.x.x.143:25 xx.115.xx.166:46355 SYN_RECV
tcp 0 0 x.x.x.143:25 xx.63.xx.197:37216 SYN_RECV
.
.
.
---/ netstat while locked up---

Again a "postfix reload" does fix the problem (for a few hours) and
postfix delivers the > 10,000 emails in the active queue in a few
seconds towards the two IPs from the DNS round robin and everything goes
back to normal for a few hours.

I'm hoping for suggestions on what to look for or check,
TIA

Christian

Wietse Venema

unread,

Oct 29, 2010, 11:49:04 AM10/29/10

to

I will assume that this is a bug in OS software or in emulated hardware.

Does this server run in a virtual machine?

What is the output from "grep fatal" on today's and yesterday's maillog file?

What is the output from "grep watchdog" on all your maillog files?

Wietse

Christian Rohmann

unread,

Oct 29, 2010, 12:05:44 PM10/29/10

to

Hey Wietse,

thanks for the quick reply. Sorry for the delay, was a few GBs of logs
to grep through ;-)

On 10/29/2010 05:49 PM, Wietse Venema wrote:
> I will assume that this is a bug in OS software or in emulated hardware.

Possible, but we are not really having a special setup ... just VMware +
Debian amd64 which hundreds of other shops run I suppose.

> Does this server run in a virtual machine?

Yeah, the Debian Lenny (amd64) runs on VMware ESX 4.1 hosts. The guests
itself are Vmware HW revision 7.

> What is the output from "grep fatal" on today's and yesterday's maillog file?

None, not a single line.

> What is the output from "grep watchdog" on all your maillog files?

Same as above -> nothing.

I guess that rules out this issue here?

On 10/29/2010 05:43 PM, lst_...@kwsoft.de wrote:
> Maybe another instance of this problem?
> http://tech.groups.yahoo.com/group/postfix-users/message/269786

Even though at some point postfix stopped at EPOLL_WAIT...

Christian

Wietse Venema

unread,

Oct 29, 2010, 1:04:40 PM10/29/10

to

Christian Rohmann:

> > Does this server run in a virtual machine?
> Yeah, the Debian Lenny (amd64) runs on VMware ESX 4.1 hosts. The guests
> itself are Vmware HW revision 7.

VMware has an entire KB article on problems with delivering timer
interrupts to guest machines, and the hoops that they are jumping
through to avoid poor performance. See
http://tech.groups.yahoo.com/group/postfix-users/message/269786

> > What is the output from "grep fatal" on today's and yesterday's maillog file?
> None, not a single line.
>
> > What is the output from "grep watchdog" on all your maillog files?
> Same as above -> nothing.
>
> I guess that rules out this issue here?

No, it confirms my suspicion that either a) you run Postfix < 2.4
and do "postfix stop" or "reload" frequently, or b) your virtual
timers are broken, or c) you used "grep" on compressed files instead
of using "zgrep" or "bzgrep".

All Postfix daemons including the master have an alarm(3) timer
that aborts the process when it becomes stuck.

Normally all processes reset their alarm timer frequently; when
they become stuck, they stop resetting their alarm timer. When the
timer goes off, it logs a watchdog error and kills the process.

> On 10/29/2010 05:43 PM, lst_...@kwsoft.de wrote:
> > Maybe another instance of this problem?
> > http://tech.groups.yahoo.com/group/postfix-users/message/269786
>
> Even though at some point postfix stopped at EPOLL_WAIT...

That does not look like the problem with "postfix stop" or "reload"
with Postfix < 2.4 which sometimes triggers a deadlock in syslog().

So we still have the possibility that your timer support is broken
such that even the per-process alarm timer is no longer working.

Postfix relies heavily on timer support to enforce sanity.

Specifically, Postfix relies on short-term timers (implemented with
poll and epoll on Linux) to enforce time limits on read/write
operations, and relies on long-term alarm timers to kill off a
process that hangs because some short-timer failed to go off.

If both layers of safety fail due to broken (virtual) timer support,
then it is not possible to run Postfix reliably.

Wietse

Wietse Venema

unread,

Oct 29, 2010, 1:23:30 PM10/29/10

to

> > On 10/29/2010 05:43 PM, lst_...@kwsoft.de wrote:
> > > Maybe another instance of this problem?
> > > http://tech.groups.yahoo.com/group/postfix-users/message/269786
> >
> > Even though at some point postfix stopped at EPOLL_WAIT...

The main loop in the master is as follows:

forever {
set an alarm for 1000s
do an EPOLL_WAIT for up to 500s and handle any child process
events, or short-term timer requests that are implemented
around the EPOLL_WAIT timer.
respond to sighup (the sighup flag is set by a signal handler)
respond to sigchld (the sigchld flag is set by a signal handler)
}

It would be worthwhile to see what strace reports when you leave
it running. If strace reports nothing in 500s then EPOLL_WAIT is
not working. If strace reports nothing after 1000s then the alarm
timer is also not working.

Wietse

Christian Rohmann

unread,

Oct 29, 2010, 2:19:38 PM10/29/10

to

Hey again,

On 10/29/2010 07:23 PM, Wietse Venema wrote:
> The main loop in the master is as follows:
>
> forever {
> set an alarm for 1000s
> do an EPOLL_WAIT for up to 500s and handle any child process
> events, or short-term timer requests that are implemented
> around the EPOLL_WAIT timer.
> respond to sighup (the sighup flag is set by a signal handler)
> respond to sigchld (the sigchld flag is set by a signal handler)
> }

Just now one machine had the issue again. I checked and saw that we
where down to just two smtpd processes and even though master was still
bound to port 25 no new connections where accepted. I did telnet to it,
but the connection was not accepted and ran into timeout.

How does the timer issue relate to the master process not accepting
anymore TCP/IP connections on port 25?

> It would be worthwhile to see what strace reports when you leave
> it running. If strace reports nothing in 500s then EPOLL_WAIT is
> not working. If strace reports nothing after 1000s then the alarm
> timer is also not working.

I'll try to gather you some strace data. I guess the strace should be of
the master? Could you give me a hint on what options you might want?

On 10/29/2010 07:04 PM, Wietse Venema wrote:
> VMware has an entire KB article on problems with delivering timer
> interrupts to guest machines, and the hoops that they are jumping
> through to avoid poor performance. See
> http://tech.groups.yahoo.com/group/postfix-users/message/269786

Thanks for the hint, I already printed that article to read over the
weekend.

Thanks for your help,

Christian

Jeroen Geilman

unread,

Oct 29, 2010, 2:38:32 PM10/29/10

to

On 10/29/2010 05:35 PM, Christian Rohmann wrote:
> Hey postfix-users,
>
> we are currently analyzing very strange postfix behavior which I can
> only describe as lockup or freeze.
> Honestly we reached our abilities to gather more info and to find the
> root cause of this issue.
>
> You are my last hope obi wan ... eh Wietse ....
>
>
> --- Setup / Configuration ---
> We are running three VMware ESX 4.1 guests (multiple VMware hosts) with
> Debian Lenny amd64.
> Each machine has 8 virtual CPUs and 4GB of RAM.

Wow. Seriously ? For *email* ?

> We are using the Debian
> lenny postfix package (2.5.5) and now moved up to 2.7.1 which we
> backported (problems remained unchanged).
> Each instance of postfix recieves about 200-300 mails/minute.

That's... frankly, that's nothing.
We do that volume on a single-vcpu with 512MB.

At 5% CPU utilization.

Which makes me suspect you have other issues.
Is your networking sane/stable/provably working 100% ?
Which vmware NIC are you using ?
I have seen incompatibilities in Linux with anything but the standard
Intel E1000 emulator, so I suggest you use only that.

> The setup and configuration works like a charm for hours at a time and
> all of a sudden it stops working leading to two issues (not at the same
> time):
>
> 1) First issue was that suddenly smtp stopped delivering email to that
> mutli-A record. We noticed a few thousand emails in the active queue (I
> guess all emails where in the active queue by that time). We rule out
> problems with the destination servers since the remaining postfix
> instances still delivered mail during that time. Even the last
> submission of email from the now locked up postfix finished without
> issue. There are just no more tries to reach the destination. Postfix
> also stops to deliver to both destination IPs at the same time. There
> was no logging anymore, but from anvil giving us some statistics about
> connection rates.
>
> 2) The second issue, occurring more often is that smtpd stopped
> responding or doing anything acutally. Sometimes the number of processes
> went to max (500) at other occurences it just stayed at like 30-50 but
> symptoms where still the same.
>

Again looks like a networking issue to me - however, the issue may well
be *caused* by a timer malfunction.

Timing is rather important for dependable TCP, after all.

> Looking at netstat all tcp connections to the smtpd processes went away
> after some time.We believe that clients trying to deliver email to us
> did disconnect due to reaching their timeout.
>

That means the symptom definitely surfaces in master(8), as that accepts
all connections.

It's a tossup between networking and timing, or network-related timing,
for all I can see.

I have to tell you we run a dozen Linux boxes on ESX 4.1, and we have
zero problems on any of them.

We sync the vmware hosts from NTP, and sync the local guest time from
the vmware tools - or sometimes also from NTP.

Both work fine.

> I'm hoping for suggestions on what to look for or check,
>

Configure reliable network time with ntp.
Works for me.

--
J.

Wietse Venema

unread,

Oct 29, 2010, 2:57:04 PM10/29/10

to

Christian Rohmann:

> > It would be worthwhile to see what strace reports when you leave
> > it running. If strace reports nothing in 500s then EPOLL_WAIT is
> > not working. If strace reports nothing after 1000s then the alarm
> > timer is also not working.
>
> I'll try to gather you some strace data. I guess the strace should be of
> the master? Could you give me a hint on what options you might want?

I was replying to your text that strace reported the master was in
EPOLL_WAIT. I suggest that you find out what strace reports when
you leave it running for a longer time.

Wietse

Wietse Venema

unread,

Oct 29, 2010, 4:35:21 PM10/29/10

to

Christian Rohmann:

> Hey again,
>
> On 10/29/2010 07:23 PM, Wietse Venema wrote:
> > The main loop in the master is as follows:
> >
> > forever {
> > set an alarm for 1000s
> > do an EPOLL_WAIT for up to 500s and handle any child process
> > events, or short-term timer requests that are implemented
> > around the EPOLL_WAIT timer.
> > respond to sighup (the sighup flag is set by a signal handler)
> > respond to sigchld (the sigchld flag is set by a signal handler)
> > }
>
> Just now one machine had the issue again. I checked and saw that we
> where down to just two smtpd processes and even though master was still
> bound to port 25 no new connections where accepted. I did telnet to it,
> but the connection was not accepted and ran into timeout.

This means that the smtpd processes are hanging, the master is
hanging, or both.

At this point I will not speculate further until you report the
result of following the instructions in
http://www.postfix.org/DEBUG_README.html#logging

If I don't see a credible report about warnings etc. in Postfix
logfiles, then that means that you are flying blind, and that needs
to be addressed first.

The following is for background information only.

The master daemon watches the SMTP port only when all existing
smtpd processes have reported that they are busy (i.e. talking to
an SMTP client). Otherwise, some idle smtpd process will watch
the port.

When all smtpd processes have reported that they are busy, the
master starts a new smtpd processes in response to a new connection,
provided that the per-service process limit is not reached (otherwise
the master logs a warning that all ports are busy).

In your case, the two smtpd processes got stuck before sending the
"I am busy" message to the master daemon, so the master daemon
still believes that the two processes are idle. I don't know if
this has anything to do with broken virtual timers.

Regardless of why a process hangs, if it hangs then you should see
watchdog errors in the Postfix logs. If you don't see those then
either your virtual timer is busted, or your logging is busted.

Wietse

Ravindra Gupta // Viva

unread,

Oct 30, 2010, 5:03:00 AM10/30/10

to

Hiii,

How to improve postfix server sending speed, any parameters required in main.cf file so confirm me.

My requirement is we will send minimum 500 to 600 email in less then 1 minutes.

Please give the suggestion

On Fri, Oct 29, 2010 at 9:05 PM, Christian Rohmann <croh...@netcologne.de> wrote:

Hey postfix-users,

we are currently analyzing very strange postfix behavior which I can
only describe as lockup or freeze.
Honestly we reached our abilities to gather more info and to find the
root cause of this issue.

You are my last hope obi wan ... eh Wietse ....

--- Setup / Configuration ---
We are running three VMware ESX 4.1 guests (multiple VMware hosts) with
Debian Lenny amd64.

Each machine has 8 virtual CPUs and 4GB of RAM. We are using the Debian

lenny postfix package (2.5.5) and now moved up to 2.7.1 which we
backported (problems remained unchanged).

Each instance of postfix recieves about 200-300 mails/minute. The mail
addresses of the incoming emails are checked using a relay_recipient_map
(proxymap + mysql) are rewritten (domain is changed) and then relayed
using a transport map towards a multi-A record containing two IPs.

--- Lockup Issues ---

The setup and configuration works like a charm for hours at a time and
all of a sudden it stops working leading to two issues (not at the same
time):

1) First issue was that suddenly smtp stopped delivering email to that
mutli-A record. We noticed a few thousand emails in the active queue (I
guess all emails where in the active queue by that time). We rule out
problems with the destination servers since the remaining postfix
instances still delivered mail during that time. Even the last
submission of email from the now locked up postfix finished without
issue. There are just no more tries to reach the destination. Postfix
also stops to deliver to both destination IPs at the same time. There
was no logging anymore, but from anvil giving us some statistics about
connection rates.

2) The second issue, occurring more often is that smtpd stopped
responding or doing anything acutally. Sometimes the number of processes
went to max (500) at other occurences it just stayed at like 30-50 but
symptoms where still the same.

Looking at netstat all tcp connections to the smtpd processes went away
after some time.We believe that clients trying to deliver email to us
did disconnect due to reaching their timeout.

Again a "postfix reload" does fix the problem (for a few hours) and
postfix delivers the > 10,000 emails in the active queue in a few
seconds towards the two IPs from the DNS round robin and everything goes
back to normal for a few hours.

I'm hoping for suggestions on what to look for or check,

TIA

Christian

--

Incase of any further queries, Please feel free to mail me or contact me on the numbers provided below.

Thanks & Regards,
Ravindra Gupta
Asst Manager - Tech Support

Viva Infomedia Pvt. Ltd.
242, Oshiwara Industrial Centre,
New Link Road, Opp. Oshiwara Bus Depot,
Goregaon West, Mumbai 400104.
Direct: +91.22.40310353
Board: +91.22.40310310
Email: ravi...@vivaconnect.in

Viva Infomedia: Awarded as Best SME (E-Commerce) at CNBC Emerging India Awards 2009

Jeroen Geilman

unread,

Oct 30, 2010, 5:50:54 AM10/30/10

to

On 10/30/2010 11:03 AM, Ravindra Gupta // Viva wrote:

Hiii,

How to improve postfix server sending speed, any parameters required in main.cf file so confirm me.

My requirement is we will send minimum 500 to 600 email in less then 1 minutes.

Please give the suggestion

Don't respond to an old thread with a new question!

And don't top-post.

-- 
J.

Christian Rohmann

unread,

Nov 3, 2010, 11:02:08 AM11/3/10

to

Hallo,

sorry for the late update ... but here it is ...

On 10/29/2010 10:35 PM, Wietse Venema wrote:
> If I don't see a credible report about warnings etc. in Postfix
> logfiles, then that means that you are flying blind, and that needs
> to be addressed first.

Agreed. I did read the document about clocksource and timekeeping with
VMware. I then configured each machine differently to see if there is
any change in stability.

> Regardless of why a process hangs, if it hangs then you should see
> watchdog errors in the Postfix logs. If you don't see those then
> either your virtual timer is busted, or your logging is busted.

We guessed ourself that there was an issue with VMware or the config of
the guest machine's kernel. I tried to reduce the number of CPUs on one
machine to two, as well as setting the clocksource to "acpi_pm" on
another one (left at eight cores); And the winner is: "less cpus".

The machine with two virtual CPUs didn't crash once. Clocksource
"acpi_pm" didn't help at all. So I now have all three instances running
with two virtual CPUs for days without ANY issue whatsoever.

As I don't need to performance of eight cores currently, I will leave it
like that for now. Maybe any1 has more ideas based on the fact that the
thing is stable with two cores now, but wasn't with eight.

Regards,

Christian

Stan Hoeppner

unread,

Nov 3, 2010, 3:33:17 PM11/3/10

to

Christian Rohmann put forth on 11/3/2010 10:02 AM:

> Maybe any1 has more ideas based on the fact that the
> thing is stable with two cores now, but wasn't with eight.

Absolutely. With 8 virtual CPUs (gasp OMG! big no-no) your guest kernel
will be generating a vastly larger number of timer interrupts/second
that the VMkernel has to service. I believe this is covered in the
VMware article Wietse referred you to.

This is one of the reasons VMware recommends you only configure a VM
guest with the minimum number of vCPUs needed, which is almost always
ONE vCPU. VMware is a consolidation platform, not a performance
platform. The only reason ESX was given the ability to use virtual SMP
is for customers who deploy massive databases and such atop ESX in a SAN
environment strictly for 2 main reasons:

1. VMotion HA
2. Consolidated backup

VMware did not envision customers doing consolidation _and_ giving each
guest a high count of vCPUs.

I've seen cases studies of real environments running huge Oracle
databases on 8 socket x86 servers with 128GB RAM, with ESX underneath
providing 8 vCPUs. This is why the feature was added. It wasn't added
so Joe admin could run his 20 LAMP servers with 8 vCPUs each. You're
not the first OP I've seen or heard of getting in trouble with this.

Unfortunately EMC is all about making money. One usually only learns of
this when going through paid training. Another avenue is asking the
right questions of one's vendor ESX engineer who has gone through the
training. One other avenue to this information is talking up a VMware
company tech at a trade show/conference. This is where I learned of
this vCPUs issue, some 4 years ago. Obviously you can learn of it on
the Postfix mailing list as well. ;)

--
Stan