So we're using postfix to help alleviate some slowdowns on our mailman
servers at SourceForge.net. I've got one of our list servers relaying all
mail out via another machine running postfix. The machine is a 4-way
Opteron with 8 GB of RAM and the only other thing that it's doing is
running SpamAssassin against all of our incoming mail (Which only takes
about 3% of the available CPU... yes, this thing is a beast :).
The disks are all 320 MB/s SCSI disks and the spool data is on a 5-disk
software RAID.
Here's the uname -a output from the machine:
Linux sc8-sf-uberspam1 2.4.21-178-numa #1 SMP Fri Jan 9 17:06:44 UTC 2004
x86_64 unknown
The OS, should that matter is SuSE Enterprise Linux 8.0.
The postfix version, from the default RPM:
sc8-sf-uberspam1:~ # rpm -qi postfix =20
Name : postfix Relocations: (not relocateable)
Version : 1.1.12 Vendor: UnitedLinux LLC
Release : 12 Build Date: Tue Jul 29 04:34:58=
2003
Our postfix configuration:
alias_maps =3D hash:/etc/aliases
canonical_maps =3D hash:/etc/postfix/canonical
command_directory =3D /usr/sbin
config_directory =3D /etc/postfix
content_filter =3D=20
daemon_directory =3D /usr/lib/postfix
debug_peer_level =3D 2
default_destination_concurrency_limit =3D 20
defer_transports =3D=20
disable_dns_lookups =3D no
in_flow_delay =3D 0
mail_owner =3D postfix
mail_spool_directory =3D /var/mail
mailbox_command =3D=20
mailbox_transport =3D=20
mailq_path =3D /usr/bin/mailq
manpage_directory =3D /usr/share/man
masquerade_classes =3D envelope_sender, header_sender, header_recipient
masquerade_domains =3D=20
masquerade_exceptions =3D root
mydestination =3D $myhostname, localhost.$mydomain
myhostname =3D sc8-sf-uberspam1.sourceforge.net
mynetworks =3D 10.3.0.0/16
newaliases_path =3D /usr/sbin/sendmail
queue_directory =3D /var/local/postfix
readme_directory =3D /usr/share/doc/packages/postfix/README_FILES
relayhost =3D=20
relocated_maps =3D hash:/etc/postfix/relocated
sample_directory =3D /usr/share/doc/packages/postfix/samples
sender_canonical_maps =3D hash:/etc/postfix/sender_canonical
sendmail_path =3D /usr/sbin/sendmail
setgid_group =3D maildrop
smtp_tls_cert_file =3D /etc/ssl/certs/star.sourceforge.pem
smtp_tls_key_file =3D /etc/ssl/certs/star.sourceforge.pem
smtp_use_tls =3D yes
smtpd_client_restrictions =3D=20
smtpd_helo_required =3D no
smtpd_helo_restrictions =3D=20
smtpd_recipient_restrictions =3D permit_mynetworks,check_relay_domains
smtpd_sender_restrictions =3D hash:/etc/postfix/access
strict_rfc821_envelopes =3D no
syslog_facility =3D local6
transport_maps =3D hash:/etc/postfix/transport
virtual_maps =3D hash:/etc/postfix/virtual
=46rom master.cf:
smtp unix - - n - 999 smtp
Our queue is this shape, at the moment:
active 4183
bounce 143
defer 1193
deferred 808
flush 1
incoming 34091
pid 10
Currently, there are 1000 smtp instances running: one to take in incoming
mail, and 999 to send out the outbound mail.
However, over time, that number ratchets downward until it hits what
appears to be a floor. Last time I found it at 247 instances and that
number wasn't budging. I'm currently monitoring it by logging this loop to
a file:
while /bin/true; do date; echo "smtp count: `ps ax | grep "smtp -t unix -u"
| grep -v grep | wc -l`"; sleep 30; done
I've found that issuing a "postfix reload" shoots it back up to 999.
I'll post any more details as I learn them, but I'm wondering if there's
something obvious that I'm missing here.
P.S. I got the queue stats with this script. It work well on machines with
fast disks:
#!/bin/sh
PF_ROOT=3D/var/spool/postfix
cd $PF_ROOT
for name in *; do
count=3D`find $name -type f | wc -l`
if [ $count -gt 0 ]; then
echo $name $count
fi
done
Thanksmuch,
--=20
Ari Gordon-Schlosberg http://www.nebcorp.com/~regs/pgp for PGP public key
--IS0zKkzwUGydFO0o
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (FreeBSD)
iD8DBQFAKSDtK6jSaj/3U00RAjsiAKD2w075n9oxzwvx7rvjpm39EUJH5ACaAjvh
fUu1XoZ9/ZVIqlS1fNHT0hE=
=qyJ4
-----END PGP SIGNATURE-----
--IS0zKkzwUGydFO0o--
Of course it shoots up. But it does not deliver more mail.
It just runs a lot of processes that do nothing.
The harder you bang on the machine with "postfix reload"
and "postfix flush" the worse it will perform.
> I'll post any more details as I learn them, but I'm wondering if there's
> something obvious that I'm missing here.
Yes, you never looked in the logfile to see why mail is not
being delivered.
Wietse
Well, that was quick. As you can see below, the number of postfix
instances drops pretty quickly and then stays down. Is it possible
that the qmgr is blocked on something? A strace of the process
shows it doing a lot of selects and then making connections via
unix sockets to private/smtp, so it seems to be working.
Here's the breakdown of the number of SMTP processes running, over ten
second intervals. In 22 minutes it drops to under 200 and then it comes
back up slightly and then down again. Shouldn't there just 1000 running
all the time, as soon as one finishes a new one is spawned to take its
place?
Tue Feb 10 10:14:37 PST 2004
smtp count: 999
Tue Feb 10 10:15:07 PST 2004
smtp count: 999
Tue Feb 10 10:15:37 PST 2004
smtp count: 999
Tue Feb 10 10:16:07 PST 2004
smtp count: 999
Tue Feb 10 10:16:37 PST 2004
smtp count: 999
Tue Feb 10 10:17:07 PST 2004
smtp count: 999
Tue Feb 10 10:17:37 PST 2004
smtp count: 999
Tue Feb 10 10:18:08 PST 2004
smtp count: 999
Tue Feb 10 10:18:38 PST 2004
smtp count: 999
Tue Feb 10 10:19:08 PST 2004
smtp count: 999
Tue Feb 10 10:19:38 PST 2004
smtp count: 983
Tue Feb 10 10:20:08 PST 2004
smtp count: 976
Tue Feb 10 10:20:38 PST 2004
smtp count: 993
Tue Feb 10 10:21:08 PST 2004
smtp count: 984
Tue Feb 10 10:21:38 PST 2004
smtp count: 975
Tue Feb 10 10:22:09 PST 2004
smtp count: 971
Tue Feb 10 10:22:39 PST 2004
smtp count: 959
Tue Feb 10 10:23:09 PST 2004
smtp count: 951
Tue Feb 10 10:23:39 PST 2004
smtp count: 945
Tue Feb 10 10:24:09 PST 2004
smtp count: 928
Tue Feb 10 10:24:39 PST 2004
smtp count: 920
Tue Feb 10 10:25:09 PST 2004
smtp count: 905
Tue Feb 10 10:25:39 PST 2004
smtp count: 904
Tue Feb 10 10:26:10 PST 2004
smtp count: 884
Tue Feb 10 10:26:40 PST 2004
smtp count: 882
Tue Feb 10 10:27:10 PST 2004
smtp count: 698
Tue Feb 10 10:27:40 PST 2004
smtp count: 650
Tue Feb 10 10:28:10 PST 2004
smtp count: 607
Tue Feb 10 10:28:40 PST 2004
smtp count: 596
Tue Feb 10 10:29:10 PST 2004
smtp count: 573
Tue Feb 10 10:29:40 PST 2004
smtp count: 553
Tue Feb 10 10:30:10 PST 2004
smtp count: 277
Tue Feb 10 10:30:40 PST 2004
smtp count: 312
Tue Feb 10 10:31:10 PST 2004
smtp count: 459
Tue Feb 10 10:31:41 PST 2004
smtp count: 459
Tue Feb 10 10:32:11 PST 2004
smtp count: 458
Tue Feb 10 10:32:41 PST 2004
smtp count: 312
Tue Feb 10 10:33:11 PST 2004
smtp count: 206
Tue Feb 10 10:33:41 PST 2004
smtp count: 248
Tue Feb 10 10:34:11 PST 2004
smtp count: 247
Tue Feb 10 10:34:41 PST 2004
smtp count: 247
Tue Feb 10 10:35:11 PST 2004
smtp count: 244
Tue Feb 10 10:35:41 PST 2004
smtp count: 211
Tue Feb 10 10:36:11 PST 2004
smtp count: 194
Tue Feb 10 10:36:41 PST 2004
smtp count: 167
Tue Feb 10 10:37:11 PST 2004
smtp count: 167
Tue Feb 10 10:37:41 PST 2004
smtp count: 90
Tue Feb 10 10:38:11 PST 2004
smtp count: 269
Tue Feb 10 10:38:41 PST 2004
smtp count: 269
Tue Feb 10 10:39:11 PST 2004
smtp count: 449
Tue Feb 10 10:39:41 PST 2004
smtp count: 449
Tue Feb 10 10:40:12 PST 2004
smtp count: 449
Tue Feb 10 10:40:42 PST 2004
smtp count: 449
Tue Feb 10 10:41:12 PST 2004
smtp count: 447
Tue Feb 10 10:41:42 PST 2004
smtp count: 402
Tue Feb 10 10:42:12 PST 2004
smtp count: 303
Tue Feb 10 10:42:42 PST 2004
smtp count: 276
Tue Feb 10 10:43:12 PST 2004
smtp count: 275
Tue Feb 10 10:43:42 PST 2004
smtp count: 273
Tue Feb 10 10:44:12 PST 2004
smtp count: 73
Tue Feb 10 10:44:42 PST 2004
smtp count: 76
Tue Feb 10 10:45:12 PST 2004
smtp count: 96
--
> Our queue is this shape, at the moment:
> active 4183
> bounce 143
> defer 1193
> deferred 808
> flush 1
> incoming 34091
> pid 10
>
Your active queue is not full, but your incoming queue is large. This
indicates that the queue manager is not able to move email into the active
queue fast enough exacerbated by unnecessary queue manager restarts. How
often to you rebuild your virtual table?
> However, over time, that number ratchets downward until it hits what
> appears to be a floor. Last time I found it at 247 instances and that
> number wasn't budging. I'm currently monitoring it by logging this loop to
> a file:
Some of the high volume destinations may be getting throttled. Show the
output "qshape" (from Ralf Hildebrandt's website, but change the code to
not count the "incoming" directory in the "A" column).
Of the above, only "incoming", "active" and "deferred" are "real" queues.
The other directories don't hold messages.
--
Viktor.
Of course not, that would be completely nonsensical.
Start looking IN THE LOG FILES.
Wietse
Mail is arriving faster than Postfix can deliver it. This is
because the mail receiving processes are keeping the disk so
busy that the queue manager loses the race.
To increase outgoing concurrency, reduce the inflow rate.
> sc8-sf-uberspam1:~ # rpm -qi postfix
> Name : postfix Relocations: (not relocateable)
> Version : 1.1.12 Vendor: UnitedLinux LLC
> Release : 12 Build Date: Tue Jul 29 04:34:58 2003
Do use a more recent version, if only so that the people here can
help you better.
Postfix 1.1 is two years old.
Wietse
This machine does not have a virtual table. It's a pure mail relay.
And I've only done the postfix reload thing four five times in the past 12
hours, as I'm testing tweaks and whatnot.
>
> > However, over time, that number ratchets downward until it hits what
> > appears to be a floor. Last time I found it at 247 instances and that
> > number wasn't budging. I'm currently monitoring it by logging this loop to
> > a file:
>
> Some of the high volume destinations may be getting throttled. Show the
> output "qshape" (from Ralf Hildebrandt's website, but change the code to
> not count the "incoming" directory in the "A" column).
>
Hmm... that is interesting:
T A 5 10 20 40 80 160 320
320+
TOTAL 11775 9996 0 0 1 1 42 94 221 14 20
user.sourceforge.net 7678 7678 0 0 0 0 0 0 0 0
lists.sourceforge.net 2313 2313 0 0 0 0 0 0 0 0
gzd.gotdns.com 102 0 0 0 0 0 0 0 2 100
We have a problem with a mail loop or something with one of our lists... I
haven't tracked it down yet. It looks like one of our list admin's
addresses is incorrect, so it's bouncing to our list admin's address.
I'm going craft a shell script that send everything to /dev/null and then
set up a pipe transport to blackhole this address. Is there a better way?
Something like (/etc/postfix/blackhole):
#!/bin/sh
cat - > /dev/null
And in master.cf:
blackhole unix - n n - - pipe
user=nobody argv=/etc/postfix/blackhole
And as soon as I did that, the number of processes jumped to 999 and has
stayed there. And the mail queue appears to be draining rather quickly.
Cranking it up to 1800 makes it go even faster. As I/O starvation happens,
it'll dip down to 1588 or so, but it comes back up.
Thanks, Victor and Wietse.
> T A 5 10 20 40 80 160 320 320+
> TOTAL 11775 9996 0 0 1 1 42 94 221 1420
> user.sourceforge.net 7678 7678 0 0 0 0 0 0 0 0
> lists.sourceforge.net 2313 2313 0 0 0 0 0 0 0 0
> gzd.gotdns.com 102 0 0 0 0 0 0 0 2 100
>
> We have a problem with a mail loop or something with one of our lists... I
> haven't tracked it down yet. It looks like one of our list admin's
> addresses is incorrect, so it's bouncing to our list admin's address.
>
I am glad "qshape" worked for you. I wrote it to help answer the
questions:
- Is the queue congested or simply large?
- Is the congestion new, old or uniformly distributed in time?
- Which recipient domains dominate the active and deferred queues?
Before I had "qshape", I would get useless "the queue is large" alerts, I
no longer need to waste time on these.
And by the way you do have a virtual table. From your "postconf -n":
virtual_maps = hash:/etc/postfix/virtual
With Postfix 1.x, if you update this frequently, the queue manager may
restart frequently which can really hurt performance.
--
Viktor.
> We have a problem with a mail loop or something with one of our lists... I
> haven't tracked it down yet. It looks like one of our list admin's
> addresses is incorrect, so it's bouncing to our list admin's address.
>
One more thing, the most common source of "postmaster" or
"list-adminisrator" bounce loops is "procmail". Teach the administrator in
question to use procmail safely (primarily to not forward messages to
other email addresses via procmail), or take the matches away from the
children and don't let them play with procmail at all.
--
Viktor.
That's good to know. Yeah, SuSE has the virtual_map parameter setup, by
default, but the table itself is empty.
If this was not an emergency, must-drain-the-queue-now situation, I would
have installed the newest version of postfix and taken some more time to
tweak things.
> That's good to know. Yeah, SuSE has the virtual_map parameter setup, b=
y
> default, but the table itself is empty.
Then lose it.
--=20
Ralf Hildebrandt Ralf.Hildebrandt@charite=
.de
my current spamtrap spam...@charite.de
http://www.arschkrebs.de/postfix/ Tel. +49 (0)30-450 570-1=
55
"Where a calculator on the ENIAC is equipped with 18 000 vacuum tubes
and weighs 30 tons, computers of the future may have only 1 000 vacuum
tubes and perhaps weigh 1=BD tons." - Popular Mechanics, March 1949.=20