Sendmail queue runner query - with deep queues starting from scratch...

305 views
Skip to first unread message

JonB

unread,
Jan 22, 2010, 6:12:03 AM1/22/10
to
Hi,

We have a number of sendmail servers that have particularly 'deep'
queues (30-60k messages) with queue run times often in tens of minutes
(a lot more for very deep queues).

We're looking at the best strategy for handling this.

At the moment we just run:

sendmail -q30s


Although 30 seconds may sound too aggressive - despite the deep queues
the machines aren't heaily disk bound so they do seem to cope.

Can someone confirm that the .cf setting (with no queue groups apart
from the default one defined):

O MaxRunnersPerQueue=80

Will dictate the maximum number of queue runners the above will ever
have running in parallel?


Also the above gives a very 'slow start' should sendmail be restarted
(assuming that the MaxRunnersPerQueue limit is adhered to) - it could
take up to 40 minutes for us to be back to the 80 runners again.


We've also looked at 'sendmail -qp' - with a different strategy to be
run up say 80 of these '-qp' processes (obviously staggered).

This would appear to get us a much faster 'ramp up' time - we realise
there'll be a chance that at some point multiple sendmails are going
to be going through the queue files at the same time but that can
happen with the current system anyway.

Would that work?

I guess the 'best' stragegy would be to somehow split the queue
equally between the 80 runners (using a modulo of the queue ID would
be ideal) - but I can't see any way to do that unless we had queue
groups (and if we do that - I can't see any way to get sendmail to
'round robin' between the queue groups when it receives the mail) -
only filter to a particular group based on domain, priority etc.

-Jon

Jose-Marcio Martins da Cruz

unread,
Jan 22, 2010, 5:09:16 PM1/22/10
to JonB

Hi,

JonB wrote:
> Hi,
>
> We have a number of sendmail servers that have particularly 'deep'
> queues (30-60k messages) with queue run times often in tens of minutes
> (a lot more for very deep queues).
>
> We're looking at the best strategy for handling this.
>
> At the moment we just run:
>
> sendmail -q30s

The good strategy depends where you are between two extreme situations :

* are you running a big listserver and you're putting out many messages which are delivered soon :
i.e., most of then goes out in the first tries.

* most of the messages are temporary failed and stay a very long time in the queue.

These two extremes means : what's the "mean stay time" of messages in your server, and why they stay
long.

One idea, just the idea, you should adapt it, tune it, and add some home made sauce (queue runners
and other things).

Instead of running the queue each 30 secs (too fast), put something like this :

O MinQueueAge=30m

This way each message in the queue won't be tried in intervals shorter than 30 minutes (or even
longer, if you prefer). And run the queue each, say, not less than 5 or 10 minutes.

sendmail -q10m

Also, you can have other queues, depending on message age. The idea is : sendmail puts the message
in the normal queue. You run it as explained above. Each hour you scan the queue and move messages
older than, say 12 or 18 hours to another queue with lower priority. In this queue, You'll still run
sendmail with "sendmail -q10m" but with something like "O MinQueueAge=2h". This way, older messages
will be run less frequently. You can use qtool.pl, a perl script which you'll find inside the
contrib directory, to move messages from one queue to the other one. These queues are just different
directories.

Also, take a look at the Sendmail book from Brian Costales. There are some hints there. Many years
ago, there were an interesting book (Sendmail Performance Tuning), but it refers to old sendmail
versions (well, the ideas are still valid) and it's out of print.

Hope this help

David F. Skoll

unread,
Jan 22, 2010, 5:51:00 PM1/22/10
to
JonB wrote:

> We have a number of sendmail servers that have particularly 'deep'
> queues (30-60k messages) with queue run times often in tens of minutes
> (a lot more for very deep queues).

> We're looking at the best strategy for handling this.

We use a variety of strategies, some of which might not be appropriate for
you:

1) We use define(`confQUEUE_SORT_ORDER',`random')dnl (or `none')
so the queue runner doesn't have to read all the qf files before starting up.
This is not appropriate if out-of-order delivery can't be tolerated.
It does greatly reduce the "ramp-up" time, though.

2) We sometimes limit confMAX_QUEUE_RUN_SIZE on very large queues just
to pick them off a manageable chunk at a time. This also greatly
reduces the ramp-up time.

3) We use a fallback MX host (with appropriately-tuned queue settings)
to keep the queues on our main mail server small.

Regards,

David.

Andrzej Adam Filip

unread,
Jan 23, 2010, 3:47:38 AM1/23/10
to
JonB <jfr...@googlemail.com> wrote:
> [...]

> I guess the 'best' stragegy would be to somehow split the queue
> equally between the 80 runners (using a modulo of the queue ID would
> be ideal) - but I can't see any way to do that unless we had queue
> groups (and if we do that - I can't see any way to get sendmail to
> 'round robin' between the queue groups when it receives the mail) -
> only filter to a particular group based on domain, priority etc.

Sendmail can use multiple queue directories even in "default queue group".
I would suggest using
define(`QUEUE_DIRECTORY', `/var/spool/mqueue*')
to allow adding /var/spool/mqueue1, /var/spool/mqueue2, ... to
default/existing /var/spool/mqueue.

As I understand sendmail start separate queue runner(s) per each queue directory.

Adding separate queue groups for "top destinations" would be wise anyway.

URL(s):
http://www.sendmail.org/~gshapiro/8.10.Training/mqueue.html
[source of keywords]

--
[pl>en Andrew] Andrzej Adam Filip : an...@onet.eu : Andrze...@gmail.com
Open-Sendmail: http://open-sendmail.sourceforge.net/
Eureka!
-- Archimedes

Andrzej Adam Filip

unread,
Jan 23, 2010, 9:32:14 AM1/23/10
to
Andrzej Adam Filip <an...@onet.eu> wrote:
> JonB <jfr...@googlemail.com> wrote:
>> [...]
>> I guess the 'best' stragegy would be to somehow split the queue
>> equally between the 80 runners (using a modulo of the queue ID would
>> be ideal) - but I can't see any way to do that unless we had queue
>> groups (and if we do that - I can't see any way to get sendmail to
>> 'round robin' between the queue groups when it receives the mail) -
>> only filter to a particular group based on domain, priority etc.
>
> Sendmail can use multiple queue directories even in "default queue group".
> I would suggest using
> define(`QUEUE_DIRECTORY', `/var/spool/mqueue*')
> to allow adding /var/spool/mqueue1, /var/spool/mqueue2, ... to
> default/existing /var/spool/mqueue.
>
> As I understand sendmail start separate queue runner(s) per each queue directory.
>
> Adding separate queue groups for "top destinations" would be wise anyway.
>
> URL(s):
> http://www.sendmail.org/~gshapiro/8.10.Training/mqueue.html
> [source of keywords]

<quote src="RELEASE_NOTES">
[...]
8.10.0/8.10.0 2000/03/01
[...]
Support multiple queue directories. To use multiple queues, supply a
QueueDirectory option value ending with an asterisk. For example,
/var/spool/mqueue/q* will use all of the directories or symbolic links
to directories beginning with 'q' in /var/spool/mqueue as queue
directories. Keep in mind, the queue directory structure should not
be changed while sendmail is running. Queue runs create a separate
process for running each queue unless the verbose flag is given on a
non-daemon queue run. New items are randomly assigned to a queue.
Contributed by Exactis.com, Inc.
</quote>

--
[pl>en Andrew] Andrzej Adam Filip : an...@onet.eu : Andrze...@gmail.com

It is the nature of extreme self-lovers, as they will set an house on fire,
and it were but to roast their eggs.
-- Francis Bacon

Chih-Cherng Chin

unread,
Jan 24, 2010, 8:35:30 AM1/24/10
to
On 2010-01-22, JonB <jfr...@googlemail.com> wrote:
> We have a number of sendmail servers that have particularly 'deep'
> queues (30-60k messages) with queue run times often in tens of minutes
> (a lot more for very deep queues).
[snip]
It is my understanding that traditionally, unix performs badly when
there are too many files in a single directory. Of course that depends
on what kind of file system you are using.

Have you investigated the reason why so many messages were queued in
the first place? Were most messages successfully delivered, or a lot
of them failed, and resulted even more messages (delivery failure
report)?

If a large portion of queued messages fail to be sent, maybe an
effective mail filtering is needed. But the correct solution
depends on what the real problem is.

--
Chih-Cherng Chin
Botnet Detection with Greylisting
http://botnet-tracker.blogspot.com/search/label/greylisting

David F. Skoll

unread,
Jan 24, 2010, 10:40:05 AM1/24/10
to
Chih-Cherng Chin wrote:

> It is my understanding that traditionally, unix performs badly when
> there are too many files in a single directory.

I don't think that's true any more for modern file systems. Even ext3
has a "dir_index" option that speeds up operations in large directories
(and most distros enable it.)

Regards,

David.

JonB

unread,
Jan 25, 2010, 11:13:33 AM1/25/10
to

> Sendmail can use multiple queue directories even in "default queue group".
> I would suggest using
>    define(`QUEUE_DIRECTORY', `/var/spool/mqueue*')
> to allow adding /var/spool/mqueue1, /var/spool/mqueue2, ... to
> default/existing /var/spool/mqueue.

We're already running that - and we've even gone so far as setting
'noatime' on the queue filesystems - file system / queue performance
isn't really the problem...

> Adding separate queue groups for "top destinations" would be wise anyway.

We did briefly look at that - but it becomes a real pain / 'manual
intervention' - and you have to also work out your top destinations,
or the ones currently having problems (I'm not saying you should never
have to look at the mail queues - but the less 'tweaks' the better).

The problem we currently have - is - imagine sendmail starting from
'scratch' - it starts up.

The first queue runner runs, reads the entire queue (which afaik - is
what they do even if in 'random' queue sort order - I think even
'filename' order will still read the entire queue - it just doesn't do
any queue sort/processing before starting the run). That process even
with a queue depth of 60k takes around 1-2 seconds.

If that queue runner gets bogged down in 'timeout land' - it can be
over an hour or so before it finishes. During that time - if we
started sendmail with a traditional 'sendmail -q15m' - only 4 more
queue runners would have been launched.

Even if (and I haven't looked yet) - our '/var/spool/mqueue/
queue1,2,3,4,5,6,7,8' start their own queue runners (4 per directory)
- there's so many entries in 'each queue' - that they're simply not
launched quickly enough.

We already have the 'MinQueueAge' set - what we have problems with is
keeping the queue churned fast enough, to keep to that.

The machine will do it - but only if we 'lean' on sendmail (e.g.
'sendmail -q30s').

My original question was:

If we use 'sendmail -q30s' will 'O MaxRunnersPerQueue=80' in
sendmail.cf keep the number of queue runners limited to 80 for that
machine?

Otherwise, with such an agressive rate as '30s' - obviously the
machine will (even with it's fast disks, and FS tweaks) eventually
start catching it's own tail (and bottleneck somewhere).

If we can't rely on 'O MaxRunnersPerQueue=80' to keep us to 80 queue
runners, can we rely on running:

80 * 'sendmail -qp' as seperate processes?


The only other option we can see if have a 'custom' sendmail launcher
- that will look at how many sendmail queue runners are running - and
automatically launch another one (if under a limit) - and perhaps
'delay' that launch if it can detect there's already too many
existing processes actually doing the fscan through the queue.

The machine will handle the load - sendmail's default options just
don't seem to be agressive enough to get it up to 'full capacity' from
a cold start (for a long time) - and even then they don't really
'push' the machine - unless we push sendmail hard.

-Jon

Andrzej Adam Filip

unread,
Jan 25, 2010, 3:16:34 PM1/25/10
to
JonB <jfr...@googlemail.com> wrote:
>> Sendmail can use multiple queue directories even in "default queue group".
>> I would suggest using
>>    define(`QUEUE_DIRECTORY', `/var/spool/mqueue*')
>> to allow adding /var/spool/mqueue1, /var/spool/mqueue2, ... to
>> default/existing /var/spool/mqueue.
>
> We're already running that - and we've even gone so far as setting
> 'noatime' on the queue filesystems - file system / queue performance
> isn't really the problem...
>
>> Adding separate queue groups for "top destinations" would be wise anyway.
>
> We did briefly look at that - but it becomes a real pain / 'manual
> intervention' - and you have to also work out your top destinations,
> or the ones currently having problems (I'm not saying you should never
> have to look at the mail queues - but the less 'tweaks' the better).

I remember one study showing a few years ago that 50 top domains received
70%+ of outgoing traffic on one specific site => your mileage may vary
but it gives you some estimates of what you may expect.

> The problem we currently have - is - imagine sendmail starting from
> 'scratch' - it starts up.
>
> The first queue runner runs, reads the entire queue (which afaik - is
> what they do even if in 'random' queue sort order - I think even
> 'filename' order will still read the entire queue - it just doesn't do
> any queue sort/processing before starting the run). That process even
> with a queue depth of 60k takes around 1-2 seconds.
>
> If that queue runner gets bogged down in 'timeout land' - it can be
> over an hour or so before it finishes.

You use hoststatus directory to avoid excessive retries to "inaccessible
sites", do not you?

> During that time - if we
> started sendmail with a traditional 'sendmail -q15m' - only 4 more
> queue runners would have been launched.
>
> Even if (and I haven't looked yet) - our '/var/spool/mqueue/
> queue1,2,3,4,5,6,7,8' start their own queue runners (4 per directory)
> - there's so many entries in 'each queue' - that they're simply not
> launched quickly enough.
>
> We already have the 'MinQueueAge' set - what we have problems with is
> keeping the queue churned fast enough, to keep to that.
>
> The machine will do it - but only if we 'lean' on sendmail (e.g.
> 'sendmail -q30s').
>
> My original question was:
>
> If we use 'sendmail -q30s' will 'O MaxRunnersPerQueue=80' in
> sendmail.cf keep the number of queue runners limited to 80 for that
> machine?

0) I think MaxQueueChildren would be a better choice.
1) I think it should deliver what you want

> Otherwise, with such an agressive rate as '30s' - obviously the
> machine will (even with it's fast disks, and FS tweaks) eventually
> start catching it's own tail (and bottleneck somewhere).
>
> If we can't rely on 'O MaxRunnersPerQueue=80' to keep us to 80 queue
> runners, can we rely on running:
>
> 80 * 'sendmail -qp' as seperate processes?
>
> The only other option we can see if have a 'custom' sendmail launcher
> - that will look at how many sendmail queue runners are running - and
> automatically launch another one (if under a limit) - and perhaps
> 'delay' that launch if it can detect there's already too many
> existing processes actually doing the fscan through the queue.
>
> The machine will handle the load - sendmail's default options just
> don't seem to be agressive enough to get it up to 'full capacity' from
> a cold start (for a long time) - and even then they don't really
> 'push' the machine - unless we push sendmail hard.
>
> -Jon

--

[pl>en Andrew] Andrzej Adam Filip : an...@onet.eu : Andrze...@gmail.com

The opposite of a correct statement is a false statement. But the opposite
of a profound truth may well be another profound truth.
-- Niels Bohr

JonB

unread,
Jan 26, 2010, 5:46:18 AM1/26/10
to
On Jan 25, 8:16 pm, Andrzej Adam Filip <a...@onet.eu> wrote:

> You use hoststatus directory to avoid excessive retries to "inaccessible
> sites", do not you?

We did look at using that - but it caused more issues than it solved,
firstly because some of the sites we deliver to use load-balancers,
with a single published MX (not nice - as Sendmail thinks that IP is
down - and will 'not try again' for some time, when in reality
subsequent connects are likely to get through).

Also - some sites will return 4xx defers for certain email addresses.
Sendmail appears to cache this fact, and again - will not even attempt
the other addresses for that MX for a time period - sure this is
correct behaviour if the destination MX is having issues but hurts if
it's just that one destination address thats having issues.

(Infact I've just done another post asking for confirmation about
Timeout.hoststatus vs. 4xx and 5xx responses in another post).

-Jon

Andrzej Adam Filip

unread,
Jan 26, 2010, 1:45:33 PM1/26/10
to

Have you considered using both "queue run by daemon" without hoststatus
and "queue runs from cron" with hoststatus?

--
[pl>en Andrew] Andrzej Adam Filip : an...@onet.eu : Andrze...@gmail.com

Open-Sendmail: http://open-sendmail.sourceforge.net/
Nothing succeeds like success.
-- Alexandre Dumas

JonB

unread,
Jan 27, 2010, 3:56:38 AM1/27/10
to
On Jan 26, 6:45 pm, Andrzej Adam Filip <a...@onet.eu> wrote:

> Have you considered using both "queue run by daemon" without hoststatus
> and "queue runs from cron" with hoststatus?

Hmmm, no - I hadn't... Interesting, though I'm not too sure what that
will get me...

As an example of the problem - a server today has 34k messages in it's
queue (which isn't *that* many) and has been running uninterrupted
several hours. It's running (for better or for worse):

sendmail -q30s

As a background process to launch queue runners.

Number of queue runners running at the moment? 21.

Each queue runner is taking just shy of an hour to go through messages
it finds in the queue.

'MaxRunnersPerQueue' on this sever is set to 230.

If they take nearly an hour to finish, and it's launching one every 30
seconds - how come there are only 21 running?

If I manually launch at 5 second intervals a whole bunch (say, 50
queue runners) they all go and find mail to try [and a whole bunch
gets delivered].

It's almost like 'sendmail -q30s' is ultra conservative about
launching runners... The machine isn't heavily loaded (LA = 0.04 at
the moment) - there is no disk bottleneck there is plenty of memory -
but it just isn't launching as many queue runners as you would expect
(or as are needed).

I might try removing 'sendmail -q30s' on this server today and instead
run up say 80 'sendmail -qp's - though in the documentation I've seen
it says:

"With -qp sendmail at the start of every queue run reads all the qf
files, sorts them and uses it to control the order in which the queue
is run, it then **forks multiple processes** to deliver the mail it
has sorted and sleeps the queue interval before awakening again. This
ensures no other queue runners are started while it's reading the
queue".

No where can I find out how many 'multiple processes' is. Is is up to
the 'MaxRunnersPerQueue' limit? - Is it a standard 'One queue runner
per queue group'?

Guess I'll have to go and find out :)

-Jon

JonB

unread,
Jan 27, 2010, 4:08:43 AM1/27/10
to
On Jan 27, 8:56 am, JonB <jfre...@googlemail.com> wrote:

> "With -qp sendmail at the start of every queue run reads all the qf
> files, sorts them and uses it to control the order in which the queue
> is run, it then **forks multiple processes** to deliver the mail it
> has sorted and sleeps the queue interval before awakening again. This
> ensures no other queue runners are started while it's reading the
> queue".
>
> No where can I find out how many 'multiple processes' is. Is is up to
> the 'MaxRunnersPerQueue' limit? - Is it a standard 'One queue runner
> per queue group'?
>
> Guess I'll have to go and find out :)

To answer my own question - it appears to fork up as many queue
runners as is allowed in the .cf file as 'MaxRunnersPerQueue' - the
system originally had 230 odd running.

However after a promising start this has fallen to 90 runners now.

I just hope '-qp' doesn't hold off on doing another 'read the queue
and fork' run until *every* queue runner it started has finished... I
have a horrible feeling it might.

-Jon

Andrzej Adam Filip

unread,
Jan 27, 2010, 5:09:55 AM1/27/10
to
JonB <jfr...@googlemail.com> wrote:
> On Jan 26, 6:45 pm, Andrzej Adam Filip <a...@onet.eu> wrote:
>
>> Have you considered using both "queue run by daemon" without hoststatus
>> and "queue runs from cron" with hoststatus?
>
> Hmmm, no - I hadn't... Interesting, though I'm not too sure what that
> will get me...
> [...]

You wrote you can not use it for *some* sites but you can arrange to get
its benefits for most sites :-)

BTW have you considered creating "old-messages" queue group?

sendmail.cf will not put anything in it but your cron scripts will move
a few hours old message to it (e.g. 4h old) using re-mqueue.pl script
from contrib directory in sendmail distribution.

--
[pl>en Andrew] Andrzej Adam Filip : an...@onet.eu : Andrze...@gmail.com
Open-Sendmail: http://open-sendmail.sourceforge.net/

Paradise is exactly like where you are right now ... only much, much better.
-- Laurie Anderson

David F. Skoll

unread,
Jan 27, 2010, 8:00:19 PM1/27/10
to
JonB wrote:

> The first queue runner runs, reads the entire queue (which afaik - is
> what they do even if in 'random' queue sort order - I think even
> 'filename' order will still read the entire queue

'filename' calls readdir() a bunch of times, but I don't believe it
actually opens and reads each qf file.

I think the parameter you want is confMAX_QUEUE_RUN_SIZE aka
MaxQueueRunSize. If you set that to 1000, then the queue runners stop
reading the queue once they hit 1000 entries and start processing it. If
you run lots of queue runners, your queue will be processed in manageable
chunks with a relatively low start-up cost for each run.

Usual caveats about potential out-of-order delivery apply.

Regards,

David.

JonB

unread,
Jan 28, 2010, 5:31:52 AM1/28/10
to
On Jan 28, 1:00 am, "David F. Skoll" <d...@roaringpenguin.com> wrote:
> JonB wrote:
> > The first queue runner runs, reads the entire queue (which afaik - is
> > what they do even if in 'random' queue sort order - I think even
> > 'filename' order will still read the entire queue
>
> 'filename' calls readdir() a bunch of times, but I don't believe it
> actually opens and reads each qf file.
>
> I think the parameter you want is confMAX_QUEUE_RUN_SIZE aka
> MaxQueueRunSize.  If you set that to 1000, then the queue runners stop
> reading the queue once they hit 1000 entries and start processing it.  If
> you run lots of queue runners, your queue will be processed in manageable
> chunks with a relatively low start-up cost for each run.

I don't think it's the actual startup cost that's an issue - the
biggest problem is where people have 3 MX's listed, and each waits the
full 10 minutes timeout for "end of data" before that forked delivery
process ends... Without sufficient deliveries going on - you run the
risk of having other 'candidates' languishing in the queue.

So far, I've found:

- Limiting the max queue runners per queue to 16

- Running 16 * 'sendmail -qp' (staggered)

Gives the kind of kind of 'out the gate' performance we need when we
for when we're starting from cold - and also seems to keep
considerably (>200) deliverers up and running consistantly (as opposed
to 40 odd previously).

Sure - some may find they have no work to do (especially as queues
empty) but I rather that than having older entries not attended to in
time.

Longer term we may look to moving hard to deliver email to seperate
queues - I think the last time we looked at that (few years ago) -
there were issues because our queue directories are spread over
spindle sets (something to do with hard links, and the tools to shift
queued mail) - but definitely worth rechecking.

-Jon

Reply all
Reply to author
Forward
0 new messages