Switching to mod_wsgi 4.0 to avoid listener backlog starvation

226 views
Skip to first unread message

stefanoC

unread,
Dec 6, 2011, 5:47:49 AM12/6/11
to modwsgi
Finally managed to jump from
http://serverfault.com/questions/335633/apachemod-wsgi-configuration-for-django-projects-on-a-quad-core
to this mailing list.

I am ready to try the switch, already compiled and run mod_wsgi 4.0 on
a test machine with the same config as usual.

Before going on production I would need to know (Graham I understand
you have a full time job at NewRelic so thanks in advance for when
you'll have the time to jump on this!)

* what are the new mod_wsgi settings for the apache2 conf? what about
"listen-backlog" (http://groups.google.com/group/modwsgi/browse_thread/
thread/b6d66d3fe5a53d2c/
what about "blocked-requests" and "blocked-timeout"
http://groups.google.com/group/modwsgi/msg/2a968d820e18e97d

In Graham's answer to my serverfault question:
> if number of processes/threads across Apache child worker processes is less than 100, the daemon process listener backlog, then all those threads can also get stuck and you will not know

I currently use these settings on a quadcore:

<IfModule mpm_worker_module>
StartServers 2
ServerLimit 4
MinSpareThreads 2
MaxSpareThreads 4
ThreadLimit 32
ThreadsPerChild 16
MaxClients 64#128
MaxRequestsPerChild 10000
</IfModule>

WSGIDaemonProcess subdomain.domain user=www-data group=www-data
threads=25

Is this sensible ?
After reading http://groups.google.com/group/modwsgi/browse_thread/thread/edffb22b2eac134b
and again http://groups.google.com/group/modwsgi/browse_thread/thread/b6d66d3fe5a53d2c/
I see that my threads settings might not be fit...

Anything else I'd need to know?

Graham Dumpleton

unread,
Dec 7, 2011, 10:33:10 PM12/7/11
to mod...@googlegroups.com
Sorry for not getting to this in a timely manner. Am having a bit of a
mind block on this stuff at the moment. I still need to go in and
tweak some stuff in mod_wsgi and not able to get my head around it.

Most of the information is covered in those posts you link to.

First up just suggest using blocked-timeout of 60 as fails safe to at
least trigger a restart when everything blocked up for 60 seconds.
Wouldn't worry about blocked-requests yet as I need to tweak stuff
related to that option.

As to listen-backlog, still trying to work out what is best thing to
do with new ability to change it. Even if dropped to be low value, a
retry mechanism kicks in with mod_wsgi when it tries to connect to
daemon processes. One needs this to ensure that when daemon processes
all restart at same time and new process not quite it state to accept
new connection that it does fail straight away. Am not totally sure
that is valid though and have to dig into it further. The retry may in
part not be needed as strictly speaking may only kick in when listen
backlog of daemon full.

So, have to do some further analysis.

BTW, make sure you have:

LogLevel info

in order to get stack trace dumps. I still need to change things so
that they are log at error level so always visible and change message
about why being logged.

Graham

> --
> You received this message because you are subscribed to the Google Groups "modwsgi" group.
> To post to this group, send email to mod...@googlegroups.com.
> To unsubscribe from this group, send email to modwsgi+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/modwsgi?hl=en.
>

stefanoC

unread,
Dec 8, 2011, 7:19:01 AM12/8/11
to modwsgi
Thanks Graham,

this is very timely answer for a side-project!

I'll try this then, and post back here my experience though I forecast
it will not be before a few days (we don't want to put anything on the
production server on a friday for obvious reasons, and I don't think
I'll manage today).

I'll be monitoring the mailing list should you come up with some
updates!

thanks
Stefano

On Dec 8, 4:33 am, Graham Dumpleton <graham.dumple...@gmail.com>
wrote:


> Sorry for not getting to this in a timely manner. Am having a bit of a
> mind block on this stuff at the moment. I still need to go in and
> tweak some stuff in mod_wsgi and not able to get my head around it.
>
> Most of the information is covered in those posts you link to.
>
> First up just suggest using blocked-timeout of 60 as fails safe to at
> least trigger a restart when everything blocked up for 60 seconds.
> Wouldn't worry about blocked-requests yet as I need to tweak stuff
> related to that option.
>
> As to listen-backlog, still trying to work out what is best thing to
> do with new ability to change it. Even if dropped to be low value, a
> retry mechanism kicks in with mod_wsgi when it tries to connect to
> daemon processes. One needs this to ensure that when daemon processes
> all restart at same time and new process not quite it state to accept
> new connection that it does fail straight away. Am not totally sure
> that is valid though and have to dig into it further. The retry may in
> part not be needed as strictly speaking may only kick in when listen
> backlog of daemon full.
>
> So, have to do some further analysis.
>
> BTW, make sure you have:
>
>   LogLevel info
>
> in order to get stack trace dumps. I still need to change things so
> that they are log at error level so always visible and change message
> about why being logged.
>
> Graham
>

> On 6 December 2011 21:47, stefanoC <stefano.cro...@gmail.com> wrote:
>
>
>
>
>
>
>
> > Finally managed to jump from

> >http://serverfault.com/questions/335633/apachemod-wsgi-configuration-...

> > After readinghttp://groups.google.com/group/modwsgi/browse_thread/thread/edffb22b2...
> > and againhttp://groups.google.com/group/modwsgi/browse_thread/thread/b6d66d3fe...

Rodrigo Campos

unread,
Dec 11, 2011, 1:54:13 PM12/11/11
to mod...@googlegroups.com
On Tue, Dec 06, 2011 at 02:47:49AM -0800, stefanoC wrote:
> Finally managed to jump from
> http://serverfault.com/questions/335633/apachemod-wsgi-configuration-for-django-projects-on-a-quad-core
> to this mailing list.

Interesting thread. We might be hitting some similar issue

> I am ready to try the switch, already compiled and run mod_wsgi 4.0 on
> a test machine with the same config as usual.

We were about to try it, but some other stuff got in the way. We plan to do it,
but not sure when :(

But if you do it before us, please let us know how it went and how you did it.
Are you using debian/ubuntu ? Did you make a debian package ? I was planning to
make a debian package (we are using ubuntu) based on debian's package, using
uupdate (I think it should be that easy, although I didn't try to do it yet)

Thanks,
Rodrigo

stefanoC

unread,
Dec 13, 2011, 6:53:33 AM12/13/11
to modwsgi
I'm building / installing from source on Ubuntu (old one!).

I'm currently making some tests on the pre-production machine before
we jump on production, and the biggest headache is still the best
apache2 worker conf that will not overrun the machine.

But I'll keep posted!

On Dec 11, 7:54 pm, Rodrigo Campos <rodr...@sdfg.com.ar> wrote:
> On Tue, Dec 06, 2011 at 02:47:49AM -0800, stefanoC wrote:
> > Finally managed to jump from

> >http://serverfault.com/questions/335633/apachemod-wsgi-configuration-...

stefanoC

unread,
Feb 6, 2012, 4:52:44 AM2/6/12
to modwsgi
I'm back!

I've got both good and bad news.

The good: mod_wsgi 4.0 worked perfectly well.

The bad: I still did not manage to fine tune the apache2+mod_wsgi
+monitoring settings to a point where I was in control.

I have an issue, definitely to be solved outside wsgi, of too many
slow requests piping up at times.
When it happens, there's a starvation effect that completely locks
apache2 - it has to be restarted, or it won't serve any request
anymore.
The slow requests are database intensive, and can't be taken
asynchronously easily. There are really few of them, but when more
than a handfull happen together problems start.

I had setup a monit to restart apache2 when it becomes irresponsive,
but getting a correct timing for this is difficult, and I end up
overcharging the machine more than anything by restarting slow
processes that were doing ok.
Also, at times of high traffic I could end with monit restarting
apache2 several times in a row, and ending up abandoning leaving with
a zombie apache2.

Again I repeat this has nothing to do with mod_wsgi itself. I tried
switching to gunicorn (I already had nginx as a frontend);
performances itself are quite similar (I did not benchmark, but at
least new relic was giving similar throughput).
But it's so much easier to tune (the settings are few, and there isn't
an apache2 to take care of) and monitor (with supervisord) that I'm
sticking with this solution for the time being. Had I a better
knowledge of Linux/Apache internals (or maybe if 4.0 was official with
a precise doc for the news settings) I might have achieved the same
results with mod_wsgi.

Thanks Graham for your support, I definitely do not mean to invite
others to move away from mod_wsgi that has been serving great until
this problem arose, but I felt I should share my feedback. BTW, the
apache2/mod_wsgi confs are still there and I can switch from one to
another with a few line changes...

Stefano

Graham Dumpleton

unread,
Feb 6, 2012, 5:33:21 AM2/6/12
to mod...@googlegroups.com
On 6 February 2012 20:52, stefanoC <stefano...@gmail.com> wrote:
> I'm back!
>
> I've got both good and bad news.
>
> The good: mod_wsgi 4.0 worked perfectly well.
>
> The bad: I still did not manage to fine tune the apache2+mod_wsgi
> +monitoring settings to a point where I was in control.
>
> I have an issue, definitely to be solved outside wsgi, of too many
> slow requests piping up at times.
> When it happens, there's a starvation effect that completely locks
> apache2 - it has to be restarted, or it won't serve any request
> anymore.

Yep. I am aware of the issue. So long as the requests threads don't
block completely, which is a different issue handling with
blocked-requests option, it isn't so much that it locks Apache up, but
that it causes an internal backlog of requests, which even when the
long requests finish the daemon processes will still process the
backlog even though the original user may have given up. In processing
the big backlog, because you then get a big influx of requests
together, you might again end up with a lot of longer requests all
coinciding again and so it starts over. Thus it can take a while for
things to stabalise, although if the resources of the systems as a
whole aren't sufficient, it could simply make the whole box grind to a
halt.

This can to a degree also happen when nginx is used as a front end as
well, as any multi hop solution will introduce these potential backlog
points solely due to the socket listen queue size for each socket.
Apache/mod_wsgi currently makes it a bit worse and easier for it to
trigger though.

For Apache/mod_wsgi you have the default (but configurable) listen
backlog of 100. So, if all processes/threads were busy, 100 more
requests could still queue up before clients start getting connection
refused. At the same time, you will have as many requests as you have
processes/threads in an accepted state and being handled within Apache
child worker processes themselves, or if using daemon mode, being
proxied to the daemon process.

Normally the number of Apache MPM processes/threads would be more than
mod_wsgi daemon process, but because the daemon processes also have a
100 listen backlog, again when all daemon process/threads busy, then
proxied requests will queue up internally and depending on how the
numbers work, it all acts as a big funnel with no way for things to
break out. In other words, if Apache MPM threads in total across all
processes is less than daemon threads +100, you will never get a
connection refused from daemon process.

Even if it was exceeded and you got a connection refused, the proxy
code for talking to the daemon process makes further attempts to
connect to the daemon process. This was done to cope with issues where
daemon processes not quite ready due to restarts or otherwise.

The problem here is the combination of the large daemon listen backlog
(which hasn't even been configurable until 4.0) as well as the retry
mechanism.

What I have started playing with, but never got a chance to finish
what I was doing, was to make the daemon listen backlog configurable,
or automatically adjust based on daemon and MPM config, but also
change when/how reconnect attempts are made.

The eventual aim was to introduce a way out of the funnel so you don't
get the backlogging problem and the issues it causes with daemon
processes getting overwhelmed and not being able to catch up again.

So, by one way or another, the aim is for 503 errors to be generated
when internal backlog occurs with the 503 going back to the client.
That way if the daemon processes get overwhelmed, then new requests
coming in will timeout and get thrown away. Yes it will mean users see
errors, but at least then you don't get a backlog and so when daemon
recovers, it has not got a pipe stuffed full of requests it will then
not be able to handle.

Graham

stefanoC

unread,
Feb 12, 2012, 7:06:38 PM2/12/12
to mod...@googlegroups.com
Graham,

thanks for you feedback. Very interesting to understand better what's going on. Indeed, it sounds like this could happen in my current configuration too, though
 in practice supervisor kills one by one very slow processes without minding what's really going on. Less drastic that my previous solution (monit, who would restart the whole apache!) yet not fully optimal.

I agree that I'd rather users see 503 when too many requests start piping up rather than having a long backlog building up. In addition, as you mention, many users might simply abandon when loading gets too load, and just retry a bit later. 
It should definitely work better as an average overall experience.

I'll definitely keep following the mailing list and the google docs wiki (PS., and I do know how difficult and tedious and long that is, but by any mean keep up-to-date the wiki especially when you feel 4.0 is mature!)

thanks,
Stefano

Graham Dumpleton

unread,
Feb 12, 2012, 7:47:15 PM2/12/12
to mod...@googlegroups.com
I am actually going to be abandoning the docs on the Google code wiki
as am shifting to ReadTheDocs for mod_wsgi docs for mod_wsgi 4.0.

Either way, just don't have the time right now to even work on
mod_wsgi 4.0 and make that behaviour better.

Hopefully things will ease off a little after PyCon and will have some
more time then to work on mod_wsgi and also followup questions on
mailing list properly.

Graham

> --
> You received this message because you are subscribed to the Google Groups
> "modwsgi" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/modwsgi/-/qA73RUPFWHQJ.

Reply all
Reply to author
Forward
0 new messages