processes and thread sizing

60 views
Skip to first unread message

Kent

unread,
Sep 6, 2012, 2:14:10 PM9/6/12
to mod...@googlegroups.com
Is there a way to tell when/whether/how often we hit the condition that an http request is fed to mod_wsgi for which there is no currently available process thread, so it must wait in queue?  Can this be logged?  I'm trying to figure out how to appropriately size my processes and threads parameters, any help there is much appreciated!

Thank you!

Graham Dumpleton

unread,
Sep 7, 2012, 1:43:10 AM9/7/12
to mod...@googlegroups.com
First up, go watch:

http://lanyrd.com/2012/pycon/spcdg/
http://lanyrd.com/2012/pycon-au/swkdq/

as it talks a bit about these issues.

So, what one can do depends on how you are using mod_wsgi. Embedded
mode or daemon mode?

With embedded mode not too much you can do just within
Apache/mod_wsgi, because the connection gets queued in the socket
listener queue for Apache itself for which there isn't a great deal of
visibility. So Apache doesn't know how long it may have been sitting
in the listener socket backlog queue before it gets it.

This arises because Apache will only accept a request when it actually
has the resources to handle it. Thus when all processes/threads busy
and the request will backlog in that socket listener queue.

If you are using daemon mode, you can do a little bit better because
of the fact that web application processes are behind Apache. Thus you
can time stamp a request when Apache does accept it and look at the
different between that and the current time when the application in
the daemon process actually gets to handle it.

What this is therefore showing is where the daemon mode processes get
overloaded, although does require Apache worker processes still having
enough threads to keep accepting requests and let them back up in the
worker processes rather than the listener queue, otherwise time stamp
not applied.

In mod_wsgi 3.4 (just release recently), it will automatically time
stamp all requests and make that available in the WSGI request environ
dictionary as 'mod_wsgi.queue_start'. Doing:

queue_start = int(value) / 1000000.0

will give you a time stamp in seconds that can then be compared to
time.time() to work out how much time occurred between Apache
accepting the request and the web application getting passed the
request. You could write a little middleware that monitors that.

Beyond queueing time, the next measure one can use is thread
utilisation. This is a measure of how much of the capacity of the WSGI
server is being used. In effect it is time spent serving requests
divided by time it could have spent serving requests based on
available number of processes/threads.

The value of thread utilisation is that once you head towards 100% and
stay at high levels, you know you are starting to run out of capacity.

In combination with queueing time, as thread utilisation increases,
then queueing time because of backlog will also increase.

Measuring thread utilisation is interesting and a bit trick to do in
pure Python with doing lots of thread locking which could impact
performance. Using a C extension one can do it with acceptable
overhead.

Important to realise though is that these sorts of measures should
only be seen as one part of what you should be monitoring. The need to
fiddle things to increase capacity actually means you most likely are
doing a poor job at making your application perform better.

You know you are doing the right thing when these measures prove that
you can safely drop processes/threads and not the other way around.

Anyway, the two talks I link to talk a bit about these issues and give examples.

Although you can easily do queuing time yourself, because thread
utilisation is tricky and because all this stuff is better seen as one
part of an overall monitoring strategy, it is going to much easier
were you to just use New Relic, which does all this stuff and more.

Queueing time is visible in the New Relic Lite plan if you don't want
to pay for New Relic after its trial period ends. The thread
utilisation and resulting capacity analysis reporting based on it are
though part of the paid level, so once you drop to Lite you don't get
access to it anymore. You still have the trial period though to get an
answer to your question.

The normal trial period for New Relic is 14 days. Use this URL at the
moment and you can get an extended trial.

http://newrelic.com/30

So having monitoring is best way of trying to work out what is going
on and then using the result of that to tune your configuration.

Another area one can investigate, especially if using embedded mode,
is if you have totally screwed up your MPM settings, or were using the
defaults Apache ships with which aren't very good for Python,
especially if using prefork MPM.

I have been doing some work in that area as well as far as writing
some scripts which will validate the Apache configuration and produce
some charts which show how it behaves under certain simulated
conditions. These tell you if you have stuffed it up and are going to
cause Apache to perform badly through basic process management.

I have this stuff working for worker MPM, but not prefork MPM yet. I
am not sure I want to make it available just yet though.

Enough words, a couple of images to wet your appetite.

https://dl.dropbox.com/u/22571016/CapacityAnalysisExample.jpg

This one shows the capacity analysis page in New Relic giving how much
your server is being used.

https://skitch.com/grahamdumpleton/e1dqj/figure-1

This shows evaluation of worker MPM settings for Apache shipped as
source code. Not ideal for Apache, but can still be okay.

https://skitch.com/grahamdumpleton/e1dqa/figure-1

This shows evaluation of poorly chosen MPM settings done by user.

Too many processes were created initially which were immediately
killed because excess to requirements. As number of concurrent
requests increased, the incorrect configuration meant Apache would
swap between thinking it needed more processes and thinking it had too
many, so potential existed for it to continually kill off and then
restart processes.

You can all mull over those images.

Since I am about to go on holidays and aren't going to be online much,
my best suggestion is just to try New Relic and find that capacity
analysis report.

Also keep an eye out on the New Relic blog as there will be a post
going up in the next week sometime about the Capacity Analysis report.
It also includes additional information about using it to tune one
aspect of mod_wsgi daemon mode.

Enjoy the carrots for now. This exploration of MPM settings and
evaluating its effectiveness will be something that intend to talk
about at next PyCon US if talk gets accepted.

Graham

Kent

unread,
Sep 10, 2012, 9:16:08 AM9/10/12
to mod...@googlegroups.com
Graham,
Thanks very much for this excellent information.  The videos are very informative and you've got me on a great start. 

We are using daemon mode with apache that is apparently compiled with prefork, the linux default.
httpd -l
Compiled in modules:
  core.c
  prefork.c
  http_core.c
  mod_so.c

You warned against bad MPM settings, but I don't know where to look for how to determine what *good* MPM settings are.  Can you point me there, or is this largely a problem for embedded mode only?

I'd also submit that clearly much of our app time is spent in database waits as our app (for now) is in the cloud and speaks with remote database, hundreds of miles away.  Further, many of our requests take quite long, some up to 10 seconds or more, since we are often saving quite complex orders with many business rules requiring many database round trips.

I can see this type of situation as being ripe for backlog, you agree?  We have 8 CPU cores and, since we had the RAM available anyway, after watching your videos I've increased from processes=4 threads=8 to processes=16 threads=10 and will monitor from here.

I'd love to find the time to monitor more extensively and try out new relic..., but if anything I've mentioned is throwing red flags in your mind, please let me know: in all humility, I know that this is not my area of expertise.

Kent

unread,
Sep 21, 2012, 3:54:30 PM9/21/12
to mod...@googlegroups.com
On Friday, September 7, 2012 1:43:12 AM UTC-4, Graham Dumpleton wrote:

What this is therefore showing is where the daemon mode processes get
overloaded, although does require Apache worker processes still having
enough threads to keep accepting requests and let them back up in the
worker processes rather than the listener queue, otherwise time stamp
not applied.

 
Should I understand that the prefork MPM will defeat being able to use mod_wsgi.queue_start value?  That is, with prefork MPM, are all the requests doomed to stay in the listener queue until the daemon process is ready for it anyway, making the mod_wsgi.queue_start value meaningless?

Graham Dumpleton

unread,
Sep 23, 2012, 9:02:04 AM9/23/12
to mod...@googlegroups.com, mod...@googlegroups.com
The queue start time in embedded mode only really shows where a request was delayed due to the initial loading of the WSGI script file.

In both embedded and daemon mode, if connections start backing up in the external listener socket, eg port 80, then you don't get visibility of that.

The only partial solution for that is using end user monitoring of something like New Relic. Which can show network time between browser client and server. It can't distinguish though whether that was general network time, or time spent waiting for connection to be accepted.

Graham

Kent

unread,
Sep 24, 2012, 8:03:39 AM9/24/12
to mod...@googlegroups.com
Thanks.  I was concerned that prefork wasn't really compatible.
Reply all
Reply to author
Forward
0 new messages