queuepool timeout's with mod_wsgi under heavier load

Damian

unread,

Jan 14, 2010, 11:36:05 AM1/14/10

to sqlalchemy, mod...@googlegroups.com

Hi,

Every few days, when we experience higher loads we get sqlalchemy's

TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30

Along with that I see an increase in (2-3 a minute):

(104)Connection reset by peer: core_output_filter: writing data to the
network

and

(32)Broken pipe: core_output_filter: writing data to the network

in my apache error logs.

Having checked over my pylons code a few times, the Session.remove()
should always be called. I'm worried that the broken pipe or
connection reset by peer mean that remove isn't being called.

The server is running mod_wsgi with apaches mpm_worker with the
following config:

<IfModule mpm_worker_module>
StartServers 16
MaxClients 480
MinSpareThreads 50
MaxSpareThreads 300
ThreadsPerChild 30
MaxRequestsPerChild 0
</IfModule>

and using mod_wsgi's daemon mode:

WSGIDaemonProcess somename user=www-data group=www-data processes=4
threads=32

Is this somehow overkill? The server is a well speced quad core with
8 gigs of ram and fast hard drives.

It also runs the database server (postgres).

Has anyone else experienced this kind of problem? I've cross posted
this to both the mod_wsgi and sqlalchemy mailing lists - hope that's
ok as I believe this may be relevant to both groups.

Thanks,
Damian

Graham Dumpleton

unread,

Jan 14, 2010, 9:38:00 PM1/14/10

to mod...@googlegroups.com

2010/1/15 Damian <damian...@gmail.com>:

> Hi,
>
> Every few days, when we experience higher loads we get sqlalchemy's
>
> TimeoutError: QueuePool limit of size 5 overflow 10 reached,
> connection timed out, timeout 30

I presume this is a postgres database client side error message, not
on the database server.

> Along with that I see an increase in (2-3 a minute):
>
> (104)Connection reset by peer: core_output_filter: writing data to the
> network
>
> and
>
> (32)Broken pipe: core_output_filter: writing data to the network
>
> in my apache error logs.

These errors would normally just indicate that HTTP client severed the
connection before request complete. So, machine bogging down, users
give up and possibly hit reload.

> Having checked over my pylons code a few times, the Session.remove()
> should always be called.

Presumably you are using recent pyscopg2. The remove() method wasn't
properly calling close. Fixed back in 2007.

http://www.mail-archive.com/sqlal...@googlegroups.com/msg05485.html

> I'm worried that the broken pipe or
> connection reset by peer mean that remove isn't being called.

Even when client connection fails, mod_wsgi, will as WSGI
specification requires, call close() on the iterable returned by the
WSGI application. So, as long as WSGI application correct cleans up
when close() called in that way, even if not all response could be
returned, then should be okay.

> The server is running mod_wsgi with apaches mpm_worker with the
> following config:
>
> <IfModule mpm_worker_module>
> StartServers 16
> MaxClients 480
> MinSpareThreads 50
> MaxSpareThreads 300
> ThreadsPerChild 30
> MaxRequestsPerChild 0
> </IfModule>
>
> and using mod_wsgi's daemon mode:

Are you serving static media or running non Python web applications on
same Apache?

If not, then worker MPM configuration is creating many more
processes/threads than is needed to handle proxying to available
processes/threads on daemon process side.

> WSGIDaemonProcess somename user=www-data group=www-data processes=4
> threads=32

Presumably you also have WSGIProcessGroup and are in fact running in
daemon mode.

> Is this somehow overkill? The server is a well speced quad core with
> 8 gigs of ram and fast hard drives.

Use of 32 threads is possibly overkill. The default is 15 per process
and even that could well be overkill.

If your request times are quick, can usually get away with under 5
threads. The only reason to run more is if you need a buffer due
having some number of long running requests.

> It also runs the database server (postgres).
>
> Has anyone else experienced this kind of problem? I've cross posted
> this to both the mod_wsgi and sqlalchemy mailing lists - hope that's
> ok as I believe this may be relevant to both groups.

The way I read this is that you have 32 potential threads in a process
which want to access database, but sqlalchemy is set up to only have
10 connections in its connection pool. Thus, on a per process basis,
if things bog down and database is overloaded with requests arriving
quicker than can be handled, then to could run out of connection
resources in pool and exceed that wait time for queued requests
wanting a connection.

Dropping down to 5 threads per mod_wsgi daemon process would avoid
this as then less than number of connections in pool and couldn't
exceed it. When system does bog down, with less threads across all
daemon processes, if all used up, then just means that requests
effectively get queued up within Apache server child processes.
Presuming that backend recovers, then those queue requests will then
in turn be handled.

If the number of requests arriving is sufficient that all the threads
across Apache server child processes also become busy, and the socket
listener queue length is exceeded for main HTTP port on Apache, only
then would clients start to see connection refused.

Dropping down number of daemon threads in this way can therefore
actually be used as a way of throttling connections where it is known
that your database isn't going to be able to handle more than a
certain number of requests at the same time. In other words, rather
than let a large number of requests through and simply overload
database even more and make things worse, the limit, with subsequent
queueing of requests within Apache, allows one to trickle connections
through when in an overloaded state.

Anyway, since I don't know much about sqlalchemy and pyscopg2, that is
my guess at what is happening.

Graham

Damian

unread,

Jan 17, 2010, 7:07:39 AM1/17/10

to modwsgi

Hi Graham,

Thank you very much for that. I've reduced the number of threads per
process to the default 15.

The apache webserver still serves static data - I have not gotten to a
point where we could shift that off to another machine which could
proxy our requests to the app server or just serve static data. I'm
not convinced it makes sense to run two webserver on the same machine.

I also run two sets of daemon processes - one in apaches 'https'
config and one in the non https config. I'm guessing it would make
more sense to run only one set of daemon processes for both the https/
http version. At the moment I define two sets of wsgi process groups
one called bla, one called bla-ssl which have 4 processes with 15
threads each.

The website runs a javascript process that polls the site a second
after each request to deliver the latest information to the client.
Over 99% of these requests never touch the database and are server out
of memcached with little processing done by the wsgi app - these
requests take as little as 20ms to process. What I may be seeing here
could be the result of the 'dog pile effect' - when we change the
database we delete the memcached entry and it gets regenerated by the
next request. If enough requests come in simultaneously they probably
cause the sqlalchemy error - too many clients are trying to regenerate
the memcached data.

Reducing the number of wsgi processes should reduce the effect as
requests will get queued by apache and we won't run out of
sqlalchemy's pooled connections. In the current setup we could be
seeing up to 80 simulataneous db queries (8 daemon processes * 10
sqlachemy pooled connections). Does that sound realistic?

I should also try and see if I can cut down on the number of processes
and see if it still runs happily, and see if I can make the cache a
bit smarter to avoid dog-pile.

Thanks a lot for your time!
Damian

On Jan 15, 3:38 am, Graham Dumpleton <graham.dumple...@gmail.com>
wrote:
> 2010/1/15 Damian <damiandimm...@gmail.com>:

Graham Dumpleton

unread,

Jan 17, 2010, 7:43:51 PM1/17/10

to mod...@googlegroups.com

2010/1/17 Damian <damian...@gmail.com>:

> Hi Graham,
>
> Thank you very much for that. I've reduced the number of threads per
> process to the default 15.
>
> The apache webserver still serves static data - I have not gotten to a
> point where we could shift that off to another machine which could
> proxy our requests to the app server or just serve static data. I'm
> not convinced it makes sense to run two webserver on the same machine.

Suggestions seem to be that it helps and on a technical level there
are valid reasons why it helps. These include nginx being better at
static files and because nginx isolates Apache from slow clients
meaning you can provision less processes/threads in Apache, thus
keeping down Apache memory usage and in the case of threads avoiding
some of the scaling issues around the GIL and multiple threads and CPU
cores.

> I also run two sets of daemon processes - one in apaches 'https'
> config and one in the non https config. I'm guessing it would make
> more sense to run only one set of daemon processes for both the https/
> http version. At the moment I define two sets of wsgi process groups
> one called bla, one called bla-ssl which have 4 processes with 15
> threads each.

Yes, using one process group would be better.

Just place the WSGIDaemonProcess directive inside the VirtualHost for
that first in order in Apache configuration. The WSGIProcessGroup
directive in the second VirtualHost can reference the daemon process
group definition in the first as allowed to reference a definition in
prior VirtualHost so long as ServerName is the same.

> The website runs a javascript process that polls the site a second
> after each request to deliver the latest information to the client.
> Over 99% of these requests never touch the database and are server out
> of memcached with little processing done by the wsgi app - these
> requests take as little as 20ms to process. What I may be seeing here
> could be the result of the 'dog pile effect' - when we change the
> database we delete the memcached entry and it gets regenerated by the
> next request. If enough requests come in simultaneously they probably
> cause the sqlalchemy error - too many clients are trying to regenerate
> the memcached data.
>
> Reducing the number of wsgi processes should reduce the effect as
> requests will get queued by apache and we won't run out of
> sqlalchemy's pooled connections. In the current setup we could be
> seeing up to 80 simulataneous db queries (8 daemon processes * 10
> sqlachemy pooled connections). Does that sound realistic?

If there is no locking around the cache check such that it is possible
that a subsequent request will still trigger another database
operation even though one may have already been started from another
request, and you can get that many requests come in, in the time to do
the database query, then yes, sounds entirely plausible.

Graham

> --
> You received this message because you are subscribed to the Google Groups "modwsgi" group.
> To post to this group, send email to mod...@googlegroups.com.
> To unsubscribe from this group, send email to modwsgi+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/modwsgi?hl=en.
>
>
>
>

Damian

unread,

Feb 1, 2010, 9:46:58 AM2/1/10

to modwsgi

Hi Graham,

Just to follow up - the culprit was the default max_connections 100
that postgres has on debian, and a script kiddie running a primitve
dos attack. Continuous http requests caused postgres to start up to
100 processes which ate all the ram and resulted in swapping.

I've now tuned it so that postgres only has the actual required max
connections, reduced the number of apache threads, set up an nginx
proxy which handles our keepalive and static files, freeing up apache
to just serve dynamic content. This meant we no longer needed modules
loading such as mod_ssl/rewrite/deflate etc.. as nginx handles that
with a much lower memory footprint.

The memory footprint has gone down significantly and we can actually
handle a lot more users now.

Next is seeing if getting the app to run in embedded mode in apache
will speed it up a lot - it's the only wsgi app we are running on that
server. Finally we're going to see if we can cache more db queries
(we already use memcached heavily) and look at nginx's request
throttling support.

Thank you very much for your help!

Damian

On Jan 18, 12:43 am, Graham Dumpleton <graham.dumple...@gmail.com>
wrote:
> 2010/1/17 Damian <damiandimm...@gmail.com>:

Graham Dumpleton

unread,

Feb 1, 2010, 4:58:53 PM2/1/10

to mod...@googlegroups.com

On 2 February 2010 01:46, Damian <damian...@gmail.com> wrote:
> Hi Graham,
>
> Just to follow up - the culprit was the default max_connections 100
> that postgres has on debian, and a script kiddie running a primitve
> dos attack. Continuous http requests caused postgres to start up to
> 100 processes which ate all the ram and resulted in swapping.
>
> I've now tuned it so that postgres only has the actual required max
> connections, reduced the number of apache threads, set up an nginx
> proxy which handles our keepalive and static files, freeing up apache
> to just serve dynamic content. This meant we no longer needed modules
> loading such as mod_ssl/rewrite/deflate etc.. as nginx handles that
> with a much lower memory footprint.
>
> The memory footprint has gone down significantly and we can actually
> handle a lot more users now.
>
> Next is seeing if getting the app to run in embedded mode in apache
> will speed it up a lot - it's the only wsgi app we are running on that
> server.

Provided just that Python web application, no PHP or other non Python
web applications, no static files handling and nginx in front, then
embedded mode does avoid an extra proxy step. If application
multithread safe, prefer worker MPM and you should though play with
the MPM configuration so that all or most of the processes up to the
maximum are created from the outset. Also avoid a maximum requests
setting if you can and just keep processes persistent. This avoids
problems caused by Apache spinning up extra processes when load
increases, thereby creating even more load and just slowing box down
even more. In other words, have all the processes you need for maximum
load running all the time, which is effectively what daemon mode does
given it doesn't support dynamic process creation.

For a bit of background of the problems than can occur due to Apache
dynamic process creation and recycling, read: