I've been using Apache + mod_wsgi + Django for quite some time now,
and everything works great:
WSGIDaemonProcess myapp.com.br processes=2 threads=25
WSGIProcessGroup myapp.com.br
WSGIReloadMechanism Process
WSGIScriptAlias / "/var/django/myapp/apache/wsgi.conf"
Lately though, probably due to the fact that the site is being used
more intensely, I'm dealing with a problem I can't pinpoint. Sometimes
it seems that a request to get the site (say, the main page) fails -
I've seen it once myself, the (Safari) browser gave me a blank page
stating that it "sometimes happened when the server was too busy". I
didn't take the time to take note of the message, but as far as I
remember it wasn't a 500 error (but then again, maybe it was). Then a
simple refresh presented me with the correct page.
The problem is, I have a script that will get a "static" version of
the site by first issuing a wget, and then copying all the static
pages to another server (so the general public actually sees the
static version, not the Django version). In my logs, I can see that
sometimes wget fails, meaning it just ends withoutu getting any page,
but it issues no error message that I can log. During normal
operation, wget will get all (+1000) pages of the site, but when that
problem happens, it gets none. It seems it's a "all or nothing"
problem. Maybe wget just fails when the very first page fails, or
something like that.
I'm sure I'm giving too few information in order to track this
problem, but I'm not sure what I should monitor. My Apache2 log shows
several "IOError: client connection closed", but I've read this list's
archive, and it looks like this is not really an error, so I don't
have anything else at this point.
Here is a typical excerpt from the Apache2 log:
[Wed Jun 03 20:16:24 2009] [notice] Apache/2.2.8 (Ubuntu) mod_wsgi/2.4
Python/2.5.2 configured -- resuming normal operations
[Fri Jun 05 12:14:42 2009] [error]
['/usr/lib/python2.5/site-packages/ipython-0.9.1-py2.5.egg',
'/usr/lib/python2.5/site-packages/Django-1.0.2_final-py2.5.egg',
'/usr/lib/python2.5/site-packages/MySQL_python-1.2.3c1-py2.5-linux-i686.egg',
'/usr/lib/python25.zip', '/usr/lib/python2.5',
'/usr/lib/python2.5/plat-linux2', '/usr/lib/python2.5/lib-tk',
'/usr/lib/python2.5/lib-dynload',
'/usr/local/lib/python2.5/site-packages',
'/usr/lib/python2.5/site-packages',
'/usr/lib/python2.5/site-packages/PIL', '/var/django',
'/var/django/myapp']
[Fri Jun 05 13:19:43 2009] [error]
['/usr/lib/python2.5/site-packages/ipython-0.9.1-py2.5.egg',
'/usr/lib/python2.5/site-packages/Django-1.0.2_final-py2.5.egg',
'/usr/lib/python2.5/site-packages/MySQL_python-1.2.3c1-py2.5-linux-i686.egg',
'/usr/lib/python25.zip', '/usr/lib/python2.5',
'/usr/lib/python2.5/plat-linux2', '/usr/lib/python2.5/lib-tk',
'/usr/lib/python2.5/lib-dynload',
'/usr/local/lib/python2.5/site-packages',
'/usr/lib/python2.5/site-packages',
'/usr/lib/python2.5/site-packages/PIL', '/var/django',
'/var/django/myapp']
[Fri Jun 05 14:46:08 2009] [error] [client 201.6.63.43] mod_wsgi
(pid=29343): Exception occurred processing WSGI script
'/var/django/myapp/apache/wsgi.conf'.
[Fri Jun 05 14:46:08 2009] [error] [client 201.6.63.43] IOError:
client connection closed
[Fri Jun 05 14:46:19 2009] [error] [client 201.6.63.43] mod_wsgi
(pid=29344): Exception occurred processing WSGI script
'/var/django/myapp/apache/wsgi.conf'.
[Fri Jun 05 14:46:19 2009] [error] [client 201.6.63.43] IOError:
client connection closed
[Fri Jun 05 15:56:54 2009] [error] [client 201.6.63.43] mod_wsgi
(pid=29344): Exception occurred processing WSGI script
'/var/django/myapp/apache/wsgi.conf'.
[Fri Jun 05 15:56:54 2009] [error] [client 201.6.63.43] IOError:
client connection closed
Any help is greatly appreciated!
Regards,
Rubens
What database are you using for your Django app? If you are using
Postgresql you probably need to manually VACUUM your database. We had
this same sort of problem just yesterday, pages take too long to load
or show up blank, etc. We have been running for nearly 6 months
without a full vacuum, seems like autovacuum can only do so much for
you if you have a lot of data inserted and updated.
I had to take the site down, put it in maintenance mode, stop all
other apps that were using the databases and run a "VACUUM FULL
VERBOSE ANALYZE" on every database we had. Took over an hour to finish
in our case. Once VACUUM was done I took the site back up and it was
responsive again.
See: http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html
--
Best Regards,
Nimrod A. Abing
W http://arsenic.ph/
W http://preownedcar.com/
W http://preownedbike.com/
W http://abing.gotdns.com/
Ensure you have LogLevel set to into in main Apache configuration. If
you have separate LogLevel settings for main Apache server and
VirtualHost, make sure both are set and look at error logs for both
main Apache server and VirtualHost. See:
http://code.google.com/p/modwsgi/wiki/DebuggingTechniques
>> What OS type are you running the Apache web server and wget on? Is
>> Safari browser on MacOS X the only place you have got error response.
>
> The only occasion when I personally "saw" this error was with Safari,
> but in the server box I'm using only wget.
>
>> Finally, check both main and virtual host error logs for any instances
>> of 'Segmentation fault'.
>
> Nope, no segmentation faults... The only 'error' messages I have are:
>
> IOError: client connection closed
Generally not a problem. Although if the client was actually
prematurely killed, can also indicate a network issue in between.
> and occasionally:
>
> (104)Connection reset by peer: mod_wsgi (pid=9145): Unable to get
> bucket brigade for request., referer: http://myapp.com.br/admin/website/gallery/add/
This error is more serious. Have had some discussion about this in the
past, but can't remember what conclusion was. My memory is getting bad
in my old age. :-(
Graham
We also get this error usually when Googlebot crawls our websites. I
haven't found the time to examine the issue in full yet but I would
guess this is because Googlebot disconnects if the site fails to send
a response after N seconds, where N is known only to Google.
We also get this error and a traceback email from Django when someone
cancels a file upload.
>> and occasionally:
>>
>> (104)Connection reset by peer: mod_wsgi (pid=9145): Unable to get
>> bucket brigade for request., referer: http://myapp.com.br/admin/website/gallery/add/
>
> This error is more serious. Have had some discussion about this in the
> past, but can't remember what conclusion was. My memory is getting bad
> in my old age. :-(
Was it really ever resolved? Looks like it was not resolved in this thread:
We also hit this "bucket brigade" thing several times a while back and
IIRC I fixed it by raising TimeOut to 600 after a bit of trial and
error. In addition I needed to rewrite some Django views that were
hitting the database so that it did not tie up database transactions.
Django uses a database table to maintain session data. If you are
using transactions and the transaction middleware there are things
that you should keep an eye on. Example:
@transaction.commit_on_success
def view_that_updates_the_db(request):
# DB updating stuff here
def just_a_regular_view(request):
# regular stuff here, no DB updates
If a request comes in through the view_that_updates_the_db(), and
moments later another request comes in through just_a_regular_view().
If you specify session middleware after transaction middleware in your
MIDDLEWARE_CLASSES, just_a_regular_view() would block until
view_that_updates_the_db() finishes. The transaction.commit_on_success
decorator causes *all* database operations, including session
creation, to be wrapped inside a transaction[1]. As a result, the
table for sessions is locked up inside the transaction until
view_that_updates_the_db() finishes running. If
view_that_updates_the_db() ties up the transaction long enough, it
will cause other incoming requests to wait and this wait time inches
it closer to your TimeOut value. Specifying transaction middleware
*after* session middleware should fix this, but you should also look
into view_that_updates_the_db() and try to optimize it to run as
quickly as possible.
Note that using wget and other clients that do not have cookie support
will cause your Django app to create a new session (thereby hitting
the database) for *each* request. I needed to write an alternative
session middleware which does not create a new session if it detects
bots or other clients that do not have cookie support.
[1] http://docs.djangoproject.com/en/dev/topics/db/transactions/#tying-transactions-to-http-requests
>> and occasionally:>>
>> (104)Connection reset by peer: mod_wsgi (pid=9145): Unable to get
>> bucket brigade for request., referer: http://myapp.com.br/admin/website/gallery/add/
>
> This error is more serious. Have had some discussion about this in the
> past, but can't remember what conclusion was. My memory is getting bad
> in my old age. :-(
Django uses a database table to maintain session data.
I'm looking into this error as well. Trying to correlate it with requests/uploads/reloads/etc to figure out what's going on. More info as i find it...
But was still random, or was it reproducible to some degree?
> Other people seem to be able to complete uploads happily... so i'm a bit
> puzzled.
For this upload case, have you use curl to do the upload, or some
other way which has a means to monitor the progress of the upload so
it is possible to see whether upload progress stops some time before
the error is raised. Obviously if it isn't readily reproducible, that
is going to be hard to do.
Anyway, your post has reminded that the unable to get bucket brigade
error is just a lower level version of my higher level client closed
connection error. Ie., duplicated error logging effectively.
Graham
Right. But by default, it will use a database table to store sessions
and is what most people will use until performance bottlenecks force
them to eventually use the alternative session engines. Also in our
own case, we are still using our own patched version of Django 0.96
which does not have the SESSION_ENGINE option so we are not only stuck
with 0.96 we are stuck with DB-based sessions as well.
Looks like we have pretty much the same mod_wsgi setup. Except that we
run 1 daemon process with multiple threads and we have Timeout set to
600.
I ran grep on our error_log files just now and it seems we are still
getting them for one particular URL and sure enough this URL handles
file uploads. Got four entries in a recent error_log, here is one of
them:
error_log.2:[Mon Jun 01 12:43:13 2009] [error] [client
xxx.xxx.xxx.xxx] (70007)The timeout specified has expired: mod_wsgi
(pid=18680): Unable to get bucket brigade for request., referer:
http://example.com/post/
The corresponding entry on the access_log file:
access_log.2:xxx.xxx.xxx.xxx - - [01/Jun/2009:12:36:17 +0100] "POST
/offer/post/ HTTP/1.1" 500 541 "http://example.com/post/" "Mozilla/5.0
(Windows; U; Windows NT 5.1; en-US; rv:1.8.1.20) Gecko/20081217
Firefox/2.0.0.20"
It looks like it's causing an internal server error, however I get no
email tracebacks from Django which probably means this happens early
on in the request.
I did a whois on the IP addresses that were triggering these errors
and most of them come from AOL and some from a dial-up ISP in Africa.
Maybe it only happens for dial up? In the past we also had email
backtraces coming in from people who are behind proxies. All of them
saying "IOError: request data read error". Here is a sample of the
last 5 lines of one such traceback:
File "/usr/local/lib/python2.5/site-packages/django/core/handlers/wsgi.py",
line 136, in _get_post
self._load_post_and_files()
File "/usr/local/lib/python2.5/site-packages/django/core/handlers/wsgi.py",
line 114, in _load_post_and_files
self._post, self._files = http.parse_file_upload(header_dict,
self.raw_post_data)
File "/usr/local/lib/python2.5/site-packages/django/core/handlers/wsgi.py",
line 165, in _get_raw_post_data
safe_copyfileobj(self.environ['wsgi.input'], buf, size=content_length)
File "/usr/local/lib/python2.5/site-packages/django/core/handlers/wsgi.py",
line 67, in safe_copyfileobj
buf = fsrc.read(min(length, size))
IOError: request data read error
The GET and POST MultiValueDict would be empty for the request.
I've turned info logging on, and also got my ISP to change the virtual
machine to some other host - if I still find anything in the logs I'll
report back...
Also, I have surrounded my wget requests with more timeout-forgiving
settings, and yesterday I didn't have any problems. I know it's not a
good thing to change many variables at once, but the site is being
used to cover a live event these days, so I couldn't take any
chances...
> > (104)Connection reset by peer: mod_wsgi (pid=9145): Unable to get
> > bucket brigade for request., referer:http://myapp.com.br/admin/website/gallery/add/
> This error is more serious. Have had some discussion about this in the
> past, but can't remember what conclusion was. My memory is getting bad
> in my old age. :-(
As for Robert, this is related to the process of uploading files, and
it looks like it's not that frequent, so I can live with that...
Thanks again,
Rubens
In my case, I have only one customer using the site, and he's
definitely not using a dialed-up connection or behind a proxy - but I
guess you're on the right track assuming it has to do with some
problem during the transmission. My client uploads lots of photos,
some of them quite big, and I assume the logs are referring to the
occasional mishaps.
Regards,
Rubens
In the past we also had email
backtraces coming in from people who are behind proxies. All of them
saying "IOError: request data read error". Here is a sample of the
last 5 lines of one such traceback:
IOError: request data read error
FWIW, the Apache documentation says:
"""The timer used to default to 1200 before 1.2, but has been lowered
to 300 which is still far more than necessary in most situations."""
So, I am not sure that increasing it to 600 from 300 is going to do
much, especially since the timeout only applies to a certain very
small number of events. The mod_wsgi daemon mode also applies the
timeout as the deadlock break value across the socket connection
between Apache server child processes, but doubt that deadlock
scenario would be coming into play either.
Graham