Detection of deadlocked or unresponsive mod_wsgi daemon processes.

274 views
Skip to first unread message

Graham Dumpleton

unread,
Nov 14, 2007, 6:05:24 AM11/14/07
to mod...@googlegroups.com, al...@swapoff.org
FYI.

As a means to combat problem recently discussed whereby a third party
C extension might block on some external resource, or get stuck in a
tight loop, without first having released the Python GIL, have added
some Python GIL deadlock detection code to subversion trunk of
mod_wsgi.

Note though that this feature is only implemented for mod_wsgi daemon
processes as no real practical way of doing it for embedded mode and
normal Apache child processes.

Anyway, what the changes means is that if mod_wsgi thinks that some
thread has held the Python GIL for what would be considered an
excessive amount of time without releasing it, by default 300 seconds,
then the mod_wsgi process will be shutdown and restarted.

At the moment I have set the default timeout period quite high, but
you can reduce it if desired using the 'deadlock-timeout' option to
WSGIDaemonProcess.

WSGIDaemonProcess trac inactivity-timeout=300 deadlock-timeout=60

I am open to opinions on what the default for this value should be
based on what people thing is reasonable for sensibly behaving Python
applications. It may well be better off being somewhat less because of
web applications the last thing you want is something hogging the GIL
for a long time as no other Python code can run.

BTW, the way this is different to the inactivity timeout is that the
latter causes a shutdown when no new requests have arrived or no new
input or output for currently executing requests for a specified
period. The inactivity timeout can indirectly detect deadlocks but it
also means you are forcing a process restart when it is idle for some
period, which may not be desired. For the deadlock detection, it will
only cause a restart when a possibly real problem has occurred.

Note that neither really deal with the case where a specific request
handler gets stuck in a tight loop in Python code, where the GIL would
still be able to be released periodically. The inactivity timeout will
detect this where that request was the only one running, but not if
there are other normally executing requests still occurring.

I have thought about implementing a request timeout previously,
whereby individual requests taking longer than a certain period of
would cause a restart, but that gets a bit tricky. I'll have a bit of
a think about it again though because if can do that then probably
cover all cases of how a application could get stuck and provide a
means of restarting such a process so things will recover
automatically.

If you want to play with this code for me and in general test it out,
you'll need to check out the latest code from subversion trunk for
mod_wsgi.

Graham

Reply all
Reply to author
Forward
0 new messages