diagnosing "missing" daemons

Alec Flett

unread,

Jan 27, 2010, 5:50:45 PM1/27/10

to mod...@googlegroups.com

We have a mod_wsgi 3.0c5 instance (just upgraded to 3.1 yesterday, but
haven't deployed the apache install with that version yet) that seems
to have this problem where some of the daemons don't restart when
they've lived their active lifetime. So what happens is the apache
runs along happily for days at a time, and then all the sudden (over
the course of a few hours) all the requests get slow, and the number
of occupied apache children (i.e. apache slots, not WSGI daemons) goes
way up. When I log into the machine, there are only a few (in one
example, only 7 of the original 48 that we started with) WSGI daemons
hanging on for dear life, handling as many requests as they can.

At the same time, we recently discovered (like a month ago) that we
had a C extension (jsonlib2) that was crashing the python interpreter
because of a refcounting bug. (We have since fixed the bug in
jsonlib2, if anyone here is using it!)

what I'm wondering is, if the WSGI daemon crashed hard (i.e. Bus
Error, etc) what is the expected behavior of mod_wsgi? It looks like
mod_wsgi should handle a SIGINT sent to a child, and restart the
daemon, but what about a crash?

Alec

Graham Dumpleton

unread,

Jan 27, 2010, 6:07:00 PM1/27/10

to mod...@googlegroups.com

2010/1/28 Alec Flett <al...@metaweb.com>:

Should restart on a crash automatically.

One cause of what you are seeing is Python threads being deadlocked
and over time causing available threads to be used up.

Are you using multithread daemons? Is your code and third party
modules thread safe?

Try setting 'inactivity-timeout=120' as option to WSGIDaemonProcess.

This will cause a process to be automatically restarted if process
idle for 2 minutes, but will also force a restart of process where
there are active requests, but none of them have read any request body
or written and response content in 2 minutes. This can therefore be
used as a fail safe for processes that get stuck due to all threads
deadlocking.

I would also suggest setting LogLevel to 'info' so that additional
information printed out in error logs about process restarts.

As to finding out if Python threads stuck, modify your program to have
a URL which does the following:

import sys
import traceback

def stacktraces():
code = []
for threadId, stack in sys._current_frames().items():
code.append("\n# ThreadID: %s" % threadId)
for filename, lineno, name, line in traceback.extract_stack(stack):
code.append('File: "%s", line %d, in %s' % (filename,
lineno, name))
if line:
code.append(" %s" % (line.strip()))
return code

def application(environ, start_response):
status = '200 OK'
output = '\n'.join(stacktraces())
response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
return [output]

and either log the results or send it as response.

This way you might get an idea what request threads are actually doing.

Let me know what you find and also post your actual daemon mode configuration.

Graham

Alec Flett

unread,

Jan 28, 2010, 12:21:42 PM1/28/10

to mod...@googlegroups.com

On Jan 27, 2010, at 3:07 PM, Graham Dumpleton wrote:

> Should restart on a crash automatically.
>
> One cause of what you are seeing is Python threads being deadlocked
> and over time causing available threads to be used up.
>
> Are you using multithread daemons? Is your code and third party
> modules thread safe?
>

nope, single-threaded! threads=1 on the WSGIDaemonProcess line.

> Try setting 'inactivity-timeout=120' as option to WSGIDaemonProcess.
>

great, that seems like a good idea anyway.

>
> I would also suggest setting LogLevel to 'info' so that additional
> information printed out in error logs about process restarts.
>

That was going to be my next question ...:)

>
> This way you might get an idea what request threads are actually
> doing.
>

So none of this explains the "missing daemons" problem - where the
daemons are not actually starting back up again... as you can see
below, I set the display-name so that I can look at the daemons with
"ps" - when I do a ps ax | grep <group> I only see a few processes (in
fact one of my servers in production has dropped from the original 24
process, down to 7 yesterday, and now only at 3 today!)

> Let me know what you find and also post your actual daemon mode
> configuration.
>

Here's one of them:

#############################
# Project: client
##############################

WSGIDaemonProcess client-freebase.com processes=24 threads=1 display-
name=%{GROU
P} python-path=/mw/app/client_88277/_install/lib/python2.6/site-
packages maximum
-requests=1000

WSGIScriptAlias / /mw/app/client_88277/_install/bin/client.wsgi

# Server configuration for client
<Directory /mw/app/client_88277/_install/bin>
WSGIProcessGroup client-freebase.com
</Directory>

> Graham
>
> --
> You received this message because you are subscribed to the Google
> Groups "modwsgi" group.
> To post to this group, send email to mod...@googlegroups.com.
> To unsubscribe from this group, send email to modwsgi+u...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/modwsgi?hl=en
> .
>

Graham Dumpleton

unread,

Jan 29, 2010, 12:51:16 AM1/29/10

to mod...@googlegroups.com

2010/1/29 Alec Flett <al...@metaweb.com>:

>
> On Jan 27, 2010, at 3:07 PM, Graham Dumpleton wrote:
>
>> Should restart on a crash automatically.
>>
>> One cause of what you are seeing is Python threads being deadlocked
>> and over time causing available threads to be used up.
>>
>> Are you using multithread daemons? Is your code and third party
>> modules thread safe?
>>
>
> nope, single-threaded! threads=1 on the WSGIDaemonProcess line.
>
>> Try setting 'inactivity-timeout=120' as option to WSGIDaemonProcess.
>>
>
> great, that seems like a good idea anyway.
>>
>> I would also suggest setting LogLevel to 'info' so that additional
>> information printed out in error logs about process restarts.
>>
> That was going to be my next question ...:)
>
>>
>> This way you might get an idea what request threads are actually doing.
>>
> So none of this explains the "missing daemons" problem - where the daemons
> are not actually starting back up again... as you can see below, I set the
> display-name so that I can look at the daemons with "ps" - when I do a ps ax
> | grep <group> I only see a few processes

The extra level of logging may show if processes are doing some sort
of shutdown. If they are crashing, then you should already see
segmentation fault messages in main Apache error log, not virtual
host, so make sure you check both logs.

The processes should be restarted if they truly exit or crash. If it
is an order process restart due to maximum requests or WSGI script
file being touched, there is also a fail safe which defaults to 5
seconds. If it doesn't die in that time a thread should cause it to
kill itself. The only way this would work in that way is if some C
extension module for Python had registered a competing C code level
signal handler or blocked signals and it interfered with mod_wsgi. In
that case though the process would still exist and you should still
see it.

If it was an Apache restart that triggered process restart, you
presumably would have known about that unless you have some automated
system which does that. Even so, Apache will kill any daemon process
off which don't shut down in 3 seconds.

Can't also be case that processes are zombies, because that would mean
Apache isn't doing wait on their exit code, which it should be.

So, all quite confusing.

> (in fact one of my servers in
> production has dropped from the original 24 process, down to 7 yesterday,
> and now only at 3 today!)

Unless you have long lived requests, 24 process is actually quite a
lot. Any well tuned system should manage with a lot less.

Even with that number of processes, since not multithreaded, unless
you have a problem in your code with not releasing file descriptors,
wouldn't expect to run out of resources. You might though use lsof or
ofiles or other tool to work out if large number of file descriptors
in use. Even then, if Apache/mod_wsgi can't restart processes because
of that, you should see error messages in main Apache error log.

Graham

Alec Flett

unread,

Feb 8, 2010, 1:35:40 PM2/8/10

to mod...@googlegroups.com

So I'm still seeing this problem - that our python processes are
crashing for some reason (our problem, I'm sure) but mod_wsgi isn't
restarting them.

I just perused the mod_wsgi.c source and I don't see anything that
would restart children if they crashed? In particular, I don't see
anything catching SIGCHLD but I'm willing to believe the the apr_
APIs are doing this in a different way.

Also is there some kind of scoreboard telling which children are
available to recieve new requests? Because the server continues to
serve requests except for the missing children, leading me to believe
mod_wsgi has somehow figured out that the dead children are not
allowed to handle new requests.

Can you point me at the crash-recovery code?

Alec

Alec Flett

unread,

Feb 8, 2010, 2:38:33 PM2/8/10

to mod...@googlegroups.com

Ok, I've now found wsgi_manage_process...

FWIW I haven't been able to reproduce the crash by calling
os.kill(os.getcwd(), signal.SIGBUS) and frankly I'm not even sure how
specifically our children are crashing, if it's a SIGBUS or something
else. all I know is the state I find the appserver in and there's
little to nothing from the logs

I'm going to keep digging...

Alec

Alec Flett

unread,

Feb 8, 2010, 6:42:52 PM2/8/10

to mod...@googlegroups.com

Ok, I think I'm starting to get a handle on whats going on.

For background, we run in prefork mode. We currently have:
StartServers 5
MinSpareServers 5
MaxSpareServers 10
ServerLimit 600
MaxClients 600
MaxRequestsPerChild 1000

For mod_wsgi I've got maximum-requests=1000

For a bunch of PIDs, these are the mod_wsgi log messages I see:
pids: 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425,
434, 473,
Initializing Python.
Attach interpreter ''.
Destroying interpreters.
Cleanup interpreter ''.
Terminating Python.
Python has shutdown.

Now I did some exploring and it turns out those PIDs are apache
children, NOT mod_wsgi daemons.

I think that apache is quietly shutting down apache children, perhaps
when they reach MaxRequestsPerChild, and this is taking the mod_wsgi
children down with them, and mod_wsgi is not restarting those
children. Could there possibly be some off-by-one bug where if we're
on the 1000th request, mod_wsgi thinks "kill this child, and restart
it" but then apache comes in and kills the child just before it starts?

Alec

Graham Dumpleton

unread,

Feb 8, 2010, 6:56:43 PM2/8/10

to mod...@googlegroups.com

If you are not using embedded mode, ie., only using daemon mode, then
add the directive:

WSGIRestrictEmbedded On

This will tell mod_wsgi not to bother to initialise the Python
interpreter in the Apache server child processes, given it will not be
required.

This presumes mod_wsgi 3.X is being used as 2.X behaves differently.

That should eliminate those messages and make it clearer what is going on.

I will explain more later when have the time to catch up on all my email.

Graham

Graham Dumpleton

unread,

Feb 8, 2010, 8:25:01 PM2/8/10

to mod...@googlegroups.com

I should add that you should also read:

http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html

That explains a bit about how Apache will kill off excess Apache
server child processes and/or create more when needed.

Because by default the Python interpreter still gets created in those
processes, is probably why you are getting confused. That is, those
processes wouldn't be replaced straight away, unlike the daemon mode
processes which would be as they are part of a static pool size,
whereas main Apache server child processes are effectively part of a
dynamic pool size.

Thus, setting WSGIRestrictEmbedded and disabling default behaviour
that sees Python interpreter still initialised in those processes may
clear things up.

Graham

Alec Flett

unread,

Feb 9, 2010, 12:12:53 PM2/9/10

to mod...@googlegroups.com

On Feb 8, 2010, at 5:25 PM, Graham Dumpleton wrote:

> I should add that you should also read:
>
> http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html
>

I have read this a few times... I feel fairly enlightened, but it
still doesn't explain how I'm losing daemons.

>
> Because by default the Python interpreter still gets created in those
> processes, is probably why you are getting confused. That is, those
> processes wouldn't be replaced straight away, unlike the daemon mode
> processes which would be as they are part of a static pool size,
> whereas main Apache server child processes are effectively part of a
> dynamic pool size.
>

But I guess what I'm seeing is that the daemons really are failing to
recirculate, or something - I'm still in this world where we're
suddenly at 2-3 daemons, when we started with 24. (I should add that
we chose 24 because we do have big beefy machines with lots of RAM and
cores, plus this particular application tends to be CPU-heavy)

> Thus, setting WSGIRestrictEmbedded and disabling default behaviour
> that sees Python interpreter still initialised in those processes may
> clear things up.
>

I'll definitely give that a try... at least it should further reduce
the number of log messages that may be confusing me.

Alec

Graham Dumpleton

unread,

Feb 9, 2010, 7:39:50 PM2/9/10

to mod...@googlegroups.com

BTW, do you have:

LogLevel info

(or debug) set in Apache configuration.

If you do, then the main Apache error log should be logging when
daemon processes are stopped and started. For example:

[Wed Feb 10 11:18:29 2010] [info] mod_wsgi (pid=1563): Starting
process 'tests' with uid=501, gid=20 and threads=15.

For a case where the daemon process is explicitly killed will see:

[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1563): Process 'tests'
has died, restarting.
[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1568): Starting
process 'tests' with uid=501, gid=20 and threads=15.

Because this was by SIGINT and so a graceful shutdown, in the virtual
host specific error log, or main error log if no virtual host error
log, you will see:

[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1563): Shutdown
requested 'tests'.
[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1563): Stopping process 'tests'.
[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1563): Destroying interpreters.
[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1563): Cleanup interpreter ''.
[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1563): Terminating Python.
[Wed Feb 10 11:18:44 2010] [info] mod_wsgi (pid=1563): Python has shutdown.

If the daemon process had just crashed, you wouldn't see the later messages.

If it crashed because of segmentation fault, then the main Apache
error log will show the segmentation fault message.

There will be different shutdown messages if something like maximum
requests or inactivity timeout is defined for daemon process and it is
triggered. I would supply examples, but for whatever reason in my
current code base, inactivity timeout doesn't seem to be working. This
may or may not be related and am investigating.

BTW, setting:

LogLevel debug
WSGIVerboseDebugging On

will give you even more debug messages, but latter causes logging for
every request which may not be what you want as will fill the logs up.

Graham

Graham Dumpleton

unread,

Feb 9, 2010, 8:06:38 PM2/9/10

to mod...@googlegroups.com

The messages you would see for inactivity timeout occurring are:

[Wed Feb 10 11:51:42 2010] [info] mod_wsgi (pid=2521): Daemon process
inactivity timer expired, stopping process 'tests'.
[Wed Feb 10 11:51:42 2010] [info] mod_wsgi (pid=2521): Shutdown
requested 'tests'.
[Wed Feb 10 11:51:42 2010] [info] mod_wsgi (pid=2521): Stopping process 'tests'.
[Wed Feb 10 11:51:42 2010] [info] mod_wsgi (pid=2521): Destroying interpreters.
[Wed Feb 10 11:51:42 2010] [info] mod_wsgi (pid=2521): Destroy
interpreter 'tests.example.com|/echo.wsgi'.
[Wed Feb 10 11:51:43 2010] [info] mod_wsgi (pid=2521): Cleanup interpreter ''.
[Wed Feb 10 11:51:43 2010] [info] mod_wsgi (pid=2521): Terminating Python.
[Wed Feb 10 11:51:43 2010] [info] mod_wsgi (pid=2521): Python has shutdown.

The reason I wasn't getting these is that there is a possible bug in
mod_wsgi code. Specifically, a single check loop is used to monitor
deadlock timeout and inactivity timeout. When it goes into this loop
prior to any request having been received it doesn't take into
consideration the inactivity timeout and would wait the time of the
deadlock time. I had the inactivity timeout at 30 seconds and deadlock
timeout at 300 seconds and made the request at the start of that 300
seconds, so took 300 seconds of inactivity the first time to trigger
restart instead of 30 seconds.

I don't believe this issue could be related to anything you are seeing
if there is an issue.

If reach maximum requests, would see:

[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Maximum
requests reached 'tests'.
[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Shutdown
requested 'tests'.
[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Stopping process 'tests'.
[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Destroying interpreters.
[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Destroy
interpreter 'tests.example.com|/echo.wsgi'.
[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Cleanup interpreter ''.
[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Terminating Python.
[Wed Feb 10 12:04:54 2010] [info] mod_wsgi (pid=2689): Python has shutdown.

So, if you have:

LogLevel info

do you see these sort of messages in normal course of operation?

Graham

Alex

unread,

Mar 13, 2010, 10:57:39 PM3/13/10

to modwsgi

The vanishing daemons problem seems to be causing an issue on a new
server I've recently deployed. It may have been a very rare
occurrence on an old single core machine (really can't remember), but
now random daemons are disappearing now we've switched to a more
modern quad core xeon server.

I've tried raising the debug level in Apache to see if it can show any
useful details, a typical working restart:

[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Daemon process
inactivity timer expired, stopping process 'fatfluffs'.
[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Shutdown
requested 'fatfluffs'.
[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Stopping
process 'fatfluffs'.
[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Destroying
interpreters.
[Sun Mar 14 01:14:02 2010] [debug] mod_wsgi.c(5172): mod_wsgi
(pid=6795): Create thread state for thread 0 against interpreter
'www.fatfluffs.com|/site.wsgi'.
[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Destroy
interpreter 'www.fatfluffs.com|/site.wsgi'.
[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Cleanup
interpreter ''.
[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Terminating
Python.
[Sun Mar 14 01:14:02 2010] [info] mod_wsgi (pid=6795): Python has
shutdown.
[Sun Mar 14 01:14:03 2010] [info] mod_wsgi (pid=6795): Process
'fatfluffs' has died, restarting.
[Sun Mar 14 01:14:03 2010] [info] mod_wsgi (pid=7134): Starting
process 'fatfluffs' with uid=33, gid=33 and threads=15.
[Sun Mar 14 01:14:03 2010] [info] mod_wsgi (pid=7134): Initializing
Python.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(11151): mod_wsgi
(pid=7134): Process 'fatfluffs' logging to 'www.fatfluffs.com' with
log level 7.
[Sun Mar 14 01:14:03 2010] [info] mod_wsgi (pid=7134): Attach
interpreter ''.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10662): mod_wsgi
(pid=7134): Starting 15 threads in daemon process 'fatfluffs'.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10491): mod_wsgi
(pid=7134): Enable monitor thread in process 'fatfluffs'.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10672): mod_wsgi
(pid=7134): Starting thread 1 in daemon process 'fatfluffs'.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10495): mod_wsgi
(pid=7134): Deadlock timeout is 300.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10498): mod_wsgi
(pid=7134): Inactivity timeout is 300.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10461): mod_wsgi
(pid=7134): Enable deadlock thread in process 'fatfluffs'.
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10672): mod_wsgi
(pid=7134): Starting thread 2 in daemon process 'fatfluffs'.
(removed repeated lines)
[Sun Mar 14 01:14:03 2010] [debug] mod_wsgi.c(10672): mod_wsgi
(pid=7134): Starting thread 15 in daemon process 'fatfluffs'.
[Sun Mar 14 01:14:06 2010] [debug] mod_wsgi.c(11925): mod_wsgi
(pid=7104): Request server was 'www.fatfluffs.com|0'.
[Sun Mar 14 01:14:06 2010] [debug] mod_wsgi.c(12676): mod_wsgi
(pid=7134): Server listener address '|80'.
[Sun Mar 14 01:14:06 2010] [debug] mod_wsgi.c(12685): mod_wsgi
(pid=7134): Server listener address '|80' was found.
[Sun Mar 14 01:14:06 2010] [debug] mod_wsgi.c(12697): mod_wsgi
(pid=7134): Connection server matched was 'drake.hawkz.com|80'.
[Sun Mar 14 01:14:06 2010] [debug] mod_wsgi.c(12713): mod_wsgi
(pid=7134): Request server matched was 'www.fatfluffs.com|0'.

Then this is the last restart in the logs before it started failing
requests:

[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Daemon process
inactivity timer expired, stopping process 'fatfluffs'.
[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Shutdown
requested 'fatfluffs'.
[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Stopping
process 'fatfluffs'.
[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Destroying
interpreters.
[Sun Mar 14 01:44:07 2010] [debug] mod_wsgi.c(5172): mod_wsgi
(pid=7471): Create thread state for thread 0 against interpreter
'www.fatfluffs.com|/site.wsgi'.
[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Destroy
interpreter 'www.fatfluffs.com|/site.wsgi'.
[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Cleanup
interpreter ''.
[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Terminating
Python.
[Sun Mar 14 01:44:07 2010] [info] mod_wsgi (pid=7471): Python has
shutdown.
[Sun Mar 14 01:44:08 2010] [debug] mod_wsgi.c(11925): mod_wsgi
(pid=7608): Request server was 'www.fatfluffs.com|0'.

After this point requests slowly end up with error 500 until an Apache
reload is needed to get the missing daemon back.

We're running mostly standard Debian Lenny stuff, Apache 2.2.9, tried
most versions of mod_wsgi (packaged, backport, latest) and all seem to
end up with the same problem. WSGIRestrictEmbedded is turned on with
mod_wsgi 3.2, all daemons are standard 1 process, 15 thread, 1000 max
requests, 300 sec timeout.

It's certainly occurring every few days now and is getting highly
annoying. :(

If there's any other details needed to figure this one out then let me
know.

Graham Dumpleton

unread,

Mar 13, 2010, 11:36:53 PM3/13/10

to mod...@googlegroups.com

And what appears in any error logs for the 500 errors or in the error
response page itself?

Can you enable multi lang error documents in Apache? When these are
used, they will some times display additional error notes about the
reason for the error.

If for some reason the mod_wsgi daemon process are disappearing, I
would not expect to see 500 errors, I would expect to see 503 errors
and there would be a series of mod_wsgi specific warning/error
messages in Apache error logs resulting from the failures and retries
in attempting to connect to the mod_wsgi daemon processes.

Graham

Ben Drees

unread,

Mar 17, 2010, 4:26:21 PM3/17/10

to modwsgi

Hi,

I'm working with the same stack and problem that Alec has been
describing here. I have a new (but still vague) theory about what's
happening for your consideration. In trying to understand exactly how
the daemon process restart mechanism works, I noticed that the switch
statement in wsgi_manage_process() has no default block. There's at
least one documented 'reason' code which is not handled:
APR_OC_REASON_UNWRITABLE. I'm not exactly sure what that code means,
but I was able to reproduce the problem (declining number of daemon
children) by adding these lines to the top of wsgi_manage_process():

if (daemon->process.pid % 2) {
reason = APR_OC_REASON_UNWRITABLE;
}

This is admittedly contrived, but demonstrates that the problem
*could* be related to an unexpected condition (unhandled reason) or
error (transient resource shortage) occurring during the execution of
wsgi_manage_process(). There seems to be at most one opportunity to
get the terminating daemon restarted. If there was a
wsgi_stork_thread() similar to wsgi_reaper_thread(), then perhaps the
restart could be retried later in the event of transient difficulties.

-Ben

Graham Dumpleton

unread,

Mar 18, 2010, 2:00:03 AM3/18/10

to mod...@googlegroups.com

The particular value is never used in APR code or Apache. It more than
likely exists such that user code can use it as a reason when alerting
that a process has died but mod_wsgi doesn't do that.

It wouldn't hurt to have a default which at least logs when unexpected
reason arrives.

FWIW, mod_cgid doesn't have that reason code or a default for switch
statement either and that is what code was modelled off.

That other process management in APR is a bit magic though and there
are certain parts of it and how it interacts with Apache graceful
restarts that I remember as being uncertain about for quite a while. I
can't remember if I resolved the issue in my mind, but for a long time
I couldn't work out where the signal came from which shutdown other
processes on an Apache graceful restart. This was an issue at the time
as I wasn't seeing proper shutdown messages from processes and Python
exit code wasn't being called, instead the processes just got killed.
One time when I went to investigate again, I couldn't duplicate the
issue again and all seemed to work properly, even when I went back and
tried older versions of Apache. Even though processes were being
killed, they were always replaced.