Re: [modwsgi] Mod_wsgi daemon processes stopped and not restarted when maximum-requests reached

1,358 views
Skip to first unread message

Graham Dumpleton

unread,
Oct 2, 2012, 1:40:13 AM10/2/12
to mod...@googlegroups.com
Sorry for the slow reply, been on long holiday.

Why are you using maximum-requests and inactivity-timeout?

I would always recommended against using these options if there is no
real requirement as unnecessary restarts should always be avoided
because the startup costs of loading up fat Python web applications
can significantly impact server performance if your server is under
constant load.

IOW, the last thing you want to happen if you are under loading,
especially a load spike, is to restart the application due to hitting
maximum-requests.

Anyway, I don't recollect specifically a case where daemon process
wouldn't come back after being killed off. Are you absolutely sure,
from running 'ps' and checking process IDs that they didn't exist in
any state in the process table? Are the process perhaps still there
and haven't actually shutdown properly?

The message:

[Fri Sep 21 09:10:23 2012] [info] mod_wsgi (pid=3845): Aborting process
'XXusername-omittedXX'.

indicates there certainly wasn't a clean shutdown, possibly because of
a threading dead lock or hung request thread. However, when you get
that message there is a C level call to exit() immediately after so
the process should die.

Technically calling exit() doesn't absolutely guarantee shutdown as a
C level atexit() handler could block but neither Apache or mod_wsgi
should be setting up any of those.

The problem with maximum-requests and inactivity-timeout is that they
are a self initiated restart and a graceful one at that. There is
therefore no absolute failsafe if exit() doesn't work as there would
if it was an Apache restart. In the case of an Apache restart, the
Apache parent process will send a SIGKILL to the child processes if
they haven't exited after 3 seconds. For a self initiated restart, the
parent process doesn't know it is restarting so if something prevents
the process from properly dying, then it will not be notified and
replace the process.

In summary, validate for sure whether the daemon process exist still
or not in the process table and in what state.

If they do still exist, try and use gdb as described in:

http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Debugging_Crashes_With_GDB

to work out where they are blocked.

That all said, don't recollect seeing this specific issue with
maximum-requests, have seen similar outcome of processes not handling
requests in the past. Generally this was always because of application
level code blocking exhausting threads. There were some unexplained
cases, but this seemed to vanish with no changes to mod_wsgi. There
was suspicion though at the time that for some reason the distro
Apache was causing problems. From memory this was for Apache in patch
version range around 15-17 of 2.2 and you are using 16. Is there a
reason why you can't upgrade to a newer Apache version than what you
are using?

Graham

On 24 September 2012 10:19, Greg Lowe <greg...@gmail.com> wrote:
> I've recently run into some stability issues with three django sites that we
> host. These are running on Apache and Mod_wsgi, and are low traffic sites.
> I’m more concerned with reliability than maximising throughput.
>
> I’ve included a summary of what has been observed so far below. I would
> greatly appreciate any advice on how to diagnose and/or fix this problem.
>
> The sites are configured to use daemon processes with a maximum-requests
> setting. Apache is running in prefork mode. i.e. WSGIDaemonProcess
> usernamehere user=usernamehere group=usernamehere processes=2 threads=25
> maximum-requests=100 inactivity-timeout=60 display-name=%{GROUP}
>
> We’ve experienced a number of outages where both of the daemon processes
> disappear, and are not recreated by mod_wsgi. Restarting Apache causes them
> to be restarted correctly.
>
> Occasionally I’ve seen the sites being served from only a single daemon
> process although it’s configured to use two. An apache restart fixes this.
>
> Our server admin has managed to reproduce the problem on our test
> environment by running apache bench at around 100 requests per second for
> ~70k requests. (He’s also done this with INFO level logging turned on - see
> the log snippets at the end of the email.)
>
> These are the typical errors seen in the error log from an outage:
>
> Many of these: Script timed out before returning headers: django.wsgi
>
> Many of these: Premature end of script headers: django.wsgi
>
> Occasionally one of these: (4)Interrupted system call: mod_wsgi (pid=19990):
> Unable to connect to WSGI daemon process 'usernamehere’' on
> '/var/run/apache2/wsgi.25099.19.3.sock' after multiple attempts.
>
> Occasionally one of these: (2)No such file or directory: mod_wsgi
> (pid=10877): Unable to connect to WSGI daemon process 'usernamehere' on
> '/var/run/apache2/wsgi.5194.16.3.sock' after multiple attempts.
>
>
> One of the sites has been running for about a year, the others only a few
> months. These problems first started a few weeks ago.
>
> The sites were migrated from a server running an older version of Debian a
> little over a month ago. There was a few weeks in between the migration and
> the problems appearing. So I’m fairly confident that we can attribute these
> problems to the newer version of mod_wsgi, but due to the timing I can’t be
> certain.
>
> Old server versions:
>
> Debian: Lenny
>
> Apache: apache2 2.2.9-10+lenny8, apache-mpm-prefork 2.2.9-10+lenny8
>
> mod_wsgi: libapache2-mod-wsgi 2.5-1~lenny1
>
> Django: 1.3.1
>
>
> Current server versions:
>
> Debian: Squeeze
>
> Apache: apache2 2.2.16-6+squeeze7, apache2-mpm-prefork 2.2.16-6+squeeze7
>
> Mod_wsgi: libapache2-mod-wsgi 3.3-2
>
> Django: 1.3.1
>
>
> Our system admin has reproduced the problem on our test environment with
> INFO level logging turned on. Here are his comments:
>
> I set the apache log level for the test site and ab'd it to death.
>
> The pertinent log entries are:
>
> [Fri Sep 21 09:10:22 2012] [info] Initial (No.1) HTTPS request received
> for child 70 (server XXXXXomittedXXXXXX:443)
> [Fri Sep 21 09:10:22 2012] [info] mod_wsgi (pid=3845): Maximum requests
> reached 'XXusername-omittedXX'.
> [Fri Sep 21 09:10:22 2012] [info] mod_wsgi (pid=3845): Shutdown
> requested 'XXusername-omittedXX'.
> [Fri Sep 21 09:10:22 2012] [info] Initial (No.1) HTTPS request received
> for child 27 (server XXXXXomittedXXXXXX:443)
> ....
> [Fri Sep 21 09:10:23 2012] [info] Initial (No.1) HTTPS request received
> for child 28 (server XXXXXomittedXXXXXX:443)
> [Fri Sep 21 09:10:23 2012] [info] mod_wsgi (pid=3845): Aborting process
> 'XXusername-omittedXX'.
> [Fri Sep 21 09:10:23 2012] [error] [client 192.168.146.19] Premature end
> of script headers: django.wsgi
> [Fri Sep 21 09:10:23 2012] [info] Initial (No.1) HTTPS request received
> for child 85 (server XXXXXomittedXXXXXX:443)
> ....
> [Fri Sep 21 09:10:23 2012] [info] Initial (No.1) HTTPS request received
> for child 46 (server XXXXXomittedXXXXXX:443)
> [Fri Sep 21 09:15:20 2012] [error] [client 192.168.146.19] Script timed
> out before returning headers: django.wsgi
> [Fri Sep 21 09:15:20 2012] [error] [client 192.168.146.19] Script timed
> out before returning headers: django.wsgi
>
> So it seems that there is an unclean shutdown after wsgi's max requests
> count is reached.
>
> A normal shutdown looks like this:
>
> [Fri Sep 21 09:10:17 2012] [info] Initial (No.1) HTTPS request received
> for child 73 (server XXXXXomittedXXXXXX:443)
> [Fri Sep 21 09:10:17 2012] [info] mod_wsgi (pid=3691): Maximum requests
> reached 'XXusername-omittedXX'.
> [Fri Sep 21 09:10:17 2012] [info] mod_wsgi (pid=3691): Shutdown
> requested 'XXusername-omittedXX'.
> [Fri Sep 21 09:10:17 2012] [info] Initial (No.1) HTTPS request received
> for child 79 (server XXXXXomittedXXXXXX:443)
> ....
> [Fri Sep 21 09:10:18 2012] [info] Initial (No.1) HTTPS request received
> for child 27 (server XXXXXomittedXXXXXX:443)
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Stopping process
> 'XXusername-omittedXX'.
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Destroying
> interpreters.
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Destroy
> interpreter 'XXXXXomittedXXXXXX|'.
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Cleanup
> interpreter ''.
> [Fri Sep 21 09:10:18 2012] [info] Initial (No.1) HTTPS request received
> for child 61 (server XXXXXomittedXXXXXX:443)
> ....
>
> [Fri Sep 21 09:10:18 2012] [info] Initial (No.1) HTTPS request received
> for child 19 (server XXXXXomittedXXXXXX:443)
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Terminating Python.
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3691): Python has shutdown.
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3845): Attach
> interpreter ''.
> [Fri Sep 21 09:10:18 2012] [info] mod_wsgi (pid=3845): Create
> interpreter 'XXXXXomittedXXXXXX|'.
> [Fri Sep 21 09:10:18 2012] [info] [client 192.168.146.19] mod_wsgi
> (pid=3845, process='XXusername-omittedXX',
> application='XXXXXomittedXXXXXX|'): Loading WSGI script
> '/home/XXusername-omittedXX/scripts/django.wsgi'.
> [Fri Sep 21 09:10:19 2012] [info] Initial (No.1) HTTPS request received
> for child 85 (server XXXXXomittedXXXXXX:443)
>
>
> Thanks for making all the way through this long post ;)
>
> --
> You received this message because you are subscribed to the Google Groups
> "modwsgi" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/modwsgi/-/NSg2A39f7DYJ.
> To post to this group, send email to mod...@googlegroups.com.
> To unsubscribe from this group, send email to
> modwsgi+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/modwsgi?hl=en.

Greg Lowe

unread,
Dec 10, 2012, 4:26:10 PM12/10/12
to mod...@googlegroups.com

Sorry for the slow reply, been on long holiday.

Hope you had a good holiday ;)  And sorry for an even slower reply ;)
 
Why are you using maximum-requests and inactivity-timeout? 
 
I would always recommended against using these options if there is no 
real requirement as unnecessary restarts should always be avoided 
because the startup costs of loading up fat Python web applications 
can significantly impact server performance if your server is under 
constant load. 

IOW, the last thing you want to happen if you are under loading, 
especially a load spike, is to restart the application due to hitting 
maximum-requests. 


We are running 5 low usage Django sites on a single host. Our system
admin is worried about one the sites getting out of control and affecting
the others, i.e. memory leak. This is why we're using maximum requests.
I'm not sure why about inactivity-timeout - I'll ask.

These sites are not public web-facing sites, they have small
user bases. We're more interested in reliability and low maintenance
than performance under heavy load.

If the daemon processes are not periodically restarted using
maximum-requests, then should they be periodically restarted? The
thought of having a python VM running for years, makes me a bit nervous.

I guess a cron job and an Apache reload perhaps?

 
Anyway, I don't recollect specifically a case where daemon process
wouldn't come back after being killed off. Are you absolutely sure,
from running 'ps' and checking process IDs that they didn't exist in
any state in the process table? Are the process perhaps still there
and haven't actually shutdown properly?


The processes are definitely gone. We have been keeping an eye on the
processes in ps. If Django or python crashed, I would expect to see
some error messages in our logs.

I had a brief look at the source code for mod_wsgi to understand what
is happening. Based on the info log messages it appears that the restart
fails whenever the "shutdown reaper" runs.

As I understand it, the reaper runs if Django takes too long to shutdown.

Is mod_wsgi supposed to restart a process after the "reaper" kills it?
 
We've also experimented with increasing the shutdown-timeout, so that
the reaper is less likely to kill the process. This improved things but did
not make the problem go away entirely.
I assume this is just the stock Debian Apache version. I'm not the system admin,
but my guess is we'd then need to manually apply security patches, rather than
using the package management system.

We're still experiencing the problem. We are monitoring it and restarting processes
when needed. I've started looking into this again, and will keep you, and the list up
to date with how it goes.

It sounds like the best approach is to remove the maximum requests and
inactivity timeout settings, and to set up a cron job to periodically restart
the daemon processes, perhaps every week or so.

Thanks for your input, and again, sorry for ignoring your reply for so long.

Cheers,
Greg.

Michael Bayer

unread,
Jan 3, 2013, 6:06:06 PM1/3/13
to mod...@googlegroups.com
We're seeing behavior that also might be this same issue.   This is again a low-use server with a maximum-requests of 1000, which is set that way since we were observing slow memory leaks in one application.   We're on Apache 2.2.3, I'm not seeing here what version of mod_wsgi we're on (unless there's some way to divine that from the .so file).

The symptom is, the server serves its last request, and the next one is a 500 server error.  The app server doesn't seem to play any role at that point and the error log has the error "Premature end of script headers: ".   It's not clear to me without asking the admins some more if the server no longer writes any more logs after that, as well as whether or not the daemon process is still present in the ps listing.

Graham Dumpleton

unread,
Jan 5, 2013, 4:18:17 AM1/5/13
to modwsgi
Okay, here is the history of this issue.

I have seen reports of it probably less than a dozen times that I recollect. If more, it would not be many more.

Although it is mod_wsgi that triggers the shutdown of the daemon process, it is process management code within the Apache Runtime Library that deals with the consequences of the child processing disappearing. The only thing mod_wsgi provides is some simple callback code to be notified when a process has died and depending on the circumstances restart it. The only case one shouldn't be restarted to replace one which has died is if Apache is being restarted or shutdown.

These two recent reports are the first time I have seen the issue brought up for a long time. Previously, based on the fact that reports of the problem went away, it was suspected that it was some Apache issue that was fixed around Apache version 2.2.16/2.2.17 due to a change in the version of the Apache Runtime Library being used at the time.

Michael, in your case you say you are using Apache 2.2.3 and Greg you said you were using stock Debian version. Debian has a history of not keeping up with the latest versions of software, so depending on what Debian version you are using, you could also be using an old version of Apache.

So that is about all I can tell you at the moment. That is, that I would suspect that it is going to be due to an older version of Apache being used. The issue just hasn't been seen since about Apache 2.2.16/2.2.17.

Sorry about the slow replies. Life is confusing right now and finding time and motivation to keep on top of the few mod_wsgi questions that do come up is proving challenging.

Graham



To view this discussion on the web visit https://groups.google.com/d/msg/modwsgi/-/Lc4lrAOyyvwJ.

Greg Lowe

unread,
Jan 7, 2014, 3:07:05 PM1/7/14
to mod...@googlegroups.com
Dredging up this old thread.

Till now we've been limping along waiting to upgrade Debian - and performing manual restarts whenever a process goes down.

We upgraded Debian a couple of days ago and were hoping this problem would be gone for good, but unfortunately have experienced it again.

Software versions:
apache2     2.2.22-13
libapache2-mod-wsgi    3.3-4

Current config:
WSGIScriptAlias / /home/XXXXX/scripts/XXXXX.wsgi
WSGIDaemonProcess epal user=XXXXX group=XXXXX processes=2 threads=25 maximum-requests=0 shutdown-timeout=15 inactivity-timeout=60 display-name=%{GROUP}
WSGIProcessGroup XXXXX

So I guess the next step is to remove the inactivity timeout and set up a cron job to periodically restart apache.

Cheers,
Greg.
Reply all
Reply to author
Forward
0 new messages