mod_wsgi on trac-hacks.org

7 views
Skip to first unread message

Alec Thomas

unread,
Aug 21, 2007, 10:45:22 AM8/21/07
to mod...@googlegroups.com
Hi Graham,

Just wanted to say that I recently migrated trac-hacks.org to a new
machine and took the opportunity to also move from mod_python to
mod_wsgi. It was a largely painless experience, and much simpler than
a mod_python deployment. Nice work.

The only slight issue was the user= group= options to
WSGIDaemonProcess failed to set the user, though this could be just
something I'm not comprehending. I have "User www-data " also set, but
it wasn't clear to me whether this would cause issues or not. Anyway,
it's a minor issue.

Again, nice work.

Alec

--
Evolution: Taking care of those too stupid to take care of themselves.

Graham Dumpleton

unread,
Aug 21, 2007, 7:35:02 PM8/21/07
to mod...@googlegroups.com
On 22/08/07, Alec Thomas <al...@swapoff.org> wrote:
>
> Hi Graham,
>
> Just wanted to say that I recently migrated trac-hacks.org to a new
> machine and took the opportunity to also move from mod_python to
> mod_wsgi. It was a largely painless experience, and much simpler than
> a mod_python deployment. Nice work.
>
> The only slight issue was the user= group= options to
> WSGIDaemonProcess failed to set the user, though this could be just
> something I'm not comprehending. I have "User www-data " also set, but
> it wasn't clear to me whether this would cause issues or not. Anyway,
> it's a minor issue.

Are you starting Apache up as root or some other user? The user/group
switch can only occur for mod_wsgi daemon processes when Apache is
started as root.

You can see what mod_wsgi is doing by change Apache LogLevel to 'info'.

LogLevel info

This will result in messages being displayed in main Apache error log
about daemon processes, when they are started, shutdown etc. If Apache
is started as root, should see:

mod_wsgi (pid=707): Starting process 'grahamd-1' with uid=501,
gid=501 and threads=10.

Ie., it will show uid/gid. If not started as root, you will not see
uid/gid in logged output.

Also ensure that you are actually delegating properly that your Python
application is to run in the daemon process using the WSGIProcessGroup
directive.

With 'info' logging enabled, you should also see in Apache log file
for that host, messages of the form:

mod_wsgi (pid=707, process='grahamd-1',
application='localhost:8004|/mod_wsgi/restart.py'): Loading WSGI
script '/usr/local/wsgi/scripts/restart.py'.

This will occur the first time script is loaded by mod_wsgi. If
process is empty string, means it isn't being delegated to daemon
process.

BTW, how many Trac instances are you running? Just the one? And how
many processes/threads did you decide to use for WSGIDaemonProcess
directive? Curious so I can get some information about what people
find effective for a site with a reasonable amount of traffic.

Graham

Alec Thomas

unread,
Aug 21, 2007, 11:42:32 PM8/21/07
to mod...@googlegroups.com
On 8/22/07, Graham Dumpleton <graham.d...@gmail.com> wrote:
> With 'info' logging enabled, you should also see in Apache log file
> for that host, messages of the form:
>
> mod_wsgi (pid=707, process='grahamd-1',
> application='localhost:8004|/mod_wsgi/restart.py'): Loading WSGI
> script '/usr/local/wsgi/scripts/restart.py'.
>
> This will occur the first time script is loaded by mod_wsgi. If
> process is empty string, means it isn't being delegated to daemon
> process.

I must be on crack. I just tried the user/group settings again and it worked
fine. Thanks for the explanation though.

> BTW, how many Trac instances are you running? Just the one? And how
> many processes/threads did you decide to use for WSGIDaemonProcess
> directive? Curious so I can get some information about what people
> find effective for a site with a reasonable amount of traffic.

I'm running two at the moment, trac-hacks.org and swapoff.org.

Here are my mod_wsgi settings:


WSGIDaemonProcess trachacks maximum-requests=1000 user=trachacks
group=trachacks
WSGIScriptAlias / /srv/httpd/trac-hacks.org/trac.wsgi
WSGIProcessGroup trachacks
WSGIApplicationGroup %{GLOBAL}

Each Trac environment is running in a "workingenv" which trac.wsgi
activates before passing control to Trac:


import os
BASE_DIR = os.path.dirname(__file__)

import sys

# Activate workingenv
sys.path.insert(0, BASE_DIR)
import wenv


# Fix for plugins that print to stdout
sys.stdout = sys.stderr

os.environ['TRAC_ENV'] = os.path.join(BASE_DIR, 'trac')
os.environ['PYTHON_EGG_CACHE'] = os.path.join(BASE_DIR, 'tmp', 'egg-cache')

import trac.web.main
application = trac.web.main.dispatch_request

Graham Dumpleton

unread,
Aug 22, 2007, 12:04:03 AM8/22/07
to mod...@googlegroups.com
On 22/08/07, Alec Thomas <al...@swapoff.org> wrote:
> > BTW, how many Trac instances are you running? Just the one? And how
> > many processes/threads did you decide to use for WSGIDaemonProcess
> > directive? Curious so I can get some information about what people
> > find effective for a site with a reasonable amount of traffic.
>
> I'm running two at the moment, trac-hacks.org and swapoff.org.
>
> Here are my mod_wsgi settings:
>
>
> WSGIDaemonProcess trachacks maximum-requests=1000 user=trachacks
> group=trachacks
> WSGIScriptAlias / /srv/httpd/trac-hacks.org/trac.wsgi
> WSGIProcessGroup trachacks
> WSGIApplicationGroup %{GLOBAL}

Which means 1 process in each process group, with that process running
default of 15 threads.

I take it then everything is running fine and don't see enough load to
justify increasing number of threads and/or add a second process to
the group. Eg.

WSGIDaemonProcess trachacks process=2 threads=25 ........

Sounds good.

> Each Trac environment is running in a "workingenv" which trac.wsgi
> activates before passing control to Trac:
>
>
> import os
> BASE_DIR = os.path.dirname(__file__)
>
> import sys
>
> # Activate workingenv
> sys.path.insert(0, BASE_DIR)
> import wenv

Your 'wenv' module, is that a custom thing you made up or is it based
on what others done such as described in:

http://docs.pythonweb.org/pages/viewpage.action?pageId=5439610

I haven't really played too much with the idea of working environments
except to test the above description to see how it worked. Although as
part of that I did investigate whether it was worthwhile having
mod_wsgi directly support setting up working environments. In the end
decided it wasn't worth the trouble as just as easy to do in script
file and easier to control by user.

I'll have to keep your Trac hacks site in mind as would like at some
point to put up a page of significant sites which are using mod_wsgi
as some evidence that it is stable.

Graham

Alec Thomas

unread,
Aug 22, 2007, 3:18:35 AM8/22/07
to mod...@googlegroups.com
On 8/22/07, Graham Dumpleton <graham.d...@gmail.com> wrote:
> > I'm running two at the moment, trac-hacks.org and swapoff.org.
> >
> > Here are my mod_wsgi settings:
> >
> >
> > WSGIDaemonProcess trachacks maximum-requests=1000 user=trachacks
> > group=trachacks
> > WSGIScriptAlias / /srv/httpd/trac-hacks.org/trac.wsgi
> > WSGIProcessGroup trachacks
> > WSGIApplicationGroup %{GLOBAL}
>
> Which means 1 process in each process group, with that process running
> default of 15 threads.
>
> I take it then everything is running fine and don't see enough load to
> justify increasing number of threads and/or add a second process to
> the group. Eg.
>
> WSGIDaemonProcess trachacks process=2 threads=25 ........
>
> Sounds good.

I've not noticed any issues yet, but it might be an idea to do this
anyway as I have the CPU and memory to spare.

Another (very minor) thing I've noticed is apache taking quite a while
to shutdown, reflected by messages like these:

[Tue Aug 21 22:32:51 2007] [warn] child process 32700 still did not
exit, sending a SIGTERM
[Tue Aug 21 22:32:53 2007] [warn] child process 32700 still did not
exit, sending a SIGTERM

Could be a Trac issue, though I didn't see it under mod_python.


> Your 'wenv' module, is that a custom thing you made up or is it based
> on what others done such as described in:
>
> http://docs.pythonweb.org/pages/viewpage.action?pageId=5439610

They based their code on a blog post of mine actually :). They link to
it in their document, but it can be found here:

http://swapoff.org/wiki/blog/2007-03-20-activating-a-workingenv-from-python

> I haven't really played too much with the idea of working environments
> except to test the above description to see how it worked. Although as
> part of that I did investigate whether it was worthwhile having
> mod_wsgi directly support setting up working environments. In the end
> decided it wasn't worth the trouble as just as easy to do in script
> file and easier to control by user.

Agreed, that would seem to be overloading the functionality of mod_wsgi
a bit.

>
> I'll have to keep your Trac hacks site in mind as would like at some
> point to put up a page of significant sites which are using mod_wsgi
> as some evidence that it is stable.

Feel free to reference trac-hacks, I'm very happy so far :)

Graham Dumpleton

unread,
Aug 22, 2007, 6:39:34 PM8/22/07
to mod...@googlegroups.com
> > I take it then everything is running fine and don't see enough load to
> > justify increasing number of threads and/or add a second process to
> > the group. Eg.
> >
> > WSGIDaemonProcess trachacks process=2 threads=25 ........
> >
> > Sounds good.
>
> I've not noticed any issues yet, but it might be an idea to do this
> anyway as I have the CPU and memory to spare.
>
> Another (very minor) thing I've noticed is apache taking quite a while
> to shutdown, reflected by messages like these:
>
> [Tue Aug 21 22:32:51 2007] [warn] child process 32700 still did not
> exit, sending a SIGTERM
> [Tue Aug 21 22:32:53 2007] [warn] child process 32700 still did not
> exit, sending a SIGTERM
>
> Could be a Trac issue, though I didn't see it under mod_python.

The reason you don't see this under mod_python is that mod_python
doesn't try and properly shutdown the Python interpreters, it just
abruptly exits the process. As a result, when running mod_python it
doesn't wait on non daemonised threads to finish or call atexit
registered functions. Thus in mod_python, if you have any code
registered using atexit, it will never be called.

In mod_wsgi it tries to do the correct thing to ensure that these
steps are done properly. What it means though is that if those things
take too long (3 seconds for an Apache stop/restart), the Apache
parent process will kill off the process anyway.

What we probably need to do is see if Trac is using any Python created
threads, especially non daemonised thread, and see if when tracd is
being used that it does something to signal that they be shutdown.
Also maybe need to see if tracd does other stuff to trigger cleanup of
database layer or anything else as by not doing this, may also be
causing the problems on shutdown.

When using TurboGears for example, it is necessary to add to the WSGI
script file registration of a callback which stops the internal
CherryPy engine. If this isn't done, it prevents a clean process
shutdown. So similar thing may be needed for Trac.

Most importantly, that you are seeing this means that should
definitely look at using more than 1 daemon process. Ie., use:

WSGIDaemonProcess trachacks process=2 maximum-requests=1000

The reason for this is that when the 1000 requests is reached, it will
shutdown and restart the process. When a restart of process is done in
this case, up to 5 seconds is allowed for it to shutdown gracefully
before it is killed off. If you have only 1 daemon process, this will
be 5 seconds that it may not be serving requests if it doesn't
shutdown promptly.

By using 2 daemon processes, the chances are that when 1 process is
restarting due to reaching maximum requests, the other process will
still be accepting requests. Thus any delay in restarting one will not
be noticed as other will handle requests in the mean time.

I'll try and dig through tracd code and see if I can see anything that
might need to be added to ensure clean shutdown.

> > Your 'wenv' module, is that a custom thing you made up or is it based
> > on what others done such as described in:
> >
> > http://docs.pythonweb.org/pages/viewpage.action?pageId=5439610
>
> They based their code on a blog post of mine actually :). They link to
> it in their document, but it can be found here:
>
> http://swapoff.org/wiki/blog/2007-03-20-activating-a-workingenv-from-python

Okay. Will look again at all this when get a chance. As would like to
include a good example in mod_wsgi documentation since it is something
that many may want to do.

Thanks.

Graham

Graham Dumpleton

unread,
Aug 23, 2007, 6:15:23 AM8/23/07
to mod...@googlegroups.com
On 23/08/07, Graham Dumpleton <graham.d...@gmail.com> wrote:
> I'll try and dig through tracd code and see if I can see anything that
> might need to be added to ensure clean shutdown.

I cant see tracd doing anything special on shutdown. :-(

> > > Your 'wenv' module, is that a custom thing you made up or is it based
> > > on what others done such as described in:
> > >
> > > http://docs.pythonweb.org/pages/viewpage.action?pageId=5439610
> >
> > They based their code on a blog post of mine actually :). They link to
> > it in their document, but it can be found here:
> >
> > http://swapoff.org/wiki/blog/2007-03-20-activating-a-workingenv-from-python
>
> Okay. Will look again at all this when get a chance. As would like to
> include a good example in mod_wsgi documentation since it is something
> that many may want to do.

One thing I don't understand in your blog post is why you have:

# Add ./lib to linker path
lib_dir = os.path.join(root, './lib')
try:
os.environ['LD_LIBRARY_PATH'] = \
os.path.pathsep.join([lib_dir, os.environ['LD_LIBRARY_PATH']])
except KeyError:
os.environ['LD_LIBRARY_PATH'] = lib_dir

This will not actually achieve anything as even for modules loaded
after that point, the dynamic linker doesn't look in os.environ and
adding stuff to os.environ doesn't end up in it being pushed into true
process environment variables. Ie., doesn't call os.putenv(). I'm also
not sure that one can change LD_LIBRARY_PATH for current process
anyway, as dynamic linker may take a snapshot only at process start.

The same probably applies to:

# Add ./bin directory to path.
bin_dir = os.path.join(root, './bin')
try:
os.environ['PATH'] = os.path.pathsep.join([bin_dir,
os.environ['PATH']])
except KeyError:
os.environ['PATH'] = bin_dir

on Windows boxes for DLLs and/or directories looked at when using
os.system() etc.

BTW, can you explain to me what:

# Load all distributions into the working set.
from pkg_resources import working_set, Environment

env = Environment(root)
env.scan()

distributions, errors = working_set.find_plugins(env)
for dist in distributions:
working_set.add(dist)

does. I couldn't find the appropriate documentation on the innards of
all this when I went looking some time back and disabling this section
made no difference, with everything still working.

If you can at least point at some good documentation which describes
what 'distributions' are all about in the context of this stuff would
be helpful.

Thanks.

Graham

Alec Thomas

unread,
Aug 23, 2007, 9:40:28 AM8/23/07
to mod...@googlegroups.com
On 8/23/07, Graham Dumpleton <graham.d...@gmail.com> wrote:
> I cant see tracd doing anything special on shutdown. :-(

:(

Ah well, it's not a huge deal.

> One thing I don't understand in your blog post is why you have:
>
> # Add ./lib to linker path
> lib_dir = os.path.join(root, './lib')
> try:
> os.environ['LD_LIBRARY_PATH'] = \
> os.path.pathsep.join([lib_dir, os.environ['LD_LIBRARY_PATH']])
> except KeyError:
> os.environ['LD_LIBRARY_PATH'] = lib_dir
>
> This will not actually achieve anything as even for modules loaded
> after that point, the dynamic linker doesn't look in os.environ and
> adding stuff to os.environ doesn't end up in it being pushed into true
> process environment variables. Ie., doesn't call os.putenv(). I'm also

That's not what http://docs.python.org/lib/os-procinfo.html implies.

> not sure that one can change LD_LIBRARY_PATH for current process
> anyway, as dynamic linker may take a snapshot only at process start.

This I'm also not sure about either. I looked around at the time, but couldn't
find any definitive answer. Short of testing it myself, I decided to
just put it in anyway on the off-chance that it would work.

> BTW, can you explain to me what:
>
> # Load all distributions into the working set.
> from pkg_resources import working_set, Environment
>
> env = Environment(root)
> env.scan()
>
> distributions, errors = working_set.find_plugins(env)
> for dist in distributions:
> working_set.add(dist)
>
> does. I couldn't find the appropriate documentation on the innards of
> all this when I went looking some time back and disabling this section
> made no difference, with everything still working.

That's strange, it doesn't work for me at all. Any eggs installed by
setuptools don't get activated until this code runs.

> If you can at least point at some good documentation which describes
> what 'distributions' are all about in the context of this stuff would
> be helpful.

It's all documented here:

http://peak.telecommunity.com/DevCenter/PkgResources

But TBH, while it's useful API documentation it's not helpful for
understanding how it all fits together.

My understanding is that a Distribution maps to a single .egg, an
Environment contains Distributions in a single path, and a WorkingSet
brings it all together - doing dependency resolution, handling egg
metadata, etc.

Graham Dumpleton

unread,
Aug 24, 2007, 12:51:55 AM8/24/07
to mod...@googlegroups.com
On 23/08/07, Alec Thomas <al...@swapoff.org> wrote:
> > One thing I don't understand in your blog post is why you have:
> >
> > # Add ./lib to linker path
> > lib_dir = os.path.join(root, './lib')
> > try:
> > os.environ['LD_LIBRARY_PATH'] = \
> > os.path.pathsep.join([lib_dir, os.environ['LD_LIBRARY_PATH']])
> > except KeyError:
> > os.environ['LD_LIBRARY_PATH'] = lib_dir
> >
> > This will not actually achieve anything as even for modules loaded
> > after that point, the dynamic linker doesn't look in os.environ and
> > adding stuff to os.environ doesn't end up in it being pushed into true
> > process environment variables. Ie., doesn't call os.putenv(). I'm also
>
> That's not what http://docs.python.org/lib/os-procinfo.html implies.

Hmmm, so it seems. Stupid me has always looked at posixmodule.c code
and just seen a dictionary being used, not realising that os.py module
file overrode behaviour.

This behaviour of calling putenv() when updating os.environ could
actually cause subtle problems in applications when using multiple
Python sub interpreters. I'll have to do some research and thinking,
but it may explain some obscure problems people have seen when using
multiple Django instances in different sub interpreters when using
mod_python (although could affect mod_wsgi as well).

> > BTW, can you explain to me what:
> >
> > # Load all distributions into the working set.
> > from pkg_resources import working_set, Environment
> >
> > env = Environment(root)
> > env.scan()
> >
> > distributions, errors = working_set.find_plugins(env)
> > for dist in distributions:
> > working_set.add(dist)
> >
> > does. I couldn't find the appropriate documentation on the innards of
> > all this when I went looking some time back and disabling this section
> > made no difference, with everything still working.
>
> That's strange, it doesn't work for me at all. Any eggs installed by
> setuptools don't get activated until this code runs.

Does it perhaps only apply to eggs which aren't expanded already. From
memory I was always using expanded eggs.

> > If you can at least point at some good documentation which describes
> > what 'distributions' are all about in the context of this stuff would
> > be helpful.
>
> It's all documented here:
>
> http://peak.telecommunity.com/DevCenter/PkgResources
>
> But TBH, while it's useful API documentation it's not helpful for
> understanding how it all fits together.
>
> My understanding is that a Distribution maps to a single .egg, an
> Environment contains Distributions in a single path, and a WorkingSet
> brings it all together - doing dependency resolution, handling egg
> metadata, etc.

I do remember looking there, but didn't seem too helpful to understand
it at a higher level. :-(

I'll just have to try again to understand it I guess.

Graham

Juergen Brendel

unread,
Aug 24, 2007, 5:23:45 PM8/24/07
to mod...@googlegroups.com

Graham,

Since you know Apache and the capabilities and architecture of mod_wsgi,
I hope you can advise me if mod_wsgi (or WSGI in general) is the right
approach at all for me to take:

I am thinking about using mod_wsgi for an application, which provides
RESTful web-services. Some of the 'services' will eventually run in
separate threads, since they may perform long-running calculations, or
produce large amounts of data, which may have to be fed in (HTTP POST)
or read out (HTTP GET).

So, the idea would be for some request to come in, which is then send
via mod_wsgi to my application, which runs in daemon mode. The
application may create a thread to perform whatever operations are
necessary.

Subsequent requests may be directed by my application to be handled by
the thread, which may get input (provided via POST) or may produce
output (received via GET).

There are a number of issues, which appear to be not necessarily ideally
suited for WSGI in general: Very long request, objects persisting across
requests, but also the need to handle timeouts on sockets and such.

I have heard even that Apache may have problems with many, long-running
requests.

What is your take on this? Is WSGI still suited for such an environment,
or should I move to something like Twisted, and forgo the thought of
having WSGI with an Apache front end?

Thank you very much...

Juergen

jbrendel

unread,
Aug 24, 2007, 6:24:44 PM8/24/07
to modwsgi

[ For some reason with my first posting of this I managed to change
the subject for an
already existing thread in the group. I' m very sorry about that.
So, to give this a proper
thread on its own, I am posting this again, as a new topic. My
apologies for causing
the confusion... ]

Graham Dumpleton

unread,
Aug 25, 2007, 12:35:45 AM8/25/07
to mod...@googlegroups.com
On 23/08/07, Graham Dumpleton <graham.d...@gmail.com> wrote:
> On 23/08/07, Graham Dumpleton <graham.d...@gmail.com> wrote:
> > I'll try and dig through tracd code and see if I can see anything that
> > might need to be added to ensure clean shutdown.
>
> I cant see tracd doing anything special on shutdown. :-(

One forgot to mention if you have time to do some testing is to add to
the WSGI script file the code:

import sys
def cleanup():
print >> sys.stderr, 'wsgi script file cleanup'

import atexit
atexit.register(cleanup)

When Python is shutdown it will do three steps.

On Python 2.5 the order is:

1. Join with non daemonised threads and wait for them to complete.
2. Call atexit registered cleanup callbacks.
3. Clear sys.modules (and possible other stuff) in an attempt to
trigger destruction of all Python objects.

In using Python 2.5, if that cleanup function above does print
something to the error logs when daemon process being shutdown, does
at least indicate that it got to step 2.

If it got stuck in step 1 that would be a concern as it would imply
that some code in the application has created a non daemonised thread
when it perhaps should have used a daemonised thread or override
Thread.join() in a derived class such that when joining with threads
the derived join() can first signal the thread in some way to stop.

Prior to Python 2.5, things are a bit different as joining with non
daemonised threads was actually done from an atexit registered cleanup
function, so no real well definined order for steps 1 and 2.

If the cleanup function does display something, then perhaps shows
that getting stuck in cleaning up Python objects in general.

if trying to track this done, best to set:

LogLevel debug

as that will show more information about where mod_wsgi is up to. I
may want to consider extra debug logging to show points between 1, 2
and 3 so can see progress if things don't shutdown promptly as
expected.

I should also perhaps consider adding directive to allow proper
cleanup to be skipped on process shutdown to allow quicker exit of
process.

Graham

Graham Dumpleton

unread,
Aug 25, 2007, 6:17:34 AM8/25/07
to mod...@googlegroups.com
On 25/08/07, jbrendel <jbre...@gmail.com> wrote:
> Since you know Apache and the capabilities and architecture of
> mod_wsgi,
> I hope you can advise me if mod_wsgi (or WSGI in general) is the right
> approach at all for me to take:
>
> I am thinking about using mod_wsgi for an application, which provides
> RESTful web-services. Some of the 'services' will eventually run in
> separate threads, since they may perform long-running calculations, or
> produce large amounts of data, which may have to be fed in (HTTP POST)
> or read out (HTTP GET).
>
> So, the idea would be for some request to come in, which is then send
> via mod_wsgi to my application, which runs in daemon mode. The
> application may create a thread to perform whatever operations are
> necessary.
>
> Subsequent requests may be directed by my application to be handled by
> the thread, which may get input (provided via POST) or may produce
> output (received via GET).
>
> There are a number of issues, which appear to be not necessarily
> ideally
> suited for WSGI in general: Very long request,

Very long requests are not a problem for WSGI, more an issue for the
hosting platform (web server) on which WSGI adapter is running.

> objects persisting
> across
> requests,

Again, not a problem for WSGI. Whether this is an issue is dependent
on whether the hosting platform (web server) uses multiple processes
for handling requests.

> but also the need to handle timeouts on sockets and such.

Not entirely sure why you see this as an issue. It does mean though
that if communicating with a backend database/application that you
might want it to have an ability to timeout. This though is going to
apply to any sort of application not just WSGI applications or long
running requests.

> I have heard even that Apache may have problems with many, long-
> running
> requests.

As I said, the problem is not WSGI, but the hosting platform, which in
this case is Apache.

In short, do not use 'prefork' MPM for Apache if you are going to have
lots of long running requests. It would only be practical if using
'worker' MPM.

The problem with using 'prefork' MPM is that you have lots of Apache
child processes where each can only be doing one thing at a time.
Thus, if you have a long running request, it will tie up the whole
process and it will sit there just chewing up memory for the life of
the request. This applies whether the processing the request triggers
is running within the Apache child process, or whether it is being
performed in a distinct process to which the request had been proxied.

If using 'worker' MPM, if one thread is tied up doing something, then
there can still be other requests being handled in other threads. Thus
the memory of the process is still be utilised.

This waste from using 'prefork' is worst where the Apache child
process is merely proxying the request through to a backend web
server, cgi process, fastcgi process or mod_wsgi daemon process.

> What is your take on this? Is WSGI still suited for such an
> environment,
> or should I move to something like Twisted, and forgo the thought of
> having WSGI with an Apache front end?

WSGI and Apache can be made to work for this. As stated above, the
first thing is simply to avoid 'prefork' MPM and use 'worker' MPM
instead as allows much better utilisation of processes and memory.

Next thing to deal with as far as the length of the request is
concerned is avoiding any restarts of Apache child processes as such a
restart would interrupt a request and result in connection to client
being dropped. To avoid this, you cant be using MaxRequestsPerChild
directive for Apache to cause process restarts when certain number of
requests exceeded. Thus, this should be set to 0 indicating that the
process will never expire.

For the Apache child processes at least, this means you cannot use
MaxRequestsPerChild as a to reclaim memory for applications which leak
memory over time.

If using daemon mode of mod_wsgi, similarly you should use the
'maximum-requests' option to the WSGIDaemonProcess directive.

That probably covers issues related to long running requests.

Next issue is the persisting of objects across requests. The problem
here in respect of Apache is that Apache is a multiprocess web server.
For an explanation of how processes and threads are used by
Apache/mod_wsgi read:

http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading

As it describes there, if a web server is multi processes, then data
that must be accessible from one request to the next no matter which
process serves a request, must be held in an external database or the
filesystem.

In mod_wsgi though there is a way around that though. This is to use
daemon mode of mod_wsgi to create a single daemon process and delegate
the whole WSGI application or some subset of it, to run within that
daemon process. Because all requests for the application, or a
selected set of URLs would always be handled by the same process, then
in memory data structures and caches would persist across requests.

There has been previous discussion on the list about how to use
mod_wsgi daemon processes associated with a subset of the URL
namespace to allow a series of requests to always end up back at the
same process. See:

http://groups.google.com/group/modwsgi/browse_frm/thread/3d1cd0cc6f2d89ae

I've also mentioned the idea of forcing URLs which use a lot of memory
into daemon process which have a low maximum-requests to reset memory
back to low levels occasionally. One could also delegate long requests
to specific processes where maximum-requests is 0 so process never
expires. Thus using multiple daemon process groups and spreading your
application across them allows quite a bit of flexibility in dealing
with how different parts of an application may have different
requirements.

Now, since it would be possible to use a daemon process to always
ensure a URL subset always ends up back at the same process, you could
handle your long running requests a bit differently, by having an
initial request only initiate the action. Subsequent requests could
then poll to see if the action has completed and download any result.

Now hopefully that gives you a few things to think about and I haven't
confused too much the idea of a long running request versus a short
request that initiates a long running action which subsequent requests
interact with. I wasn't clear if you were only using the later or a
combination of both. Anyway, I have tried to address both issues and
that prior discussion on the mailing list I referred to may also help
in all of this as well with some actual examples of how to possibly
configure mod_wsgi.

Hopefully you will give back some feedback on whether this is helpful
or not and if necessary seek clarification or want example
configuration that may suit your application better. I want to add a
document on long running requests, interacting with long running
actions in same process and streaming data over a long time, but feel
I need a bit more idea of the sorts of things people try to do first.
:-)

Graham

Reply all
Reply to author
Forward
0 new messages