Sending specific requests to specific Python instances?

7 views

Skip to first unread message

Juergen Brendel

unread,

Jul 31, 2007, 3:09:31 PM7/31/07

to mod...@googlegroups.com

Hello!

I'm using mod_wsgi in daemon mode, with a bunch of processes serving my
app.

Problem description
-------------------
I need to implement a sort of 'session' capability. This means that
first a request to my Python app may create state, and for a second,
subsequent request I need to get to that state again. The 'state' here
is an ongoing computation (running in a thread in one of the Python
processes), so this cannot readily be solved by having a special
centralized storage for shared state (a separate process or database of
sorts). I need to get back to the same Python instance, so that I can
get to that still ongoing computation.

So, basically, is there a way to to determine a particular Python
interpreter to which an incoming request should be scheduled?

A hack-ish solution
-------------------
Not being much of an Apache expert, I came up with the following hack
for now, which doesn't look very beautiful, though:

1. Make multiple identical copies (or create hardlinks) of the
application in different directories.
2. Then do something like this:

WSGIScriptAlias /myapp_0 /var/www/wsgi_0/hello.py
WSGIScriptAlias /myapp_1 /var/www/wsgi_1/hello.py

WSGIDaemonProcess foo processes=1 threads=10
WSGIDaemonProcess bar processes=1 threads=10

<Directory /var/www/wsgi_0/>
WSGIProcessGroup foo
</Directory>

<Directory /var/www/wsgi_1/>
WSGIProcessGroup bar
</Directory>

This forces the processing into different process groups, based on the
directory prefix. For each process group, we specify that only one
process should run and which executable should be used.

So, if I create my computing resource under /myapp_0, for example, then
I can just make sure that subsequent requests are also directed
to /myapp_0.

The problem of course is that I have to copy the executable around and
that I need to introduce multiple paths (/myapp_0 and /myapp_1) in my
application. Also, this does not seem to be very elegant.

A better solution?
------------------
It would be great if the process could discover a 'process ID' of sorts
and then allow scheduling to specific processes based on the process ID
in the URL (or even in some HTTP header field). Even in absence of the
process ID discovery, just something like this (with some added ugly
syntax) would be helpful:

WSGIDaemonProcess foo processes=2 threads=10

WSGIScriptAlias /myapp /var/www/wsgi/hello.py
WSGIScriptAlias /myapp_0 /var/www/wsgi/hello.py[process=0]
WSGIScriptAlias /myapp_1 /var/www/wsgi/hello.py[process=1]

<Directory /var/www/wsgi/>
WSGIProcessGroup foo
</Directory>

Here I did not have to specify different directories. I just specified
the same application program for any request into the /var/www/wsgi
directory, but I also specified that particular path prefixes should be
mapped to particular processes of the process group that is serving
those requests. This would be a great start!

Is there anything like this (or better) that can be done already?

Some wishful thinking
---------------------
As an aside, would it be possible to just 'script alias' to the process
group, rather than the individual application? This of course only makes
sense if I only have a single application which makes up the process
group:

WSGIScriptAlias /myapp foo
WSGIScriptAlias /myapp_0 foo[process=0]
WSGIScriptAlias /myapp_1 foo[process=1]

I think this would be ideal for my special case...

Thank you very much...

Juergen Brendel

Graham Dumpleton

unread,

Jul 31, 2007, 9:02:26 PM7/31/07

to mod...@googlegroups.com

On 01/08/07, Juergen Brendel <jbre...@gmail.com> wrote:
>
>
> Hello!
>
> I'm using mod_wsgi in daemon mode, with a bunch of processes serving my
> app.
>
>
> Problem description
> -------------------
> I need to implement a sort of 'session' capability. This means that
> first a request to my Python app may create state, and for a second,
> subsequent request I need to get to that state again. The 'state' here
> is an ongoing computation (running in a thread in one of the Python
> processes), so this cannot readily be solved by having a special
> centralized storage for shared state (a separate process or database of
> sorts). I need to get back to the same Python instance, so that I can
> get to that still ongoing computation.

A common problem at that. For those who may not understand what the
issue may be, some information can be found in:

http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading

I still haven't finished this document and need to talk about daemon
processes, but similar to distinction between prefork and worker in
the way you can set them up.

> So, basically, is there a way to to determine a particular Python
> interpreter to which an incoming request should be scheduled?

There are some ways. Some are slightly hackish as you have found, but
I'll explain a couple of other ways which may make it slightly
cleaner. I'll also mention something I have been thinking about for
future companion module to mod_wsgi.

> A hack-ish solution
> -------------------
> Not being much of an Apache expert, I came up with the following hack
> for now, which doesn't look very beautiful, though:
>
> 1. Make multiple identical copies (or create hardlinks) of the
> application in different directories.
> 2. Then do something like this:
>
> WSGIScriptAlias /myapp_0 /var/www/wsgi_0/hello.py
> WSGIScriptAlias /myapp_1 /var/www/wsgi_1/hello.py
>
> WSGIDaemonProcess foo processes=1 threads=10
> WSGIDaemonProcess bar processes=1 threads=10

Do note that there is a subtle difference between using:

WSGIDaemonProcess foo processes=1 threads=10

and:

WSGIDaemonProcess foo threads=10

Quoting from configuration directives documentation:

"""Note that if this option is defined as 'processes=1', then the WSGI
environment attribute called 'wsgi.multiprocess' will be set to be
True whereas not providing the option at all will result in the
attribute being set to be False. This distinction is to allow for
where some form of mapping mechanism might be used to distribute
requests across multiple process groups and thus in effect it is still
a multiprocess application. If you need to ensure that
'wsgi.multiprocess' is False so that interactive debuggers will work,
simply do not specify the 'processes' option and allow the default
single daemon process to be created in the process group."""

> <Directory /var/www/wsgi_0/>
> WSGIProcessGroup foo
> </Directory>
>
> <Directory /var/www/wsgi_1/>
> WSGIProcessGroup bar
> </Directory>
>
> This forces the processing into different process groups, based on the
> directory prefix. For each process group, we specify that only one
> process should run and which executable should be used.
>
> So, if I create my computing resource under /myapp_0, for example, then
> I can just make sure that subsequent requests are also directed
> to /myapp_0.
>
> The problem of course is that I have to copy the executable around and
> that I need to introduce multiple paths (/myapp_0 and /myapp_1) in my
> application. Also, this does not seem to be very elegant.

A slightly cleaner variation on this which avoids needing to create
two copies of your application is:

WSGIScriptAlias /myapp_0 /var/www/wsgi/hello.py
WSGIScriptAlias /myapp_1 /var/www/wsgi/hello.py

WSGIDaemonProcess foo threads=10
WSGIDaemonProcess bar threads=10

<Location /myapp_0>
WSGIProcessGroup foo
</Location>

<Location /myapp_1>
WSGIProcessGroup bar
</Location>

In this example, two different URLs are mapped to same application
script file. A Location directive (instead of Directory), is then used
to match the URL and set the process group differently.

> A better solution?
> ------------------
> It would be great if the process could discover a 'process ID' of sorts
> and then allow scheduling to specific processes based on the process ID
> in the URL (or even in some HTTP header field). Even in absence of the
> process ID discovery, just something like this (with some added ugly
> syntax) would be helpful:
>
>
> WSGIDaemonProcess foo processes=2 threads=10
>
> WSGIScriptAlias /myapp /var/www/wsgi/hello.py
> WSGIScriptAlias /myapp_0 /var/www/wsgi/hello.py[process=0]
> WSGIScriptAlias /myapp_1 /var/www/wsgi/hello.py[process=1]
>
> <Directory /var/www/wsgi/>
> WSGIProcessGroup foo
> </Directory>

I can think of a number of different ways of doing this, but lets show
one first where everything is done in the configuration file, but
which uses a few more processes than perhaps necessary.

# Create process group for main application.
WSGIDaemonProcess main processes=2 threads=10

# Create process groups (one process each) for jobs.
WSGIDaemonProcess jobs/1 threads=10
WSGIDaemonProcess jobs/2 threads=10

# Match /myapp or /myapp_[0-9] to script file.
WSGIScriptAliasMatch /myapp(_[0-9])? /var/www/wsgi/hello.py

# Set the default process group.
SetEnv PROCESS_GROUP main

# Override default process group where URL matches job.
RewriteEngine On
RewriteCond %{REQUEST_URI} ^/myapp_([0-9])/
RewriteRule . - [E=PROCESS_GROUP:jobs/%1]

# Set the process group from request environment variable.
WSGIProcessGroup %{ENV:PROCESS_GROUP}

The tricky bit here is that by using %{ENV} value to WSGIProcessGroup,
one can set process group based on value of request environment
variable. In the example above we use some rewrite rules to match URLs
on the fly, extracting part of the URL and using that as part of
request environment variable which selects which process group handles
that request.

In this example there is a separation between main application process
and those for jobs. Rather than waste processes like that though could
instead combine them by using a rewrite map that randomly selects a
process for us.

# Create whole lot of daemon processes. Use processes=1 so that
# wsgi.multiprocess is still set to True for main application.
WSGIDaemonProcess app/1 processes=1 threads=10
WSGIDaemonProcess app/2 processes=1 threads=10

# Match /myapp or /myapp_[0-9] to script file.
WSGIScriptAliasMatch /myapp(_[0-9])? /var/www/wsgi/hello.py

RewriteEngine On

# Select default process group randomly from set of all processes.
RewriteMap wsgiprocmap rnd:/var/www/wsgi/map.txt
RewriteRule . - [E=PROCESS_GROUP:${wsgiprocmap:app}]

# Override default process group where URL matches job.
RewriteCond %{REQUEST_URI} ^/myapp_([0-9])/
RewriteRule . - [E=PROCESS_GROUP:jobs/%1]

# Set the process group from request environment variable.
WSGIProcessGroup %{ENV:PROCESS_GROUP}

Here we create a bunch of independent daemon processes, but set
processes=1 as for main application we merge them together using a
rewrite map and random selection and so want application to still know
it is multiprocess. The contents of the map.txt file would be:

app app/1|app/2

In other words, random selects one process group or the other. Rest is
similar to before.

Now I have followed your URL scheme here, but you could also use other
URL schemes. For example, I would prefer process group number/name to
be a distinct part of the URL. Ie.,

/app
/app/0
/app/1

You might even have the entry point be /app and then immediately
redirect them to sub URL for a specific process.

> Here I did not have to specify different directories. I just specified
> the same application program for any request into the /var/www/wsgi
> directory, but I also specified that particular path prefixes should be
> mapped to particular processes of the process group that is serving
> those requests. This would be a great start!
>
> Is there anything like this (or better) that can be done already?

Using rewrite rules to select on parts of the URL and dispatch to
different processes is certainly one way. Since rewrite rules can also
see %{QUERY_STRING} one could use URL query string arguments as well.

Unfortunately rewrite rules don't have ways of using cookies in an
easy way, except perhaps by writing the cookie string out to a
external program which returns name of process group based on that.
See 'prg' rewrite map type for that.

What I have been thinking about for future companion module for
mod_wsgi is something that can look at cookies and from that set a
request note picked up by %{ENV} which selects a particular daemon
process group. That way the application itself could set a cookie
which on the next request would anchor the users requests to a
particular daemon process.

> Some wishful thinking
> ---------------------
> As an aside, would it be possible to just 'script alias' to the process
> group, rather than the individual application? This of course only makes
> sense if I only have a single application which makes up the process
> group:
>
> WSGIScriptAlias /myapp foo
> WSGIScriptAlias /myapp_0 foo[process=0]
> WSGIScriptAlias /myapp_1 foo[process=1]
>
> I think this would be ideal for my special case...

Hopefully the examples above provide viable alternatives for you to play with.

One could certainly do some interesting things with all of this, just
might need to add a bit more stuff in future mod_wsgi or companion
modules to make it easier to make use of.

Graham

Graham Dumpleton

unread,

Jul 31, 2007, 10:37:24 PM7/31/07

to mod...@googlegroups.com

Whoops, that should be 'app' not 'jobs' in the above line.

Juergen Brendel

unread,

Jul 31, 2007, 10:46:09 PM7/31/07

to mod...@googlegroups.com, Graham Dumpleton

I tried to send this e-mail earlier already, but for some reason it
never showed up on the group. So please allow me to resend...

- - - - - - - - - - - - - - -

Hello!

I'm using mod_wsgi in daemon mode, with a bunch of processes serving my
app.

So, basically, is there a way to to determine a particular Python

interpreter to which an incoming request should be scheduled?