Large number of open files and sockets?

33 views
Skip to first unread message

JP

unread,
Jul 31, 2008, 5:44:30 PM7/31/08
to modwsgi
Hi,

I'm having some trouble with mod_wsgi holding a seemingly unwarranted
number of open files. The behavior varies somewhat from machine to
machine (all running RHEL4, apache 2.2, mod_wsgi 2.0 or 2.1), but the
constants are:

1. Each mod_wsgi daemon process has an open fd for every apache log
file. Is this normal?

2. mod_wsgi daemon processes seem to have an awful lot of open
sockets, and the number grows. Even processes running apps that I know
are getting no traffic see their number of open sockets grow and grow
over time -- each one eventually having hundreds to thousands of open
sockets.

Can anyone help me to understand what's going on here? I can post a
sample config or two, but they are nothing special --
WSGIDaemonProcesses are set up with varying settings for # of
processes and threads, WSGIScriptAlias is used to dispatch urls, and
WSGIProcessGroup is set in the virtual host or location, as
appropriate. None of the problematic apps are set for more than 2
processes or 100 threads.

Thanks for any help,

JP

Graham Dumpleton

unread,
Jul 31, 2008, 7:14:19 PM7/31/08
to mod...@googlegroups.com
2008/8/1 JP <jpel...@gmail.com>:

>
> Hi,
>
> I'm having some trouble with mod_wsgi holding a seemingly unwarranted
> number of open files. The behavior varies somewhat from machine to
> machine (all running RHEL4, apache 2.2, mod_wsgi 2.0 or 2.1), but the
> constants are:
>
> 1. Each mod_wsgi daemon process has an open fd for every apache log
> file. Is this normal?

Yes, open log file descriptors are inherited across the fork from
Apache parent process. This should not present any problem.

> 2. mod_wsgi daemon processes seem to have an awful lot of open
> sockets, and the number grows. Even processes running apps that I know
> are getting no traffic see their number of open sockets grow and grow
> over time -- each one eventually having hundreds to thousands of open
> sockets.
>
> Can anyone help me to understand what's going on here? I can post a
> sample config or two, but they are nothing special --
> WSGIDaemonProcesses are set up with varying settings for # of
> processes and threads, WSGIScriptAlias is used to dispatch urls, and
> WSGIProcessGroup is set in the virtual host or location, as
> appropriate. None of the problematic apps are set for more than 2
> processes or 100 threads.

I would need to see the Apache/mod_wsgi configuration bits and know
what Python web application you are running.

Are you able to work out using 'lsof' or 'ofiles' what the sockets are
connecting to? Are you sure you are looking at the mod_wsgi daemon
processes and not the normal Apache worker child processes? Do you use
'display-name' option with WSGIDaemonProcess so 'ps' can show which
are the mod_wsgi daemon processes?

An sample of 'lsof' or 'ofiles' output for one such process would be
good for helping to work it out.

Graham

JP

unread,
Aug 4, 2008, 6:30:34 PM8/4/08
to modwsgi
Apologies for taking so long to reply with the additional details --
the most problematic machine happened to hit filemax just after I
posted, and it's taken a while to build up a suitably impressive
number of open socket fds. ;)

> I would need to see the Apache/mod_wsgi configuration bits and know
> what Python web application you are running.

This is the config for a django app running under mod_wsgi on our qa
server, with some irrelevant bits removed:

<VirtualHost *:80>
ServerName qa.jack
DocumentRoot /path/to/jack/webroot/public

# Application
WSGIScriptAlias / /path/to/bin/jack.wsgi
WSGIDaemonProcess jack_public threads=100 processes=2 maximum-
requests=10000
home=/var/www display-name=%{GROUP}

WSGIProcessGroup jack_public
WSGIScriptReloading off

<Directory /path/to/jack/webroot/public>
Include /etc/httpd/access/open.conf
</Directory>

<Directory /path/to/bin>
Options ExecCGI
Include /etc/httpd/access/open.conf
</Directory>

</VirtualHost>

The django app is essentially our app skeleton, and gets no traffic at
all. Immediately after the last apache full restart, each of its wsgi
procs had under 10 sockets open; now each has around 250. Here's a
small sample, from lsof:

httpd 24343 apache 158u unix 0xccd75300 35386246 /var/run/
wsgi.29033.0.1.sock
httpd 24343 apache 159u unix 0xccd75700 35757064 /var/run/
wsgi.29038.1.1.sock
httpd 24343 apache 160u unix 0xccd75900 35386248 /var/run/
wsgi.29033.0.2.sock
httpd 24343 apache 161u unix 0xe1409cc0 35757491 /var/run/
wsgi.29038.2.1.sock
httpd 24343 apache 162u unix 0xccd75500 35386252 /var/run/
wsgi.29033.0.3.sock
httpd 24343 apache 163u unix 0xccd75d00 35386255 /var/run/
wsgi.29033.0.4.sock

> Are you able to work out using 'lsof' or 'ofiles' what the sockets are
> connecting to?

I don't know how to tell what the socket is connected to, sorry. I do
see that the same sockets seem to be open in many (or all) of the wsgi
processes on the machine, and many of them seem to contain long-dead
pids in the socket name. /var/run/ currently contains about 8500
wsgi*.sock files.

> Are you sure you are looking at the mod_wsgi daemon
> processes and not the normal Apache worker child processes? Do you use
> 'display-name' option with WSGIDaemonProcess so 'ps' can show which
> are the mod_wsgi daemon processes?

Yes to all of the above. Anything else I can look at? Any ideas about
what might be going on?

Thanks,

jp

Graham Dumpleton

unread,
Aug 4, 2008, 7:40:38 PM8/4/08
to mod...@googlegroups.com
2008/8/5 JP <jpel...@gmail.com>:

>
> Apologies for taking so long to reply with the additional details --
> the most problematic machine happened to hit filemax just after I
> posted, and it's taken a while to build up a suitably impressive
> number of open socket fds. ;)
>
>> I would need to see the Apache/mod_wsgi configuration bits and know
>> what Python web application you are running.
>
> This is the config for a django app running under mod_wsgi on our qa
> server, with some irrelevant bits removed:
>
> <VirtualHost *:80>
> ServerName qa.jack
> DocumentRoot /path/to/jack/webroot/public
>
> # Application
> WSGIScriptAlias / /path/to/bin/jack.wsgi
> WSGIDaemonProcess jack_public threads=100 processes=2 maximum-
> requests=10000
> home=/var/www display-name=%{GROUP}

Not sure that I would recommend running as high as 100 threads, would
normally suggest more processes and less threads per process as
reduces any impact of the GIL.

I am not concerned that the sockets show in daemon processes as they
are inherited across fork from Apache parent process. I would be
concerned if they reference non existent process IDs. The PID in that
socket path should always be the Apache parent process.

Can you run the lsof on the Apache parent process as well and tell me
what the PID of that Apache parent process is.

BTW, elsewhere in main part of Apache configuration, are you setting
WSGISocketPrefix at all?

Graham

JP

unread,
Aug 5, 2008, 10:17:23 AM8/5/08
to modwsgi

> I am not concerned that the sockets show in daemon processes as they
> are inherited across fork from Apache parent process.

I guess it just seems odd to me to keep these (and all of the log
file fds). It's definitely become a problem for us, as it makes it
appear at least that mod_wsgi is leaking fds, to the point where it
brings apache down every couple of days. I know that passenger (aka
mod_rails) which has a similar mode of operation does close all
inherited fds in its daemon processes -- is there a reason why
mod_wsgi doesn't?

> I would be
> concerned if they reference non existent process IDs. The PID in that
> socket path should always be the Apache parent process.

Some of them reference a PID that is not longer active, but most
reference the current active apache parent PID. Right now out of
roughly 300 open sockets per wsgi process, 15 are referencing a PID
that no longer exists, the rest the current apache parent PID.

> Can you run the lsof on the Apache parent process as well and tell me
> what the PID of that Apache parent process is.

The apache parent PID is 29038. It seems to have the same set of
sockets open as all of the wsgi procs.

> BTW, elsewhere in main part of Apache configuration, are you setting
> WSGISocketPrefix at all?

No. Should we be?

JP

Graham Dumpleton

unread,
Aug 5, 2008, 7:47:01 PM8/5/08
to mod...@googlegroups.com
2008/8/6 JP <jpel...@gmail.com>:

>
>
>> I am not concerned that the sockets show in daemon processes as they
>> are inherited across fork from Apache parent process.
>
> I guess it just seems odd to me to keep these (and all of the log
> file fds). It's definitely become a problem for us, as it makes it
> appear at least that mod_wsgi is leaking fds, to the point where it
> brings apache down every couple of days. I know that passenger (aka
> mod_rails) which has a similar mode of operation does close all
> inherited fds in its daemon processes -- is there a reason why
> mod_wsgi doesn't?

The Passenger software may appear to have a similar mode of operation,
but it is implemented quite differently from what I understand. The
Passenger software follows more traditional fastcgi type
implementation of performing a fork/exec to run up the actual
application process where as mod_wsgi just does a fork only. That a
separate application is exec'd means all that sort of stuff gets
cleaned up automatically.

In mod_wsgi, it could in the daemon processes at least close off all
daemon listener sockets but its own. Log files are a different matter
though. If the WSGIDaemonProcess is outside of all VirtualHost
containers, it wouldn't be able to close any. If WSGIDaemonProcess is
inside of a VirtualHost, then it could close any which are for a
different VirtualHost.

It can't close everything for global scope daemon processes, because
what is delegated to a daemon process is actually dynamic and so you
can't know in advance what you might need. In other words,
applications from any VirtualHost could be delegated to a global
daemon process.

All the same, yes there is some scope here for doing some cleanup. I
have added ticket:

http://code.google.com/p/modwsgi/issues/detail?id=94

>> I would be
>> concerned if they reference non existent process IDs. The PID in that
>> socket path should always be the Apache parent process.
>
> Some of them reference a PID that is not longer active, but most
> reference the current active apache parent PID. Right now out of
> roughly 300 open sockets per wsgi process, 15 are referencing a PID
> that no longer exists, the rest the current apache parent PID.

Still don't understand how that could be. :-(

>> Can you run the lsof on the Apache parent process as well and tell me
>> what the PID of that Apache parent process is.
>
> The apache parent PID is 29038. It seems to have the same set of
> sockets open as all of the wsgi procs.
>
>> BTW, elsewhere in main part of Apache configuration, are you setting
>> WSGISocketPrefix at all?
>
> No. Should we be?

No, just wanted to ask in case that code path had some bug in it. I
don't usually use that directive myself when doing development and
testing.

I'll see if can make a change to close off stuff that isn't needed and
if you are able to maybe you can try it and see if that makes it more
obvious what actual problem may exist.

Graham

Graham Dumpleton

unread,
Aug 6, 2008, 7:47:58 AM8/6/08
to mod...@googlegroups.com
2008/8/6 Graham Dumpleton <graham.d...@gmail.com>:

>>> I would be
>>> concerned if they reference non existent process IDs. The PID in that
>>> socket path should always be the Apache parent process.
>>
>> Some of them reference a PID that is not longer active, but most
>> reference the current active apache parent PID. Right now out of
>> roughly 300 open sockets per wsgi process, 15 are referencing a PID
>> that no longer exists, the rest the current apache parent PID.
>
> Still don't understand how that could be. :-(

Hmmm, now that I look at this on my own system, I see odd things as
well. Doing lsof on parent process yields:

$ sudo lsof -p 5225 | grep sock
httpd 5225 root 11u unix 0x02780ad0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5224.0.1.sock
httpd 5225 root 12u unix 0x02780150 0t0
/usr/local/apache-2.2.4/logs/wsgi.5224.0.2.sock
httpd 5225 root 13u unix 0x02076bf0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5224.0.3.sock

Logging shows:

[Wed Aug 06 21:15:25 2008] [debug] mod_wsgi.c(8327): mod_wsgi
(pid=5225): Socket for 'wsgi' is
'/usr/local/apache-2.2.4/logs/wsgi.5224.0.1.sock'.
[Wed Aug 06 21:15:25 2008] [debug] mod_wsgi.c(8327): mod_wsgi
(pid=5225): Socket for 'wsgi-2' is
'/usr/local/apache-2.2.4/logs/wsgi.5224.0.2.sock'.
[Wed Aug 06 21:15:25 2008] [debug] mod_wsgi.c(8327): mod_wsgi
(pid=5225): Socket for 'wsgi-3' is
'/usr/local/apache-2.2.4/logs/wsgi.5224.0.3.sock'.

All I can say is that I mustn't quite understand Apache startup
process as well as I thought. It must be performing a fork at some
point after it first reads the configuration, but before it then
starts initialising things based on that configuration.

Anyway, when a 'restart' is now done, the next version of the sockets
uses the actual parent process id, rather than what ever process
preceded it. Thus:

$ sudo lsof -p 5225 | grep sock
httpd 5225 root 11u unix 0x02780ad0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5225.1.1.sock
httpd 5225 root 12u unix 0x02780150 0t0
/usr/local/apache-2.2.4/logs/wsgi.5225.1.2.sock
httpd 5225 root 13u unix 0x02076bf0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5225.1.3.sock

Thus, although odd, should be quite okay. Issue would be if in one
process you had a mix of open sockets from different process IDs, or
different version of parent process. That is, if that middle number is
not the same for all sockets.

If this does occur, then only could see it being caused by one thing.
When restart occurring, the listener socket is only closed when parent
process detects the first daemon process in group dying. Ie.,

if (daemon->instance == 1) {
if (close(daemon->group->listener_fd) < 0) {
ap_log_error(APLOG_MARK, WSGI_LOG_ERR(errno),
wsgi_server, "mod_wsgi (pid=%d): "
"Couldn't close unix domain socket '%s'.",
getpid(), daemon->group->socket);
}

if (unlink(daemon->group->socket) < 0 && errno != ENOENT) {
ap_log_error(APLOG_MARK, WSGI_LOG_ERR(errno),
wsgi_server, "mod_wsgi (pid=%d): "
"Couldn't unlink unix domain socket '%s'.",
getpid(), daemon->group->socket);
}
}

If that process was hanging and parent process wasn't detecting it as
having shutdown properly, it would never close the listener socket for
that daemon process group. Thus it may persist.

So, if you are seeing this, see if you can find daemon processes that
aren't dying completely.

BTW, if you have a box on which to experiment with same application,
have checked in at revision 969 to subversion trunk for mod_wsgi,
changes that will close all daemon listener sockets in Apache worker
processes and close all but its own listener socket for daemon
processes.

This will at least mean you will not see open sockets for child
processes, but if listener sockets aren't being cleaned up properly in
parent process, you would still see them accumulate in the parent
process.

Graham

Graham Dumpleton

unread,
Aug 6, 2008, 8:24:36 AM8/6/08
to mod...@googlegroups.com
2008/8/6 Graham Dumpleton <graham.d...@gmail.com>:

I've updated trunk at revision 971 to defer incorporating process ID
into name of listener socket until point it is being created rather
than when configuration is being read. This now means that process ID
in path matches actual Apache parent process ID and it doesn't change
when a restart is done.

Graham

JP

unread,
Aug 6, 2008, 10:48:12 AM8/6/08
to modwsgi
Wow! I'll get this rev built and installed on our qa server today, and
report back tonight or tomorrow on how it impacts the open fd issues.

Thank you!

JP

On Aug 6, 8:24 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> 2008/8/6 Graham Dumpleton <graham.dumple...@gmail.com>:
>
>
>
> > 2008/8/6 Graham Dumpleton <graham.dumple...@gmail.com>:

Graham Dumpleton

unread,
Aug 7, 2008, 6:56:32 AM8/7/08
to mod...@googlegroups.com
When you do testing, can you pay particular attention to whether
listener sockets have more of a tendency to be accumulated when a
'graceful' restart is being done. Ie., 'apachectl graceful'.

Graham

2008/8/7 JP <jpel...@gmail.com>:

JP

unread,
Aug 7, 2008, 11:11:21 AM8/7/08
to modwsgi
That's exactly what happens. With r971, each wsgi process starts out
with 1 socket. On each graceful restart, it accumulates one more for
each wsgi daemon process running at the time of restart. In the case
of our qa server at this moment, there are 15 daemon processes running
(various apps). So each graceful restart results in an apparent leak
of 15 fds.

JP

On Aug 7, 6:56 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> When you do testing, can you pay particular attention to whether
> listener sockets have more of a tendency to be accumulated when a
> 'graceful' restart is being done. Ie., 'apachectl graceful'.
>
> Graham
>
> 2008/8/7 JP <jpelle...@gmail.com>:

Graham Dumpleton

unread,
Aug 7, 2008, 8:55:39 PM8/7/08
to mod...@googlegroups.com
Have logged issue at:

http://code.google.com/p/modwsgi/issues/detail?id=95

Problem goes beyond the leaked file descriptors. It is possible that
for graceful restart and shutdown that the daemon processes aren't
being given an opportunity to shutdown in a orderly manner and instead
Apache is just killing them outright (through some mechanism I don't
yet understand).

So, graceful restart and shutdown potentially has problems, although
interesting that no one has noticed that processes are just getting
killed off.

Graham

2008/8/8 JP <jpel...@gmail.com>:

Graham Dumpleton

unread,
Aug 8, 2008, 7:35:34 AM8/8/08
to mod...@googlegroups.com
Fix for leaking of listener socket on graceful restart committed at
revision 978 of trunk for 3.0.

Still have to work out issue around how daemon processes are being
shutdown on the graceful restart and shutdown operations.

Testing of code appreciated. Look out for UNIX listener socket files
left in file system and use lsof to watch for leaking file descriptors
in Apache parent process. Neither should hopefully now occur.

Graham

2008/8/8 Graham Dumpleton <graham.d...@gmail.com>:

JP

unread,
Aug 8, 2008, 11:28:11 AM8/8/08
to modwsgi
Thank you! r979 is installed on our qa machine, should get a good
workout today as a few products should be hitting qa, which means
*many* graceful restarts.

Would you prefer updates here, or on the ticket, or both?

JP

On Aug 8, 7:35 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> Fix for leaking of listener socket on graceful restart committed at
> revision 978 of trunk for 3.0.
>
> Still have to work out issue around how daemon processes are being
> shutdown on the graceful restart and shutdown operations.
>
> Testing of code appreciated. Look out for UNIX listener socket files
> left in file system and use lsof to watch for leaking file descriptors
> in Apache parent process. Neither should hopefully now occur.
>
> Graham
>
> 2008/8/8 Graham Dumpleton <graham.dumple...@gmail.com>:
>
> > Have logged issue at:
>
> >  http://code.google.com/p/modwsgi/issues/detail?id=95
>
> > Problem goes beyond the leaked file descriptors. It is possible that
> > for graceful restart and shutdown that the daemon processes aren't
> > being given an opportunity to shutdown in a orderly manner and instead
> > Apache is just killing them outright (through some mechanism I don't
> > yet understand).
>
> > So, graceful restart and shutdown potentially has problems, although
> > interesting that no one has noticed that processes are just getting
> > killed off.
>
> > Graham
>
> > 2008/8/8 JP <jpelle...@gmail.com>:

JP

unread,
Aug 8, 2008, 6:37:13 PM8/8/08
to modwsgi
r979 appears to have solved the problem. It's been running on our qa
server all day, through many graceful restarts, and there have been no
socket leaks. Another week or so and there should be a window for us
to get it into production. Thanks very much for nailing this down so
quickly (and for modwsgi itself)!

JP

Graham Dumpleton

unread,
Aug 9, 2008, 1:42:26 AM8/9/08
to mod...@googlegroups.com
2008/8/9 JP <jpel...@gmail.com>:

>
> r979 appears to have solved the problem. It's been running on our qa
> server all day, through many graceful restarts, and there have been no
> socket leaks. Another week or so and there should be a window for us
> to get it into production. Thanks very much for nailing this down so
> quickly (and for modwsgi itself)!

Hmmm, I was going to suggest that I would back port the changes to
include in mod_wsgi 2.2 and release that rather than you relying on
working versions from subversion repository.If however you are happy
with using that version out of the repository for now then fine. Will
give me a little bit more time then to work out the orderly shutdown
issues with processes and also get it included in version 2.2.

Graham

Graham Dumpleton

unread,
Aug 9, 2008, 8:03:23 AM8/9/08
to mod...@googlegroups.com
Note, if you don't run any daemon mode processes, thus embedded mode
only, then Apache child processes will crash. Fixed in r981.

Graham

2008/8/9 JP <jpel...@gmail.com>:

JP

unread,
Aug 11, 2008, 11:39:24 AM8/11/08
to modwsgi
On Aug 9, 1:42 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> 2008/8/9 JP <jpelle...@gmail.com>:
>
>
>
> > r979 appears to have solved the problem. It's been running on our qa
> > server all day, through many graceful restarts, and there have been no
> > socket leaks. Another week or so and there should be a window for us
> > to get it into production. Thanks very much for nailing this down so
> > quickly (and for modwsgi itself)!
>
> Hmmm, I was going to suggest that I would back port the changes to
> include in mod_wsgi 2.2 and release that rather than you relying on
> working versions from subversion repository. If however you are happy
> with using that version out of the repository for now then fine. Will
> give me a little bit more time then to work out the orderly shutdown
> issues with processes and also get it included in version 2.2.

A 2.2 release would be much better than running production out of svn.
I just got a little over-excited about the fix there for a minute. ;)
Since our rate of graceful restarts on production is far less than qa,
we can wait for a full 2.2 release, if it's going to be not too much
more than a few weeks away. Do you have a date you're shooting for,
and is there anything we can do to help (aside from lots of testing)?

Thanks again,

JP

Graham Dumpleton

unread,
Aug 11, 2008, 7:39:22 PM8/11/08
to mod...@googlegroups.com
2008/8/12 JP <jpel...@gmail.com>:

I had actually found some time yesterday to back port changes, so, can
you test code by checking out from 2.2 branch instead of from trunk.

https://modwsgi.googlecode.com/svn/branches/mod_wsgi-2.X

If your testing doesn't raise any issues, I will checkpoint it and
release it as 2.2.

Graham

Hongli Lai

unread,
Aug 13, 2008, 7:34:21 AM8/13/08
to modwsgi
On Aug 6, 1:47 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> The Passenger software may appear to have a similar mode of operation,
> but it is implemented quite differently from what I understand. The
> Passenger software follows more traditional fastcgi type
> implementation of performing a fork/exec to run up the actual
> application process where as mod_wsgi just does a fork only. That a
> separate application is exec'd means all that sort of stuff gets
> cleaned up automatically.

Speaking about Phusion Passenger: during its development we actually
found a conflict with mod_wsgi. Phusion passenger spawns a daemon
which we call the "application pool server". This daemon is supposed
to exit when the Apache control process exits or when it is
restarting. The Apache control process is connected to the daemon via
an anonymous Unix socket. Apache child processes close this file
descriptor immediately, so only the control process has the
connection. The daemon will exit when this connection is closed by the
other side.

When Apache is restarting, Phusion Passenger will close the connection
and waitpid() the application pool server. But mod_wsgi's daemon
doesn't close file descriptors, so it keeps this connection open. As a
result the application pool server will not exit, so we had to write
some timeout code which would forcefully kill the application pool
server if it doesn't exist within a specified interval.

Regards,
Hongli Lai

Graham Dumpleton

unread,
Aug 13, 2008, 7:43:56 AM8/13/08
to mod...@googlegroups.com
2008/8/13 Hongli Lai <hong...@gmail.com>:

I saw a comment in your code about that.

The problem is that mod_wsgi daemon mode processes can't just go
arbitrarily closing file descriptors as it has no idea what they are
for and may be needed for correct operation of Apache internals it
uses. Haven't got time to explain now, more another day.

Graham

Graham

Hongli Lai

unread,
Aug 13, 2008, 10:12:04 AM8/13/08
to modwsgi
On Aug 13, 1:43 pm, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> I saw a comment in your code about that.
>
> The problem is that mod_wsgi daemon mode processes can't just go
> arbitrarily closing file descriptors as it has no idea what they are
> for and may be needed for correct operation of Apache internals it
> uses. Haven't got time to explain now, more another day.

In Phusion Passenger, after a fork() we close all file descriptors
except a few that are on the white list. This works very well. But if
the WSGI daemon uses the Apache logging functions or other Apache API
calls then I suppose you can't just go around and close everything
that you don't know.

JP

unread,
Aug 13, 2008, 3:02:51 PM8/13/08
to modwsgi
Sorry for the late reply -- busy week this week. I'll try to get the
2.X branch build onto our qa server tomorrow.

Thanks again,

JP

On Aug 11, 7:39 pm, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> 2008/8/12 JP <jpelle...@gmail.com>:

Graham Dumpleton

unread,
Aug 13, 2008, 8:00:31 PM8/13/08
to mod...@googlegroups.com
2008/8/14 Hongli Lai <hong...@gmail.com>:

Still no time to sit down properly and response because of work load,
but yes that is part of it.

BTW, at a guess I don't believe your problem would be specific to
mod_wsgi but could also occur with mod_cgid and mod_fastcgi as both
fork off daemons from Apache parent process and neither from memory go
around and close file descriptors which may have been set up by other
modules.

Also, using a timeout may not be the best way of dealing with it, but
really need to sit down and try and work out what your code is
actually doing and what the issue is before pass that judgement.

Sorry, I will say more later when get some time.

Graham

Graham Dumpleton

unread,
Aug 15, 2008, 3:35:04 AM8/15/08
to mod...@googlegroups.com
Its Friday afternoon and sick of working. Time maybe to followup on
some emails. :-)

2008/8/13 Hongli Lai <hong...@gmail.com>:

Relying only on the socket being closed is not a good idea as you will
never be able to control what other Apache modules do. As mentioned
before, what you are seeing isn't going to be specific to mod_wsgi and
its daemon processes, it will also be an issue with the master
processes created by mod_cgid, mod_fastcgi and perhaps other modules.

If you look at how Apache itself handles its pipe of death for worker
processes, it doesn't rely only on the socket being closed. Instead it
will actually write a character onto the pipe. When a process
monitoring reads a character off the pipe then it knows it is time to
die. In your case your application pool server would be the only
process which would be reading from the socket, so the only thing
which would get the notification. This will work whether or not some
other process is somehow holding the other end of the socket open.

To better understand how Apache uses its pipe of death have a look at
server/mpm_common.c in Apache source code. A document which is also
very helpful in understanding how Apache works is:

http://www.fmc-modeling.org/category/projects/apache/amp/Apache_Modeling_Project.html

As to why mod_wsgi can't close all file descriptors, yes it is in part
because of the need to keep Apache log file descriptors open, but that
isn't the real problem. The problem is knowing which file descriptors
are the ones it can close.

Under normal circumstances if an Apache module has a need to ensure
that file descriptors are closed in the context of an Apache child
worker process, it would register a child init handler, which Apache
would trigger in the child, for the module to do it. Thus triggering
the child init handlers would be only way of ensuring any cleanup is
done before worker processes start doing anything. Unfortunately in
mod_wsgi daemon process, triggering the child init handlers would have
other more undesirable consequences.

As example, imagine mod_python were being loaded at the same time and
someone had used PythonImport to preload a lot of Python code into the
worker child processes. Because the imports defined by PythonImport
are done in the child init handler for mod_python, if mod_wsgi
triggered child init handlers it would cause the imports to happen in
the mod_wsgi daemon process even though the process isn't being used
for executing stuff in the context of mod_python.

So, you are potentially dammed by not calling child init handlers if
they do some important cleanup and you are dammed if you do call them
because then all sorts of resource consuming code could be executed
which is related to other Apache modules.

As such, mod_wsgi does not call child init handlers. This behaviour is
exactly the same as for mod_cgid and mod_fastcgi, which is why you
would have same problem there and why you just need to find a better
way of dealing with it.

In your specific problem, actually writing characters onto your
equivalent of the pipe of death should be sufficient and would be
better than a timeout, although factoring a timeout into as well may
also be a good idea to ensure your pool server does actually die.

If you actually look at Apache internals there are actually two ways
in which restarts can occur. The first is a normal restart and what
Apache does there is send a SIGTERM to any child processes it asked to
monitor. While the child processes don't die, it will keep sending
SIGTERM every second, giving up after a few seconds and sending a
SIGKILL to make sure it dies.

The other type of restart is a graceful restart. In this case it
doesn't send the SIGTERM and instead uses the pipe of death instead.
That way the processes know they are meant to die, but can delay exit
a but until they have finished handling current requests.

The odd one with a graceful restart, and which the OPs original
problem has revealed as an issue for mod_wsgi daemon processes which I
didn't realise before, is that since the SIGTERM sequence isn't sent,
even to non Apache child worker processes, such as mod_wsgi daemon
processes which Apache manages, the mod_wsgi daemon processes don't
get an opportunity to do an orderly shutdown and trigger Python atexit
handlers, destroy interpreters etc. Instead it seems the mod_wsgi
daemon processes get a SIGKILL at the point that the Apache
configuration memory pool against which the process structures were
allocated is destroyed.

At least this is what I think is happening, I still need to find some
time to properly work through it. What I do know at this point is that
the daemon processes do get killed and replaced on the graceful
restart, but there is nothing logged in Apache error logs in relation
to an orderly shutdown. At this point not sure if it is because they
are just getting a SIGKILL when memory pool is destroyed, or whether
the Apache error logs get closed off somehow such that logging isn't
being done.

Anyway, I have a bit of work to sort this out. Seems all I might be
able to do is register my own cleanup function against the
configuration memory pool and try and send SIGTERM first myself. Other
option is to use a pipe of death myself and somehow register it so
Apache triggers it automatically.

Not sure that this is possible however as that is all controlled from
main loop of the specific MPM used. Like signal, pipe of death could
be triggered from destruction of configuration memory pool though.
Either way, need to work out how to prevent the SIGKILL getting sent
if it is. For normal Apache worker processes it doesn't appear to, so
they can survive until when they want to die, but not so for the
mod_wsgi daemon process at the moment and so even if can trigger
orderly shutdown, may not get much time to complete it.

Graham

Hongli Lai

unread,
Aug 15, 2008, 7:38:36 AM8/15/08
to modwsgi
On Aug 15, 9:35 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
>  http://www.fmc-modeling.org/category/projects/apache/amp/Apache_Model...
> [...]

Very interesting read. :) The tip that you gave about writing a
character into the pipe is a good one.


> If you actually look at Apache internals there are actually two ways
> in which restarts can occur. The first is a normal restart and what
> Apache does there is send a SIGTERM to any child processes it asked to
> monitor. While the child processes don't die, it will keep sending
> SIGTERM every second, giving up after a few seconds and sending a
> SIGKILL to make sure it dies.

Yes, we do something similar as well. First we close the socket, then
we sent SIGINT every 100 msec, and if after 5 seconds the daemon still
hasn't exited, we send SIGTERM. We actually use both the socket and
SIGINT for graceful termination. SIGINT interrupts any blocking system
calls in daemon, and in the daemon we check for such interruptions.
We've written a C++ framework to facilitate this; it gives us a Java-
style interruption API.

Graham Dumpleton

unread,
Aug 15, 2008, 7:53:30 AM8/15/08
to mod...@googlegroups.com
2008/8/15 Hongli Lai <hong...@gmail.com>:

>
> On Aug 15, 9:35 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
> wrote:
>> http://www.fmc-modeling.org/category/projects/apache/amp/Apache_Model...
>> [...]
>
> Very interesting read. :) The tip that you gave about writing a
> character into the pipe is a good one.

Just started doing some more reading about graceful restart, in the
documentation I pointed you at it says:

"""Each time the master server processes the restart loop, it
increments the generation ID. All child servers it creates have this
generation ID in their scoreboard entry. Whenever a child server
completes the handling of a request, it checks its generation ID
against the global generation ID. If they don't match, it exits the
request-response loop and terminates.
This behavior is important for the graceful restart."""

According to that there is more to graceful restart again than what i
said. I could be a bit wrong about the pipe of death. It may come more
into play when Apache parent wants to cull some of the child
processes, rather than all. Ensuring that all die on graceful restart
may depend more on the generation ID.

Problem is that it is all quite hard to work out from code without a
lot of study. :-(

>> If you actually look at Apache internals there are actually two ways
>> in which restarts can occur. The first is a normal restart and what
>> Apache does there is send a SIGTERM to any child processes it asked to
>> monitor. While the child processes don't die, it will keep sending
>> SIGTERM every second, giving up after a few seconds and sending a
>> SIGKILL to make sure it dies.
>
> Yes, we do something similar as well. First we close the socket, then
> we sent SIGINT every 100 msec, and if after 5 seconds the daemon still
> hasn't exited, we send SIGTERM. We actually use both the socket and
> SIGINT for graceful termination. SIGINT interrupts any blocking system
> calls in daemon, and in the daemon we check for such interruptions.
> We've written a C++ framework to facilitate this; it gives us a Java-
> style interruption API.

One of the issues with relying on signals if using an exec model to
create daemon processes is that if the daemon process code is all
provided by the user and relies on something like flup for
fastcgi/scgi/ajp or other management shell code, too easy for user
code to screw things up if they register signal handlers to block or
intercept signals that the management shell code is using to provide a
graceful shutdown mechanism.

Haven't looked at code for your stuff to understand how you handle the
daemon side after an exec to know if this is an issue for you. In
mod_wsgi it actually replaces signal.signal() in Python so Python code
cant go screwing things up by registering signal handlers for signals
you care about. Certain CherryPy versions want to install signal
handlers by default even when it might be used in an embedded system.
If it isn't prevented from doing this, the daemon processes don't
shutdown in an orderly manner and have to be SIGKILL'd. Of course if
someone wants to write a C extension module which installs a signal
handler, not much you can do about it.

Graham

Reply all
Reply to author
Forward
0 new messages