Yes, open log file descriptors are inherited across the fork from
Apache parent process. This should not present any problem.
> 2. mod_wsgi daemon processes seem to have an awful lot of open
> sockets, and the number grows. Even processes running apps that I know
> are getting no traffic see their number of open sockets grow and grow
> over time -- each one eventually having hundreds to thousands of open
> sockets.
>
> Can anyone help me to understand what's going on here? I can post a
> sample config or two, but they are nothing special --
> WSGIDaemonProcesses are set up with varying settings for # of
> processes and threads, WSGIScriptAlias is used to dispatch urls, and
> WSGIProcessGroup is set in the virtual host or location, as
> appropriate. None of the problematic apps are set for more than 2
> processes or 100 threads.
I would need to see the Apache/mod_wsgi configuration bits and know
what Python web application you are running.
Are you able to work out using 'lsof' or 'ofiles' what the sockets are
connecting to? Are you sure you are looking at the mod_wsgi daemon
processes and not the normal Apache worker child processes? Do you use
'display-name' option with WSGIDaemonProcess so 'ps' can show which
are the mod_wsgi daemon processes?
An sample of 'lsof' or 'ofiles' output for one such process would be
good for helping to work it out.
Graham
Not sure that I would recommend running as high as 100 threads, would
normally suggest more processes and less threads per process as
reduces any impact of the GIL.
I am not concerned that the sockets show in daemon processes as they
are inherited across fork from Apache parent process. I would be
concerned if they reference non existent process IDs. The PID in that
socket path should always be the Apache parent process.
Can you run the lsof on the Apache parent process as well and tell me
what the PID of that Apache parent process is.
BTW, elsewhere in main part of Apache configuration, are you setting
WSGISocketPrefix at all?
Graham
The Passenger software may appear to have a similar mode of operation,
but it is implemented quite differently from what I understand. The
Passenger software follows more traditional fastcgi type
implementation of performing a fork/exec to run up the actual
application process where as mod_wsgi just does a fork only. That a
separate application is exec'd means all that sort of stuff gets
cleaned up automatically.
In mod_wsgi, it could in the daemon processes at least close off all
daemon listener sockets but its own. Log files are a different matter
though. If the WSGIDaemonProcess is outside of all VirtualHost
containers, it wouldn't be able to close any. If WSGIDaemonProcess is
inside of a VirtualHost, then it could close any which are for a
different VirtualHost.
It can't close everything for global scope daemon processes, because
what is delegated to a daemon process is actually dynamic and so you
can't know in advance what you might need. In other words,
applications from any VirtualHost could be delegated to a global
daemon process.
All the same, yes there is some scope here for doing some cleanup. I
have added ticket:
http://code.google.com/p/modwsgi/issues/detail?id=94
>> I would be
>> concerned if they reference non existent process IDs. The PID in that
>> socket path should always be the Apache parent process.
>
> Some of them reference a PID that is not longer active, but most
> reference the current active apache parent PID. Right now out of
> roughly 300 open sockets per wsgi process, 15 are referencing a PID
> that no longer exists, the rest the current apache parent PID.
Still don't understand how that could be. :-(
>> Can you run the lsof on the Apache parent process as well and tell me
>> what the PID of that Apache parent process is.
>
> The apache parent PID is 29038. It seems to have the same set of
> sockets open as all of the wsgi procs.
>
>> BTW, elsewhere in main part of Apache configuration, are you setting
>> WSGISocketPrefix at all?
>
> No. Should we be?
No, just wanted to ask in case that code path had some bug in it. I
don't usually use that directive myself when doing development and
testing.
I'll see if can make a change to close off stuff that isn't needed and
if you are able to maybe you can try it and see if that makes it more
obvious what actual problem may exist.
Graham
Hmmm, now that I look at this on my own system, I see odd things as
well. Doing lsof on parent process yields:
$ sudo lsof -p 5225 | grep sock
httpd 5225 root 11u unix 0x02780ad0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5224.0.1.sock
httpd 5225 root 12u unix 0x02780150 0t0
/usr/local/apache-2.2.4/logs/wsgi.5224.0.2.sock
httpd 5225 root 13u unix 0x02076bf0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5224.0.3.sock
Logging shows:
[Wed Aug 06 21:15:25 2008] [debug] mod_wsgi.c(8327): mod_wsgi
(pid=5225): Socket for 'wsgi' is
'/usr/local/apache-2.2.4/logs/wsgi.5224.0.1.sock'.
[Wed Aug 06 21:15:25 2008] [debug] mod_wsgi.c(8327): mod_wsgi
(pid=5225): Socket for 'wsgi-2' is
'/usr/local/apache-2.2.4/logs/wsgi.5224.0.2.sock'.
[Wed Aug 06 21:15:25 2008] [debug] mod_wsgi.c(8327): mod_wsgi
(pid=5225): Socket for 'wsgi-3' is
'/usr/local/apache-2.2.4/logs/wsgi.5224.0.3.sock'.
All I can say is that I mustn't quite understand Apache startup
process as well as I thought. It must be performing a fork at some
point after it first reads the configuration, but before it then
starts initialising things based on that configuration.
Anyway, when a 'restart' is now done, the next version of the sockets
uses the actual parent process id, rather than what ever process
preceded it. Thus:
$ sudo lsof -p 5225 | grep sock
httpd 5225 root 11u unix 0x02780ad0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5225.1.1.sock
httpd 5225 root 12u unix 0x02780150 0t0
/usr/local/apache-2.2.4/logs/wsgi.5225.1.2.sock
httpd 5225 root 13u unix 0x02076bf0 0t0
/usr/local/apache-2.2.4/logs/wsgi.5225.1.3.sock
Thus, although odd, should be quite okay. Issue would be if in one
process you had a mix of open sockets from different process IDs, or
different version of parent process. That is, if that middle number is
not the same for all sockets.
If this does occur, then only could see it being caused by one thing.
When restart occurring, the listener socket is only closed when parent
process detects the first daemon process in group dying. Ie.,
if (daemon->instance == 1) {
if (close(daemon->group->listener_fd) < 0) {
ap_log_error(APLOG_MARK, WSGI_LOG_ERR(errno),
wsgi_server, "mod_wsgi (pid=%d): "
"Couldn't close unix domain socket '%s'.",
getpid(), daemon->group->socket);
}
if (unlink(daemon->group->socket) < 0 && errno != ENOENT) {
ap_log_error(APLOG_MARK, WSGI_LOG_ERR(errno),
wsgi_server, "mod_wsgi (pid=%d): "
"Couldn't unlink unix domain socket '%s'.",
getpid(), daemon->group->socket);
}
}
If that process was hanging and parent process wasn't detecting it as
having shutdown properly, it would never close the listener socket for
that daemon process group. Thus it may persist.
So, if you are seeing this, see if you can find daemon processes that
aren't dying completely.
BTW, if you have a box on which to experiment with same application,
have checked in at revision 969 to subversion trunk for mod_wsgi,
changes that will close all daemon listener sockets in Apache worker
processes and close all but its own listener socket for daemon
processes.
This will at least mean you will not see open sockets for child
processes, but if listener sockets aren't being cleaned up properly in
parent process, you would still see them accumulate in the parent
process.
Graham
I've updated trunk at revision 971 to defer incorporating process ID
into name of listener socket until point it is being created rather
than when configuration is being read. This now means that process ID
in path matches actual Apache parent process ID and it doesn't change
when a restart is done.
Graham
Graham
2008/8/7 JP <jpel...@gmail.com>:
http://code.google.com/p/modwsgi/issues/detail?id=95
Problem goes beyond the leaked file descriptors. It is possible that
for graceful restart and shutdown that the daemon processes aren't
being given an opportunity to shutdown in a orderly manner and instead
Apache is just killing them outright (through some mechanism I don't
yet understand).
So, graceful restart and shutdown potentially has problems, although
interesting that no one has noticed that processes are just getting
killed off.
Graham
2008/8/8 JP <jpel...@gmail.com>:
Still have to work out issue around how daemon processes are being
shutdown on the graceful restart and shutdown operations.
Testing of code appreciated. Look out for UNIX listener socket files
left in file system and use lsof to watch for leaking file descriptors
in Apache parent process. Neither should hopefully now occur.
Graham
2008/8/8 Graham Dumpleton <graham.d...@gmail.com>:
Hmmm, I was going to suggest that I would back port the changes to
include in mod_wsgi 2.2 and release that rather than you relying on
working versions from subversion repository.If however you are happy
with using that version out of the repository for now then fine. Will
give me a little bit more time then to work out the orderly shutdown
issues with processes and also get it included in version 2.2.
Graham
Graham
2008/8/9 JP <jpel...@gmail.com>:
I had actually found some time yesterday to back port changes, so, can
you test code by checking out from 2.2 branch instead of from trunk.
https://modwsgi.googlecode.com/svn/branches/mod_wsgi-2.X
If your testing doesn't raise any issues, I will checkpoint it and
release it as 2.2.
Graham
I saw a comment in your code about that.
The problem is that mod_wsgi daemon mode processes can't just go
arbitrarily closing file descriptors as it has no idea what they are
for and may be needed for correct operation of Apache internals it
uses. Haven't got time to explain now, more another day.
Graham
Graham
Still no time to sit down properly and response because of work load,
but yes that is part of it.
BTW, at a guess I don't believe your problem would be specific to
mod_wsgi but could also occur with mod_cgid and mod_fastcgi as both
fork off daemons from Apache parent process and neither from memory go
around and close file descriptors which may have been set up by other
modules.
Also, using a timeout may not be the best way of dealing with it, but
really need to sit down and try and work out what your code is
actually doing and what the issue is before pass that judgement.
Sorry, I will say more later when get some time.
Graham
2008/8/13 Hongli Lai <hong...@gmail.com>:
Relying only on the socket being closed is not a good idea as you will
never be able to control what other Apache modules do. As mentioned
before, what you are seeing isn't going to be specific to mod_wsgi and
its daemon processes, it will also be an issue with the master
processes created by mod_cgid, mod_fastcgi and perhaps other modules.
If you look at how Apache itself handles its pipe of death for worker
processes, it doesn't rely only on the socket being closed. Instead it
will actually write a character onto the pipe. When a process
monitoring reads a character off the pipe then it knows it is time to
die. In your case your application pool server would be the only
process which would be reading from the socket, so the only thing
which would get the notification. This will work whether or not some
other process is somehow holding the other end of the socket open.
To better understand how Apache uses its pipe of death have a look at
server/mpm_common.c in Apache source code. A document which is also
very helpful in understanding how Apache works is:
http://www.fmc-modeling.org/category/projects/apache/amp/Apache_Modeling_Project.html
As to why mod_wsgi can't close all file descriptors, yes it is in part
because of the need to keep Apache log file descriptors open, but that
isn't the real problem. The problem is knowing which file descriptors
are the ones it can close.
Under normal circumstances if an Apache module has a need to ensure
that file descriptors are closed in the context of an Apache child
worker process, it would register a child init handler, which Apache
would trigger in the child, for the module to do it. Thus triggering
the child init handlers would be only way of ensuring any cleanup is
done before worker processes start doing anything. Unfortunately in
mod_wsgi daemon process, triggering the child init handlers would have
other more undesirable consequences.
As example, imagine mod_python were being loaded at the same time and
someone had used PythonImport to preload a lot of Python code into the
worker child processes. Because the imports defined by PythonImport
are done in the child init handler for mod_python, if mod_wsgi
triggered child init handlers it would cause the imports to happen in
the mod_wsgi daemon process even though the process isn't being used
for executing stuff in the context of mod_python.
So, you are potentially dammed by not calling child init handlers if
they do some important cleanup and you are dammed if you do call them
because then all sorts of resource consuming code could be executed
which is related to other Apache modules.
As such, mod_wsgi does not call child init handlers. This behaviour is
exactly the same as for mod_cgid and mod_fastcgi, which is why you
would have same problem there and why you just need to find a better
way of dealing with it.
In your specific problem, actually writing characters onto your
equivalent of the pipe of death should be sufficient and would be
better than a timeout, although factoring a timeout into as well may
also be a good idea to ensure your pool server does actually die.
If you actually look at Apache internals there are actually two ways
in which restarts can occur. The first is a normal restart and what
Apache does there is send a SIGTERM to any child processes it asked to
monitor. While the child processes don't die, it will keep sending
SIGTERM every second, giving up after a few seconds and sending a
SIGKILL to make sure it dies.
The other type of restart is a graceful restart. In this case it
doesn't send the SIGTERM and instead uses the pipe of death instead.
That way the processes know they are meant to die, but can delay exit
a but until they have finished handling current requests.
The odd one with a graceful restart, and which the OPs original
problem has revealed as an issue for mod_wsgi daemon processes which I
didn't realise before, is that since the SIGTERM sequence isn't sent,
even to non Apache child worker processes, such as mod_wsgi daemon
processes which Apache manages, the mod_wsgi daemon processes don't
get an opportunity to do an orderly shutdown and trigger Python atexit
handlers, destroy interpreters etc. Instead it seems the mod_wsgi
daemon processes get a SIGKILL at the point that the Apache
configuration memory pool against which the process structures were
allocated is destroyed.
At least this is what I think is happening, I still need to find some
time to properly work through it. What I do know at this point is that
the daemon processes do get killed and replaced on the graceful
restart, but there is nothing logged in Apache error logs in relation
to an orderly shutdown. At this point not sure if it is because they
are just getting a SIGKILL when memory pool is destroyed, or whether
the Apache error logs get closed off somehow such that logging isn't
being done.
Anyway, I have a bit of work to sort this out. Seems all I might be
able to do is register my own cleanup function against the
configuration memory pool and try and send SIGTERM first myself. Other
option is to use a pipe of death myself and somehow register it so
Apache triggers it automatically.
Not sure that this is possible however as that is all controlled from
main loop of the specific MPM used. Like signal, pipe of death could
be triggered from destruction of configuration memory pool though.
Either way, need to work out how to prevent the SIGKILL getting sent
if it is. For normal Apache worker processes it doesn't appear to, so
they can survive until when they want to die, but not so for the
mod_wsgi daemon process at the moment and so even if can trigger
orderly shutdown, may not get much time to complete it.
Graham
Just started doing some more reading about graceful restart, in the
documentation I pointed you at it says:
"""Each time the master server processes the restart loop, it
increments the generation ID. All child servers it creates have this
generation ID in their scoreboard entry. Whenever a child server
completes the handling of a request, it checks its generation ID
against the global generation ID. If they don't match, it exits the
request-response loop and terminates.
This behavior is important for the graceful restart."""
According to that there is more to graceful restart again than what i
said. I could be a bit wrong about the pipe of death. It may come more
into play when Apache parent wants to cull some of the child
processes, rather than all. Ensuring that all die on graceful restart
may depend more on the generation ID.
Problem is that it is all quite hard to work out from code without a
lot of study. :-(
>> If you actually look at Apache internals there are actually two ways
>> in which restarts can occur. The first is a normal restart and what
>> Apache does there is send a SIGTERM to any child processes it asked to
>> monitor. While the child processes don't die, it will keep sending
>> SIGTERM every second, giving up after a few seconds and sending a
>> SIGKILL to make sure it dies.
>
> Yes, we do something similar as well. First we close the socket, then
> we sent SIGINT every 100 msec, and if after 5 seconds the daemon still
> hasn't exited, we send SIGTERM. We actually use both the socket and
> SIGINT for graceful termination. SIGINT interrupts any blocking system
> calls in daemon, and in the daemon we check for such interruptions.
> We've written a C++ framework to facilitate this; it gives us a Java-
> style interruption API.
One of the issues with relying on signals if using an exec model to
create daemon processes is that if the daemon process code is all
provided by the user and relies on something like flup for
fastcgi/scgi/ajp or other management shell code, too easy for user
code to screw things up if they register signal handlers to block or
intercept signals that the management shell code is using to provide a
graceful shutdown mechanism.
Haven't looked at code for your stuff to understand how you handle the
daemon side after an exec to know if this is an issue for you. In
mod_wsgi it actually replaces signal.signal() in Python so Python code
cant go screwing things up by registering signal handlers for signals
you care about. Certain CherryPy versions want to install signal
handlers by default even when it might be used in an embedded system.
If it isn't prevented from doing this, the daemon processes don't
shutdown in an orderly manner and have to be SIGKILL'd. Of course if
someone wants to write a C extension module which installs a signal
handler, not much you can do about it.
Graham