mod_wsgi returning 503 service unavailable

4,897 views
Skip to first unread message

Kelvin Wong

unread,
Oct 28, 2014, 12:43:12 AM10/28/14
to mod...@googlegroups.com
Hi Graham and everyone else

I'm running multiple site on Django 1.6.7, Apache/2.4.7 (Ubuntu 14.04), OpenSSL/1.0.1f, mod_wsgi/4.2.5, Python/2.7.6, Server MPM: worker.
I found that the server start returning 504 and then 503, and the following error shown up.
I researched some issues related with it, even added "WSGISocketPrefix /var/run/apache2/wsgi", but the issue still occured.
I have no idea why it happened. Can anyone give some directions on this issue?

Thanks!

apache error log
[Sun Oct 26 07:34:34.732934 2014] [wsgi:error] [pid 29268:tid 140053011478272] [client xx.xxx.xxx.xxx:xxxxx] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py
[Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.

occasionally
[Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 140182690981632] (2)No such file or directory: [client 24.171.250.159:60769] mod_wsgi (pid=24158): Unable to connect to WSGI daemon process 'snaptee-production-api-ssl' on '/var/run/apache2/wsgi.30188.7.3.sock'.

wsgi config for that site
WSGIDaemonProcess site-1 display-name=site-1 user=www-data threads=25 python-path=/home/ubuntu/site-1/django:/home/ubuntu/.virtualenvs/site-1/lib/python2.7/site-packages
WSGIProcessGroup site-1
WSGIApplicationGroup %{GLOBAL}
WSGIScriptAlias / /home/ubuntu/site-1/apache/wsgi.py

worker.conf
<IfModule mpm_worker_module>
       StartServers                 2
       MinSpareThreads             25
       MaxSpareThreads             75
       ThreadLimit                 64
       ThreadsPerChild             25
       MaxRequestWorkers          150
       MaxConnectionsPerChild    1000
</IfModule>

Graham Dumpleton

unread,
Oct 28, 2014, 1:30:06 AM10/28/14
to modwsgi
Would suggest upgrading to mod_wsgi 4.3.0 if you can as the error messages when there are communication problems between Apache child worker process and mod_wsgi daemon process have been improved.

More comments below.

On 28 October 2014 15:43, Kelvin Wong <kelvi...@gmail.com> wrote:
Hi Graham and everyone else

I'm running multiple site on Django 1.6.7, Apache/2.4.7 (Ubuntu 14.04), OpenSSL/1.0.1f, mod_wsgi/4.2.5, Python/2.7.6, Server MPM: worker.
I found that the server start returning 504 and then 503, and the following error shown up.
I researched some issues related with it, even added "WSGISocketPrefix /var/run/apache2/wsgi", but the issue still occured.
I have no idea why it happened. Can anyone give some directions on this issue?

Thanks!

apache error log
[Sun Oct 26 07:34:34.732934 2014] [wsgi:error] [pid 29268:tid 140053011478272] [client xx.xxx.xxx.xxx:xxxxx] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py
[Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.

This one can occur when the mod_wsgi daemon process crashes. There should be a segmentation fault error message or similar in the main Apache error log (not VirtualHost specific log).

It can also occur if there are incomplete requests still running when a mod_wsgi daemon process is shutdown on being restarted due to the WSGI script file being touched or if Apache was restarted. In the latter case, the mod_wsgi daemon process would have had to have been killed off by Apache before the Apache child worker process which was proxying it to had. This can especially be the case if an Apache graceful restart was being done. 
 
occasionally
[Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 140182690981632] (2)No such file or directory: [client 24.171.250.159:60769] mod_wsgi (pid=24158): Unable to connect to WSGI daemon process 'snaptee-production-api-ssl' on '/var/run/apache2/wsgi.30188.7.3.sock'.

This can also be due to Apache graceful restart being done and there were keep alive connections being handled from a HTTP client. In an Apache graceful restart, because of Apache handles the mod_wsgi daemon processes, they don't have a graceful shutdown in the same way as Apache child worker processes.

So what happens is the following:

1. Apache graceful restart is triggered.
2. Apache parent process sends SIGUSR1 to Apache child worker process to signal graceful shutdown.
3. Apache parent process sends SIGINT to mod_wsgi daemon processes to signal shutdown.
4. The mod_wsgi daemon processes complete their requests and restart. In the next incarnation of the mod_wsgi daemon processes after an Apache restart they expect a different path for the proxy socket, with the number at the end increasing based on Apache generation number.
5. The Apache child worker process because it was in a graceful restart mode, operates on the understanding that it can keep handling any requests on a keep alive socket connection from a HTTP client until there are no more. It therefore takes next request on same connection and tries to connect to mod_wsgi daemon process, but using the proxy socket name as was used before, but that name has changed for the next Apache configuration generation and no longer exists, thus it fails.

The name of the proxy socket changes across Apache restarts because otherwise you could have Apache child worker processes under an old configuration sending requests to a mod_wsgi daemon process using the new configuration, which could cause problems including security issues. There are therefore specific protections in place to ensure that only Apache child worker processes and mod_wsgi daemon mode processes created against the same Apache configuration generation can talk to each other.
 
wsgi config for that site
WSGIDaemonProcess site-1 display-name=site-1 user=www-data threads=25 python-path=/home/ubuntu/site-1/django:/home/ubuntu/.virtualenvs/site-1/lib/python2.7/site-packages
WSGIProcessGroup site-1
WSGIApplicationGroup %{GLOBAL}
WSGIScriptAlias / /home/ubuntu/site-1/apache/wsgi.py

worker.conf
<IfModule mpm_worker_module>
       StartServers                 2
       MinSpareThreads             25
       MaxSpareThreads             75
       ThreadLimit                 64
       ThreadsPerChild             25
       MaxRequestWorkers          150
       MaxConnectionsPerChild    1000
</IfModule>

So my best guess is that you are doing Apache graceful restarts when these are occurring.

Are you using Apache graceful restarts as suspected?

Graham 

Kelvin Wong

unread,
Oct 28, 2014, 2:18:13 AM10/28/14
to mod...@googlegroups.com
Hi Graham,

Thanks a lot for your detailed explanations.

I used to reload the apache processes instead of restart them.
So is there any relation to the "MaxConnectionsPerChild" setting that when the process met the limit, it restart the child process? If so, any alternative to this setting? I used this setting to bound the memory usage of apache.
Upgrading to mod_wsgi 4.3.0 will solve this problem? mod_wsgi 4.3.0 improved handling on segmentation fault error?

Graham Dumpleton

unread,
Oct 28, 2014, 6:21:22 AM10/28/14
to mod...@googlegroups.com
On 28/10/2014, at 5:18 PM, Kelvin Wong <kelvi...@gmail.com> wrote:

Hi Graham,

Thanks a lot for your detailed explanations.

I used to reload the apache processes instead of restart them.
So is there any relation to the "MaxConnectionsPerChild" setting that when the process met the limit, it restart the child process?

There shouldn't be as MaxConnectionsPerChild only pertains to the Apache child worker processes and the number of connections. When a specific Apache child worker process is restarted, it is in a form of graceful restart, but the Apache configuration isn't being read by Apache as a whole so the 'generation' counter inside Apache wouldn't change and so the name of the proxy socket file wouldn't change either. So that option shouldn't cause those errors.

If so, any alternative to this setting? I used this setting to bound the memory usage of apache.

The issue is why you would be seeing memory growth in the Apache child worker processes to start with and how much. By rights they shouldn't keep increasing in memory usage. They can increase in memory a bit, but then should plateau. For example:


If using mod_wsgi daemon mode where the Apache child worker process are only proxying requests or serving static files, this growth up to a ceiling as reflected in 'Apache Child Process Memory (Average)' is generally the result of the per worker thread memory pools that Apache uses.

The problem is that there may not be a limit on the upper size of the per worker thread memory pools and that is that the size is unbounded. This is especially the case in Apache 2.2 as the compiled in default is unlimited, so if the configuration file doesn't set it, then the ceiling can grow to be quite high as it depends a lot on how much data may get buffered in memory due to slow HTTP clients.

In Apache 2.4 there is now at least a compiled in default, but it still may be higher than desirable. In Apache 2.4 that default is:

#define ALLOCATOR_MAX_FREE_DEFAULT (2048*1024)

This means that for each worker thread in a process, the memory pool associated with it can retain 2MB of memory. As you have 25 worker threads, that means these memory pools can consume up to 50MB per worker process. You then have up to 6 worker processes. So that is 300MB in worst case if the throughput was enough to keep all the process active and Apache didn't start killing them off as not needed. But then the number of processes will not go all the way back to 1 due to MaxSpareThreads being 75, thus it will always keep at least 3 processes around.

Anyway, if memory usage in Apache child worker process is a big issue, especially where you are actually delegating the WSGI application to run in mod_wsgi daemon mode, meaning that the Apache child worker processes should be able to be run quite lean, then you can adjust MaxMemFree down from the quite high default in Apache 2.4 (and non existent in Apache 2.2). 

There are two other changes you can also make related to memory usage in the Apache child worker processes.

The first is if you are always using mod_wsgi daemon mode and never requiring embedded mode, then turn off initialisation of the Python interpreter in the Apache child worker processes.

The second is that on Linux the default per thread stack size is 8MB. This much shouldn't usually be required and really only counts towards virtual memory usage, but some VPS systems count virtual memory for billing purposes so it can become a problem that way.

So rather than thinking that MaxConnectionsPerChild is the only solution, use directives to control how much memory may be getting used and/or retained by the worker threads.

In  mod_wsgi-express for example, the default generated configuration it generates as a saner default where mod_wsgi daemon mode is always used is:

# Turn off Python interpreter initialisation in Apache child worker process as not required
# if using mod_wsgi daemon mode exclusively. This will be overridden if enabled use of
# Python scripts for access control/authentication/authorisation which have to run in the
# Apache child worker processes.

WSGIRestrictEmbedded On

# Set a limit on the amount of memory which will be retained in per worker memory pools.
# More memory than this can still be used if need be, but when no longer required and above
# this limited it will be released back to the process level memory allocated for reuse rather
# that being retained for exclusive use by the thread, with a risk of a higher memory level.

MaxMemFree 64

# Reduce the notional per thread stack size for all the worker threads. This relates more to
# virtual memory usage, but some VPS systems can virtual memory for billing purposes.

ThreadStackSize 262144

Upgrading to mod_wsgi 4.3.0 will solve this problem? mod_wsgi 4.3.0 improved handling on segmentation fault error?

As far as what was being discussed before, version 4.3.0 makes the error messages more descriptive of the issue and introduces new messages where they weren't able to be distinguished before. It wasn't possible to do this as well before as Apache code was being relied on to handle reading back data from mod_wsgi daemon processes. The mod_wsgi code now does this itself and can better control things. In mod_wsgi version 4.4.0 if can get the changes completed (more likely 4.5.0), there will be better messages again for some things as error codes that were previously being thrown away by Apache will actually be known.

So it will not change how a segmentation fault is handled as that can't be changed, just the wording of the error message when a mod_wsgi daemon process may have died or was shutdown.

I have thrown a lot of information at you here, but do you actually have figures on what the memory usage of the Apache child worker processes grows to? Are you using any Apache modules for implementing caching in Apache?

Graham 

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Graham Dumpleton

unread,
Oct 28, 2014, 7:07:48 AM10/28/14
to mod...@googlegroups.com
One further comment about those per worker thread memory pools. It isn't an issue for you as you are using Apache 2.4, but Apache 2.2 in particular may be susceptible to further problems with them when doing proxying, as is the case with mod_wsgi daemon mode. This was exacerbated due to the unbounded limit.

To understand the problem, I will explain what happens in Apache 2.4 when proxying.

So when a request is received by Apache it is accepted by the Apache child worker process. It is worked out it needs to be handled by the WSGI application in a mod_wsgi daemon process. The request details (headers) are then proxied across to the mod_wsgi daemon process. There is a little handshake done to see if the mod_wsgi daemon process will accept it, or whether something has changed and the mod_wsgi daemon process wants to do a restart at that time. This might occur where the WSGI script file was touched.

If the mod_wsgi daemon process is going to accept the request, all the request content is then sent across to the mod_wsgi daemon process. The WSGI application should be reading this content. Only once the request content has all been sent, will the Apache child worker process start looking for response headers and content from the mod_wsgi daemon process.

Now, when dealing with this response, the Apache child worker process will simply go into a loop of reading data from the mod_wsgi daemon process and writing it back out to the client. In writing back data to the HTTP client though, it isn't direct onto the socket connection. Instead it is via Apaches output filter stack where various transformations may be applied. When it final gets down to the lowest output filter, which is the core output filter, if when writing data onto the socket it determines that the write would block, rather than block it will buffer the data and return. This allows what is producing data to keep working.

The problem that can arise here is that if the producer of the data, in this case the code reading the response from the mod_wsgi daemon process, is generating data quicker than the HTTP client is consuming it, then data could keep getting buffered up and so grow memory. Because of this issue there is some protection though and there is a threshold on the amount of data that will be allowed to be buffered. When this threshold is reached, then the writing of data back to the HTTP client will be forced to block until the HTTP client reads more data.

So there is definitely protection on buffering up too much data in memory in Apache 2.4. Right now my understanding is that this protection may not exist in Apache 2.2.  Certainly the code in Apache 2.2 is quite different and there aren't huge comments in it like in Apache 2.4 which explain how the buffering works.

For Apache 2.2 then, what this may mean if there is indeed no limit on the amount of buffering is that if you have a slow HTTP client, then some part of the response coming back from a WSGI application in mod_wsgi daemon mode might get buffered up in the memory of the Apache child worker process. This will be allocated against the per worker thread memory pool. Thus memory can blow out. Further, with there being no MaxMemFree default in Apache 2.2, even when the request is done, that memory is retained against the per worker thread memory pool and will not be released back to the process to be used by other things.

There are two potential problems with Apache 2.2 therefore. Potentially no limit on how much can be buffered in memory and then even when a request completes, all that memory which was buffered gets retained by that worker thread.

The moral of this story. Use Apache 2.4, and do not use Apache 2.2.

Since I have never gone and tested Apache 2.2 to validate my understanding on this point, probably about time I went and did that. :-)

Graham


On 28/10/2014, at 9:21 PM, Graham Dumpleton <graham.d...@gmail.com> wrote:

On 28/10/2014, at 5:18 PM, Kelvin Wong <kelvi...@gmail.com> wrote:

Hi Graham,

Thanks a lot for your detailed explanations.

I used to reload the apache processes instead of restart them.
So is there any relation to the "MaxConnectionsPerChild" setting that when the process met the limit, it restart the child process?

There shouldn't be as MaxConnectionsPerChild only pertains to the Apache child worker processes and the number of connections. When a specific Apache child worker process is restarted, it is in a form of graceful restart, but the Apache configuration isn't being read by Apache as a whole so the 'generation' counter inside Apache wouldn't change and so the name of the proxy socket file wouldn't change either. So that option shouldn't cause those errors.

If so, any alternative to this setting? I used this setting to bound the memory usage of apache.

The issue is why you would be seeing memory growth in the Apache child worker processes to start with and how much. By rights they shouldn't keep increasing in memory usage. They can increase in memory a bit, but then should plateau. For example:

<PastedGraphic-1.tiff>

Graham Dumpleton

unread,
Oct 28, 2014, 7:39:53 AM10/28/14
to mod...@googlegroups.com
On 28/10/2014, at 10:07 PM, Graham Dumpleton <graham.d...@gmail.com> wrote:

One further comment about those per worker thread memory pools. It isn't an issue for you as you are using Apache 2.4, but Apache 2.2 in particular may be susceptible to further problems with them when doing proxying, as is the case with mod_wsgi daemon mode. This was exacerbated due to the unbounded limit.

To understand the problem, I will explain what happens in Apache 2.4 when proxying.

So when a request is received by Apache it is accepted by the Apache child worker process. It is worked out it needs to be handled by the WSGI application in a mod_wsgi daemon process. The request details (headers) are then proxied across to the mod_wsgi daemon process. There is a little handshake done to see if the mod_wsgi daemon process will accept it, or whether something has changed and the mod_wsgi daemon process wants to do a restart at that time. This might occur where the WSGI script file was touched.

If the mod_wsgi daemon process is going to accept the request, all the request content is then sent across to the mod_wsgi daemon process. The WSGI application should be reading this content. Only once the request content has all been sent, will the Apache child worker process start looking for response headers and content from the mod_wsgi daemon process.

Now, when dealing with this response, the Apache child worker process will simply go into a loop of reading data from the mod_wsgi daemon process and writing it back out to the client. In writing back data to the HTTP client though, it isn't direct onto the socket connection. Instead it is via Apaches output filter stack where various transformations may be applied. When it final gets down to the lowest output filter, which is the core output filter, if when writing data onto the socket it determines that the write would block, rather than block it will buffer the data and return. This allows what is producing data to keep working.

The problem that can arise here is that if the producer of the data, in this case the code reading the response from the mod_wsgi daemon process, is generating data quicker than the HTTP client is consuming it, then data could keep getting buffered up and so grow memory. Because of this issue there is some protection though and there is a threshold on the amount of data that will be allowed to be buffered. When this threshold is reached, then the writing of data back to the HTTP client will be forced to block until the HTTP client reads more data.

So there is definitely protection on buffering up too much data in memory in Apache 2.4. Right now my understanding is that this protection may not exist in Apache 2.2.  Certainly the code in Apache 2.2 is quite different and there aren't huge comments in it like in Apache 2.4 which explain how the buffering works.

For Apache 2.2 then, what this may mean if there is indeed no limit on the amount of buffering is that if you have a slow HTTP client, then some part of the response coming back from a WSGI application in mod_wsgi daemon mode might get buffered up in the memory of the Apache child worker process. This will be allocated against the per worker thread memory pool. Thus memory can blow out. Further, with there being no MaxMemFree default in Apache 2.2, even when the request is done, that memory is retained against the per worker thread memory pool and will not be released back to the process to be used by other things.

There are two potential problems with Apache 2.2 therefore. Potentially no limit on how much can be buffered in memory and then even when a request completes, all that memory which was buffered gets retained by that worker thread.

The moral of this story. Use Apache 2.4, and do not use Apache 2.2.

Since I have never gone and tested Apache 2.2 to validate my understanding on this point, probably about time I went and did that. :-)

And after a bit playing around, this is definitely a problem in Apache 2.2.

I don't quite understand how its algorithm works because if I actually suspend the HTTP client process the Apache child worker process will actually block and not try and send more data.

The problem case is therefore specifically where the HTTP client is still actually reading some data, but not at a fast enough rate. My best guess based on what I am observing is that if no data at all can be written then it will block until it can write some more and so everything halts at that point. If maybe it can do a partial write, then it will allow the remainder to be buffered. So as long as the HTTP client is reading at a fast enough rate that it doesn't block entirely, then the excess coming though from the mod_wsgi daemon process can backlog and get buffered.

So it is quite nasty, but the only proper solution is simply to use Apache 2.4 where the problem is addressed. The only workaround I can do at the mod_wsgi level is to periodically force a flush of data based on the volume being written. Probably something that I should finally do now that I am trying to be active in doing more work on it. :-) 

Kelvin Wong

unread,
Oct 28, 2014, 7:58:08 AM10/28/14
to mod...@googlegroups.com
do you actually have figures on what the memory usage of the Apache child worker processes grows to?

I do. I used New Relic to monitor the system resource usage. I found as time goes, the apache processes take a lot of memory. That's why I want to control the memory usage of the apache.
Also, my application is mainly apis for mobile application which involved uploading files/images. I found that there are a lot of IOError occurred as seemed the upload is unexpected terminated by the mobile application.
Do you have any suggestions on this? And will these kind of requests keep in memory forever as it handled incorrectly and make the memory usage grow?


Are you using any Apache modules for implementing caching in Apache?     No. I just have application-level caching (ORM caching and memcache for some of the requests).

Kelvin Wong

unread,
Oct 28, 2014, 10:12:39 PM10/28/14
to mod...@googlegroups.com


On Tuesday, October 28, 2014 1:30:06 PM UTC+8, Graham Dumpleton wrote:
Would suggest upgrading to mod_wsgi 4.3.0 if you can as the error messages when there are communication problems between Apache child worker process and mod_wsgi daemon process have been improved.

More comments below.

On 28 October 2014 15:43, Kelvin Wong <kelvi...@gmail.com> wrote:
Hi Graham and everyone else

I'm running multiple site on Django 1.6.7, Apache/2.4.7 (Ubuntu 14.04), OpenSSL/1.0.1f, mod_wsgi/4.2.5, Python/2.7.6, Server MPM: worker.
I found that the server start returning 504 and then 503, and the following error shown up.
I researched some issues related with it, even added "WSGISocketPrefix /var/run/apache2/wsgi", but the issue still occured.
I have no idea why it happened. Can anyone give some directions on this issue?

Thanks!

apache error log
[Sun Oct 26 07:34:34.732934 2014] [wsgi:error] [pid 29268:tid 140053011478272] [client xx.xxx.xxx.xxx:xxxxx] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py
[Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.

This one can occur when the mod_wsgi daemon process crashes. There should be a segmentation fault error message or similar in the main Apache error log (not VirtualHost specific log).

It can also occur if there are incomplete requests still running when a mod_wsgi daemon process is shutdown on being restarted due to the WSGI script file being touched or if Apache was restarted. In the latter case, the mod_wsgi daemon process would have had to have been killed off by Apache before the Apache child worker process which was proxying it to had. This can especially be the case if an Apache graceful restart was being done. 
 
occasionally
[Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 140182690981632] (2)No such file or directory: [client 24.171.250.159:60769] mod_wsgi (pid=24158): Unable to connect to WSGI daemon process 'snaptee-production-api-ssl' on '/var/run/apache2/wsgi.30188.7.3.sock'.

This can also be due to Apache graceful restart being done and there were keep alive connections being handled from a HTTP client. In an Apache graceful restart, because of Apache handles the mod_wsgi daemon processes, they don't have a graceful shutdown in the same way as Apache child worker processes.

So what happens is the following:

1. Apache graceful restart is triggered.
2. Apache parent process sends SIGUSR1 to Apache child worker process to signal graceful shutdown.
3. Apache parent process sends SIGINT to mod_wsgi daemon processes to signal shutdown.
4. The mod_wsgi daemon processes complete their requests and restart. In the next incarnation of the mod_wsgi daemon processes after an Apache restart they expect a different path for the proxy socket, with the number at the end increasing based on Apache generation number.
5. The Apache child worker process because it was in a graceful restart mode, operates on the understanding that it can keep handling any requests on a keep alive socket connection from a HTTP client until there are no more. It therefore takes next request on same connection and tries to connect to mod_wsgi daemon process, but using the proxy socket name as was used before, but that name has changed for the next Apache configuration generation and no longer exists, thus it fails.

Is there any ways to avoid Apache graceful restart? Apache graceful restart is triggered by the "MaxConnectionsPerChild" or other settings?
If so, is it better to control by "maximum-requests" in mod_wsgi setting?

Graham Dumpleton

unread,
Oct 28, 2014, 10:36:13 PM10/28/14
to mod...@googlegroups.com
On 28/10/2014, at 10:58 PM, Kelvin Wong <kelvi...@gmail.com> wrote:

do you actually have figures on what the memory usage of the Apache child worker processes grows to?

I do. I used New Relic to monitor the system resource usage. I found as time goes, the apache processes take a lot of memory. That's why I want to control the memory usage of the apache.

Okay, but where in New Relic are you monitoring this? I am concerned now as to whether you are even looking at just the Apache child worker processes that MaxConnectionsPerChild pertains to.

If you were looking at the host breakout chart on the overview dashboard for the WSGI application being monitoring by the Python web application agent, and you are using daemon mode, then what you are looking at is the memory taken by the mod_wsgi daemon processes and not the Apache child worker processes. As a consequence the MaxConnectionsPerChild directive doesn't apply.

If you were looking at the server monitoring charts and looking at the Apache httpd/apache2 process, then that is all processes under Apache, which counts both the Apache child worker processes and the mod_wsgi daemon processes. If you relied on those charts, you can't tell whether it is the Apache child processes or mod_wsgi daemon processes.

So you can from the Python web application agent or the server monitoring agent tell how much memory is just being used by the Apache child worker processes.

In the chart I included which can still see below, that is relying on a platform plugin agent for Apache/mod_wsgi. Unlike the others, it does pull out memory just for the Apache child worker processes. I then created a custom dashboard which includes charts for metrics from both the Python web application agent and the Apache/mod_wsgi platform plugin so can cross compare them. That is how I got all the charts I showed.

So right now I am question whether you should be using MaxConnectionsPerChild as it is more likely that you may be looking at the size of the mod_wsgi daemon processes which actually contain your WSGI application.

Also, my application is mainly apis for mobile application which involved uploading files/images. I found that there are a lot of IOError occurred as seemed the upload is unexpected terminated by the mobile application.
Do you have any suggestions on this?

You can't stop connections being dropped, especially with mobile agents.

What size are the images?

One thing you can do and which is actually a good idea overall independent of your specific situation, is to place a nginx front end proxy in front of Apache/mod_wsgi. The preferable way of configuring nginx is to have it use HTTP/1.1 and keep alive connections to Apache. You have to be on top of understanding your configuration though. If you aren't you are better off using default of HTTP/1.0 for the proxy connections from nginx to Apache.

Either way, the reason nginx helps is that when doing proxying, nginx can pre buffer up to a certain amount of request content and will only bother proxying a request to Apache, if request content is below the limit, when it has successfully received it all. Thus Apache will not get troubled by any requests which got dropped and so the IOError issue can be cut dramatically.

In short, nginx helps to isolate Apache from slow HTTP clients and can make Apache perform better with less resources.

And will these kind of requests keep in memory forever as it handled incorrectly and make the memory usage grow?

No they aren't held indefinitely.

The problem with slow HTTP clients is when although no data is coming through, it still holds the connection open until a timeout occurs based on Timeout directive and then connection is dropped.

What are you using for the Timeout directive?

The compiled in default for Timeout is 60 seconds, but the sample configuration files often have 300 seconds. 300 seconds it way too high and for many situations 60 seconds is also too much, but you have to be a bit careful about dropping it too low.

This again though is where nginx as a front end proxy helps, because the request would simply never get through to Apache if the content wasn't coming through and expected content was under the limit.

Yes nginx still has to deal with the hung connection, but it is much more efficient at that than Apache as nginx uses an async event driven system to manage many connections in one thread where as Apache uses a thread per connection.

So what happens is the following:

1. Apache graceful restart is triggered.
2. Apache parent process sends SIGUSR1 to Apache child worker process to signal graceful shutdown.
3. Apache parent process sends SIGINT to mod_wsgi daemon processes to signal shutdown.
4. The mod_wsgi daemon processes complete their requests and restart. In the next incarnation of the mod_wsgi daemon processes after an Apache restart they expect a different path for the proxy socket, with the number at the end increasing based on Apache generation number.
5. The Apache child worker process because it was in a graceful restart mode, operates on the understanding that it can keep handling any requests on a keep alive socket connection from a HTTP client until there are no more. It therefore takes next request on same connection and tries to connect to mod_wsgi daemon process, but using the proxy socket name as was used before, but that name has changed for the next Apache configuration generation and no longer exists, thus it fails.

Is there any ways to avoid Apache graceful restart? Apache graceful restart is triggered by the "MaxConnectionsPerChild" or other settings?
If so, is it better to control by "maximum-requests" in mod_wsgi setting?

The maximum-requests pertains for mod_wsgi daemon mode processes.

This is where you have to correctly identify which processes are actually growing, the Apache child worker processes of the mod_wsgi daemon processes that your WSGI application is running in.

If it is the one your WSGI application is in, there are many things you could be doing wrong which would cause memory usage to keep growing.

You might even be encountering bugs in third party packages you use. Django for example until at least 1.6.? has had an issue with its signal mechanism that could result in deadlocks when garbage collection is being done. This could lock up a request thread, but then also cause the garbage collector to not run again. The result being that memory usage could keep growing and growing since as the garbage collector will never be able to reclaim objects.

So the big question still is, which processes are the ones growing in memory usage? Only then can say what you really need to do and give suggestions on how to track it down.

For reference, the Apache/mod_wsgi platform plugin is detailed at:


The Django GC dead lock issue as it came to our intention, with links to Django bug reports, can be found at:


Graham

Kelvin Wong

unread,
Oct 29, 2014, 3:53:22 AM10/29/14
to mod...@googlegroups.com
I just setup the Apache/mod_wsgi platform plugin to collect the resource usage of mod_wsgi.
It needs some time to collect the data in production. I will get back to you when I have some insights.

Thanks for your help. :)

Kelvin Wong

unread,
Oct 29, 2014, 10:12:19 PM10/29/14
to mod...@googlegroups.com
Hi Graham

I upgraded mod_wsgi to 4.3.0. The 503 situation happened again.
The error message is same as before. But this time, there is no more "Timeout when reading response headers from daemon process" before the "(11)Resource temporarily unavailable".
Is there anything I can do to prevent this kind of situation?
Or there is any ways to make the apache self-heal?

Thanks.

Kelvin Wong

unread,
Oct 29, 2014, 10:30:41 PM10/29/14
to mod...@googlegroups.com
Here is a part of main apache error log when the 503 happened.
[Wed Oct 29 12:56:26.727197 2014] [mpm_worker:error] [pid 1322:tid 139958218430336] AH00287: server is within MinSpareThreads of MaxRequestWorkers, consider raising the MaxRequestWorkers setting
[Wed Oct 29 12:56:30.730902 2014] [mpm_worker:error] [pid 1322:tid 139958218430336] AH00286: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

What does it means? It can be solved by raising "MaxRequestWorkers"?

Thanks.

Kelvin Wong

unread,
Oct 30, 2014, 9:40:24 AM10/30/14
to mod...@googlegroups.com
I found that on the main Apache log, when it is occurring, Apache is doing graceful restart.
How can I avoid Apache graceful restart?

Thanks.


On Tuesday, October 28, 2014 1:30:06 PM UTC+8, Graham Dumpleton wrote:

Graham Dumpleton

unread,
Oct 30, 2014, 3:30:58 PM10/30/14
to mod...@googlegroups.com
Before I can help you, you need to start answering my questions. If you pepper me with email after email on different things and don't answer the questions I ask you, then we will get no where. When I ask questions it is because that information is likely key to me being able to understand the problem and help you.

The main question I put to you was about how you are determining which processes are growing in memory.

It needs to be properly determined whether the processes which were growing in memory were Apache child worker processes or mod_wsgi daemon processes.

The reason for this is that if they are in fact mod_wsgi daemon processes, then you are possibly in part causing the problem yourself by setting MaxConnectionsPerChild. That setting has got nothing to do with the mod_wsgi daemon processes and by setting it, you are creating the conditions which cause the sorts of errors you are seeing.

So, which processes are growing in memory?

Graham 

Kelvin Wong

unread,
Oct 30, 2014, 10:12:18 PM10/30/14
to mod...@googlegroups.com
I'm sorry to throw too much questions to you. I just want to provide more information to make it easier to find the cause.
Also, I rushed to you because this situation happened during sleep time. So I'm little stressed about it.

Back to your question.
After following your instructions to setup mod_wsgi plugin to monitor the mod_wsgi processes in New Relic, I take a look on the data.
I found that it seemed normal to me. Before the 503 situation, "Average Memory Used (bytes/process)" and "Process Count" do not have great changes.
But looking data from "Server", the Apache processes still accumulating.
So, can we conclude that Apache child worker processes growing in memory?

Also, as I post previously, the main Apache error log, 
[Wed Oct 29 12:56:26.727197 2014] [mpm_worker:error] [pid 1322:tid 139958218430336] AH00287: server is within MinSpareThreads of MaxRequestWorkers, consider raising the MaxRequestWorkers setting
[Wed Oct 29 12:56:30.730902 2014] [mpm_worker:error] [pid 1322:tid 139958218430336] AH00286: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

I changed the config accordingly, the 503 situation still happened.
<IfModule mpm_worker_module>
        StartServers                 2
        MinSpareThreads             25
        MaxSpareThreads             75
        ThreadLimit                 64
        ThreadsPerChild             25
        MaxRequestWorkers          400
        MaxConnectionsPerChild       0
</IfModule>


Thanks.

Graham Dumpleton

unread,
Oct 31, 2014, 12:29:12 AM10/31/14
to mod...@googlegroups.com
So lets look at just memory initially.

When we talk about Apache processes, there are actually going to be three different classes of processes.

These are:

1. The Apache parent process.

This runs as root. Because it runs as root and not the Apache user (www-data) that the Apache child worker processes run as, then it can be seen separately on the NR server monitoring.

There is no memory increase here. It is actually very important that the Apache parent process doesn't increase in size over time with restarts as when it forks child process they would inherit that increasing memory size. One of the reason that using mod_python is a bad idea is that it used to initialise the Python interpreter in the Apache parent process and when restarted, the Python interpreter would leak memory and so the Apache parent process would grow in size on each restart.

Anyway, you are good here and nothing to worry about.

2. The Apache child worker processes.

These run as the Apache user which on your system is www-data.

The Apache child worker processes are what accepts all inbound HTTP requests. Static files will be served up by these processes.

In your case where you are using mod_wsgi daemon processes, the Python web application does not run in the Apache child worker processes. The Apache child worker process still accept the requests initially though and would then proxy the request through to the daemon process running the WSGI application.

3. The mod_wsgi daemon processes.

By default these also run as the Apache user which on your system is www-data.

This is where your WSGI application is running.

Now both the Apache child worker processes and the mod_wsgi daemon processes are forked children of the Apache parent process. They will therefore all show as being the process 'apache2' on your system.

Because both (1) and (2) are running as the Apache user, NR server monitoring will not allow you to distinguish between them.

Thus when you get:

you aren't looking at just the Apache child worker processes and are also looking at the mod_wsgi daemon processes.

The only way you could use NR server monitoring to tell between the two is if you used the 'user' option to WSGIDaemonProcess and ran the mod_wsgi daemon processes as a different user again. Then you could distinguish.

Although you can't distinguish using NR server monitoring, you can on Linux if you ensure you are naming the processes. That is, when using WSGIDaemonProcess directive, supply the display-name=%{GROUP} option. This will allow you to see the mod_wsgi daemon processes as distinct if using 'ps' or 'htop'.

Lets instead now look at what the mod_wsgi plugin tells you:

Over the 24 hours matching last charts we get:

so nothing conclusive. For a shorter period of one hour we get:

This shows an increase over time, but you have to be a little bit careful in interpreting that. If you hover over the memory chart and look at the popup you will see:

This shows that the processes actually range in size. There is a range of sizes because the process churn chart shows that processes are being restarted at different times.

Important though is to look at the max value. This is the largest size of any process. If one moves along the chart for that period one sees that the maximum is always about 12.9-13.0 MB.

This indicates to me that the Apache child worker process size has reached a plateau and that this is the normal behaviour of Apache child process to grow over time up to a plateau which I believe I noted previously.

So for the Apache child worker processes this isn't ringing any alarm bells for me.

If you have already removed MaxConnectionsPerChild and that maximum continues along and not increasing too much more over time, then I wouldn't be seeing a need to be setting MaxConnectionsPerChild.

Next chart to look at is the memory chart where monitoring the actual Python web application.

Over the last 6 hours this shows:

I am not 100% sure if you can get to this chart on the Lite subscription level you are on but I can still see it.

If it is available on Lite subscription, go to the bottom of the overview dashboard and select on the chart icon for the server breakout.

So as far as what process is growing in memory size, it is very much the mod_wsgi daemon processes.

The problem therefore is going to be an issue with caching in your application, construction of object cycles that cannot be broken by the garbage collector, or in worst case the garbage collector isn't even running due to it dead locking.

As a short term remedy to at least keep the memory growth under control for the mod_wsgi daemon processes, you have two choices.

The first is to use the 'inactivity-timeout' option to WSGIDaemonProcess directive.

The behaviour of this is slightly different between mod_wsgi 3.X and 4.X.

Under 4.X what will happen when inactivity-timeout is set is that if no requests at all are received by a daemon process group, and there are no active ones, after the specified period, then the process will be restarted.

You run up to 150 requests per minute across about 7 process instances.


In practice this means that any one process is not even handling one request per second.

Even so, what it means is that if you set this to be 30 seconds, then a process is never going to be idle long enough to be restarted. Can't therefore use this.

On mod_wsgi 3.X the difference was that the reading of request content and writing of response content was monitored and if there was no activity on those after the timeout period, then the process would be restarted. With the way it worked, it meant if there was a long running request, or stuck requests, they would be interrupted. In mod_wsgi 4.X, you should be using request-timeout instead if the intent is to be able to interrupt long running or stuck requests.

The second option you can use instead of inactivity-timeout is the maximum-requests option to WSGIDaemonProcess.

This also has changed a bit in mod_wsgi 4.X from 3.X.

Under 3.X once that many requests were processed by a process, the process would be restarted. In restarting though, it followed the typical shutdown sequence for mod_wsgi daemon processes.

That is, no new requests would be accepted on that process. Currently running requests would be given a grace period to complete (default 2.5 seconds). If they did not complete in that time, the process would be more brutally shutdown.

The result of this was that long running requests could be interrupted, resulting in some of those errors you saw before.

In mod_wsgi 4.X this option has the same default behaviour, but there is a new way of having the restart be more graceful.

What you instead use is both maximum-requests and graceful-timeout to WSGIDaemonProcess at the same time.

When you also have graceful-timeout option specified, what will happen when the maximum requests is reached is that a new initial grace period is entered based on what graceful-timeout is set to.

During this initial grace period, new requests will still be accepted, but if at any point during that time there are no active requests, the process will be restarted immediately.

If for some reason the initial grace period expires, then the normal restart starts. That is, stop accepting requests, provide a further 2.5 seconds grace and then brutally restart.

So this new graceful-timeout option in conjunction with maximum-requests means that a process can be restarted but with a better ability to do so without interrupting any requests.

In your case because there is less than 1 request per second on average across all processes, there should be a lot of opportunity for a restart to occur in the gaps in between those infrequent requests.

My recommendations at this point are:

1. Set MaxConnectionsPerChild to 0 so that Apache child worker processes aren't being recycled unnecessarily.

Monitor the mod_wsgi plugin chart 'Average Memory Useds (bytes/process)' to ensure that the Apache child worker processes aren't growing in size in an unbounded way above the typical max you are seeing now. They should simply plateau after a while. Keep an eye on max in the popup rather than the average as the latter can be deceiving as Apache child worker processes can still be restarted by Apache because of its dynamic scaling feature, thereby confusing the results you see.

2. Add to WSGIDaemonProcess the maximum-requests and graceful-timeout option.

As you are running only about 20 requests per minute for process, set maximum-requests at about 250 initially, which would be about every 10 minutes or so at that rate. If the processes are getting recycled too often and the startup cost is too much, then increase it. At the same time though keep an eye on that memory chart in the server breakout on the main dashboard to ensure memory not growing too much before the process gets recycled.

At the same time, make sure you set graceful-timeout. A safe figure for this initially would be 30. That is the number of seconds to wait to see if can restart when no requests, before a more brutal restart.

3. If you haven't done it already and you are definitely only using mod_wsgi daemon mode for running your WSGI application, ensure you are setting WSGIRestrictEmbedded to On at global scope in the Apache configuration outside of any VirtualHost.

So as an initial remedial action only, to combat memory growth, do that for now. This will give you some breathing space so you can work out why your web application is growing memory usage continually.

When done and when it has been running for 6-12 hours, come back and we can review how the memory growth charts are looking. I can also get to some other charts you can't when on Lite subscription, which will show details about the mod_wsgi daemon process restarts and we can see how far apart they are coming and whether we need to adjust maximum-requests.

As I say though, this is just to get you in a better position while you work out your memory leakage problems.

A next step I would like to do is have you install an extension to the NR Python agent which allows the activities of the garbage collector to be monitored to make sure it is running and whether uncollectable objects are accumulating. You would not normally be able to use this with a Lite account but I will be able to see the data. Before I can give you that extension to try I need to wait on some internal approvals to release it, which are already in train due to us needing to provide it to some other users who are seeing memory growth issues which we are trying to help them with.

For interests sake, the charts it can produce are:

custom-dashboard.jpg
Anyway, work slowy through what I have said above and understand it. Get those few configuration changes made and then we will look at the next steps.

Graham

Kelvin Wong

unread,
Oct 31, 2014, 2:04:20 AM10/31/14
to mod...@googlegroups.com
Thanks for the details!

I now have a better picture on how it works.
I employed all the recommendations you suggested. I will keep monitoring the processes.

But how about the graceful restart signaled by Apache? Why it happened?

Thanks again.

Graham Dumpleton

unread,
Oct 31, 2014, 2:42:05 AM10/31/14
to mod...@googlegroups.com
On 31/10/2014, at 5:04 PM, Kelvin Wong <kelvi...@gmail.com> wrote:

Thanks for the details!

I now have a better picture on how it works.
I employed all the recommendations you suggested. I will keep monitoring the processes.

But how about the graceful restart signaled by Apache? Why it happened?

As far as the message:

[Tue Oct 28 02:20:40.722140 2014] [wsgi:error] [pid 24158:tid 140182690981632] (2)No such file or directory: [client 24.171.250.159:60769] mod_wsgi (pid=24158): Unable to connect to WSGI daemon process 'snaptee-production-api-ssl' on'/var/run/apache2/wsgi.30188.7.3.sock'.

goes, as I said I believe that could only be the result of an Apache graceful restart.

My recollection is that a 'reload' under a Linux system using its process management tools triggers an Apache graceful restart where as a 'restart' does a normal Apache restart.

I believe you said you used 'reload' and so that would be the source of that.

The whole reason though I have been trying to sort out what processes actually had memory issues though was to avoid MaxConnectionsPerChild as well to ensure that that wasn't a source of the problem as well.

Now that we have you on a better configuration lets see now whether it happens again, rather than worrying too much about what happened in the past. Just note when you happen to do a 'reload', and so understand that if seen at that time that that was likely the cause.

Once things settle down for a while then can look at your processes/threads configuration and capacity utilisation and so work out whether a poor configuration there is causing problems. Can also see whether you are getting stuck requests as well which are eating into capacity and which could result in some of the errors you were seeing when any sort of restart was done.

So slowly slowly. Resolve one issue at a time. Changing too many things at once will only confuse and you will not get a sense of what changes are actually making things better otherwise.

Graham

Graham Dumpleton

unread,
Oct 31, 2014, 2:55:32 AM10/31/14
to mod...@googlegroups.com
BTW, in your NR agent configuration file find the error_collector.ignore_errors setting and change it to:

error_collector.ignore_errors = django.http.request:UnreadablePostError

These indicate the HTTP client connection dropping before all content had been sent.

Generally you want to ignore these and eliminate them from your error count as they do not represent an error in your application but are the result of a client problem.

Adding that configuration setting will cause them to be ignored and not recorded as an error.

Graham

Kelvin Wong

unread,
Oct 31, 2014, 3:36:22 AM10/31/14
to mod...@googlegroups.com
I checked the memory of overall Apache process, it goes up and down a little. It Seemed the "maximum-requests" is doing its job.

I found that when I do "sudo service apache reload", Apache will do graceful restart as it shown in main Apache error log.


[Sun Oct 26 07:34:37.198806 2014] [wsgi:error] [pid 27816:tid 140052910765824] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.
When before above situation happened, Apache also doing graceful restart.
But this should not be triggered by anyone as I mentioned it happened during sleep time.
Is there any reasons that Apache do graceful restart by itself?

Thanks.

Thanks again.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://grou
...

Graham Dumpleton

unread,
Oct 31, 2014, 3:45:35 AM10/31/14
to mod...@googlegroups.com
Your operating system could be signalling Apache during the middle of the night so that the log files can be rotated.

Graham

Kelvin Wong

unread,
Oct 31, 2014, 3:53:08 AM10/31/14
to mod...@googlegroups.com
You means that the OS triggered Apache graceful restart? If so, why the OS do that?


Thanks again.
Graham 

More comments below.

<div style="font-family:Arial,Helvetica,sans-serif;background-col
...

Graham Dumpleton

unread,
Oct 31, 2014, 4:16:55 AM10/31/14
to mod...@googlegroups.com, mod...@googlegroups.com


On 31/10/2014, at 6:53 PM, Kelvin Wong <kelvi...@gmail.com> wrote:

You means that the OS triggered Apache graceful restart? If so, why the OS do that?

Yes and I just told you why, as part of a log rotation mechanism.

Try googling 'log file rotation' and see what mechanism is used on your specific Linux system.

One link about one such solution which is often used is:

http://www.cyberciti.biz/faq/how-do-i-rotate-log-files/

Apache does itself have a way of handling log rotation itself, but likely ignored by Linux distros which use their own preferred mechanism.

Graham


--

Kelvin Wong

unread,
Oct 31, 2014, 4:26:59 AM10/31/14
to mod...@googlegroups.com
Oh. I see. Thanks.

One more question.
As I set the graceful-timeout on mod_wsgi process, then when Apache do graceful restart, will mod_wsgi obey that timeout setting?

Graham Dumpleton

unread,
Oct 31, 2014, 5:08:27 AM10/31/14
to mod...@googlegroups.com
On 31/10/2014, at 7:26 PM, Kelvin Wong <kelvi...@gmail.com> wrote:

Oh. I see. Thanks.

One more question.
As I set the graceful-timeout on mod_wsgi process, then when Apache do graceful restart, will mod_wsgi obey that timeout setting?

No. It is impossible. There is simply no way for the mod_wsgi daemon processes to act in coordination with the Apache child worker processes on an Apache graceful restart.

This is because the only way that mod_wsgi can create its distinct child processes and have them managed by Apache at all is using a mechanism that in the Apache runtime library is referred to as a 'other processes'.

It doesn't matter whether Apache as a whole is signalled for a normal shutdown (SIGHUP) or a graceful restart (SIGUSR1), Apache will in turn always send a SIGINT on to 'other processes' such as those created by mod_wsgi. If those 'other processes' don't shutdown immediately it will signal them again at one second intervals until 3 seconds have elapsed, at which point it will send them a SIGKILL. So Apache only supports a brutal shutdown of that sort for 'other processes'.

For its own child worker processes, to even support a graceful restart is a non trivial thing. It isn't just a matter of sending the child worker processes a SIGUSR1. Instead there is an elaborate scheme involving what is called a pipe of death between the Apache parent and the child worker processes, as well as shared state via shared memory between the Apache parent process so the child worker processes now what sort of shutdown has been triggered. That sort of mechanism to support a graceful shutdown for 'other processes' simply does exist.

So it is how it is and there is no way to change things. In practice it hasn't caused any issues and web requests being interrupted is always a risk that a web application has to deal with anyway.

Kelvin Wong

unread,
Nov 1, 2014, 11:18:41 AM11/1/14
to mod...@googlegroups.com

I have some updates on the issue as it happened again. This time I captured some data from New Relic.

From the Requests.png, I found that there is an spike aroung 22:23 and check the Apache log. I found there are many as following.

Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py

Spike also found on the Workers.png during that time period. The graph is showing most of the worker is doing write?
After a few minute, the following log appear again, and start returning 503.

(11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.

Can you tell what happened from those graphs?

Thanks.
...
Requests.png
Workers.png

Graham Dumpleton

unread,
Nov 2, 2014, 7:00:10 PM11/2/14
to mod...@googlegroups.com
On 02/11/2014, at 2:18 AM, Kelvin Wong <kelvi...@gmail.com> wrote:


I have some updates on the issue as it happened again. This time I captured some data from New Relic.

From the Requests.png, I found that there is an spike aroung 22:23 and check the Apache log. I found there are many as following.
Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py

Spike also found on the Workers.png during that time period. The graph is showing most of the worker is doing write?
After a few minute, the following log appear again, and start returning 503.
(11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.

Can you tell what happened from those graphs?

What I need to see is more of the log file around those events.

Ensure you have LogLevel set to 'info' in the Apache configuration as mod_wsgi will then log more messages about why processes restarted.

Also ensure you are checking for message around the time in the global Apache error log and not just in any virtual host specific error log, as some messages, such as evidence of a restart or process crashes will only appear in the main Apache error log.

With memory growth under control:


turning to look at daemon process restarts:


they are obviously more frequent. Zooming in, each process is restarting about every 5 minutes, although varies over time.

I would suggest you can increase maximum-requests, possibly as high as 4 times what you are using.

When enabling 'info' logging that will reduce the amount of noise in the error logs which may hide the specific anomalous event you are trying to track down.

BTW. Did you look into whether any log file rotation is set up or not and what did you find?

Depending on the system you might be able to find whether something is running by looking at one cron jobs are running for root.

    sudo crontab -l

Graham


--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.
<Requests.png><Workers.png>

Kelvin Wong

unread,
Nov 2, 2014, 9:56:33 PM11/2/14
to mod...@googlegroups.com
Yes, the memory usage is under control now and I increased maximum-requests by 4 times.

For the error, 
Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py
came out as sudden, that's why I capture this message only. I also found main Apache error log nothing special.

Now, I turned the log level to 'info' in order to obtain more messages.

About the log file rotation, I only found the ubuntu default rotation one which is using logrotate.
/var/log/apache2/*.log {
        weekly
        missingok
        rotate 52
        compress
        delaycompress
        notifempty
        create 640 root adm
        sharedscripts
        postrotate
                if /etc/init.d/apache2 status > /dev/null ; then \
                    /etc/init.d/apache2 reload > /dev/null; \
                fi;
        endscript
        prerotate
                if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
                        run-parts /etc/logrotate.d/httpd-prerotate; \
                fi; \
        endscript
}
It will do Apache reload on postrotate. I research a little bit, the reload is needed for Apache handling file.
Some suggested "copytruncate" on the config to avoid reload. Will you also suggest that?


Thanks.


On Monday, November 3, 2014 8:00:10 AM UTC+8, Graham Dumpleton wrote:
On 02/11/2014, at 2:18 AM, Kelvin Wong <kelvi...@gmail.com> wrote:


I have some updates on the issue as it happened again. This time I captured some data from New Relic.

From the Requests.png, I found that there is an spike aroung 22:23 and check the Apache log. I found there are many as following.
Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py

Spike also found on the Workers.png during that time period. The graph is showing most of the worker is doing write?
After a few minute, the following log appear again, and start returning 503.
(11)Resource temporarily unavailable: [client xx.xx.xx.xx:xxxxx] mod_wsgi (pid=27816): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.17227.2.3.sock'.

Can you tell what happened from those graphs?

What I need to see is more of the log file around those events.

Ensure you have LogLevel set to 'info' in the Apache configuration as mod_wsgi will then log more messages about why processes restarted.

Also ensure you are checking for message around the time in the global Apache error log and not just in any virtual host specific error log, as some messages, such as evidence of a restart or process crashes will only appear in the main Apache error log.

With memory growth under control:


turning to look at daemon process restarts:


they are obviously more frequent. Zooming in, each process is restarting about every 5 minutes, although varies over time.

...

Graham Dumpleton

unread,
Nov 2, 2014, 10:00:10 PM11/2/14
to mod...@googlegroups.com
And when is 'cron' configured to run logrotate? Does it correspond to when you are seeing the issue?

Kelvin Wong

unread,
Nov 2, 2014, 10:09:33 PM11/2/14
to mod...@googlegroups.com
I found no cron is running on root and user.

According to the logrotate config, the log rotate weekly.
But the error situation is likely happened nightly.

So I think it will not be my concern.
...

Kelvin Wong

unread,
Nov 3, 2014, 4:28:23 AM11/3/14
to mod...@googlegroups.com
I tried some load testing on the server to see if it is caused by overloading.

I found there will be some other log message when the server is overloading.
2014-11-03 07:47:01.591 [Mon Nov 03 07:46:53.795864 2014] [wsgi:error] [pid 3724:tid 140372768429824] [client xx.xx.xx.xx:26561] Truncated or oversized response headers received from daemon process 'site-1': /home/ubuntu/site-1/apache/wsgi.py
2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'.

Then I concluded that the situation is not caused by overloading the server.

Do you have any directions for me to reproduce the situation?

Thanks.

Graham Dumpleton

unread,
Nov 3, 2014, 4:33:46 AM11/3/14
to mod...@googlegroups.com
As I said before, I need to see messages either side of the event and with Apache log level set to info.

Graham

Graham Dumpleton

unread,
Nov 3, 2014, 4:58:47 AM11/3/14
to mod...@googlegroups.com
I still want any log messages either side of that point as it may confirm what is going on, but what you might be encountering could be due to Linux behaviour when performing a connect() on a UNIX socket in non blocking mode not being POSIX compliant. Details:


The mod_wsgi code is possibly not dealing with Linux's non POSIX compliance and instead of interpreting it the same as connection refused (when the listener socket backlog is full), it is regarding it as an outright failure.

So the cause of this if you are trying to do load testing is that you are overwhelming the capacity of the mod_wsgi daemon processes causing the socket listener queue to fill up. When it fills up then it is triggering the non POSIX behaviour of Linux.

Can you confirm that you only saw this when you overwhelmed the server with more traffic that you would realistically expect it to receive and way beyond the capacity it could handle based on how it is configured?

Graham

Graham Dumpleton

unread,
Nov 3, 2014, 5:23:21 AM11/3/14
to mod...@googlegroups.com
Try with this version of mod_wsgi.


and see if you can reproduce it.

Let me know the result.

If you do manage to overload it again, you may not see the same error message, but is possible that things get so backlogged you might still end up seeing message:

                   ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, r,
                                 "mod_wsgi (pid=%d): Unable to connect to "
                                 "WSGI daemon process '%s' on '%s' after "
                                 "multiple attempts as listener backlog "
                                 "limit was exceeded.", getpid(),
                                 daemon->name, daemon->socket_path);

You can also get this where your daemon process all hung due to blocked requests.

If you were to set WSGIVerboseLogging to On then you would also see:

                    if (wsgi_server_config->verbose_debugging) {
                        ap_log_rerror(APLOG_MARK, APLOG_DEBUG, rv, r,
                                     "mod_wsgi (pid=%d): Connection attempt "
                                     "#%d to WSGI daemon process '%s' on "
                                     "'%s' failed, sleeping before retrying "
                                     "again.", getpid(), retries,
                                     daemon->name, daemon->socket_path);
                    }

You should not set WSGIVerboseLogging to On in a production system or where high throughput as it will produce too much logging and fill up the Apache error log.

If you get backlog then it means one of the following.

You are being totally unrealistic about how much traffic you are sending your site in doing load testing. That is not load testing but failure testing.

You have bad mod_wsgi daemon processes/threads configuration.

The problem with bad backlogging is that a server may never catch up and so you effectively kill the whole server until you can stop and then start Apache to clear all the backlog.

Latest mod_wsgi versions do have a queue time concept that can be set up if necessary as a safe guard against a server being overwhelmed. The setting allows you to set a maximum for how long a request should wait in the queue and cause them to be discarded if the time is exceeded. This helps to throw away any backlog if the server gets bogged down so it may have a chance of recovering.

I will not explain the queue timeout at this point and will instead wait for you to confirm whether the patched mod_wsgi works.

Graham

Kelvin Wong

unread,
Nov 3, 2014, 5:35:49 AM11/3/14
to mod...@googlegroups.com
I don't think the original error situation is caused by server overloading as the traffic of the server on that time it's fine.
I just want to rule out some conditions as returning 503 can be caused by overloading the service.
Overloading the servers can produce 503 but the error log messages do not match with the original error log message.
...

Graham Dumpleton

unread,
Nov 3, 2014, 5:43:19 AM11/3/14
to mod...@googlegroups.com
On 03/11/2014, at 9:35 PM, Kelvin Wong <kelvi...@gmail.com> wrote:

I don't think the original error situation is caused by server overloading as the traffic of the server on that time it's fine.

I didn't say it was.

I just want to rule out some conditions as returning 503 can be caused by overloading the service.
Overloading the servers can produce 503 but the error log messages do not match with the original error log message.

And I am only addressing the message:

2014-11-03 07:47:01.591 [Mon Nov 03 07:46:54.109098 2014] [wsgi:error] [pid 4615:tid 140372718073600] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:26793] mod_wsgi (pid=4615): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.3387.0.4.sock'.

That area of the code was rewritten in 4.X versions at some point and introduced non blocking connects. Looking more at that specific message and knowing the possibility that you were load testing I came across that difference in Linux behaviour compared to what POSIX specifies.

So all I can do is address that specific point and that it is all I can do until I can get those info level messages from the error logs so I can understand whether the mod_wsgi daemon processes are actually restarting at the same time.

Graham


Kelvin Wong

unread,
Nov 3, 2014, 6:25:38 AM11/3/14
to mod...@googlegroups.com
I obtained log messages on production when the error happened. The log level is set to 'info'.

[Mon Nov 03 11:11:14.947304 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Shutdown requested 'site-2'.
[Mon Nov 03 11:11:14.947499 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Stopping process 'site-2'.
[Mon Nov 03 11:11:14.947521 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Destroying interpreters.
[Mon Nov 03 11:11:14.947536 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Cleanup interpreter ''.
[Mon Nov 03 11:11:14.951446 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Shutdown requested 'site-1'.
[Mon Nov 03 11:11:14.951672 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Stopping process 'site-1'.
[Mon Nov 03 11:11:14.951693 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Destroying interpreters.
[Mon Nov 03 11:11:14.951708 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Cleanup interpreter ''.
[Mon Nov 03 11:11:14.955690 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Shutdown requested 'site-3'.
[Mon Nov 03 11:11:14.955911 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Stopping process 'site-3'.
[Mon Nov 03 11:11:14.955933 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Destroying interpreters.
[Mon Nov 03 11:11:14.955947 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Cleanup interpreter ''.
[Mon Nov 03 11:11:14.967577 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Terminating Python.
[Mon Nov 03 11:11:14.970224 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Python has shutdown.
[Mon Nov 03 11:11:14.970246 2014] [wsgi:info] [pid 4476:tid 139644458678144] mod_wsgi (pid=4476): Exiting process 'site-3'.
[Mon Nov 03 11:11:14.975368 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Terminating Python.
[Mon Nov 03 11:11:14.978051 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Python has shutdown.
[Mon Nov 03 11:11:14.978072 2014] [wsgi:info] [pid 4477:tid 139644458678144] mod_wsgi (pid=4477): Exiting process 'site-1'.
[Mon Nov 03 11:11:14.983240 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Terminating Python.
[Mon Nov 03 11:11:14.986019 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Python has shutdown.
[Mon Nov 03 11:11:14.986041 2014] [wsgi:info] [pid 4478:tid 139644458678144] mod_wsgi (pid=4478): Exiting process 'site-2'.
[Mon Nov 03 11:11:18.985926 2014] [wsgi:error] [pid 1712:tid 140701685757696] [client xx.xx.xx.xx:13183] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:11:57.511176 2014] [wsgi:error] [pid 2417:tid 140701627008768] [client xx.xx.xx.xx:9079] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:11:59.504832 2014] [wsgi:error] [pid 2026:tid 140701568259840] [client xx.xx.xx.xx:13194] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:12:02.893114 2014] [wsgi:error] [pid 1712:tid 140701593437952] [client xx.xx.xx.xx:9123] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:12:15.471793 2014] [wsgi:error] [pid 2026:tid 140701643794176] [client xx.xx.xx.xx:9145] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:12:18.741610 2014] [wsgi:error] [pid 1712:tid 140701840639744] [client xx.xx.xx.xx:13230] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:19:32.470117 2014] [wsgi:error] [pid 1712:tid 140701840639744] [client xx.xx.xx.xx:9723] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:19:33.336583 2014] [wsgi:error] [pid 2417:tid 140701807068928] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10126] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:36.421071 2014] [wsgi:error] [pid 1712:tid 140701660579584] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10148] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:38.102948 2014] [wsgi:error] [pid 1712:tid 140701677364992] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10157] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:38.186446 2014] [wsgi:error] [pid 1712:tid 140701685757696] [client xx.xx.xx.xx:13868] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:19:42.206195 2014] [wsgi:error] [pid 1712:tid 140701823854336] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14302] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:42.366307 2014] [wsgi:error] [pid 3039:tid 140701568259840] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14306] mod_wsgi (pid=3039): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:46.209203 2014] [wsgi:error] [pid 2417:tid 140701593437952] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14308] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:47.026183 2014] [wsgi:error] [pid 2026:tid 140701677364992] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14312] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:49.092478 2014] [wsgi:error] [pid 1712:tid 140701652186880] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10163] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:49.505943 2014] [wsgi:error] [pid 2417:tid 140701840639744] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14314] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:49.692284 2014] [wsgi:error] [pid 2417:tid 140701719328512] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10173] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:49.725615 2014] [wsgi:error] [pid 2026:tid 140701568259840] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14316] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:52.273296 2014] [wsgi:error] [pid 2026:tid 140701685757696] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10175] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:53.356815 2014] [wsgi:error] [pid 2026:tid 140701823854336] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14318] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:57.364586 2014] [wsgi:error] [pid 2026:tid 140701585045248] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10178] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:19:57.620572 2014] [wsgi:error] [pid 2417:tid 140701652186880] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14324] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:01.508145 2014] [wsgi:error] [pid 2026:tid 140701719328512] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14331] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:03.685263 2014] [wsgi:error] [pid 3039:tid 140701576652544] [client xx.xx.xx.xx:9777] Timeout when reading response headers from daemon process 'site-1': /home/ubuntu/snaptee-production/apache/wsgi.py
[Mon Nov 03 11:20:07.131191 2014] [wsgi:error] [pid 2026:tid 140701610223360] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14337] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:07.210243 2014] [wsgi:error] [pid 2026:tid 140701601830656] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10185] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:10.990187 2014] [wsgi:error] [pid 2417:tid 140701660579584] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10206] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:12.343626 2014] [wsgi:error] [pid 2026:tid 140701807068928] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14353] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:12.383353 2014] [wsgi:error] [pid 2417:tid 140701601830656] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10210] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:12.514056 2014] [wsgi:error] [pid 2026:tid 140701568259840] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10213] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:13.106436 2014] [wsgi:error] [pid 2026:tid 140701677364992] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14356] mod_wsgi (pid=2026): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:13.367651 2014] [wsgi:error] [pid 3039:tid 140701668972288] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10219] mod_wsgi (pid=3039): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:17.423692 2014] [wsgi:error] [pid 2417:tid 140701610223360] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:14359] mod_wsgi (pid=2417): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:17.450428 2014] [wsgi:error] [pid 3039:tid 140701815461632] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10221] mod_wsgi (pid=3039): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.
[Mon Nov 03 11:20:20.733221 2014] [wsgi:info] [pid 3794:tid 140056943240960] mod_wsgi (pid=3794): Maximum requests reached, triggering immediate shutdown 'site-1'.
[Mon Nov 03 11:20:20.733323 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Shutdown requested 'site-1'.
[Mon Nov 03 11:20:20.733511 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Stopping process 'site-1'.
[Mon Nov 03 11:20:20.733528 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Destroying interpreters.
[Mon Nov 03 11:20:20.733538 2014] [wsgi:info] [pid 3794:tid 140057235269504] mod_wsgi (pid=3794): Cleanup interpreter ''.
[Mon Nov 03 11:20:20.734308 2014] [wsgi:error] [pid 3794:tid 140057235269504] 2014-11-03 11:20:20,734 (3794/Dummy-7) newrelic.core.agent INFO - New Relic Python Agent Shutdown
[Mon Nov 03 11:20:20.680893 2014] [wsgi:error] [pid 1712:tid 140701710935808] (11)Resource temporarily unavailable: [client xx.xx.xx.xx:10226] mod_wsgi (pid=1712): Unable to connect to WSGI daemon process 'site-1' on '/var/run/apache2/wsgi.1237.0.4.sock'.

Please advise. Thanks.
...

Kelvin Wong

unread,
Nov 3, 2014, 6:30:46 AM11/3/14
to mod...@googlegroups.com
It's running mod_wsgi 4.3.0, not the version you gave me.
...

Graham Dumpleton

unread,
Nov 3, 2014, 7:10:00 AM11/3/14
to mod...@googlegroups.com
Can you provide further back in time and further forward in time.

Basically, from when everything looked normal until everything was back to normal.

I get the feel that this has chopped off bits and not everything I might want to see is there.

Sleep time here. Will then look tomorrow.

Graham

Kelvin Wong

unread,
Nov 3, 2014, 10:53:57 AM11/3/14
to mod...@googlegroups.com
I attached the error log. See it is enough.

...
error.log

Graham Dumpleton

unread,
Nov 4, 2014, 4:43:28 AM11/4/14
to mod...@googlegroups.com
Can you supply the main Apache error log for the same period of time? Not all information is in the virtual host specific error log, including key information which captures when an Apache restart was triggered.

Based on the incomplete information from just that log, it opens the possibility that the restart was actually the outcome of problems which arose and not the trigger. I really need that main Apache error log to be sure.

As well as supplying the main Apache error log, can you indicate whether your system is set up with any of these things.

1. Any sort of backup process, be it file system backups or database backups. If there are such backup services, what time of day do they run.

2. Any sort of intensive batch update processing which loads data into any database at a specific time of day.

3. Do you have any monitoring system which checks for the liveness of Apache and which would force a restart if it was unable to poll Apache and thus though it was hung.

Lack of a permalink from NR UI (bottom right of pages) makes it hard to find the specific time of day for the event you are talking about because of uncertainties of timezones, but I see evidence of various other strange things going on at times. Lets see those main Apache error logs first though and then can properly come up with a plausible explanation of what may have triggered things.

Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.
<error.log>

Kelvin Wong

unread,
Nov 4, 2014, 4:58:23 AM11/4/14
to mod...@googlegroups.com
I attached the main Apache error log for the same period of time.


For the system setup,
1. I have no backup process for that instance. it's a web app server only.

2. Except the web application, no process connecting to database.

3. I used AWS Elastic load balancer and New Relic pinger to monitor the liveness of the website, not specified for Apache. All restart will be triggered manually for now.


Thanks.

...
main_error.log

Graham Dumpleton

unread,
Nov 4, 2014, 7:26:38 AM11/4/14
to mod...@googlegroups.com
Here is what I have.

You appear to be running daemon process groups with only a single process. For one of those daemon process groups, the daemon process hung.

The life time of the process is shown by the following logs.

Daemon logging:

[Mon Nov 03 10:39:25.824186 2014] [wsgi:info] [pid 2517:tid 140701954910080] mod_wsgi (pid=2517): Attach interpreter ''.
[Mon Nov 03 10:39:25.824339 2014] [wsgi:info] [pid 2517:tid 140701954910080] mod_wsgi (pid=2517): Adding '/home/ubuntu/site-1/django' to path.
[Mon Nov 03 10:39:25.824488 2014] [wsgi:info] [pid 2517:tid 140701954910080] mod_wsgi (pid=2517): Adding '/home/ubuntu/.virtualenvs/site-1/lib/python2.7/site-packages' to path.
[Mon Nov 03 10:39:25.826052 2014] [wsgi:info] [pid 2517:tid 140701954910080] mod_wsgi (pid=2517): Imported 'mod_wsgi'.
[Mon Nov 03 10:39:25.828107 2014] [wsgi:info] [pid 2517:tid 140701830735616] [remote xx.xx.xx.xx:64697] mod_wsgi (pid=2517, process='site-1', application=''): Loading WSGI script '/home/ubuntu/site-1/apache/wsgi.py'.
[Mon Nov 03 10:39:27.205378 2014] [wsgi:error] [pid 2517:tid 140701830735616] 2014-11-03 10:39:27,204 (2517/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.32.0.28)
[Mon Nov 03 10:39:27.205566 2014] [wsgi:error] [pid 2517:tid 140701637703424] 2014-11-03 10:39:27,205 (2517/Dummy-1) newrelic.core.agent INFO - New Relic Python Agent (2.32.0.28)
[Mon Nov 03 10:39:30.871519 2014] [wsgi:error] [pid 2517:tid 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) newrelic.core.data_collector INFO - Successfully registered New Relic Python agent where app_name='Snaptee', pid=2517, redirect_host=u'collector-129.newrelic.com' and agent_run_id=1824811776, in 0.50 seconds.
[Mon Nov 03 10:39:30.871653 2014] [wsgi:error] [pid 2517:tid 140701294765824] 2014-11-03 10:39:30,871 (2517/NR-Activate-Session/Snaptee) newrelic.core.data_collector INFO - Reporting to: https://rpm.newrelic.com/accounts/375291/applications/2915135
[Mon Nov 03 11:21:04.860039 2014] [wsgi:info] [pid 2517:tid 140701954910080] mod_wsgi (pid=2517): Shutdown requested 'site-1'.

The above logging is missing the full set of messages one would expect to see on process shutdown. This means it was hung and the Apache parent process would have brutally killed it when it didn't shutdown itself.

Apache logging:

[Mon Nov 03 10:39:25.812190 2014] [wsgi:info] [pid 2517:tid 140701954910080] mod_wsgi (pid=2517): Starting process 'site-1-api-ssl' with uid=33, gid=33 and threads=25.
[Mon Nov 03 10:39:25.812744 2014] [wsgi:info] [pid 2517:tid 140701954910080] mod_wsgi (pid=2517): Initializing Python.
[…
[Mon Nov 03 11:21:04.857993 2014] [mpm_worker:notice] [pid 1237:tid 140701954910080] AH00297: SIGUSR1 received.  Doing graceful restart
[Mon Nov 03 11:21:04.858106 2014] [wsgi:info] [pid 1237] mod_wsgi (pid=2517): Process 'site-1-api-ssl' has been deregistered and will no longer be monitored.

From our own tracking of what the New Relic agent is up to, we had a lifetime for that process of:

5288315 / 1824811776 ip-10-0-21-120 Snaptee 2517 2014-11-03 02:39:31 2014-11-03 03:01:34 2.32.0.28 Apache/mod_wsgi (daemon)

Our time zone is different, but you will note that it had the process life ending at 1 minute after the hour where logs show it was killed at 21 minutes past the hour.

The again confirms to me the process was hung. More significantly, whatever occurred also hung our agent background thread which does reporting of data. If it was still running then the life of the process wouldn't have been recorded as ending prematurely. In reality data reporting would have stopped even before that time as we don't mark it dead straight away when we stop getting data and it didn't send a proper disconnect message.

Unfortunately can't get anything more useful from NR metrics as the Lite plan will start summarising data after a day and loose ability to meaningfully zoom into historical data.

So the issue is why the process hung.

The main culprit for that in a multithreaded system is thread deadlock.

This definitely could be an issue in your code but would expect if that was the case that would only affect the request handler threads and not the NR agent background thread as it doesn't call into your code.

The one scenario I do know of that can explain it and we have experienced ourselves and even reported it as a bug with tests to replicate, is an issue in PostgreSQL client libraries which affect Python web applications which also do external HTTP requests using SSL connections. For whatever reason the use of the NR agent changes the dynamics of a web application enough that it seems to trigger the bug in PostgreSQL more easily.

The details of that PostgreSQL bug can be found at:


We are still waiting to see a fix from the PostgreSQL folks about that.

Now, given that the process is hanging, there are a couple of things that can be done.

The first is to set some options on the mod_wsgi daemon process group which will help with detecting a stuck process and automatically restarting it so you don't have to get out of bed in the middle of the night and manually restart it.

The second is to get some Python tracebacks for where it is stuck. Luckily (provided it works), the stuck process detection feature will try and log Python stack trace data for all threads on shutdown when it detects such a problem.

The option you want to use here with WSGIDaemonProcess is request-timeout. The trick here is going to be to work out a good value that is appropriate for your web application and the fact you are using 25 threads.

When one has 1 thread per process, the request-timeout works as a hard limit.

That is, when handling a web request, it takes more than the request-timeout then the process will be restarted to eject the potentially stuck request.

When you have more than one 1 thread per process handling request, a hard limit isn't necessarily a good idea due to the potential to kill off other active requests which are running fine. The graceful-timeout doesn't really help in this case because the stuck thread isn't going to complete anyway, so you end up just going into the more brutal shutdown sequence anyway.

Also, if only one request gets stuck, for example on a backend service call, in your case you would still have 24 other threads, so seems a bit excessive to restart the whole process when it could chug along quite happily anyway and the request may still complete after a while.

How request-timeout is interpreted therefore in a multi threaded situation is that if you have set it to 45 seconds for example, when the average accumulated response time across all request thread slots for active requests has reached 45 seconds, only then will it restart the process.

Imagine a worst case scenario where 25 requests came in at the same time and they all managed to deadlock on this PostgreSQL issue. As they started at the same time, then 45 seconds later, they will all have been running for 45 seconds and so the process will be restarted.

Now imagine though that only one thread had got stuck, after 45 seconds the process wouldn't be restarted as the average across all request thread slots would be 45/25 if there were no other requests running.

So the algorithm for now at least, tries to deal with the fact that there may still be threads which can be used and allow the process to run as long as possible before forcing a restart.

If further threads start getting stuck and as their running time grows, then the average will keep growing and eventually you will hit the 45 second average. This will obviously occur quite quickly if all threads start getting stuck at the same time as seemed to happen in your case.

My suggestion therefore is to set request-timeout on WSGIDaemonProcess. Select a timeout value (in seconds), which is above your typical worst case scenario for response time. The 99th percentile for your requests seems to be up around 25 seconds, so perhaps start with 60 seconds for the timeout initially.

So like with maximum-requests, we are putting in place something that will allow your system to keep going and recover, or avoid problems, while you work out the cause of why the process is hanging.

For the request-timeout one. Although the process will now recover, you will have a bunch of backlogged requests which will immediately get sent into the daemon process and temporarily overload it. Those requests could have hung around for quite a while and likely users no longer waiting. For these I would suggest setting queue-timeout option to WSGIDaemonProcess. This will cause the backlog to be discarded with a service unavailable allowing the daemon process to better recover.

Initially, set the queue-timeout to be 45 seconds and can see how it goes.

If the problem occurs again, you may only know by messages in Apache error logs about request and queue timeouts. If it occurs, keep an eye out for stack trace information as well in the logs.

There are also various other time outs one can set to override some defaults. Some defaults are based off the value of the Timeout directive in Apache.

Can you tell me what your Apache configuration currently has the Timeout directive set to, or whether it is using the default. The builtin default for Timeout should be 60. Some same configuration files override that to be 300 which is a ridiculously high value. Based on some of the log messages I suspect in your system it is set to 300.

Depending on what you say, may set a few other timeout values as well so as to best setup recovery for failure cases.

Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.
<main_error.log>

Kelvin Wong

unread,
Nov 4, 2014, 10:02:19 PM11/4/14
to mod...@googlegroups.com
Thanks for your suggestions.
I will try request-timeout and queue-timeout settings for the process recovery.

Apache configuration currently the Timeout directive is set to default 300, I will set it to 60.

But can you suggest some guides and ways to spot out the deadlocked requests?
Can it be spotted out on code or adding some logging in the code?

Thanks again.
Thanks.

Graham

<div style="background-color:rgb(250,250,250);border-color:rgb(187,187,
...

Graham Dumpleton

unread,
Nov 4, 2014, 10:10:54 PM11/4/14
to mod...@googlegroups.com
If the request-timeout mechanism of WSGIDaemonProcess triggers, it will attempt to log stack frames for each request thread into the Apache error log. That will be the first thing to look at.

The alternative is:


but you need to be present to trigger it to dump the stack traces, which is why going to see whether the new feature of mod_wsgi to dump them automatically will yield results first.

Graham

Kelvin Wong

unread,
Nov 5, 2014, 2:45:13 AM11/5/14
to mod...@googlegroups.com
I attached the stack trace when the request-timeout triggered.
Can you give me some hints on what it is happening?
It seemed related to the issue you mentioned.

Thanks.
...
stacktrace.log

Graham Dumpleton

unread,
Nov 5, 2014, 5:33:29 AM11/5/14
to mod...@googlegroups.com
If all the request threads reported as being stuck in the PostgreSQL connect() call, then it would appear to be the same issue.

The problem is caused by a bug in the libpq client library for PostgreSQL. That there is a problem has been acknowledged by the PostgreSQL developers in:


Unfortunately the PostgreSQL developers have no public issue tracker that I know of so I have no idea of the status of the issue and whether they have a fix or whether one has been released yet. Given that was only acknowledged recently, likely not. Even if they had come out with a fix, that Linux distributions for past versions do not update packages except for security fixes, if you are using a packaged PostgreSQL then you likely will never see the fix while you are using the older Linux version.

As to the scenario which causes the problem, it arises when you are talking to your PostgreSQL database using a SSL connection and where at the same time, other code within your application in parallel threads is making separate SSL connections out to backend web services or remote web sites.

So if for example any of your HTTP call outs to Instagram, Facebook or Amazon S3 are over SSL connections, you can hit this problem. That the NR agent also makes HTTPS call outs once a minute will add to the possibility that you could hit the bug in PostgreSQL, but the issue can arise independent of whether you are using the NR agent.

Getting down to the actual bug in PostgreSQL, what it does when it is making SSL connections is that when it hasn't already got an SSL connection to the database open, it will the first time do some initialisation in the SSL libraries to register some callbacks to allow it to manage some locks on some resource. It will keep these callbacks in place so long as it has at least one active SSL connection. When all the SSL connections to the database are closed, it will deregister its callbacks. They would be reregistered the next time it tries to create a database connection again.

The problem is where a separate SSL connection is created for a HTTP request or something else, the duration of which starts while there is an active database SSL connection, but finishes after the last database SSL connection had closed.

In this situation, the HTTP request over SSL will still go into the callbacks setup by PostgreSQL to do the locking it wants, but when the HTTP request completes, PostgreSQL has removed those callbacks and so they aren't called and the locks that were created at the start of the HTTP request are not released. Now the next time that an SSL connection is created by PostgreSQL, it will setup its callbacks again and try and acquire the locks again, but since they were never released at the end of the HTTP request, it deadlocks instead. Over time as each request handler is run, this will happen to all threads and so the whole process gets stuck.

Overall it is a pretty evil bug so am not sure why it hasn't been noticed before. It does actually explain though cases which have popped up over the years where mod_wsgi users web applications were deadlocking, but where never got sufficient information out of people to track it down.

What can you do.

If you can't avoid having to use SSL connections for outbound HTTP connections from your application it is going to be hard to avoid it completely.

There are perhaps a couple of things you could do to lesson the possibility of it occurring though.

The first is not use SSL connections for HTTP calls where you don't absolutely require it. Obviously this usually isn't going to be possible though.

You might for example configure the NR agent not to use SSL connections by setting:

ssl = false

in the agent configuration file.

This does mean that data goes in the clear. For metric data this isn't going to be a problem as it is generally innocuous. The only area of concern may be error details, which can contain Python traceback information.

Making that change for the NR agent isn't going to help you with Instagram. Facebook and Amazon S3 if they must be SSL.

The other thing you can do is if using Django, is to use a new enough Django version that you can use the database connection pooling feature of Django. This will help as normally Django creates a new database connection per request and so there is lots of opportunities where there will not always be an active database connection and so a HTTP request over an SSL connection can occur which overlaps when the database connection was closed.

So by using the database connection pooling mechanism in Django, then the database connections will be kept active for the life of the whole process, thereby pretty well completely eliminating the probability it can occur.

Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.
<stacktrace.log>

Kelvin Wong

unread,
Nov 5, 2014, 11:47:50 PM11/5/14
to mod...@googlegroups.com
Thanks a lot.

I used Django connection pooling feature to avoid making database connection frequently.


On Wednesday, November 5, 2014 6:33:29 PM UTC+8, Graham Dumpleton wrote:
If all the request threads reported as being stuck in the PostgreSQL connect() call, then it would appear to be the same issue.

The problem is caused by a bug in the libpq client library for PostgreSQL. That there is a problem has been acknowledged by the PostgreSQL developers in:

Thanks.

Graham

Reply all
Reply to author
Forward
0 new messages