inactivity-timeout

154 views
Skip to first unread message

Kent

unread,
Jan 14, 2015, 4:32:53 PM1/14/15
to mod...@googlegroups.com
Graham, the docs state: "For the purposes of this option, being idle means no new requests being received, or no attempts by current requests to read request content or generate response content for the defined period."

This implies to me that a running request that is taking a long time could actually be killed as if it were idle (suppose it were fetching a very slow database query). Is this the case?

Also, I'm looking for an ultra-conservative and graceful method of recycling memory. I've read your article on url partitioning, which was useful, but sooner or later, one must rely on either inactivity-timeout or maximum-requests, is that accurate? But both these will eventually, after graceful timeout/shutdown timeout, potentially kill active requests. It is valid for our app to handle long-running reports, so I was hoping for an ultra-safe mechanism.
Do you have any advice here?

Graham Dumpleton

unread,
Jan 14, 2015, 9:48:02 PM1/14/15
to mod...@googlegroups.com

On 15/01/2015, at 8:32 AM, Kent <jkent...@gmail.com> wrote:

> Graham, the docs state: "For the purposes of this option, being idle means no new requests being received, or no attempts by current requests to read request content or generate response content for the defined period."
>
> This implies to me that a running request that is taking a long time could actually be killed as if it were idle (suppose it were fetching a very slow database query). Is this the case?

This is the case for mod_wsgi prior to version 4.0.

Things have changed in mod_wsgi 4.X.

How long are your long running requests though? The inactivity-timeout was more about restarting infrequently used applications so that memory can be taken back.

> Also, I'm looking for an ultra-conservative and graceful method of recycling memory. I've read your article on url partitioning, which was useful, but sooner or later, one must rely on either inactivity-timeout or maximum-requests, is that accurate? But both these will eventually, after graceful timeout/shutdown timeout, potentially kill active requests. It is valid for our app to handle long-running reports, so I was hoping for an ultra-safe mechanism.
> Do you have any advice here?

The options available in mod_wsgi 4.X are much better in this area than 3.X. The changes in 4.X aren't covered in main documentation though and are only described in the release notes where change was made.

In 4.X the concept of an inactivity-timeout is now separate to the idea of a request-timeout. There is also a graceful-timeout that can be applied to maximum-requests and some other situations as well to allow requests to finish out properly before being more brutal. One can also signal the daemon processes to do a more graceful restart as well.

You cannot totally avoid having to be brutal though and kill things else you don't have a fail safe for a stuck process where all request threads were blocked on back end services and were never going to recover. Use of multithreading in a process also complicates the implementation of request-timeout.

Anyway, the big question is what version are you using?

Graham

Kent

unread,
Jan 15, 2015, 3:28:01 PM1/15/15
to mod...@googlegroups.com
I'm running 4 (a very early version of it, possibly before you officially released it).   We upgraded to take advantage of the amazingly-helpful SIGUSR1 signaling for graceful process restarting, which we use somewhat regularly to gracefully deploy software changes (minor ones which won't matter if 2 processes have different versions loaded) without disrupting users.  Thanks a ton for that!

We are also multi-threading our processes (plural processes, plural threads).

Some requests could be (validly) running for very long periods of time (database reporting, maybe even half hour, though that would be very extreme).

Some processes (especially those generating .pdfs, for example), hog tons of RAM, as you know, so I'd like these to eventually check their RAM back in, so to speak, by utilizing either inactivity-timeout or maximum-requests, but always in a very gentle way, since, as I mentioned, some requests might be properly running, even though for many minutes.  maximum-requests seems too brutal for my use-case since the threshold request sends the process down the graceful-timeout/shutdown-timeout, even if there are valid processes running and then SIGKILLs.  My ideal vision of "maximum-requests," since it is primarily for memory management, is to be very gentle, sort of a "ok, now that I've hit my threshold, at my next earliest convenience, I should die, but only once all my current requests have ended of their own accord."

"inactivity-timeout" seems to function exactly as I want in that it seems like it won't ever kill a process with a thread with an active request (at least, I can't get it too even by adding a long import time;time.sleep(longtime)... it doesn't seem to die until the request is finished.  But that's why the documentation made me nervous because it implies that it could, in fact, kill a proc with an active request: "For the purposes of this option, being idle means no new requests being received, or no attempts by current requests to read request content or generate response content for the defined period."   

I'd rather have a more gentle "maximum-requests" than "inactivity-timeout" because then, even on very heavy days (when RAM is most likely to choke), I could gracefully turn over these processes a couple times a day, which I couldn't do with "inactivity-timeout" on an extremely heavy day.

Hope this makes sense.  I'm really asking :
  1. whether inactivity-timeout triggering will ever SIGKILL a process with an active request, as the docs intimate
  2. whether there is any way to get maximum-requests to behave more gently under all circumstances
  3. for your ideas/best advice 

Thanks for your help!

Graham Dumpleton

unread,
Jan 16, 2015, 4:54:50 AM1/16/15
to mod...@googlegroups.com
On 16/01/2015, at 7:28 AM, Kent <jkent...@gmail.com> wrote:

I'm running 4 (a very early version of it, possibly before you officially released it).   We upgraded to take advantage of the amazingly-helpful SIGUSR1 signaling for graceful process restarting, which we use somewhat regularly to gracefully deploy software changes (minor ones which won't matter if 2 processes have different versions loaded) without disrupting users.  Thanks a ton for that!

SIGUSR1 support was released in version 4.1.0.


That same version has all the other stuff which was changed so long as using the actual 4.1.0 is being used and you aren't still using the repo from before the official release.

If not sure, best just upgrading to latest version if you can.

We are also multi-threading our processes (plural processes, plural threads).

Some requests could be (validly) running for very long periods of time (database reporting, maybe even half hour, though that would be very extreme).

Some processes (especially those generating .pdfs, for example), hog tons of RAM, as you know, so I'd like these to eventually check their RAM back in, so to speak, by utilizing either inactivity-timeout or maximum-requests, but always in a very gentle way, since, as I mentioned, some requests might be properly running, even though for many minutes.  maximum-requests seems too brutal for my use-case since the threshold request sends the process down the graceful-timeout/shutdown-timeout, even if there are valid processes running and then SIGKILLs.  My ideal vision of "maximum-requests," since it is primarily for memory management, is to be very gentle, sort of a "ok, now that I've hit my threshold, at my next earliest convenience, I should die, but only once all my current requests have ended of their own accord."

That is where if you vertically partition those URLs out to a separate daemon process group, you can simply set a very hight graceful-timeout value.

So relying on the feature:

"""
2. Add a graceful-timeout option to WSGIDaemonProcess. This option is applied in a number of circumstances.

When maximum-requests and this option are used together, when maximum requests is reached, rather than immediately shutdown, potentially interupting active requests if they don’t finished with shutdown timeout, can specify a separate graceful shutdown period. If the all requests are completed within this time frame then will shutdown immediately, otherwise normal forced shutdown kicks in. In some respects this is just allowing a separate shutdown timeout on cases where requests could be interrupted and could avoid it if possible.
"""

You could set:

    maximum-requests=20 graceful-timeout=600

So as soon as it hits 20 requests, it switches to a mode where it will when no requests, restart. You can set that timeout as high as you want, even hours, and it will not care.

"inactivity-timeout" seems to function exactly as I want in that it seems like it won't ever kill a process with a thread with an active request (at least, I can't get it too even by adding a long import time;time.sleep(longtime)... it doesn't seem to die until the request is finished.  But that's why the documentation made me nervous because it implies that it could, in fact, kill a proc with an active request: "For the purposes of this option, being idle means no new requests being received, or no attempts by current requests to read request content or generate response content for the defined period."   

The release notes for 4.1.0 say:

"""
4. The inactivity-timeout option to WSGIDaemonProcess now only results in the daemon process being restarted after the idle timeout period where there are no active requests. Previously it would also interrupt a long running request. See the new request-timeout option for a way of interrupting long running, potentially blocked requests and restarting the process.
"""

I'd rather have a more gentle "maximum-requests" than "inactivity-timeout" because then, even on very heavy days (when RAM is most likely to choke), I could gracefully turn over these processes a couple times a day, which I couldn't do with "inactivity-timeout" on an extremely heavy day.

Hope this makes sense.  I'm really asking :
  1. whether inactivity-timeout triggering will ever SIGKILL a process with an active request, as the docs intimate
No from 4.1.0 onwards.
  1. whether there is any way to get maximum-requests to behave more gently under all circumstances
By setting a very very long graceful-timeout.
  1. for your ideas/best advice
Have a good read through the release notes for 4.1.0.

Not necessarily useful in your case, but also look at request-timeout. It can act as a final fail safe for when things are stuck, but since it is more of a fail safe, it doesn't make use of graceful-timeout.

Graham


Thanks for your help!



On Wednesday, January 14, 2015 at 9:48:02 PM UTC-5, Graham Dumpleton wrote:

On 15/01/2015, at 8:32 AM, Kent <jkent...@gmail.com> wrote:

> Graham, the docs state: "For the purposes of this option, being idle means no new requests being received, or no attempts by current requests to read request content or generate response content for the defined period."  
>
> This implies to me that a running request that is taking a long time could actually be killed as if it were idle (suppose it were fetching a very slow database query).  Is this the case?

This is the case for mod_wsgi prior to version 4.0.

Things have changed in mod_wsgi 4.X.

How long are your long running requests though? The inactivity-timeout was more about restarting infrequently used applications so that memory can be taken back.
 

> Also, I'm looking for an ultra-conservative and graceful method of recycling memory. I've read your article on url partitioning, which was useful, but sooner or later, one must rely on either inactivity-timeout or maximum-requests, is that accurate?  But both these will eventually, after graceful timeout/shutdown timeout, potentially kill active requests.  It is valid for our app to handle long-running reports, so I was hoping for an ultra-safe mechanism.
> Do you have any advice here?

The options available in mod_wsgi 4.X are much better in this area than 3.X. The changes in 4.X aren't covered in main documentation though and are only described in the release notes where change was made.

In 4.X the concept of an inactivity-timeout is now separate to the idea of a request-timeout. There is also a graceful-timeout that can be applied to maximum-requests and some other situations as well to allow requests to finish out properly before being more brutal. One can also signal the daemon processes to do a more graceful restart as well.

You cannot totally avoid having to be brutal though and kill things else you don't have a fail safe for a stuck process where all request threads were blocked on back end services and were never going to recover. Use of multithreading in a process also complicates the implementation of request-timeout.

Anyway, the big question is what version are you using?

Graham


--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.

Kent

unread,
Jan 16, 2015, 8:27:50 AM1/16/15
to mod...@googlegroups.com
Thanks again.  Yes, I did take our current version from the repo because you hadn't released the SIGUSR1 bit yet...  I should upgrade now.

As for the very long graceful-timeout, I was skirting around that solution because I like where the setting is currently for SIGUSR1.  I would like to ask, "Is there a way to indicate a different graceful-timeout for handling SIGUSR1 vs. maximum-requests?" but I already have the answer from the release notes: "No."

I don't know if you can see the value in distinguishing the two, but maximum-requests is sort of a "standard operating mode," so it might make sense for a modwsgi user to want a higher, very gentle mode of operation there, whereas SIGUSR1, while beautifully more graceful than SIGKILL, still "means business," so the same user may want a shorter responsive timeout there (while still allowing a decent chunk of time for being graceful to running requests).   That is the case for me at least.  Any chance you'd entertain that as a feature request?

Even if not, you've been extremely helpful, thank you!  And thanks for pointing me to the correct version of documentation: I thought I was reading current version.
Kent

P.S. I also have ideas for possible vertical URL partitioning, but unfortunately, our app has much cross-over by URL, so that's why I'm down this maximum-requests path...

Graham Dumpleton

unread,
Jan 18, 2015, 12:43:08 AM1/18/15
to mod...@googlegroups.com
There are a few possibilities here of how this could be enhanced/changed.

The problem with maximum-requests is that it can be dangerous. People can set it too low and when their site gets a big spike of traffic then the processes can be restarted too quickly only adding to the load of the site and causing things to slow down and hamper their ability to handle the spike. This is where setting a longer amount of time for graceful-timeout helps because you can set it to be quite large. The use of maximum-requests can still be like using a hammer though, and one which can be applied unpredictably.

The maximum-requests option also doesn't help in the case where you are running background threads which do stuff and it is them and not the number of requests coming in that dictate things like memory growth that you want to counter.

So the other option I have contemplated adding a number of times is is one to periodically restart the process. The way this would work is that a process restart would be done periodically based on what time was specified. You could therefore say the restart interval was 3600 and it would restart the process once an hour.

The start of the time period for this would either be, when the process was created, if any Python code or a WSGI script was preloaded at process start time. Or, it would be from when the first request arrived if the WSGi application was lazily loaded. This restart-interval could be tied to the graceful-timeout option so that you can set and extended period if you want to try and ensure that requests are not interrupted.

Now we have the ability to sent the process graceful restart signal (usually SIGUSR1), to force an individual process to restart.

Right now this is tied to the graceful-timeout duration as well, which as you point out, would perhaps be better off having its own time duration for the notional grace period.

Using the name restart-timeout for this could be confusing if I have a restart interval option.

I also have another type of process restart I am trying to work out how to accommodate and the naming of options again complicates the problem. In this case we want to introduce an artificial restart delay.

This would be an option to combat the problem which is being caused by Django 1.7 in that WSGI script file loading for Django isn't stateless. If a transient problem occurs, such as the database not being ready, the loading of the WSGI script file can fail. On the next request an attempt is made to load it again but now Django kicks a stink because it was half way setting things up last time when it failed and the setup code cannot be run a second time. The result is that the process then keeps failing.

The idea of the restart delay option therefore is to allow you to set it to number of seconds, normally just 1. If set like that, if a WSGI script file import fails, it will effectively block for the delay specified and when over it will kill the process so the whole process is thrown away and the WSGI script file can be reloaded in a fresh process. This gets rid of the problem of Django initialisation not being able to be retried.

A delay is needed to avoid an effective fork bomb, where a WSGI script file not loading with high request throughput would cause a constant cycle of processes dying and being replaced. It is possible it wouldn't be as bad as I think as Apache only checks for dead processes to replace once a second, but still prefer my own failsafe in case that changes.

I am therefore totally fine with a separate graceful time period for when SIGUSR1 is used, I just need to juggle these different features and come up with an option naming scheme that make sense.

How about then that I add the following new options:

    maximum-lifetime - Similar to maximum-requests in that it will cause the processes to be shutdown and restarted, but in this case it will occur based on the time period given as argument, measured from the first request or when the WSGI script file or any other Python code was preloaded, that is, in the latter case when the process was started.

    restart-timeout - Specifies a separate grace period for when the process is being forcibly restarted using the graceful restart signal. If restart-timeout is not specified and graceful-timeout is specified, then the value of graceful-timeout is used. If neither are specified, then the restart signal will be have similar to the process being sent a SIGINT.

    linger-timeout - When a WSGI script file, of other Python code is being imported by mod_wsgi directly, if that fails the default is that the error is ignored. For a WSGI script file reloading will be attempted on the next request. But if preloading code then it will fail and merely be logged. If linger-timeout is specified to a non zero value, with the value being seconds, then the daemon process will instead be shutdown and restarted to try and allow a successful reloading of the code to occur if it was a transient issue. To avoid a fork bomb if a persistent issue, a delay will be introduced based on the value of the linger-timeout option.

How does that all sound, if it makes sense that is. :-)

Graham

Kent

unread,
Jan 20, 2015, 7:50:13 AM1/20/15
to mod...@googlegroups.com
On Sunday, January 18, 2015 at 12:43:08 AM UTC-5, Graham Dumpleton wrote:
There are a few possibilities here of how this could be enhanced/changed.

The problem with maximum-requests is that it can be dangerous. People can set it too low and when their site gets a big spike of traffic then the processes can be restarted too quickly only adding to the load of the site and causing things to slow down and hamper their ability to handle the spike. This is where setting a longer amount of time for graceful-timeout helps because you can set it to be quite large. The use of maximum-requests can still be like using a hammer though, and one which can be applied unpredictably.

Yes, I can see that. (It may be overkill, but you could default a separate minimum-lifetime parameter so only users who specifically mess with that as well as maximum-requests shoot themselves in the foot, but it is starting to get confusing with all the different timeouts, I'll agree there...)
 

The maximum-requests option also doesn't help in the case where you are running background threads which do stuff and it is them and not the number of requests coming in that dictate things like memory growth that you want to counter.


True, but solving with maximum lifetime... well, actually, solving memory problems with any of these mechanisms isn't measuring the heart of the problem, which is RAM.  I imagine there isn't a good way to measure RAM or you would have added that option by now.  Seems what we are truly after for the majority of these isn't how many requests or how log its been up, etc, but how much RAM it is taking (or perhaps, optionally, average RAM per thread, instead).  If my process exceeds consuming 1.5GB, then trigger a graceful restart at the next appropriate convenience, being gentle to existing requests.  That may be arguably the most useful parameter.

 
So the other option I have contemplated adding a number of times is is one to periodically restart the process. The way this would work is that a process restart would be done periodically based on what time was specified. You could therefore say the restart interval was 3600 and it would restart the process once an hour.

The start of the time period for this would either be, when the process was created, if any Python code or a WSGI script was preloaded at process start time. Or, it would be from when the first request arrived if the WSGi application was lazily loaded. This restart-interval could be tied to the graceful-timeout option so that you can set and extended period if you want to try and ensure that requests are not interrupted.

We just wouldn't want it to die having never even served a single request, so my vote would be against the birth of the process as the beginning point (and, rather, at first request).
 

Now we have the ability to sent the process graceful restart signal (usually SIGUSR1), to force an individual process to restart.

Right now this is tied to the graceful-timeout duration as well, which as you point out, would perhaps be better off having its own time duration for the notional grace period.

Using the name restart-timeout for this could be confusing if I have a restart interval option.


In my opinion, SIGUSR1 is different from the automatic parameters because it was (most likely) triggered by user intervention, so that one should ideally have its own parameter.  If that is the case and this parameter becomes dedicated to SIGUSR1, then the least ambiguous name I can think of is sigusr1-timeout.
 
I also have another type of process restart I am trying to work out how to accommodate and the naming of options again complicates the problem. In this case we want to introduce an artificial restart delay.

This would be an option to combat the problem which is being caused by Django 1.7 in that WSGI script file loading for Django isn't stateless. If a transient problem occurs, such as the database not being ready, the loading of the WSGI script file can fail. On the next request an attempt is made to load it again but now Django kicks a stink because it was half way setting things up last time when it failed and the setup code cannot be run a second time. The result is that the process then keeps failing.

The idea of the restart delay option therefore is to allow you to set it to number of seconds, normally just 1. If set like that, if a WSGI script file import fails, it will effectively block for the delay specified and when over it will kill the process so the whole process is thrown away and the WSGI script file can be reloaded in a fresh process. This gets rid of the problem of Django initialisation not being able to be retried.


(We are using turbogears... I don't think I've seen something like that happen, but rarely have seen start up anomalies.)
 
A delay is needed to avoid an effective fork bomb, where a WSGI script file not loading with high request throughput would cause a constant cycle of processes dying and being replaced. It is possible it wouldn't be as bad as I think as Apache only checks for dead processes to replace once a second, but still prefer my own failsafe in case that changes.

I am therefore totally fine with a separate graceful time period for when SIGUSR1 is used, I just need to juggle these different features and come up with an option naming scheme that make sense.

How about then that I add the following new options:

    maximum-lifetime - Similar to maximum-requests in that it will cause the processes to be shutdown and restarted, but in this case it will occur based on the time period given as argument, measured from the first request or when the WSGI script file or any other Python code was preloaded, that is, in the latter case when the process was started.

    restart-timeout - Specifies a separate grace period for when the process is being forcibly restarted using the graceful restart signal. If restart-timeout is not specified and graceful-timeout is specified, then the value of graceful-timeout is used. If neither are specified, then the restart signal will be have similar to the process being sent a SIGINT.

    linger-timeout - When a WSGI script file, of other Python code is being imported by mod_wsgi directly, if that fails the default is that the error is ignored. For a WSGI script file reloading will be attempted on the next request. But if preloading code then it will fail and merely be logged. If linger-timeout is specified to a non zero value, with the value being seconds, then the daemon process will instead be shutdown and restarted to try and allow a successful reloading of the code to occur if it was a transient issue. To avoid a fork bomb if a persistent issue, a delay will be introduced based on the value of the linger-timeout option.
 
How does that all sound, if it makes sense that is. :-)



That sounds absolutely great!  How would I get on the notification cc: of the ticket or whatever so I'd be informed of progress on that?

 

Graham Dumpleton

unread,
Jan 20, 2015, 5:53:26 PM1/20/15
to mod...@googlegroups.com
On 20/01/2015, at 11:50 PM, Kent <jkent...@gmail.com> wrote:

On Sunday, January 18, 2015 at 12:43:08 AM UTC-5, Graham Dumpleton wrote:
There are a few possibilities here of how this could be enhanced/changed.

The problem with maximum-requests is that it can be dangerous. People can set it too low and when their site gets a big spike of traffic then the processes can be restarted too quickly only adding to the load of the site and causing things to slow down and hamper their ability to handle the spike. This is where setting a longer amount of time for graceful-timeout helps because you can set it to be quite large. The use of maximum-requests can still be like using a hammer though, and one which can be applied unpredictably.

Yes, I can see that. (It may be overkill, but you could default a separate minimum-lifetime parameter so only users who specifically mess with that as well as maximum-requests shoot themselves in the foot, but it is starting to get confusing with all the different timeouts, I'll agree there...)
 

The minimum-lifetime option is an interesting idea. It may have to do nothing by default to avoid conflicts with existing expected behaviour.


The maximum-requests option also doesn't help in the case where you are running background threads which do stuff and it is them and not the number of requests coming in that dictate things like memory growth that you want to counter.


True, but solving with maximum lifetime... well, actually, solving memory problems with any of these mechanisms isn't measuring the heart of the problem, which is RAM.  I imagine there isn't a good way to measure RAM or you would have added that option by now.  Seems what we are truly after for the majority of these isn't how many requests or how log its been up, etc, but how much RAM it is taking (or perhaps, optionally, average RAM per thread, instead).  If my process exceeds consuming 1.5GB, then trigger a graceful restart at the next appropriate convenience, being gentle to existing requests.  That may be arguably the most useful parameter.


The problem with calculating memory is that there isn't one cross platform portable way of doing it. On Linux you have to dive into the /proc file system. On MacOS X you can use C API calls. On Solaris I think you again need to dive into a /proc file system but it obviously has a different file structure for getting details out compared to Linux. Adding such cross platform stuff in gets a bit messy.

What I was moving towards as an extension of the monitoring stuff I am doing for mod_wsgi was to have a special daemon process you can setup which has access to some sort of management API. You could then create your own Python script that runs in that and which using the management API can get daemon process pids and then use Python psutil to get memory usage on periodic basis and then you decide if process should be restarted and send it a signal to stop, or management API provided which allows you to notify in some way, maybe by signal, or maybe using shared memory flag, that daemon process should shut down.

So the other option I have contemplated adding a number of times is is one to periodically restart the process. The way this would work is that a process restart would be done periodically based on what time was specified. You could therefore say the restart interval was 3600 and it would restart the process once an hour.

The start of the time period for this would either be, when the process was created, if any Python code or a WSGI script was preloaded at process start time. Or, it would be from when the first request arrived if the WSGi application was lazily loaded. This restart-interval could be tied to the graceful-timeout option so that you can set and extended period if you want to try and ensure that requests are not interrupted.

We just wouldn't want it to die having never even served a single request, so my vote would be against the birth of the process as the beginning point (and, rather, at first request).


It would effectively be from first request if lazily loaded. If preloaded though, as background threads could be created which do stuff and consume memory over time, would then be from when process started, ie., when Python code was preloaded.


Now we have the ability to sent the process graceful restart signal (usually SIGUSR1), to force an individual process to restart.

Right now this is tied to the graceful-timeout duration as well, which as you point out, would perhaps be better off having its own time duration for the notional grace period.

Using the name restart-timeout for this could be confusing if I have a restart interval option.


In my opinion, SIGUSR1 is different from the automatic parameters because it was (most likely) triggered by user intervention, so that one should ideally have its own parameter.  If that is the case and this parameter becomes dedicated to SIGUSR1, then the least ambiguous name I can think of is sigusr1-timeout.
 

Except that it isn't guaranteed to be called SIGUSR1. Technically it could be a different signal dependent on platform that Apache runs as. But then, as far as I know all UNIX systems do use SIGUSR1.

I also have another type of process restart I am trying to work out how to accommodate and the naming of options again complicates the problem. In this case we want to introduce an artificial restart delay.

This would be an option to combat the problem which is being caused by Django 1.7 in that WSGI script file loading for Django isn't stateless. If a transient problem occurs, such as the database not being ready, the loading of the WSGI script file can fail. On the next request an attempt is made to load it again but now Django kicks a stink because it was half way setting things up last time when it failed and the setup code cannot be run a second time. The result is that the process then keeps failing.

The idea of the restart delay option therefore is to allow you to set it to number of seconds, normally just 1. If set like that, if a WSGI script file import fails, it will effectively block for the delay specified and when over it will kill the process so the whole process is thrown away and the WSGI script file can be reloaded in a fresh process. This gets rid of the problem of Django initialisation not being able to be retried.


(We are using turbogears... I don't think I've seen something like that happen, but rarely have seen start up anomalies.)
 
A delay is needed to avoid an effective fork bomb, where a WSGI script file not loading with high request throughput would cause a constant cycle of processes dying and being replaced. It is possible it wouldn't be as bad as I think as Apache only checks for dead processes to replace once a second, but still prefer my own failsafe in case that changes.

I am therefore totally fine with a separate graceful time period for when SIGUSR1 is used, I just need to juggle these different features and come up with an option naming scheme that make sense.

How about then that I add the following new options:

    maximum-lifetime - Similar to maximum-requests in that it will cause the processes to be shutdown and restarted, but in this case it will occur based on the time period given as argument, measured from the first request or when the WSGI script file or any other Python code was preloaded, that is, in the latter case when the process was started.

    restart-timeout - Specifies a separate grace period for when the process is being forcibly restarted using the graceful restart signal. If restart-timeout is not specified and graceful-timeout is specified, then the value of graceful-timeout is used. If neither are specified, then the restart signal will be have similar to the process being sent a SIGINT.

    linger-timeout - When a WSGI script file, of other Python code is being imported by mod_wsgi directly, if that fails the default is that the error is ignored. For a WSGI script file reloading will be attempted on the next request. But if preloading code then it will fail and merely be logged. If linger-timeout is specified to a non zero value, with the value being seconds, then the daemon process will instead be shutdown and restarted to try and allow a successful reloading of the code to occur if it was a transient issue. To avoid a fork bomb if a persistent issue, a delay will be introduced based on the value of the linger-timeout option.
 
How does that all sound, if it makes sense that is. :-)



That sounds absolutely great!  How would I get on the notification cc: of the ticket or whatever so I'd be informed of progress on that?

These days my turn around time is pretty quick so long as I am happy and know what to change and how. So I just need to think a bit more about it and gets some day job stuff out of the way before I can do something.

So don't be surprised if you simply get a reply to this email within a week pointing at a development version to try.

Graham

Kent

unread,
Jan 21, 2015, 7:15:00 AM1/21/15
to mod...@googlegroups.com

On Tuesday, January 20, 2015 at 5:53:26 PM UTC-5, Graham Dumpleton wrote:

On 20/01/2015, at 11:50 PM, Kent <jkent...@gmail.com> wrote:

On Sunday, January 18, 2015 at 12:43:08 AM UTC-5, Graham Dumpleton wrote:
There are a few possibilities here of how this could be enhanced/changed.

The problem with maximum-requests is that it can be dangerous. People can set it too low and when their site gets a big spike of traffic then the processes can be restarted too quickly only adding to the load of the site and causing things to slow down and hamper their ability to handle the spike. This is where setting a longer amount of time for graceful-timeout helps because you can set it to be quite large. The use of maximum-requests can still be like using a hammer though, and one which can be applied unpredictably.

Yes, I can see that. (It may be overkill, but you could default a separate minimum-lifetime parameter so only users who specifically mess with that as well as maximum-requests shoot themselves in the foot, but it is starting to get confusing with all the different timeouts, I'll agree there...)
 

The minimum-lifetime option is an interesting idea. It may have to do nothing by default to avoid conflicts with existing expected behaviour.


The maximum-requests option also doesn't help in the case where you are running background threads which do stuff and it is them and not the number of requests coming in that dictate things like memory growth that you want to counter.


True, but solving with maximum lifetime... well, actually, solving memory problems with any of these mechanisms isn't measuring the heart of the problem, which is RAM.  I imagine there isn't a good way to measure RAM or you would have added that option by now.  Seems what we are truly after for the majority of these isn't how many requests or how log its been up, etc, but how much RAM it is taking (or perhaps, optionally, average RAM per thread, instead).  If my process exceeds consuming 1.5GB, then trigger a graceful restart at the next appropriate convenience, being gentle to existing requests.  That may be arguably the most useful parameter.


The problem with calculating memory is that there isn't one cross platform portable way of doing it. On Linux you have to dive into the /proc file system. On MacOS X you can use C API calls. On Solaris I think you again need to dive into a /proc file system but it obviously has a different file structure for getting details out compared to Linux. Adding such cross platform stuff in gets a bit messy.

What I was moving towards as an extension of the monitoring stuff I am doing for mod_wsgi was to have a special daemon process you can setup which has access to some sort of management API. You could then create your own Python script that runs in that and which using the management API can get daemon process pids and then use Python psutil to get memory usage on periodic basis and then you decide if process should be restarted and send it a signal to stop, or management API provided which allows you to notify in some way, maybe by signal, or maybe using shared memory flag, that daemon process should shut down.


I figured there was something making that a pain...
 
So the other option I have contemplated adding a number of times is is one to periodically restart the process. The way this would work is that a process restart would be done periodically based on what time was specified. You could therefore say the restart interval was 3600 and it would restart the process once an hour.

The start of the time period for this would either be, when the process was created, if any Python code or a WSGI script was preloaded at process start time. Or, it would be from when the first request arrived if the WSGi application was lazily loaded. This restart-interval could be tied to the graceful-timeout option so that you can set and extended period if you want to try and ensure that requests are not interrupted.

We just wouldn't want it to die having never even served a single request, so my vote would be against the birth of the process as the beginning point (and, rather, at first request).


It would effectively be from first request if lazily loaded. If preloaded though, as background threads could be created which do stuff and consume memory over time, would then be from when process started, ie., when Python code was preloaded.


But then for preloaded, processes life-cycle themselves for no reason throughout inactive periods like maybe overnight.  That's not the end of the world, but I wonder if we're catering to the wrong design. (These are, after all, webserver processes, so it seems a fair assumption that they exist primarily to handle requests, else why even run under apache?)  My vote, for what it's worth, would still be timed from first request, but I probably won't use that particular option.  Either way would be useful for some I'm sure.
 

Now we have the ability to sent the process graceful restart signal (usually SIGUSR1), to force an individual process to restart.

Right now this is tied to the graceful-timeout duration as well, which as you point out, would perhaps be better off having its own time duration for the notional grace period.

Using the name restart-timeout for this could be confusing if I have a restart interval option.


In my opinion, SIGUSR1 is different from the automatic parameters because it was (most likely) triggered by user intervention, so that one should ideally have its own parameter.  If that is the case and this parameter becomes dedicated to SIGUSR1, then the least ambiguous name I can think of is sigusr1-timeout.
 

Except that it isn't guaranteed to be called SIGUSR1. Technically it could be a different signal dependent on platform that Apache runs as. But then, as far as I know all UNIX systems do use SIGUSR1.


In any case, they are "signals": you like signal-timeout? (Also could be taken ambiguously, but maybe less so than restart-timeout?)
 
I also have another type of process restart I am trying to work out how to accommodate and the naming of options again complicates the problem. In this case we want to introduce an artificial restart delay.

This would be an option to combat the problem which is being caused by Django 1.7 in that WSGI script file loading for Django isn't stateless. If a transient problem occurs, such as the database not being ready, the loading of the WSGI script file can fail. On the next request an attempt is made to load it again but now Django kicks a stink because it was half way setting things up last time when it failed and the setup code cannot be run a second time. The result is that the process then keeps failing.

The idea of the restart delay option therefore is to allow you to set it to number of seconds, normally just 1. If set like that, if a WSGI script file import fails, it will effectively block for the delay specified and when over it will kill the process so the whole process is thrown away and the WSGI script file can be reloaded in a fresh process. This gets rid of the problem of Django initialisation not being able to be retried.


(We are using turbogears... I don't think I've seen something like that happen, but rarely have seen start up anomalies.)
 
A delay is needed to avoid an effective fork bomb, where a WSGI script file not loading with high request throughput would cause a constant cycle of processes dying and being replaced. It is possible it wouldn't be as bad as I think as Apache only checks for dead processes to replace once a second, but still prefer my own failsafe in case that changes.

I am therefore totally fine with a separate graceful time period for when SIGUSR1 is used, I just need to juggle these different features and come up with an option naming scheme that make sense.

How about then that I add the following new options:

    maximum-lifetime - Similar to maximum-requests in that it will cause the processes to be shutdown and restarted, but in this case it will occur based on the time period given as argument, measured from the first request or when the WSGI script file or any other Python code was preloaded, that is, in the latter case when the process was started.

    restart-timeout - Specifies a separate grace period for when the process is being forcibly restarted using the graceful restart signal. If restart-timeout is not specified and graceful-timeout is specified, then the value of graceful-timeout is used. If neither are specified, then the restart signal will be have similar to the process being sent a SIGINT.

    linger-timeout - When a WSGI script file, of other Python code is being imported by mod_wsgi directly, if that fails the default is that the error is ignored. For a WSGI script file reloading will be attempted on the next request. But if preloading code then it will fail and merely be logged. If linger-timeout is specified to a non zero value, with the value being seconds, then the daemon process will instead be shutdown and restarted to try and allow a successful reloading of the code to occur if it was a transient issue. To avoid a fork bomb if a persistent issue, a delay will be introduced based on the value of the linger-timeout option.
 
How does that all sound, if it makes sense that is. :-)



That sounds absolutely great!  How would I get on the notification cc: of the ticket or whatever so I'd be informed of progress on that?

These days my turn around time is pretty quick so long as I am happy and know what to change and how. So I just need to think a bit more about it and gets some day job stuff out of the way before I can do something.

So don't be surprised if you simply get a reply to this email within a week pointing at a development version to try.


Well tons of thanks again.

Graham Dumpleton

unread,
Jan 26, 2015, 12:44:13 AM1/26/15
to mod...@googlegroups.com
Want to give:


a go?

The WSGIDaemonProcess directive is 'eviction-timeout'. For mod_wsgi-express the command line option is '--eviction-timeout'.

So the terminology am using around this is that sending a signal is like forcibly evicting the WSGI application, allow the process to be restarted. At least this way can have an option name that is distinct enough from generic 'restart' so as not to be confusing.

Graham

Kent

unread,
Jan 26, 2015, 8:04:10 AM1/26/15
to mod...@googlegroups.com
Excellent.  I will certainly try this out, thanks!

Kent

unread,
Jan 27, 2015, 9:34:12 AM1/27/15
to mod...@googlegroups.com
I think I might understand the difference between 'graceful-timeout' and 'shutdown-timeout', but can you please just clarify the difference?  Are they additive?

Also, will 'eviction-timeout' interact with either of those, or simply override them?

Thanks,
Kent

Kent

unread,
Jan 27, 2015, 11:02:32 AM1/27/15
to mod...@googlegroups.com
Let me be more specific.  I'm having a hard time getting this to test as I expected.  Here is my WSGIDaemonProcess directive:

WSGIDaemonProcess rarch processes=3 threads=2 inactivity-timeout=1800 display-name=%{GROUP} graceful-timeout=140 eviction-timeout=60 python-eggs=/home/rarch/tg2env/lib/python-egg-cache

I put a 120 sec sleep in one of the processes' requests and then SIGUSR1 (Linux) all three processes.  The two inactive ones immediately restart, as I expect.  However, the 3rd (sleeping) one is allowed to run past the 60 second eviction_timeout and runs straight to the graceful_timeout before it is terminated.  Shouldn't it have been killed at 60 sec?

(And then, as my previous question, how does shutdown-timeout factor into all this?)

Thanks again!
Kent

Graham Dumpleton

unread,
Jan 27, 2015, 9:48:01 PM1/27/15
to mod...@googlegroups.com
Can you ensure that LogLevel is set to at least info and provide what messages are in the Apache error log file

If I use:

    $ mod_wsgi-express start-server hack/sleep.wsg--log-level=debug --verbose-debugging --eviction-timeout 30 --graceful-timeout 60

which is equivalent to:

    WSGIDaemonProcess … graceful-timeout=60 eviction-timeout=30

and fire a request against application that sleeps a long time I see in the Apache error logs at the time of the signal:

[Wed Jan 28 13:34:34 2015] [info] mod_wsgi (pid=29639): Process eviction requested, waiting for requests to complete 'localhost:8000'.

At the end of the 30 seconds given by the eviction timeout I see:

[Wed Jan 28 13:35:05 2015] [info] mod_wsgi (pid=29639): Daemon process graceful timer expired 'localhost:8000'.
[Wed Jan 28 13:35:05 2015] [info] mod_wsgi (pid=29639): Shutdown requested 'localhost:8000'.

Up till that point the process would still have been accepting new requests and was waiting for point that there was no active requests to allow it to shutdown.

As the timeout tripped at 30 seconds, it then instead goes into the more brutal shutdown process. No new requests are accepted from this point.

For my setup the shutdown-timeout defaults to 5 seconds and because the request still hadn't completed within 5 seconds, then the process is exited anyway and allowed to shutdown.

[Wed Jan 28 13:35:10 2015] [info] mod_wsgi (pid=29639): Aborting process 'localhost:8000'.
[Wed Jan 28 13:35:10 2015] [info] mod_wsgi (pid=29639): Exiting process 'localhost:8000'.

Because the application never returned a response, that results in the Apache child worker who was trying to talk to the daemon process seeing a truncated response.

[Wed Jan 28 13:35:10 2015] [error] [client 127.0.0.1] Truncated or oversized response headers received from daemon process 'localhost:8000': /tmp/mod_wsgi-localhost:8000:502/htdocs/

When the Apache parent process notices the daemon process has died, it cleans up and starts a new one.

[Wed Jan 28 13:35:11 2015] [info] mod_wsgi (pid=29639): Process 'localhost:8000' has died, deregister and restart it.
[Wed Jan 28 13:35:11 2015] [info] mod_wsgi (pid=29639): Process 'localhost:8000' has been deregistered and will no longer be monitored.
[Wed Jan 28 13:35:11 2015] [info] mod_wsgi (pid=29764): Starting process 'localhost:8000' with threads=5.

So the shutdown phase specified by shutdown-timeout is subsequent to eviction-timeout. It is one last chance to shutdown during a time that no new requests are accepted in case it is the constant flow of requests that is preventing it, rather than one long running request.

The shutdown-timeout should always be kept quite short because no new requests will be accepted during that time. So changing it from the default isn't something one would normally do.

Graham

Kent

unread,
Jan 28, 2015, 3:30:15 PM1/28/15
to mod...@googlegroups.com
Ok, I plan to run those tests with debug and post, but please, in the meantime:

For our app, not interrupting existing requests is a higher priority than being able to accept new requests, particularly since we typically run many wsgi processes, each with a handful of threads.  So, I'm not really concerned about maintaining always available threads (statistically, I will be fine... that isn't the issue for me).  

In these circumstances, it would be much better for all these triggering events (SIGUSR1, maximum-requests, or inactivity-timeout, etc.) to immediately stop accepting new requests and "concentrate" on shutting down.  (Unless that means requests waiting in apache are terminated because they were queued for this particular process, but I doubt apache has already determined the request's process if none are available, has it?)  With high graceful-timeout/eviction-timeout and low shutdown-timeout, I run a pretty high risk of accepting a new request at the tail end of graceful-timeout or eviction-timeout, only to have it basically doomed to ungraceful death because many of our requests are long running (very often well over 5 or 10 sec).

I guess that's why, through experimentation with SIGUSR1 a few years back, I ended up "graceful-timeout=5 shutdown-timeout=300" ... the opposite of how it would default, because this works well when trying to signal these to recycle themselves: they basically immediately stop accepting new requests so your "guaranteed" graceful timeout is 300.  It seems I have no way to "guarantee" a very large graceful timeout for each and every request, even if affected by maximum-requests or inactivity-timeout, and specify a different (lower) one for SIGUSR1 because the only truly guaranteed lifetime in seconds is "shutdown-timeout," is that accurate?

The ideal for our app, which may accept certain request that run for several minutes is this:
  • if maximum-requests or inactivity-timeout is hit, stop taking new requests immediately and shutdown as soon as possible, but give existing requests basically all the time they need to finish (say, up to 40 minutes (for long-running db reports)).
  • if SIGUSR1, stop taking new requests immediately and shutdown as soon as possible, but give existing requests a really good chance to complete, maybe 3-5 minutes, but not the 40 minutes, because this is slightly more urgent (was triggered manually and a user is monitoring/waiting for turnover and wants new code in place)
I don't think I can accomplish the above if I understand the design correctly because a request may have been accepted at the tail end of graceful-timeout/eviction-timeout and so is only guaranteed a lifetime of shutdown-timeout, regardless of what the trigger was (SIGUSR1 vs. automatic).

Is my understanding of this accurate?

Graham Dumpleton

unread,
Jan 30, 2015, 6:34:28 AM1/30/15
to mod...@googlegroups.com
If you have web requests generating reports which take 40 minutes to run, you are going the wrong way about it.

What would be regarded as best practice for long running requests is to use a task queuing system to queue up the task to be run and run it in a distinct set of processes to the web server. Your web request can then return immediately, with some sort of polling system used as necessary to check the progress of the task and allow the result to be downloaded when complete. By using a separate system to run the tasks, it doesn't matter whether the web server is restarted as the tasks will still run and after the web server is restarted, a user can still check on progress of the tasks and get back his response.

The most common such task execution system for doing this sort of thing is Celery.

So it is because you aren't using the correct tool for the job here that you are fighting against things like timeouts in the web server. No web server is really a suitable environment to be used as an in process task execution system. The web server should handle requests quickly and offload longer processing tasks a separate task system which is purpose built for handling the management of long running tasks.

I am not inclined to keep fiddling how the timeouts work now I understand what you are trying to do. I am even questioning now whether I should have introduced the separate eviction timeout I already did given that it is turning out to be a questionable use case.

I would really recommend you look at re-architecting how you do things. I don't think I would have any trouble finding others on the list who would advise the same thing and who could also give you further advice on using something like Celery instead for task execution.

Graham

Kent

unread,
Jan 30, 2015, 8:31:28 AM1/30/15
to mod...@googlegroups.com
Thanks for your reply and recommendations.  We're aware of the issues, but I didn't give the full picture for brevity's sake.  The reports are user generated reports.  Ultimately, the users know whether the reports should return quickly (which many, many will), or whether they are long-running.  There is no way for the application to know that, so to avoid some sort of polling (which we've done in the past and was a pain in the rear to users), the design is to allow the user to decide whether to run the report in the background or "foreground" via a check box.  Since most reports will return in a matter of a minute or so, we wanted to avoid the pain of making them poll, but I need to look at Celery.  However, I'm not comfortable punishing users for accidentally choosing foreground on a long-running report.  That is, not for an automatic turn-over mechanism like maximum-requests or inactivity-timeout.  In my mind, those are inherently different than something like a SIGUSR1 mechanism because the former are automatic.  

So, while admitting there are edge cases we are using that don't have a perfect solution (or even admitting we need a better mechanism in that case), it still seems to me mod_wsgi should be somewhat agnostic of design choices.  In other words, when it comes to automatic turning over of processes, it seems mod_wsgi shouldn't be involved with length of time considerations, except to allow the user to specify timeouts.  See, the long running reports are only one of my concerns: we also fight with database locks sometimes, held by another application attached to the same database and wholly out of our control.  Sometimes those locks can be held for many minutes on a request that normally should complete within seconds.  There too, it seems mod_wsgi should be very gentle in the automatic turnover cases.

Thanks for pointing to Celery.  I really wonder whether I can get a message broker to work with Adobe Flash, our current client, but I haven't looked into this much yet.

Also, my apologies if you believe this to have been a waste of time on your part.  You've been extremely helpful, though and I'm quite thankful for your time!  I understand you not wanting to redesign the shutdown-timeout thing and mess with what otherwise isn't broken.  Would you still like me to post the apache debug logs regarding 'eviction-timeout' or have you changed your mind about releasing that?  (In which case, extra apologies.)

Kent
...

Graham Dumpleton

unread,
Feb 1, 2015, 7:08:49 PM2/1/15
to mod...@googlegroups.com
Your Flask client doesn't need to know about Celery, as your web application accepts requests as normal and it is your Python code which would queue the job with Celery.

Now looking back, the only configuration I can find, but which I don't know if it is your actual production configuration is:

    WSGIDaemonProcess rarch processes=3 threads=2 inactivity-timeout=1800 display-name=%{GROUP} graceful-timeout=140 eviction-timeout=60 python-eggs=/home/rarch/tg2env/lib/python-egg-cache

Provided that you don't then start to have overall host memory issues, the simplest way around this whole issue is not to use a multithreaded process.

What you would do is vertically partition your URL name space so that just the URLs which do the long running report generation would be delegated to single threaded processes. Everything else would keep going to the multithread processes.

    WSGIDaemonProcess rarch processes=3 threads=2
    WSGIDaemonProvess rarch-long-running processes=6 threads=1 maximum-requests=20

    WSGIProcessGroup rarch

    <Location /suburl/of/long/running/report/generator>
    WSGIProcessGroup rarch-long-running
    </Location>

You wouldn't even have to worry about the graceful-timeout on rarch-long-running as that is only relevant for maxiumum-requests where it is a multithreaded processes.

So what would happen is that when the request has finished, if maximum-requests is reached, the process would be restarted even before any new request was accepted by the process, so there is no chance of a new request being interrupted.

You could still set an eviction-timeout of some suitably large value to allow you to use SIGUSR1 to be sent to processes in that daemon process group to shut them down.

In this case, having eviction-timeout being able to be set independent of graceful-timeout (for maximum-requests), is probably useful and so I will retain the option.

So is there any reason you couldn't use a daemon process group with many single threaded process instead?

Note that since only a sub set of URLs would go to the daemon process group, the memory usage profile will change as you aren't potentially loading the complete application code into those processes and only those needed for that URL and that report. So it could use up less memory than application as a whole, allowing you to have multiple single threaded processes with no issue.

Graham

Kent Bower

unread,
Feb 2, 2015, 12:15:28 PM2/2/15
to mod...@googlegroups.com
On Sun, Feb 1, 2015 at 7:08 PM, Graham Dumpleton <graham.d...@gmail.com> wrote:
Your Flask client doesn't need to know about Celery, as your web application accepts requests as normal and it is your Python code which would queue the job with Celery.

Now looking back, the only configuration I can find, but which I don't know if it is your actual production configuration is:

    WSGIDaemonProcess rarch processes=3 threads=2 inactivity-timeout=1800 display-name=%{GROUP} graceful-timeout=140 eviction-timeout=60 python-eggs=/home/rarch/tg2env/lib/python-egg-cache

Provided that you don't then start to have overall host memory issues, the simplest way around this whole issue is not to use a multithreaded process.
 
What you would do is vertically partition your URL name space so that just the URLs which do the long running report generation would be delegated to single threaded processes. Everything else would keep going to the multithread processes.

    WSGIDaemonProcess rarch processes=3 threads=2
    WSGIDaemonProvess rarch-long-running processes=6 threads=1 maximum-requests=20

    WSGIProcessGroup rarch

    <Location /suburl/of/long/running/report/generator>
    WSGIProcessGroup rarch-long-running
    </Location>

You wouldn't even have to worry about the graceful-timeout on rarch-long-running as that is only relevant for maxiumum-requests where it is a multithreaded processes.

So what would happen is that when the request has finished, if maximum-requests is reached, the process would be restarted even before any new request was accepted by the process, so there is no chance of a new request being interrupted.

You could still set an eviction-timeout of some suitably large value to allow you to use SIGUSR1 to be sent to processes in that daemon process group to shut them down.

In this case, having eviction-timeout being able to be set independent of graceful-timeout (for maximum-requests), is probably useful and so I will retain the option.

So is there any reason you couldn't use a daemon process group with many single threaded process instead?


This is very good to know (that single threaded procs would behave more ideally in these circumstances).  The above was just my configuration for testing 'eviction-timeout'.  Our software generally runs with many more processes and threads, on servers with maybe 16 or 32 GB RAM.  And unfortunately, the RAM is the limiting resource here as our python app, built on turbo-gears, is a memory hog and we have yet to find the resources to dissect that.  I was aiming to head in the direction of URL partitioning, but there are big obstacles.  (Chiefly, RAM consumption would make threads=1 and yet more processes very difficult unless we spend the huge effort in dissecting the app to locate and pull the many unused memory hogging libraries out.)

So, URL partitioning is sort of the ideal, distant solution, as well as a Celery-like polling solution, but out of my reach for now.

Another question for multithreaded graceful-timeout with maximum-requests:  during a period of heavy traffic, it seems the graceful-timeout setting just pushes the real timeout until shutdown-timeout because, if heavy enough, you'll be getting requests during graceful-timeout.  That diminishes the fidelity of "graceful-timeout."  Do you see where I'm coming from (even if you're happy with the design and don't want to mess with it, which I'd understand)?


Ok, here is the log demonstrating the troubles I saw with eviction-timeout.  For demonstration purposes, here is the simplified directive I'm using:

WSGIDaemonProcess rarch processes=1 threads=1 display-name=%{GROUP} graceful-timeout=140 eviction-timeout=60 python-eggs=/home/rarch/tg2env/lib/python-egg-cache

Here is the log:

[Mon Feb 02 11:36:16 2015] [info] Init: Initializing (virtual) servers for SSL
[Mon Feb 02 11:36:16 2015] [info] Server: Apache/2.2.3, Interface: mod_ssl/2.2.3, Library: OpenSSL/0.9.8e-fips-rhel5
[Mon Feb 02 11:36:16 2015] [notice] Digest: generating secret for digest authentication ...
[Mon Feb 02 11:36:16 2015] [notice] Digest: done
[Mon Feb 02 11:36:16 2015] [info] APR LDAP: Built with OpenLDAP LDAP SDK
[Mon Feb 02 11:36:16 2015] [info] LDAP: SSL support available
[Mon Feb 02 11:36:16 2015] [info] Init: Seeding PRNG with 256 bytes of entropy
[Mon Feb 02 11:36:16 2015] [info] Init: Generating temporary RSA private keys (512/1024 bits)
[Mon Feb 02 11:36:16 2015] [info] Init: Generating temporary DH parameters (512/1024 bits)
[Mon Feb 02 11:36:16 2015] [info] Shared memory session cache initialised
[Mon Feb 02 11:36:16 2015] [info] Init: Initializing (virtual) servers for SSL
[Mon Feb 02 11:36:16 2015] [info] Server: Apache/2.2.3, Interface: mod_ssl/2.2.3, Library: OpenSSL/0.9.8e-fips-rhel5
[Mon Feb 02 11:36:16 2015] [info] mod_wsgi (pid=29447): Starting process 'rarch' with uid=48, gid=48 and threads=1.
[Mon Feb 02 11:36:16 2015] [info] mod_wsgi (pid=29447): Python home /home/rarch/tg2env.
[Mon Feb 02 11:36:16 2015] [info] mod_wsgi (pid=29447): Initializing Python.
[Mon Feb 02 11:36:16 2015] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations
[Mon Feb 02 11:36:16 2015] [info] Server built: Aug 30 2010 12:28:40
[Mon Feb 02 11:36:16 2015] [info] mod_wsgi (pid=29447): Attach interpreter ''.
[Mon Feb 02 11:36:16 2015] [info] mod_wsgi (pid=29447, process='rarch', application=''): Loading WSGI script '/home/rarch/trunk/src/appserver/wsgi-config/wsgi-deployment.py'.
[Mon Feb 02 11:39:13 2015] [info] mod_wsgi (pid=29447): Process eviction requested, waiting for requests to complete 'rarch'.
[Mon Feb 02 11:41:00 2015] [info] mod_wsgi (pid=29447): Daemon process graceful timer expired 'rarch'.
[Mon Feb 02 11:41:00 2015] [info] mod_wsgi (pid=29447): Shutdown requested 'rarch'.
[Mon Feb 02 11:41:05 2015] [info] mod_wsgi (pid=29447): Aborting process 'rarch'.
[Mon Feb 02 11:41:05 2015] [info] mod_wsgi (pid=29447): Exiting process 'rarch'.
[Mon Feb 02 11:41:06 2015] [info] mod_wsgi (pid=29447): Process 'rarch' has died, deregister and restart it.
[Mon Feb 02 11:41:06 2015] [info] mod_wsgi (pid=29447): Process 'rarch' has been deregistered and will no longer be monitored.
[Mon Feb 02 11:41:06 2015] [info] mod_wsgi (pid=31331): Starting process 'rarch' with uid=48, gid=48 and threads=1.
[Mon Feb 02 11:41:06 2015] [info] mod_wsgi (pid=31331): Python home /home/rarch/tg2env.
[Mon Feb 02 11:41:06 2015] [info] mod_wsgi (pid=31331): Initializing Python.
[Mon Feb 02 11:41:06 2015] [info] mod_wsgi (pid=31331): Attach interpreter ''.
[Mon Feb 02 11:41:06 2015] [info] mod_wsgi (pid=31331, process='rarch', application=''): Loading WSGI script '/home/rarch/trunk/src/appserver/wsgi-config/wsgi-deployment.py'.

The process was signaled at 11:39:13 with eviction-timeout=60 but 11:40:13 came and passed and nothing happened until 107 seconds passed, at which time graceful timer expired.


Next, I changed the parameters a little:

WSGIDaemonProcess rarch processes=1 threads=1 display-name=%{GROUP} eviction-timeout=30 graceful-timeout=240 python-eggs=/home/rarch/tg2env/lib/python-egg-cache

[Mon Feb 02 12:06:57 2015] [info] mod_wsgi (pid=3381): Starting process 'rarch' with uid=48, gid=48 and threads=1.
[Mon Feb 02 12:06:57 2015] [info] mod_wsgi (pid=3381): Python home /home/rarch/tg2env.
[Mon Feb 02 12:06:57 2015] [info] mod_wsgi (pid=3381): Initializing Python.
[Mon Feb 02 12:06:57 2015] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations
[Mon Feb 02 12:06:57 2015] [info] Server built: Aug 30 2010 12:28:40
[Mon Feb 02 12:06:57 2015] [info] mod_wsgi (pid=3381): Attach interpreter ''.
[Mon Feb 02 12:06:57 2015] [info] mod_wsgi (pid=3381, process='rarch', application=''): Loading WSGI script '/home/rarch/trunk/src/appserver/wsgi-config/wsgi-deployment.py'.
[Mon Feb 02 12:07:19 2015] [info] mod_wsgi (pid=3381): Process eviction requested, waiting for requests to complete 'rarch'.
[Mon Feb 02 12:11:01 2015] [info] mod_wsgi (pid=3381): Daemon process graceful timer expired 'rarch'.
[Mon Feb 02 12:11:01 2015] [info] mod_wsgi (pid=3381): Shutdown requested 'rarch'.
[Mon Feb 02 12:11:06 2015] [info] mod_wsgi (pid=3381): Aborting process 'rarch'.
[Mon Feb 02 12:11:06 2015] [info] mod_wsgi (pid=3381): Exiting process 'rarch'.
[Mon Feb 02 12:11:07 2015] [info] mod_wsgi (pid=3381): Process 'rarch' has died, deregister and restart it.
[Mon Feb 02 12:11:07 2015] [info] mod_wsgi (pid=3381): Process 'rarch' has been deregistered and will no longer be monitored.
[Mon Feb 02 12:11:07 2015] [info] mod_wsgi (pid=7028): Starting process 'rarch' with uid=48, gid=48 and threads=1.
[Mon Feb 02 12:11:07 2015] [info] mod_wsgi (pid=7028): Python home /home/rarch/tg2env.
[Mon Feb 02 12:11:07 2015] [info] mod_wsgi (pid=7028): Initializing Python.
[Mon Feb 02 12:11:07 2015] [info] mod_wsgi (pid=7028): Attach interpreter ''.
[Mon Feb 02 12:11:07 2015] [info] mod_wsgi (pid=7028, process='rarch', application=''): Loading WSGI script '/home/rarch/trunk/src/appserver/wsgi-config/wsgi-deployment.py'.


So, for me, eviction-timeout is apparently being ignored...

Thanks again for all your time and help,
Kent


--
You received this message because you are subscribed to a topic in the Google Groups "modwsgi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/modwsgi/84yzDAMFRsw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to modwsgi+u...@googlegroups.com.

Graham Dumpleton

unread,
Feb 3, 2015, 1:02:13 AM2/3/15
to modwsgi
On 3 February 2015 at 04:15, Kent Bower <ke...@bowermail.net> wrote:
On Sun, Feb 1, 2015 at 7:08 PM, Graham Dumpleton <graham.d...@gmail.com> wrote:
Your Flask client doesn't need to know about Celery, as your web application accepts requests as normal and it is your Python code which would queue the job with Celery.

Now looking back, the only configuration I can find, but which I don't know if it is your actual production configuration is:

    WSGIDaemonProcess rarch processes=3 threads=2 inactivity-timeout=1800 display-name=%{GROUP} graceful-timeout=140 eviction-timeout=60 python-eggs=/home/rarch/tg2env/lib/python-egg-cache

Provided that you don't then start to have overall host memory issues, the simplest way around this whole issue is not to use a multithreaded process.
 
What you would do is vertically partition your URL name space so that just the URLs which do the long running report generation would be delegated to single threaded processes. Everything else would keep going to the multithread processes.

    WSGIDaemonProcess rarch processes=3 threads=2
    WSGIDaemonProvess rarch-long-running processes=6 threads=1 maximum-requests=20

    WSGIProcessGroup rarch

    <Location /suburl/of/long/running/report/generator>
    WSGIProcessGroup rarch-long-running
    </Location>

You wouldn't even have to worry about the graceful-timeout on rarch-long-running as that is only relevant for maxiumum-requests where it is a multithreaded processes.

So what would happen is that when the request has finished, if maximum-requests is reached, the process would be restarted even before any new request was accepted by the process, so there is no chance of a new request being interrupted.

You could still set an eviction-timeout of some suitably large value to allow you to use SIGUSR1 to be sent to processes in that daemon process group to shut them down.

In this case, having eviction-timeout being able to be set independent of graceful-timeout (for maximum-requests), is probably useful and so I will retain the option.

So is there any reason you couldn't use a daemon process group with many single threaded process instead?


This is very good to know (that single threaded procs would behave more ideally in these circumstances).  The above was just my configuration for testing 'eviction-timeout'.  Our software generally runs with many more processes and threads, on servers with maybe 16 or 32 GB RAM.  And unfortunately, the RAM is the limiting resource here as our python app, built on turbo-gears, is a memory hog and we have yet to find the resources to dissect that.  I was aiming to head in the direction of URL partitioning, but there are big obstacles.  (Chiefly, RAM consumption would make threads=1 and yet more processes very difficult unless we spend the huge effort in dissecting the app to locate and pull the many unused memory hogging libraries out.)

So, URL partitioning is sort of the ideal, distant solution, as well as a Celery-like polling solution, but out of my reach for now.

Have you ever run a test where you compare the whole memory usage of your application where all URLs are visited, to how much memory is used if only the URL which generates the long running report is visited?

In Django at least, a lot of stuff is lazily loaded only when a URL requiring it is first accessed. So even with a heavy code base, there can still be benefits in splitting out URLs to their own processes because the whole code base wouldn't be loaded due to the lazy loading.

So do you have any actual memory figures from doing that?

How many URLs are there that generates these reports vs those that don't, or is that all the whole application does?

Are your most frequently visited URLs those generating the reports or something else?
The background monitor thread which monitors for expiry wasn't taking into consideration the eviction timeout period being able to be less than the graceful timeout. I didn't see a problem as I was also setting request timeout, which causes the way the monitor thread works to be different, waking up every second regardless. I will work on a fix for that.

Another issue for consideration is if a graceful timeout is already in progress and a signal comes in for eviction, which timeout wins? Right now the eviction time will trump the graceful time if already set by maximum requests. The converse isn't true though in that if already in eviction cycle and maximum requests arrives, it wouldn't be trumped by graceful timeout. So eviction time had authority given that it was triggered by explicit user signal. It does mean that the signal could effectively extend what ever graceful time was in progress.

Graham

Graham Dumpleton

unread,
Feb 3, 2015, 4:50:09 AM2/3/15
to modwsgi
The application of the eviction timeout should not be fixed in develop branch.


Graham

Graham Dumpleton

unread,
Feb 3, 2015, 5:20:06 AM2/3/15
to modwsgi
Should now be fixed.

Kent Bower

unread,
Feb 3, 2015, 3:19:59 PM2/3/15
to mod...@googlegroups.com
Awesome, I'll take a look pretty soon; got to get unburied. 

As far as memory tests for URL partitioning, I haven't tried such a test specifically, but I'm acutely aware of how terrible our app/TurboGears is at (not) lazy-loading modules, so I'm not optimistic there.  Further, the amount of RAM is largely dependent on how much data is queried (not just the URL) as well as, even more so, what the format of the output it, with .pdf generation being the worst, of course.

Thanks, I'll get back to you confirming the fix, if you like.  I tend to agree with your decision that eviction timeout would/should trump a current graceful timeout.  I suppose another solid decision would be "whichever is smaller," with the rationale that two events have been triggered and so technically, either timeout by itself would have justified ending the process. Thus, the smaller of the new timeout and the already-in-progress timeout could be honored without dispute.  (I think your current design for that scenario is very acceptable, too.)

Kent Bower

unread,
Feb 4, 2015, 4:57:34 PM2/4/15
to mod...@googlegroups.com
Yes, sir, my tests also seem to show it works as you intend it to.

Thanks.

On Tue, Feb 3, 2015 at 5:20 AM, Graham Dumpleton <graham.d...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages