I'm running 4 (a very early version of it, possibly before you officially released it). We upgraded to take advantage of the amazingly-helpful SIGUSR1 signaling for graceful process restarting, which we use somewhat regularly to gracefully deploy software changes (minor ones which won't matter if 2 processes have different versions loaded) without disrupting users. Thanks a ton for that!
We are also multi-threading our processes (plural processes, plural threads).Some requests could be (validly) running for very long periods of time (database reporting, maybe even half hour, though that would be very extreme).Some processes (especially those generating .pdfs, for example), hog tons of RAM, as you know, so I'd like these to eventually check their RAM back in, so to speak, by utilizing either inactivity-timeout or maximum-requests, but always in a very gentle way, since, as I mentioned, some requests might be properly running, even though for many minutes. maximum-requests seems too brutal for my use-case since the threshold request sends the process down the graceful-timeout/shutdown-timeout, even if there are valid processes running and then SIGKILLs. My ideal vision of "maximum-requests," since it is primarily for memory management, is to be very gentle, sort of a "ok, now that I've hit my threshold, at my next earliest convenience, I should die, but only once all my current requests have ended of their own accord."
"inactivity-timeout" seems to function exactly as I want in that it seems like it won't ever kill a process with a thread with an active request (at least, I can't get it too even by adding a long import time;time.sleep(longtime)... it doesn't seem to die until the request is finished. But that's why the documentation made me nervous because it implies that it could, in fact, kill a proc with an active request: "For the purposes of this option, being idle means no new requests being received, or no attempts by current requests to read request content or generate response content for the defined period."
I'd rather have a more gentle "maximum-requests" than "inactivity-timeout" because then, even on very heavy days (when RAM is most likely to choke), I could gracefully turn over these processes a couple times a day, which I couldn't do with "inactivity-timeout" on an extremely heavy day.Hope this makes sense. I'm really asking :
- whether inactivity-timeout triggering will ever SIGKILL a process with an active request, as the docs intimate
- whether there is any way to get maximum-requests to behave more gently under all circumstances
- for your ideas/best advice
Thanks for your help!
On Wednesday, January 14, 2015 at 9:48:02 PM UTC-5, Graham Dumpleton wrote:
On 15/01/2015, at 8:32 AM, Kent <jkent...@gmail.com> wrote:
> Graham, the docs state: "For the purposes of this option, being idle means no new requests being received, or no attempts by current requests to read request content or generate response content for the defined period."
>
> This implies to me that a running request that is taking a long time could actually be killed as if it were idle (suppose it were fetching a very slow database query). Is this the case?
This is the case for mod_wsgi prior to version 4.0.
Things have changed in mod_wsgi 4.X.
How long are your long running requests though? The inactivity-timeout was more about restarting infrequently used applications so that memory can be taken back.
> Also, I'm looking for an ultra-conservative and graceful method of recycling memory. I've read your article on url partitioning, which was useful, but sooner or later, one must rely on either inactivity-timeout or maximum-requests, is that accurate? But both these will eventually, after graceful timeout/shutdown timeout, potentially kill active requests. It is valid for our app to handle long-running reports, so I was hoping for an ultra-safe mechanism.
> Do you have any advice here?
The options available in mod_wsgi 4.X are much better in this area than 3.X. The changes in 4.X aren't covered in main documentation though and are only described in the release notes where change was made.
In 4.X the concept of an inactivity-timeout is now separate to the idea of a request-timeout. There is also a graceful-timeout that can be applied to maximum-requests and some other situations as well to allow requests to finish out properly before being more brutal. One can also signal the daemon processes to do a more graceful restart as well.
You cannot totally avoid having to be brutal though and kill things else you don't have a fail safe for a stuck process where all request threads were blocked on back end services and were never going to recover. Use of multithreading in a process also complicates the implementation of request-timeout.
Anyway, the big question is what version are you using?
Graham
--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at http://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.
There are a few possibilities here of how this could be enhanced/changed.The problem with maximum-requests is that it can be dangerous. People can set it too low and when their site gets a big spike of traffic then the processes can be restarted too quickly only adding to the load of the site and causing things to slow down and hamper their ability to handle the spike. This is where setting a longer amount of time for graceful-timeout helps because you can set it to be quite large. The use of maximum-requests can still be like using a hammer though, and one which can be applied unpredictably.
The maximum-requests option also doesn't help in the case where you are running background threads which do stuff and it is them and not the number of requests coming in that dictate things like memory growth that you want to counter.
So the other option I have contemplated adding a number of times is is one to periodically restart the process. The way this would work is that a process restart would be done periodically based on what time was specified. You could therefore say the restart interval was 3600 and it would restart the process once an hour.The start of the time period for this would either be, when the process was created, if any Python code or a WSGI script was preloaded at process start time. Or, it would be from when the first request arrived if the WSGi application was lazily loaded. This restart-interval could be tied to the graceful-timeout option so that you can set and extended period if you want to try and ensure that requests are not interrupted.
Now we have the ability to sent the process graceful restart signal (usually SIGUSR1), to force an individual process to restart.Right now this is tied to the graceful-timeout duration as well, which as you point out, would perhaps be better off having its own time duration for the notional grace period.Using the name restart-timeout for this could be confusing if I have a restart interval option.
I also have another type of process restart I am trying to work out how to accommodate and the naming of options again complicates the problem. In this case we want to introduce an artificial restart delay.This would be an option to combat the problem which is being caused by Django 1.7 in that WSGI script file loading for Django isn't stateless. If a transient problem occurs, such as the database not being ready, the loading of the WSGI script file can fail. On the next request an attempt is made to load it again but now Django kicks a stink because it was half way setting things up last time when it failed and the setup code cannot be run a second time. The result is that the process then keeps failing.The idea of the restart delay option therefore is to allow you to set it to number of seconds, normally just 1. If set like that, if a WSGI script file import fails, it will effectively block for the delay specified and when over it will kill the process so the whole process is thrown away and the WSGI script file can be reloaded in a fresh process. This gets rid of the problem of Django initialisation not being able to be retried.
A delay is needed to avoid an effective fork bomb, where a WSGI script file not loading with high request throughput would cause a constant cycle of processes dying and being replaced. It is possible it wouldn't be as bad as I think as Apache only checks for dead processes to replace once a second, but still prefer my own failsafe in case that changes.I am therefore totally fine with a separate graceful time period for when SIGUSR1 is used, I just need to juggle these different features and come up with an option naming scheme that make sense.How about then that I add the following new options:maximum-lifetime - Similar to maximum-requests in that it will cause the processes to be shutdown and restarted, but in this case it will occur based on the time period given as argument, measured from the first request or when the WSGI script file or any other Python code was preloaded, that is, in the latter case when the process was started.restart-timeout - Specifies a separate grace period for when the process is being forcibly restarted using the graceful restart signal. If restart-timeout is not specified and graceful-timeout is specified, then the value of graceful-timeout is used. If neither are specified, then the restart signal will be have similar to the process being sent a SIGINT.linger-timeout - When a WSGI script file, of other Python code is being imported by mod_wsgi directly, if that fails the default is that the error is ignored. For a WSGI script file reloading will be attempted on the next request. But if preloading code then it will fail and merely be logged. If linger-timeout is specified to a non zero value, with the value being seconds, then the daemon process will instead be shutdown and restarted to try and allow a successful reloading of the code to occur if it was a transient issue. To avoid a fork bomb if a persistent issue, a delay will be introduced based on the value of the linger-timeout option.
How does that all sound, if it makes sense that is. :-)
On Sunday, January 18, 2015 at 12:43:08 AM UTC-5, Graham Dumpleton wrote:There are a few possibilities here of how this could be enhanced/changed.The problem with maximum-requests is that it can be dangerous. People can set it too low and when their site gets a big spike of traffic then the processes can be restarted too quickly only adding to the load of the site and causing things to slow down and hamper their ability to handle the spike. This is where setting a longer amount of time for graceful-timeout helps because you can set it to be quite large. The use of maximum-requests can still be like using a hammer though, and one which can be applied unpredictably.
Yes, I can see that. (It may be overkill, but you could default a separate minimum-lifetime parameter so only users who specifically mess with that as well as maximum-requests shoot themselves in the foot, but it is starting to get confusing with all the different timeouts, I'll agree there...)
The maximum-requests option also doesn't help in the case where you are running background threads which do stuff and it is them and not the number of requests coming in that dictate things like memory growth that you want to counter.
True, but solving with maximum lifetime... well, actually, solving memory problems with any of these mechanisms isn't measuring the heart of the problem, which is RAM. I imagine there isn't a good way to measure RAM or you would have added that option by now. Seems what we are truly after for the majority of these isn't how many requests or how log its been up, etc, but how much RAM it is taking (or perhaps, optionally, average RAM per thread, instead). If my process exceeds consuming 1.5GB, then trigger a graceful restart at the next appropriate convenience, being gentle to existing requests. That may be arguably the most useful parameter.
So the other option I have contemplated adding a number of times is is one to periodically restart the process. The way this would work is that a process restart would be done periodically based on what time was specified. You could therefore say the restart interval was 3600 and it would restart the process once an hour.The start of the time period for this would either be, when the process was created, if any Python code or a WSGI script was preloaded at process start time. Or, it would be from when the first request arrived if the WSGi application was lazily loaded. This restart-interval could be tied to the graceful-timeout option so that you can set and extended period if you want to try and ensure that requests are not interrupted.
We just wouldn't want it to die having never even served a single request, so my vote would be against the birth of the process as the beginning point (and, rather, at first request).
Now we have the ability to sent the process graceful restart signal (usually SIGUSR1), to force an individual process to restart.Right now this is tied to the graceful-timeout duration as well, which as you point out, would perhaps be better off having its own time duration for the notional grace period.Using the name restart-timeout for this could be confusing if I have a restart interval option.
In my opinion, SIGUSR1 is different from the automatic parameters because it was (most likely) triggered by user intervention, so that one should ideally have its own parameter. If that is the case and this parameter becomes dedicated to SIGUSR1, then the least ambiguous name I can think of is sigusr1-timeout.
I also have another type of process restart I am trying to work out how to accommodate and the naming of options again complicates the problem. In this case we want to introduce an artificial restart delay.This would be an option to combat the problem which is being caused by Django 1.7 in that WSGI script file loading for Django isn't stateless. If a transient problem occurs, such as the database not being ready, the loading of the WSGI script file can fail. On the next request an attempt is made to load it again but now Django kicks a stink because it was half way setting things up last time when it failed and the setup code cannot be run a second time. The result is that the process then keeps failing.The idea of the restart delay option therefore is to allow you to set it to number of seconds, normally just 1. If set like that, if a WSGI script file import fails, it will effectively block for the delay specified and when over it will kill the process so the whole process is thrown away and the WSGI script file can be reloaded in a fresh process. This gets rid of the problem of Django initialisation not being able to be retried.
(We are using turbogears... I don't think I've seen something like that happen, but rarely have seen start up anomalies.)
A delay is needed to avoid an effective fork bomb, where a WSGI script file not loading with high request throughput would cause a constant cycle of processes dying and being replaced. It is possible it wouldn't be as bad as I think as Apache only checks for dead processes to replace once a second, but still prefer my own failsafe in case that changes.I am therefore totally fine with a separate graceful time period for when SIGUSR1 is used, I just need to juggle these different features and come up with an option naming scheme that make sense.How about then that I add the following new options:maximum-lifetime - Similar to maximum-requests in that it will cause the processes to be shutdown and restarted, but in this case it will occur based on the time period given as argument, measured from the first request or when the WSGI script file or any other Python code was preloaded, that is, in the latter case when the process was started.restart-timeout - Specifies a separate grace period for when the process is being forcibly restarted using the graceful restart signal. If restart-timeout is not specified and graceful-timeout is specified, then the value of graceful-timeout is used. If neither are specified, then the restart signal will be have similar to the process being sent a SIGINT.linger-timeout - When a WSGI script file, of other Python code is being imported by mod_wsgi directly, if that fails the default is that the error is ignored. For a WSGI script file reloading will be attempted on the next request. But if preloading code then it will fail and merely be logged. If linger-timeout is specified to a non zero value, with the value being seconds, then the daemon process will instead be shutdown and restarted to try and allow a successful reloading of the code to occur if it was a transient issue. To avoid a fork bomb if a persistent issue, a delay will be introduced based on the value of the linger-timeout option.How does that all sound, if it makes sense that is. :-)
That sounds absolutely great! How would I get on the notification cc: of the ticket or whatever so I'd be informed of progress on that?
On 20/01/2015, at 11:50 PM, Kent <jkent...@gmail.com> wrote:On Sunday, January 18, 2015 at 12:43:08 AM UTC-5, Graham Dumpleton wrote:There are a few possibilities here of how this could be enhanced/changed.The problem with maximum-requests is that it can be dangerous. People can set it too low and when their site gets a big spike of traffic then the processes can be restarted too quickly only adding to the load of the site and causing things to slow down and hamper their ability to handle the spike. This is where setting a longer amount of time for graceful-timeout helps because you can set it to be quite large. The use of maximum-requests can still be like using a hammer though, and one which can be applied unpredictably.
Yes, I can see that. (It may be overkill, but you could default a separate minimum-lifetime parameter so only users who specifically mess with that as well as maximum-requests shoot themselves in the foot, but it is starting to get confusing with all the different timeouts, I'll agree there...)
The minimum-lifetime option is an interesting idea. It may have to do nothing by default to avoid conflicts with existing expected behaviour.The maximum-requests option also doesn't help in the case where you are running background threads which do stuff and it is them and not the number of requests coming in that dictate things like memory growth that you want to counter.
True, but solving with maximum lifetime... well, actually, solving memory problems with any of these mechanisms isn't measuring the heart of the problem, which is RAM. I imagine there isn't a good way to measure RAM or you would have added that option by now. Seems what we are truly after for the majority of these isn't how many requests or how log its been up, etc, but how much RAM it is taking (or perhaps, optionally, average RAM per thread, instead). If my process exceeds consuming 1.5GB, then trigger a graceful restart at the next appropriate convenience, being gentle to existing requests. That may be arguably the most useful parameter.The problem with calculating memory is that there isn't one cross platform portable way of doing it. On Linux you have to dive into the /proc file system. On MacOS X you can use C API calls. On Solaris I think you again need to dive into a /proc file system but it obviously has a different file structure for getting details out compared to Linux. Adding such cross platform stuff in gets a bit messy.What I was moving towards as an extension of the monitoring stuff I am doing for mod_wsgi was to have a special daemon process you can setup which has access to some sort of management API. You could then create your own Python script that runs in that and which using the management API can get daemon process pids and then use Python psutil to get memory usage on periodic basis and then you decide if process should be restarted and send it a signal to stop, or management API provided which allows you to notify in some way, maybe by signal, or maybe using shared memory flag, that daemon process should shut down.
So the other option I have contemplated adding a number of times is is one to periodically restart the process. The way this would work is that a process restart would be done periodically based on what time was specified. You could therefore say the restart interval was 3600 and it would restart the process once an hour.The start of the time period for this would either be, when the process was created, if any Python code or a WSGI script was preloaded at process start time. Or, it would be from when the first request arrived if the WSGi application was lazily loaded. This restart-interval could be tied to the graceful-timeout option so that you can set and extended period if you want to try and ensure that requests are not interrupted.
We just wouldn't want it to die having never even served a single request, so my vote would be against the birth of the process as the beginning point (and, rather, at first request).It would effectively be from first request if lazily loaded. If preloaded though, as background threads could be created which do stuff and consume memory over time, would then be from when process started, ie., when Python code was preloaded.
Now we have the ability to sent the process graceful restart signal (usually SIGUSR1), to force an individual process to restart.Right now this is tied to the graceful-timeout duration as well, which as you point out, would perhaps be better off having its own time duration for the notional grace period.Using the name restart-timeout for this could be confusing if I have a restart interval option.
In my opinion, SIGUSR1 is different from the automatic parameters because it was (most likely) triggered by user intervention, so that one should ideally have its own parameter. If that is the case and this parameter becomes dedicated to SIGUSR1, then the least ambiguous name I can think of is sigusr1-timeout.
Except that it isn't guaranteed to be called SIGUSR1. Technically it could be a different signal dependent on platform that Apache runs as. But then, as far as I know all UNIX systems do use SIGUSR1.
I also have another type of process restart I am trying to work out how to accommodate and the naming of options again complicates the problem. In this case we want to introduce an artificial restart delay.This would be an option to combat the problem which is being caused by Django 1.7 in that WSGI script file loading for Django isn't stateless. If a transient problem occurs, such as the database not being ready, the loading of the WSGI script file can fail. On the next request an attempt is made to load it again but now Django kicks a stink because it was half way setting things up last time when it failed and the setup code cannot be run a second time. The result is that the process then keeps failing.The idea of the restart delay option therefore is to allow you to set it to number of seconds, normally just 1. If set like that, if a WSGI script file import fails, it will effectively block for the delay specified and when over it will kill the process so the whole process is thrown away and the WSGI script file can be reloaded in a fresh process. This gets rid of the problem of Django initialisation not being able to be retried.
(We are using turbogears... I don't think I've seen something like that happen, but rarely have seen start up anomalies.)
A delay is needed to avoid an effective fork bomb, where a WSGI script file not loading with high request throughput would cause a constant cycle of processes dying and being replaced. It is possible it wouldn't be as bad as I think as Apache only checks for dead processes to replace once a second, but still prefer my own failsafe in case that changes.I am therefore totally fine with a separate graceful time period for when SIGUSR1 is used, I just need to juggle these different features and come up with an option naming scheme that make sense.How about then that I add the following new options:maximum-lifetime - Similar to maximum-requests in that it will cause the processes to be shutdown and restarted, but in this case it will occur based on the time period given as argument, measured from the first request or when the WSGI script file or any other Python code was preloaded, that is, in the latter case when the process was started.restart-timeout - Specifies a separate grace period for when the process is being forcibly restarted using the graceful restart signal. If restart-timeout is not specified and graceful-timeout is specified, then the value of graceful-timeout is used. If neither are specified, then the restart signal will be have similar to the process being sent a SIGINT.linger-timeout - When a WSGI script file, of other Python code is being imported by mod_wsgi directly, if that fails the default is that the error is ignored. For a WSGI script file reloading will be attempted on the next request. But if preloading code then it will fail and merely be logged. If linger-timeout is specified to a non zero value, with the value being seconds, then the daemon process will instead be shutdown and restarted to try and allow a successful reloading of the code to occur if it was a transient issue. To avoid a fork bomb if a persistent issue, a delay will be introduced based on the value of the linger-timeout option.How does that all sound, if it makes sense that is. :-)
That sounds absolutely great! How would I get on the notification cc: of the ticket or whatever so I'd be informed of progress on that?These days my turn around time is pretty quick so long as I am happy and know what to change and how. So I just need to think a bit more about it and gets some day job stuff out of the way before I can do something.So don't be surprised if you simply get a reply to this email within a week pointing at a development version to try.
...
Your Flask client doesn't need to know about Celery, as your web application accepts requests as normal and it is your Python code which would queue the job with Celery.Now looking back, the only configuration I can find, but which I don't know if it is your actual production configuration is:WSGIDaemonProcess rarch processes=3 threads=2 inactivity-timeout=1800 display-name=%{GROUP} graceful-timeout=140 eviction-timeout=60 python-eggs=/home/rarch/tg2env/lib/python-egg-cacheProvided that you don't then start to have overall host memory issues, the simplest way around this whole issue is not to use a multithreaded process.
What you would do is vertically partition your URL name space so that just the URLs which do the long running report generation would be delegated to single threaded processes. Everything else would keep going to the multithread processes.WSGIDaemonProcess rarch processes=3 threads=2WSGIDaemonProvess rarch-long-running processes=6 threads=1 maximum-requests=20WSGIProcessGroup rarch<Location /suburl/of/long/running/report/generator>WSGIProcessGroup rarch-long-running</Location>You wouldn't even have to worry about the graceful-timeout on rarch-long-running as that is only relevant for maxiumum-requests where it is a multithreaded processes.So what would happen is that when the request has finished, if maximum-requests is reached, the process would be restarted even before any new request was accepted by the process, so there is no chance of a new request being interrupted.You could still set an eviction-timeout of some suitably large value to allow you to use SIGUSR1 to be sent to processes in that daemon process group to shut them down.In this case, having eviction-timeout being able to be set independent of graceful-timeout (for maximum-requests), is probably useful and so I will retain the option.So is there any reason you couldn't use a daemon process group with many single threaded process instead?
--
You received this message because you are subscribed to a topic in the Google Groups "modwsgi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/modwsgi/84yzDAMFRsw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to modwsgi+u...@googlegroups.com.
On Sun, Feb 1, 2015 at 7:08 PM, Graham Dumpleton <graham.d...@gmail.com> wrote:Your Flask client doesn't need to know about Celery, as your web application accepts requests as normal and it is your Python code which would queue the job with Celery.Now looking back, the only configuration I can find, but which I don't know if it is your actual production configuration is:WSGIDaemonProcess rarch processes=3 threads=2 inactivity-timeout=1800 display-name=%{GROUP} graceful-timeout=140 eviction-timeout=60 python-eggs=/home/rarch/tg2env/lib/python-egg-cacheProvided that you don't then start to have overall host memory issues, the simplest way around this whole issue is not to use a multithreaded process.What you would do is vertically partition your URL name space so that just the URLs which do the long running report generation would be delegated to single threaded processes. Everything else would keep going to the multithread processes.WSGIDaemonProcess rarch processes=3 threads=2WSGIDaemonProvess rarch-long-running processes=6 threads=1 maximum-requests=20WSGIProcessGroup rarch<Location /suburl/of/long/running/report/generator>WSGIProcessGroup rarch-long-running</Location>You wouldn't even have to worry about the graceful-timeout on rarch-long-running as that is only relevant for maxiumum-requests where it is a multithreaded processes.So what would happen is that when the request has finished, if maximum-requests is reached, the process would be restarted even before any new request was accepted by the process, so there is no chance of a new request being interrupted.You could still set an eviction-timeout of some suitably large value to allow you to use SIGUSR1 to be sent to processes in that daemon process group to shut them down.In this case, having eviction-timeout being able to be set independent of graceful-timeout (for maximum-requests), is probably useful and so I will retain the option.So is there any reason you couldn't use a daemon process group with many single threaded process instead?This is very good to know (that single threaded procs would behave more ideally in these circumstances). The above was just my configuration for testing 'eviction-timeout'. Our software generally runs with many more processes and threads, on servers with maybe 16 or 32 GB RAM. And unfortunately, the RAM is the limiting resource here as our python app, built on turbo-gears, is a memory hog and we have yet to find the resources to dissect that. I was aiming to head in the direction of URL partitioning, but there are big obstacles. (Chiefly, RAM consumption would make threads=1 and yet more processes very difficult unless we spend the huge effort in dissecting the app to locate and pull the many unused memory hogging libraries out.)So, URL partitioning is sort of the ideal, distant solution, as well as a Celery-like polling solution, but out of my reach for now.