How to determine a good value for graceful_timeout

92 views
Skip to first unread message

andreas....@gmail.com

unread,
Apr 9, 2015, 12:39:14 PM4/9/15
to mojol...@googlegroups.com
We hope that we are missing something and that there is a solution to
our problem that we just do not see yet. Please bear with us when this
is the case.

In the following problem description we currently come to the
conclusion that the configuration parameter *graceful_timeout* is not
usable for us. The current behaviour of hypnotoad never satisfies all
of our requirements. The problem is relatively new, we traced it back
to v5.10.

Falsification and corrections very welcome.

Our server is a completely synchronous prefork hypnotoad farm with
several hundred workers. Our accepts parameter is 1000. We cannot go
higher because we have memory leaks. Our heartbeat_timeout is 100000.
We cannot go lower, we must satisfy slow and long running requests.
And now we try to determine a good value for graceful_timeout.

As far as we can see, *graceful_timeout* is used in four different
situations:

(1) When a *heartbeat_timeout* is reached, the manager process sends a
SIGQUIT to the worker and starting from that, after *graceful_timeout*
sends a SIGKILL.

Determining a good value for *graceful_timeout* in this context is
dependent on the time the server needs when it turns out that a
request cannot be finished within *heartbeat_timeout*. Our need here
would be around 10 seconds for cleanup.

(2) When a graceful server shutdown is triggered by some human
intervention, the manager process sends a SIGQUIT to all running
workers and after that, after *graceful_timeout* sends a SIGKILL to
each of them. Note that the human who triggers the graceful shutdown
may have to wait that long until all processes have finished or have
been killed.

Determining a good value for *graceful_timeout* in this context is
very similar to (1).

(3) When *accepts* has been reached for a worker, the worker process
lets the manager process know about that in its heartbeat (since
v5.10); then the manager process sends a SIGQUIT to the old process
and after *graceful_timeout* sends a SIGKILL.

Determining a good value for *graceful_timeout* in this context is the
same as for *heartbeat_timeout*, i.e. 100000 seconds.

(4) When a graceful server restart is triggered by some human
intervention, the manager process sends a SIGQUIT to all running
workers and after that, after *graceful_timeout* sends a SIGKILL to
each of them.

Determining a good value for *graceful_timeout* in this context is
very similar to (3).

To sum up: we have two situations (1 and 2) during which we want to
set graceful_timeout to 10 seconds and we have two situations (3
and 4) where we need to set it to 100000 seconds.

Is there a way out? If we choose 10, then we have too many broken
connections. When we set it to 100000, then we frustrate our DevOps
team.

Please advise.

Dan Book

unread,
Apr 9, 2015, 12:43:32 PM4/9/15
to mojol...@googlegroups.com
You are correct that hypnotoad is not well designed for such long running requests. Browsers usually are not going to wait more than a few minutes for a request to complete. If you have background processes that you don't intend to return to the browser, perhaps something like Minion is a better solution for that.

-Dan

--
You received this message because you are subscribed to the Google Groups "Mojolicious" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mojolicious...@googlegroups.com.
To post to this group, send email to mojol...@googlegroups.com.
Visit this group at http://groups.google.com/group/mojolicious.
For more options, visit https://groups.google.com/d/optout.

andreas....@gmail.com

unread,
Apr 9, 2015, 4:18:00 PM4/9/15
to mojol...@googlegroups.com
Thanks Dan, I find Minion attractive but then I also found Mojolicious
very well suited for our tasks until the small change in v5.10 turned
out to be a bit of a challenge. I would hope there is enough expertese
around to suggest a less intrusive change to our (or Mojo's?) codebase.

sri

unread,
Apr 9, 2015, 4:41:56 PM4/9/15
to mojol...@googlegroups.com
To sum it up, you want no safety timeouts for restarts, old workers should just be able to finish whatever they are doing indefinitely.

--
sebastian

sri

unread,
Apr 9, 2015, 4:58:59 PM4/9/15
to mojol...@googlegroups.com
The current behaviour of hypnotoad never satisfies all
of our requirements. The problem is relatively new, we traced it back
to v5.10.

I'm afraid no matter how this thread ends, going back to the old bug ridden behavior from before 5.10 will not be an option, it used to cause way too many problems. And one thing is clear, your use case will never be a perfect fit for Hypnotoad, which is at its heart an event loop manager. Perhaps there are things we can do to make long-running blocking requests more pleasant, but we will not make sacrifices for it.

--
sebastian

sri

unread,
Apr 9, 2015, 5:52:16 PM4/9/15
to mojol...@googlegroups.com
To sum it up, you want no safety timeouts for restarts, old workers should just be able to finish whatever they are doing indefinitely.

Or to be more clear, what you want is the removal of this line.


--
sebastian 

sri

unread,
Apr 9, 2015, 11:20:39 PM4/9/15
to mojol...@googlegroups.com
Or to be more clear, what you want is the removal of this line.


And before anyone asks why we don't just do that... Mojo::IOLoop::max_accepts would not work anymore, if you have for example Mojo::Pg::PubSub listening for notifications, or anything else that uses Mojo::Reactor directly.

--
sebastian

sri

unread,
Apr 9, 2015, 11:39:37 PM4/9/15
to mojol...@googlegroups.com
...or anything else that uses Mojo::Reactor directly.

Actually that's not quite correct, but you can test the problem with a one-liner like this.

    perl -Mojo -E 'Mojo::IOLoop->stream(Mojo::IOLoop::Stream->new(*STDOUT)->timeout(0)); a({text => "Hello World!"})->start' prefork -a 1 -w 1

--
sebastian

andreas....@gmail.com

unread,
Apr 10, 2015, 3:27:49 AM4/10/15
to mojol...@googlegroups.com
On Thursday, April 9, 2015 at 11:52:16 PM UTC+2, sri wrote:
To sum it up, you want no safety timeouts for restarts, old workers should just be able to finish whatever they are doing indefinitely.

For restarts if you include end-of-accepts-reached under the terminus restart. And not indefinitely but with precise and well known timeouts. But the timeouts need to be conditional. The single parameter graceful_timeout is not sufficient. If the worker could communicate the required timeout to the manager, that would probably lead to a solution for us.

Another promising tweak is to get our workers to the point that they do not grow. Then we can set accepts=0 and do not have this problem anymore. But we need time for this to develop and we need a Mojolicious::Plugin::SizeLimit for the days where we make mistakes.

andreas....@gmail.com

unread,
Apr 11, 2015, 7:19:45 AM4/11/15
to mojol...@googlegroups.com
If this thread has come to an end, I wonder whether the documentation should be amended to help other users to get their configurations right. Especially the surprise factor is not amusing when they are hit by this. Believe me, I've been there.

So I would suggest to add a sentence like this to graceful_timeout:

Note that if you are using accepts>0, then you should always set graceful_timeout to the same value as heartbeat_timeout, otherwise you get inconsistent timeouts.

Thanks for your considerations.

sri

unread,
Apr 11, 2015, 10:25:41 AM4/11/15
to mojol...@googlegroups.com
Note that if you are using accepts>0, then you should always set graceful_timeout to the same value as heartbeat_timeout, otherwise you get inconsistent timeouts.

For the record, this is actually not good advice. Anyone stumbling over this in the future (so far this is the first time the topic has come up), just use a job queue for blocking operations that take a very long time.

--
sebastian 

sri

unread,
Apr 11, 2015, 10:36:23 AM4/11/15
to mojol...@googlegroups.com
...just use a job queue for blocking operations that take a very long time.

And one more data point, browsers tend to wait between 1 and 5 minutes for a response. So you should never have to set either timeout to value greater than 300.

--
sebastian

Stefan Adams

unread,
Apr 11, 2015, 10:41:19 AM4/11/15
to mojolicious

On Sat, Apr 11, 2015 at 9:36 AM, sri <kra...@googlemail.com> wrote:
browsers tend to wait between 1 and 5 minutes for a response

Is this a hard and fast number per browser type / version, or does a specific browser vary wait times from one request to the next?

That is, Chrome waits, e.g., 3 minutes, FF 4 2 minutes, and FF 20 5 minutes?  Or all browsers wait between 1 and 5 minutes without predictability?

sri

unread,
Apr 11, 2015, 11:11:30 AM4/11/15
to mojol...@googlegroups.com
That is, Chrome waits, e.g., 3 minutes, FF 4 2 minutes, and FF 20 5 minutes?  Or all browsers wait between 1 and 5 minutes without predictability?

This, and it varies between browser versions. Personally, i would never go above 60 seconds, which is already much longer than most users are willing to wait.

--
sebastian 

sri

unread,
Apr 11, 2015, 11:18:48 AM4/11/15
to mojol...@googlegroups.com
If this thread has come to an end, I wonder whether the documentation should be amended to help other users to get their configurations right.

If i had to make up rules for setting timeouts correctly, they would prolly be along the lines of "heartbeat_timeout: should be a little higher than the longest blocking operation you're performing, but smaller than 60 seconds", and "graceful_timeout: should be a little higher than the longest time it takes your application to generate a response, but smaller than 60 seconds". And "If you need more than 60 seconds, some of those operations should really be performed with a job queue".

--
sebastian
Reply all
Reply to author
Forward
0 new messages