We hope that we are missing something and that there is a solution to
our problem that we just do not see yet. Please bear with us when this
is the case.
In the following problem description we currently come to the
conclusion that the configuration parameter *graceful_timeout* is not
usable for us. The current behaviour of hypnotoad never satisfies all
of our requirements. The problem is relatively new, we traced it back
to v5.10.
Falsification and corrections very welcome.
Our server is a completely synchronous prefork hypnotoad farm with
several hundred workers. Our accepts parameter is 1000. We cannot go
higher because we have memory leaks. Our heartbeat_timeout is 100000.
We cannot go lower, we must satisfy slow and long running requests.
And now we try to determine a good value for graceful_timeout.
As far as we can see, *graceful_timeout* is used in four different
situations:
(1) When a *heartbeat_timeout* is reached, the manager process sends a
SIGQUIT to the worker and starting from that, after *graceful_timeout*
sends a SIGKILL.
Determining a good value for *graceful_timeout* in this context is
dependent on the time the server needs when it turns out that a
request cannot be finished within *heartbeat_timeout*. Our need here
would be around 10 seconds for cleanup.
(2) When a graceful server shutdown is triggered by some human
intervention, the manager process sends a SIGQUIT to all running
workers and after that, after *graceful_timeout* sends a SIGKILL to
each of them. Note that the human who triggers the graceful shutdown
may have to wait that long until all processes have finished or have
been killed.
Determining a good value for *graceful_timeout* in this context is
very similar to (1).
(3) When *accepts* has been reached for a worker, the worker process
lets the manager process know about that in its heartbeat (since
v5.10); then the manager process sends a SIGQUIT to the old process
and after *graceful_timeout* sends a SIGKILL.
Determining a good value for *graceful_timeout* in this context is the
same as for *heartbeat_timeout*, i.e. 100000 seconds.
(4) When a graceful server restart is triggered by some human
intervention, the manager process sends a SIGQUIT to all running
workers and after that, after *graceful_timeout* sends a SIGKILL to
each of them.
Determining a good value for *graceful_timeout* in this context is
very similar to (3).
To sum up: we have two situations (1 and 2) during which we want to
set graceful_timeout to 10 seconds and we have two situations (3
and 4) where we need to set it to 100000 seconds.
Is there a way out? If we choose 10, then we have too many broken
connections. When we set it to 100000, then we frustrate our DevOps
team.
Please advise.