keep-alives and graceful restarts

Bill Moseley

unread,

Aug 17, 2012, 12:38:06 PM8/17/12

to psgi-plack

We currently have a pool of Apache servers behind an F5 load balancer. We have long-lived keep-alives set between the F5 and the "backend" Apache app servers.

The problem with this setup is when we do a graceful restart on an Apache instance the keep-alive children don't exit until their connection is closed -- which can be a very long time due to how the F5 keeps connections alive.

What we really want for a "graceful" restart is to allow any current request to finish, close the connection, and then kill the process and reload.

We have added code to our app that catches SIGUSR1 and sets a flag, and when this flag is set we force a Connect: close header which allows Apache to kill the worker child process.

Could someone please explain how all this works with Starman and Server::Starter?

Now, with Server::Starter a "graceful" restart sends a SIGHUP -- I'm not clear if that's the same "graceful" apache uses where Apache waits for a child's connection to close before killing off that process.

Would I need to still catch a SIGHUP and send a connection: close header as I do with Apache?

Thanks,

--
Bill Moseley
mos...@hank.org

Tatsuhiko Miyagawa

unread,

Aug 17, 2012, 12:47:27 PM8/17/12

to psgi-...@googlegroups.com

On Aug 17, 2012, at 9:38 AM, Bill Moseley wrote:

We currently have a pool of Apache servers behind an F5 load balancer. We have long-lived keep-alives set between the F5 and the "backend" Apache app servers.

The problem with this setup is when we do a graceful restart on an Apache instance the keep-alive children don't exit until their connection is closed -- which can be a very long time due to how the F5 keeps connections alive.

What we really want for a "graceful" restart is to allow any current request to finish, close the connection, and then kill the process and reload.

We have added code to our app that catches SIGUSR1 and sets a flag, and when this flag is set we force a Connect: close header which allows Apache to kill the worker child process.

Could someone please explain how all this works with Starman and Server::Starter?

Now, with Server::Starter a "graceful" restart sends a SIGHUP -- I'm not clear if that's the same "graceful" apache uses where Apache waits for a child's connection to close before killing off that process.

You have to convert the SIGHUP to SIGQUIT as in the Starman documentation so that Starman gracefully shuts down the current workers while Server::Starter launches the new set of the workers.

The way it works is that Server::Starter accepts the requests on the listening ports and manages two clusters of Starman during the restart, so the existing (serving) workers will gracefully quit till all the requests are handled, while launching a new set of workers to route new requests.

That said, because of the problem you described, I strongly do not recommend enabling keep-alives between your frontend servers and Starman. I would say it's not even supported. (The documentation suggests disabling keep-alives when the frontend proxy has a long-open keep-alive connections)

Would I need to still catch a SIGHUP and send a connection: close header as I do with Apache?

Thanks,

--
Bill Moseley
mos...@hank.org

--

Bill Moseley

unread,

Aug 18, 2012, 12:32:43 PM8/18/12

to psgi-...@googlegroups.com

On Fri, Aug 17, 2012 at 9:47 AM, Tatsuhiko Miyagawa <miya...@gmail.com> wrote:

Now, with Server::Starter a "graceful" restart sends a SIGHUP -- I'm not clear if that's the same "graceful" apache uses where Apache waits for a child's connection to close before killing off that process.

You have to convert the SIGHUP to SIGQUIT as in the Starman documentation so that Starman gracefully shuts down the current workers while Server::Starter launches the new set of the workers.

Right, because by default Server::Starter sends a TERM.

I intended to ask about the docs. The docs say:

... sending "QUIT" signal to the master process will gracefully shutdown the workers (meaning the currently running requests will shut down once the request is complete).

What does "request is complete" mean? Is that like Apache where it waits for the socket to close on the worker? As in my case where a long running keep-alive will keep the process busy until the connection is closed?

That said, because of the problem you described, I strongly do not recommend enabling keep-alives between your frontend servers and Starman. I would say it's not even supported. (The documentation suggests disabling keep-alives when the frontend proxy has a long-open keep-alive connections)

Again, we have client-facing load balancers (F5s) that currently talk to Apache/mod_perl. So, to be clear, I'm talking about replacing Apache/mod_perl with Starman so the F5 talks directly to Starman (or Server::Starter, I suppose when making a connection).

Now, we are on a WAN with multiple datacenters, so it's possible that the F5 and Apache (or Starman) app servers are not in the same location. I don't know how significant that latency is, but that's why we would prefer to use keep-alives on the backend. At least according to our network engineers.

I'd have to check, but IIRC, the F5s don't pre-connect -- that is when a client request comes in and needs to be passed to a backend server the F5 creates a new connection if one does not exist (from a keep-alive). In other words, I don't think the F5 pre-connects before it has a request to handle.

I guess it's time to test, but if I really needed to use the keep-alives would I then need to do something similar with Starman and catch the SIGQUIT and then add a "Connection: close" header to the response to free up the child to exit after that response? I'm not clear if that's just a "problem" with Apache or if it also would apply to Starman.

Thanks!

--
Bill Moseley
mos...@hank.org

Tatsuhiko Miyagawa

unread,

Aug 18, 2012, 3:06:16 PM8/18/12

to psgi-...@googlegroups.com

On Aug 18, 2012, at 9:32 AM, Bill Moseley wrote:

On Fri, Aug 17, 2012 at 9:47 AM, Tatsuhiko Miyagawa <miya...@gmail.com> wrote:

Now, with Server::Starter a "graceful" restart sends a SIGHUP -- I'm not clear if that's the same "graceful" apache uses where Apache waits for a child's connection to close before killing off that process.

You have to convert the SIGHUP to SIGQUIT as in the Starman documentation so that Starman gracefully shuts down the current workers while Server::Starter launches the new set of the workers.

Right, because by default Server::Starter sends a TERM.

I intended to ask about the docs. The docs say:

... sending "QUIT" signal to the master process will gracefully shutdown the workers (meaning the currently running requests will shut down once the request is complete).

What does "request is complete" mean? Is that like Apache where it waits for the socket to close on the worker?

when the current request is completely served and client disconnected. Take a look at the code (it's all pure perl) if you're wondering.

As in my case where a long running keep-alive will keep the process busy until the connection is closed?

I suppose so.

Now, we are on a WAN with multiple datacenters, so it's possible that the F5 and Apache (or Starman) app servers are not in the same location. I don't know how significant that latency is, but that's why we would prefer to use keep-alives on the backend. At least according to our network engineers.

I'd have to check, but IIRC, the F5s don't pre-connect -- that is when a client request comes in and needs to be passed to a backend server the F5 creates a new connection if one does not exist (from a keep-alive). In other words, I don't think the F5 pre-connects before it has a request to handle.

I guess it's time to test, but if I really needed to use the keep-alives would I then need to do something similar with Starman and catch the SIGQUIT and then add a "Connection: close" header to the response to free up the child to exit after that response? I'm not clear if that's just a "problem" with Apache or if it also would apply to Starman.

I suppose so but there's no way to catch the quit signals in your worker to close the connection so long as I see. In other words it's not supported.

Reply all

Reply to author

Forward