Using the node 0.6.x cluster module, what is the best way to gracefully restart a cluster, including both the workers *and* the master, without any downtime? My current naive attempt is as follows:
0. Start a cluster of workers. Each worker calls 'server.listen(80)'.
When master receives SIGHUP signal:
1. Spawn a new master and set of workers. Each new worker calls 'server.listen(80)'. (uh, problem here - same port) 2. Original master sends message to each of its workers telling it to close. 3. Upon receipt of message, each worker calls server.close(). No new connections are accepted. 4. On server 'close' event, each worker tells its master that it is closed, and calls process.exit(). 5. When the original master has received 'closed' messages back from all its workers, original master calls process.exit().
The original master and it workers are now dead, and a new master and worker process are running. However, if a long running connection is running on one of the original workers and the new cluster starts before that connection finishes, the new cluster doesn't receive any HTTP requests. I assume this is because the already port is in use.
If I don't start the new cluster (step 1) until after all the original workers have exited just before closing the original master (step 5), everything works fine. However there is an interval of time where I'm am not accepting any connections. If I don't start a new master process but just close and restart individual workers everything works fine also. However, my goal is no downtime, and reloading of all node processes including the master (to pick up new master code, a new node version, etc).
The node-cluster module from LearnBoost accomplished this, but I'm trying to use node 0.6 and its built in cluster support.
For really zero downtime, you have to do this way:
0. Start master and workers
- SIGHUP comes in
1. Send that information to, for example, half of the workers 2. This half should stop accepting connections and as soon as they serve they last request they should exit 3. The master know when the workers start exiting gracefully and starts new workers 4. When half of your workers have restarted you can do the same to the others
Remember that you will have your workers possibly running different code versions at the same so ensure this won't be a problem.
The half 1st, half later is just a strategy. You can do one by one or anything other. Just don't notify all workers at the same time or you might have them closed too fast for you to start new ones..
On Tue, 24 Jan 2012 17:22:29 -0600, Steve Molitor wrote: > Using the node 0.6.x cluster module, what is the best way to > gracefully restart a cluster, including both the workers *and* the > master, without any downtime? My current naive attempt is as > follows:
> 0. Start a cluster of workers. Each worker calls > 'server.listen(80)'.
> When master receives SIGHUP signal:
> 1. Spawn a new master and set of workers. Each new worker calls > 'server.listen(80)'. (uh, problem here - same port)
> 2. Original master sends message to each of its workers telling it > to > close. > 3. Upon receipt of message, each worker calls server.close(). No new > connections are accepted. > 4. On server 'close' event, each worker tells its master that it is > closed, and calls process.exit(). > 5. When the original master has received 'closed' messages back from > all its workers, original master calls process.exit().
> The original master and it workers are now dead, and a new master and > worker process are running. However, if a long running connection is > running on one of the original workers and the new cluster starts > before that connection finishes, the new cluster doesn't receive any > HTTP requests. I assume this is because the already port is in use.
> If I don't start the new cluster (step 1) until after all the > original > workers have exited just before closing the original master (step 5), > everything works fine. However there is an interval of time where > I'm am not accepting any connections. If I don't start a new master > process but just close and restart individual workers everything > works > fine also. However, my goal is no downtime, and reloading of all > node processes including the master (to pick up new master code, a > new > node version, etc).
> The node-cluster module from LearnBoost accomplished this, but I'm > trying to use node 0.6 and its built in cluster support.
Thanks for the suggestion, but I'm trying to kill the original master and start a new master, in addition to the workers. However, it seems I can't have two master processes with workers handling requests on the same port, and the same time. I know how to get zero downtime, if I don't restart the master process. And I know how to kill and start a new master, if I give up on zero downtime. But I'm trying to both restart the master, and have zero downtime.
Steve
On Tue, Jan 24, 2012 at 5:31 PM, Diogo Resende <drese...@thinkdigital.pt>wrote:
> For really zero downtime, you have to do this way:
> 0. Start master and workers
> - SIGHUP comes in
> 1. Send that information to, for example, half of the workers > 2. This half should stop accepting connections and as soon as they > serve they last request they should exit > 3. The master know when the workers start exiting gracefully and > starts new workers > 4. When half of your workers have restarted you can do the same to > the others
> Remember that you will have your workers possibly running different > code versions at the same so ensure this won't be a problem.
> The half 1st, half later is just a strategy. You can do one by one > or anything other. Just don't notify all workers at the same time or > you might have them closed too fast for you to start new ones..
> --- > Diogo R.
> On Tue, 24 Jan 2012 17:22:29 -0600, Steve Molitor wrote:
>> Using the node 0.6.x cluster module, what is the best way to >> gracefully restart a cluster, including both the workers *and* the >> master, without any downtime? My current naive attempt is as >> follows:
>> 0. Start a cluster of workers. Each worker calls >> 'server.listen(80)'.
>> When master receives SIGHUP signal:
>> 1. Spawn a new master and set of workers. Each new worker calls >> 'server.listen(80)'. (uh, problem here - same port)
>> 2. Original master sends message to each of its workers telling it to >> close. >> 3. Upon receipt of message, each worker calls server.close(). No new >> connections are accepted. >> 4. On server 'close' event, each worker tells its master that it is >> closed, and calls process.exit(). >> 5. When the original master has received 'closed' messages back from >> all its workers, original master calls process.exit().
>> The original master and it workers are now dead, and a new master and >> worker process are running. However, if a long running connection is >> running on one of the original workers and the new cluster starts >> before that connection finishes, the new cluster doesn't receive any >> HTTP requests. I assume this is because the already port is in use.
>> If I don't start the new cluster (step 1) until after all the original >> workers have exited just before closing the original master (step 5), >> everything works fine. However there is an interval of time where >> I'm am not accepting any connections. If I don't start a new master >> process but just close and restart individual workers everything works >> fine also. However, my goal is no downtime, and reloading of all >> node processes including the master (to pick up new master code, a new >> node version, etc).
>> The node-cluster module from LearnBoost accomplished this, but I'm >> trying to use node 0.6 and its built in cluster support.
On 25 January 2012 15:50, Steve Molitor <stevemoli...@gmail.com> wrote:
> Thanks for the suggestion, but I'm trying to kill the original master and > start a new master, in addition to the workers. However, it seems I can't > have two master processes with workers handling requests on the same port, > and the same time. I know how to get zero downtime, if I don't restart the > master process. And I know how to kill and start a new master, if I give up > on zero downtime. But I'm trying to both restart the master, and have zero > downtime.
There are other ways of doing it too. For example in a recent project of mine, I start two node.js servers listening on different local ports. Both are proxied to by Nginx in the same server{} stanza. Basically I rarely need to restart Nginx but I _can_ restart one nodejs server, then the other to get new code running.
I know this isn't using cluster like you asked, but extrapolating can make this work for your use-case too. I'm afraid I don't know how to do it with a single master process without downtime, but I'd be keen on knowing if this is possible. I think Nginx can also be restarted with zero downtime - so the hard problem is already solved - so that is also an option (as well as making it serve your static content) which means you can still get what you want.
In an older project of mine where we had 4 webservers, we load balanced to each of them, but took one out of the load balancer at a time to load new code. Again, no downtime but a different solution to what you asked for. Maybe these have given you some ideas. :)
software only on 1 system... its not really feasible... you would have to have a load balancer basically in front of 2 systems... where you can shut 1 down and still restart the other... you cant really overcome those limitations w/o downtime otherwise have to add another layer... (hardware or virtual)
Not sure what load balancers you used, but don't most support not serving to a down server already? Seems like extra work to reconfig for that purpose ;)
On Tue, Jan 24, 2012 at 9:07 PM, Andrew Chilton <chi...@appsattic.com> wrote: > On 25 January 2012 15:50, Steve Molitor <stevemoli...@gmail.com> wrote: > In an older project of mine where we had 4 webservers, we load > balanced to each of them, but took one out of the load balancer at a > time to load new code. Again, no downtime but a different solution to > what you asked for. Maybe these have given you some ideas. :)
On 25 January 2012 16:14, Karl Tiedt <kti...@gmail.com> wrote:
> Not sure what load balancers you used, but don't most support not > serving to a down server already? Seems like extra work to reconfig > for that purpose ;)
Absolutely, nothing will come through to a dead server (which is the point of this solution).
However, just blithly stopping a server may cause some requests to fail (maybe this was an artifact of our system) but we found it useful to take a server out of the lb, wait until zero requests were being sent to it and then we knew we could restart it without anyone seeing any problems. :) This could have been automated quite easily so it's not really a problem - just an extra safety net. :)
You're probably right, but this was what we found in our environment. Others may differ and playing with your own system will teach you what you need to do. Hope that makes sense.
Any specific reason why you want to restart the master process as
well?
I recently started working on a wrapper/helper for the core cluster
moduler. Pretty much a work in progress, but a lot of what you have
said would be handy to have in this module.
> Using the node 0.6.x cluster module, what is the best way to gracefully
> restart a cluster, including both the workers *and* the master, without any
> downtime? My current naive attempt is as follows:
> 0. Start a cluster of workers. Each worker calls 'server.listen(80)'.
> When master receives SIGHUP signal:
> 1. Spawn a new master and set of workers. Each new worker calls
> 'server.listen(80)'. (uh, problem here - same port)
> 2. Original master sends message to each of its workers telling it to close.
> 3. Upon receipt of message, each worker calls server.close(). No new
> connections are accepted.
> 4. On server 'close' event, each worker tells its master that it is closed,
> and calls process.exit().
> 5. When the original master has received 'closed' messages back from all
> its workers, original master calls process.exit().
> The original master and it workers are now dead, and a new master and
> worker process are running. However, if a long running connection is
> running on one of the original workers and the new cluster starts before
> that connection finishes, the new cluster doesn't receive any HTTP
> requests. I assume this is because the already port is in use.
> If I don't start the new cluster (step 1) until after all the original
> workers have exited just before closing the original master (step 5),
> everything works fine. However there is an interval of time where I'm am
> not accepting any connections. If I don't start a new master process but
> just close and restart individual workers everything works fine also.
> However, my goal is no downtime, and reloading of all node processes
> including the master (to pick up new master code, a new node version, etc).
> The node-cluster module from LearnBoost accomplished this, but I'm trying
> to use node 0.6 and its built in cluster support.
On Wed, Jan 25, 2012 at 12:36 AM, knc <kishor...@gmail.com> wrote: > Any specific reason why you want to restart the master process as > well?
> I recently started working on a wrapper/helper for the core cluster > moduler. Pretty much a work in progress, but a lot of what you have > said would be handy to have in this module.
> On Jan 25, 4:22 am, Steve Molitor <stevemoli...@gmail.com> wrote: > > Using the node 0.6.x cluster module, what is the best way to gracefully > > restart a cluster, including both the workers *and* the master, without > any > > downtime? My current naive attempt is as follows:
> > 0. Start a cluster of workers. Each worker calls 'server.listen(80)'.
> > When master receives SIGHUP signal:
> > 1. Spawn a new master and set of workers. Each new worker calls > > 'server.listen(80)'. (uh, problem here - same port) > > 2. Original master sends message to each of its workers telling it to > close. > > 3. Upon receipt of message, each worker calls server.close(). No new > > connections are accepted. > > 4. On server 'close' event, each worker tells its master that it is > closed, > > and calls process.exit(). > > 5. When the original master has received 'closed' messages back from all > > its workers, original master calls process.exit().
> > The original master and it workers are now dead, and a new master and > > worker process are running. However, if a long running connection is > > running on one of the original workers and the new cluster starts before > > that connection finishes, the new cluster doesn't receive any HTTP > > requests. I assume this is because the already port is in use.
> > If I don't start the new cluster (step 1) until after all the original > > workers have exited just before closing the original master (step 5), > > everything works fine. However there is an interval of time where I'm am > > not accepting any connections. If I don't start a new master process but > > just close and restart individual workers everything works fine also. > > However, my goal is no downtime, and reloading of all node processes > > including the master (to pick up new master code, a new node version, > etc).
> > The node-cluster module from LearnBoost accomplished this, but I'm trying > > to use node 0.6 and its built in cluster support.
If I somehow passed the original socket to the new master and had it use that, would that work? Could I have two clusters servicing requests on the same port (temporarily)? Is that what learn boost's cluster module does?
On Wed, Jan 25, 2012 at 12:36 AM, knc <kishor...@gmail.com> wrote: > Any specific reason why you want to restart the master process as > well?
> I recently started working on a wrapper/helper for the core cluster > moduler. Pretty much a work in progress, but a lot of what you have > said would be handy to have in this module.
> On Jan 25, 4:22 am, Steve Molitor <stevemoli...@gmail.com> wrote: > > Using the node 0.6.x cluster module, what is the best way to gracefully > > restart a cluster, including both the workers *and* the master, without > any > > downtime? My current naive attempt is as follows:
> > 0. Start a cluster of workers. Each worker calls 'server.listen(80)'.
> > When master receives SIGHUP signal:
> > 1. Spawn a new master and set of workers. Each new worker calls > > 'server.listen(80)'. (uh, problem here - same port) > > 2. Original master sends message to each of its workers telling it to > close. > > 3. Upon receipt of message, each worker calls server.close(). No new > > connections are accepted. > > 4. On server 'close' event, each worker tells its master that it is > closed, > > and calls process.exit(). > > 5. When the original master has received 'closed' messages back from all > > its workers, original master calls process.exit().
> > The original master and it workers are now dead, and a new master and > > worker process are running. However, if a long running connection is > > running on one of the original workers and the new cluster starts before > > that connection finishes, the new cluster doesn't receive any HTTP > > requests. I assume this is because the already port is in use.
> > If I don't start the new cluster (step 1) until after all the original > > workers have exited just before closing the original master (step 5), > > everything works fine. However there is an interval of time where I'm am > > not accepting any connections. If I don't start a new master process but > > just close and restart individual workers everything works fine also. > > However, my goal is no downtime, and reloading of all node processes > > including the master (to pick up new master code, a new node version, > etc).
> > The node-cluster module from LearnBoost accomplished this, but I'm trying > > to use node 0.6 and its built in cluster support.
you won't be able to have two unrelated processes listening on the
same port at the same time so if you want to restart the master
process without downtime you will need to have a load balancer in
front of it. there's a good summary of the options here:
http://www.loadbalancing.org/
as far as i know you can only pass a socket to a related (child)
process and not to an unrelated process which would pretty much rule
out what you suggest above. it might be possible to spawn a child
process (which should use the newly installed version of node) and
send the socket to that before killing the parent, as long as there is
nothing that breaks between the two version of node, but you'll still
have to deal with the issue of not being able to copy the new version
of node over the old one while it's running.
On Jan 25, 4:16 pm, Steve Molitor <stevemoli...@gmail.com> wrote:
> If I somehow passed the original socket to the new master and had it use
> that, would that work? Could I have two clusters servicing requests on the
> same port (temporarily)? Is that what learn boost's cluster module does?
> Steve
> On Wed, Jan 25, 2012 at 12:36 AM, knc <kishor...@gmail.com> wrote:
> > Any specific reason why you want to restart the master process as
> > well?
> > I recently started working on a wrapper/helper for the core cluster
> > moduler. Pretty much a work in progress, but a lot of what you have
> > said would be handy to have in this module.
> > On Jan 25, 4:22 am, Steve Molitor <stevemoli...@gmail.com> wrote:
> > > Using the node 0.6.x cluster module, what is the best way to gracefully
> > > restart a cluster, including both the workers *and* the master, without
> > any
> > > downtime? My current naive attempt is as follows:
> > > 0. Start a cluster of workers. Each worker calls 'server.listen(80)'.
> > > When master receives SIGHUP signal:
> > > 1. Spawn a new master and set of workers. Each new worker calls
> > > 'server.listen(80)'. (uh, problem here - same port)
> > > 2. Original master sends message to each of its workers telling it to
> > close.
> > > 3. Upon receipt of message, each worker calls server.close(). No new
> > > connections are accepted.
> > > 4. On server 'close' event, each worker tells its master that it is
> > closed,
> > > and calls process.exit().
> > > 5. When the original master has received 'closed' messages back from all
> > > its workers, original master calls process.exit().
> > > The original master and it workers are now dead, and a new master and
> > > worker process are running. However, if a long running connection is
> > > running on one of the original workers and the new cluster starts before
> > > that connection finishes, the new cluster doesn't receive any HTTP
> > > requests. I assume this is because the already port is in use.
> > > If I don't start the new cluster (step 1) until after all the original
> > > workers have exited just before closing the original master (step 5),
> > > everything works fine. However there is an interval of time where I'm am
> > > not accepting any connections. If I don't start a new master process but
> > > just close and restart individual workers everything works fine also.
> > > However, my goal is no downtime, and reloading of all node processes
> > > including the master (to pick up new master code, a new node version,
> > etc).
> > > The node-cluster module from LearnBoost accomplished this, but I'm trying
> > > to use node 0.6 and its built in cluster support.
It says "All sockets in Node set SO_REUSEADDR already". In TCP, can't
we have two programs listening on the same socket, if we set the
SO_REUSEADDR on the socket before bind?
On Jan 26, 1:02 am, billywhizz <apjohn...@gmail.com> wrote:
> you won't be able to have two unrelated processes listening on the
> same port at the same time so if you want to restart the master
> process without downtime you will need to have a load balancer in
> front of it. there's a good summary of the options here:http://www.loadbalancing.org/
> as far as i know you can only pass a socket to a related (child)
> process and not to an unrelated process which would pretty much rule
> out what you suggest above. it might be possible to spawn a child
> process (which should use the newly installed version of node) and
> send the socket to that before killing the parent, as long as there is
> nothing that breaks between the two version of node, but you'll still
> have to deal with the issue of not being able to copy the new version
> of node over the old one while it's running.
> On Jan 25, 4:16 pm, Steve Molitor <stevemoli...@gmail.com> wrote:
> > If I somehow passed the original socket to the new master and had it use
> > that, would that work? Could I have two clusters servicing requests on the
> > same port (temporarily)? Is that what learn boost's cluster module does?
> > Steve
> > On Wed, Jan 25, 2012 at 12:36 AM, knc <kishor...@gmail.com> wrote:
> > > Any specific reason why you want to restart the master process as
> > > well?
> > > I recently started working on a wrapper/helper for the core cluster
> > > moduler. Pretty much a work in progress, but a lot of what you have
> > > said would be handy to have in this module.
> > > On Jan 25, 4:22 am, Steve Molitor <stevemoli...@gmail.com> wrote:
> > > > Using the node 0.6.x cluster module, what is the best way to gracefully
> > > > restart a cluster, including both the workers *and* the master, without
> > > any
> > > > downtime? My current naive attempt is as follows:
> > > > 0. Start a cluster of workers. Each worker calls 'server.listen(80)'.
> > > > When master receives SIGHUP signal:
> > > > 1. Spawn a new master and set of workers. Each new worker calls
> > > > 'server.listen(80)'. (uh, problem here - same port)
> > > > 2. Original master sends message to each of its workers telling it to
> > > close.
> > > > 3. Upon receipt of message, each worker calls server.close(). No new
> > > > connections are accepted.
> > > > 4. On server 'close' event, each worker tells its master that it is
> > > closed,
> > > > and calls process.exit().
> > > > 5. When the original master has received 'closed' messages back from all
> > > > its workers, original master calls process.exit().
> > > > The original master and it workers are now dead, and a new master and
> > > > worker process are running. However, if a long running connection is
> > > > running on one of the original workers and the new cluster starts before
> > > > that connection finishes, the new cluster doesn't receive any HTTP
> > > > requests. I assume this is because the already port is in use.
> > > > If I don't start the new cluster (step 1) until after all the original
> > > > workers have exited just before closing the original master (step 5),
> > > > everything works fine. However there is an interval of time where I'm am
> > > > not accepting any connections. If I don't start a new master process but
> > > > just close and restart individual workers everything works fine also.
> > > > However, my goal is no downtime, and reloading of all node processes
> > > > including the master (to pick up new master code, a new node version,
> > > etc).
> > > > The node-cluster module from LearnBoost accomplished this, but I'm trying
> > > > to use node 0.6 and its built in cluster support.
On Wed, Jan 25, 2012 at 10:48 PM, knc <kishor...@gmail.com> wrote: > With respect to two processes using the same port, can someone please > example what's said on this page:
> It says "All sockets in Node set SO_REUSEADDR already". In TCP, can't > we have two programs listening on the same socket, if we set the > SO_REUSEADDR on the socket before bind?
> On Jan 26, 1:02 am, billywhizz <apjohn...@gmail.com> wrote: > > you won't be able to have two unrelated processes listening on the > > same port at the same time so if you want to restart the master > > process without downtime you will need to have a load balancer in > > front of it. there's a good summary of the options here: > http://www.loadbalancing.org/
> > as far as i know you can only pass a socket to a related (child) > > process and not to an unrelated process which would pretty much rule > > out what you suggest above. it might be possible to spawn a child > > process (which should use the newly installed version of node) and > > send the socket to that before killing the parent, as long as there is > > nothing that breaks between the two version of node, but you'll still > > have to deal with the issue of not being able to copy the new version > > of node over the old one while it's running.
> > On Jan 25, 4:16 pm, Steve Molitor <stevemoli...@gmail.com> wrote:
> > > If I somehow passed the original socket to the new master and had it > use > > > that, would that work? Could I have two clusters servicing requests > on the > > > same port (temporarily)? Is that what learn boost's cluster module > does?
> > > Steve
> > > On Wed, Jan 25, 2012 at 12:36 AM, knc <kishor...@gmail.com> wrote: > > > > Any specific reason why you want to restart the master process as > > > > well?
> > > > I recently started working on a wrapper/helper for the core cluster > > > > moduler. Pretty much a work in progress, but a lot of what you have > > > > said would be handy to have in this module.
> > > > On Jan 25, 4:22 am, Steve Molitor <stevemoli...@gmail.com> wrote: > > > > > Using the node 0.6.x cluster module, what is the best way to > gracefully > > > > > restart a cluster, including both the workers *and* the master, > without > > > > any > > > > > downtime? My current naive attempt is as follows:
> > > > > 0. Start a cluster of workers. Each worker calls > 'server.listen(80)'.
> > > > > When master receives SIGHUP signal:
> > > > > 1. Spawn a new master and set of workers. Each new worker calls > > > > > 'server.listen(80)'. (uh, problem here - same port) > > > > > 2. Original master sends message to each of its workers telling it > to > > > > close. > > > > > 3. Upon receipt of message, each worker calls server.close(). No > new > > > > > connections are accepted. > > > > > 4. On server 'close' event, each worker tells its master that it is > > > > closed, > > > > > and calls process.exit(). > > > > > 5. When the original master has received 'closed' messages back > from all > > > > > its workers, original master calls process.exit().
> > > > > The original master and it workers are now dead, and a new master > and > > > > > worker process are running. However, if a long running connection > is > > > > > running on one of the original workers and the new cluster starts > before > > > > > that connection finishes, the new cluster doesn't receive any HTTP > > > > > requests. I assume this is because the already port is in use.
> > > > > If I don't start the new cluster (step 1) until after all the > original > > > > > workers have exited just before closing the original master (step > 5), > > > > > everything works fine. However there is an interval of time where > I'm am > > > > > not accepting any connections. If I don't start a new master > process but > > > > > just close and restart individual workers everything works fine > also. > > > > > However, my goal is no downtime, and reloading of all node > processes > > > > > including the master (to pick up new master code, a new node > version, > > > > etc).
> > > > > The node-cluster module from LearnBoost accomplished this, but I'm > trying > > > > > to use node 0.6 and its built in cluster support.
> you won't be able to have two unrelated processes listening on the
> same port at the same time so if you want to restart the master
> process without downtime you will need to have a load balancer in
> front of it. there's a good summary of the options here:http://www.loadbalancing.org/
> as far as i know you can only pass a socket to a related (child)
> process and not to an unrelated process which would pretty much rule
> out what you suggest above. it might be possible to spawn a child
> process (which should use the newly installed version of node) and
> send the socket to that before killing the parent, as long as there is
> nothing that breaks between the two version of node, but you'll still
> have to deal with the issue of not being able to copy the new version
> of node over the old one while it's running.
> On Jan 25, 4:16 pm, Steve Molitor <stevemoli...@gmail.com> wrote:
> > If I somehow passed the original socket to the new master and had it use
> > that, would that work? Could I have two clusters servicing requests on the
> > same port (temporarily)? Is that what learn boost's cluster module does?
> > Steve
> > On Wed, Jan 25, 2012 at 12:36 AM, knc <kishor...@gmail.com> wrote:
> > > Any specific reason why you want to restart the master process as
> > > well?
> > > I recently started working on a wrapper/helper for the core cluster
> > > moduler. Pretty much a work in progress, but a lot of what you have
> > > said would be handy to have in this module.
> > > On Jan 25, 4:22 am, Steve Molitor <stevemoli...@gmail.com> wrote:
> > > > Using the node 0.6.x cluster module, what is the best way to gracefully
> > > > restart a cluster, including both the workers *and* the master, without
> > > any
> > > > downtime? My current naive attempt is as follows:
> > > > 0. Start a cluster of workers. Each worker calls 'server.listen(80)'.
> > > > When master receives SIGHUP signal:
> > > > 1. Spawn a new master and set of workers. Each new worker calls
> > > > 'server.listen(80)'. (uh, problem here - same port)
> > > > 2. Original master sends message to each of its workers telling it to
> > > close.
> > > > 3. Upon receipt of message, each worker calls server.close(). No new
> > > > connections are accepted.
> > > > 4. On server 'close' event, each worker tells its master that it is
> > > closed,
> > > > and calls process.exit().
> > > > 5. When the original master has received 'closed' messages back from all
> > > > its workers, original master calls process.exit().
> > > > The original master and it workers are now dead, and a new master and
> > > > worker process are running. However, if a long running connection is
> > > > running on one of the original workers and the new cluster starts before
> > > > that connection finishes, the new cluster doesn't receive any HTTP
> > > > requests. I assume this is because the already port is in use.
> > > > If I don't start the new cluster (step 1) until after all the original
> > > > workers have exited just before closing the original master (step 5),
> > > > everything works fine. However there is an interval of time where I'm am
> > > > not accepting any connections. If I don't start a new master process but
> > > > just close and restart individual workers everything works fine also.
> > > > However, my goal is no downtime, and reloading of all node processes
> > > > including the master (to pick up new master code, a new node version,
> > > etc).
> > > > The node-cluster module from LearnBoost accomplished this, but I'm trying
> > > > to use node 0.6 and its built in cluster support.
If your master restarts quickly enough you can have the client wait
for it to restart without rejecting the connection - it appears as a
couple second pause for users but not an error.
If not, you can put a proxy in front that listens on the one port and
switches automatically. Presumably this proxy would be shut down less
frequently than your master process, allowing most upgrades to avoid
downtime. Perhaps use something like nginx for that proxy.
If you can run the new master on a different port and have new clients/
workers use the new port the proxying logic could be coded right into
the master to save running an extra process - perhaps the master is
told there's a new master and it re-wires itself to proxy all requests
to the new master instead of processing them itself.
On Jan 25, 10:50 am, Steve Molitor <stevemoli...@gmail.com> wrote:
> Thanks for the suggestion, but I'm trying to kill the original master and
> start a new master, in addition to the workers. However, it seems I can't
> have two master processes with workers handling requests on the same port,
> and the same time. I know how to get zero downtime, if I don't restart
> the master process. And I know how to kill and start a new master, if I
> give up on zero downtime. But I'm trying to both restart the master, and
> have zero downtime.
> Steve
> On Tue, Jan 24, 2012 at 5:31 PM, Diogo Resende <drese...@thinkdigital.pt>wrote:
> > For really zero downtime, you have to do this way:
> > 0. Start master and workers
> > - SIGHUP comes in
> > 1. Send that information to, for example, half of the workers
> > 2. This half should stop accepting connections and as soon as they
> > serve they last request they should exit
> > 3. The master know when the workers start exiting gracefully and
> > starts new workers
> > 4. When half of your workers have restarted you can do the same to
> > the others
> > Remember that you will have your workers possibly running different
> > code versions at the same so ensure this won't be a problem.
> > The half 1st, half later is just a strategy. You can do one by one
> > or anything other. Just don't notify all workers at the same time or
> > you might have them closed too fast for you to start new ones..
> > ---
> > Diogo R.
> > On Tue, 24 Jan 2012 17:22:29 -0600, Steve Molitor wrote:
> >> Using the node 0.6.x cluster module, what is the best way to
> >> gracefully restart a cluster, including both the workers *and* the
> >> master, without any downtime? My current naive attempt is as
> >> follows:
> >> 0. Start a cluster of workers. Each worker calls
> >> 'server.listen(80)'.
> >> When master receives SIGHUP signal:
> >> 1. Spawn a new master and set of workers. Each new worker calls
> >> 'server.listen(80)'. (uh, problem here - same port)
> >> 2. Original master sends message to each of its workers telling it to
> >> close.
> >> 3. Upon receipt of message, each worker calls server.close(). No new
> >> connections are accepted.
> >> 4. On server 'close' event, each worker tells its master that it is
> >> closed, and calls process.exit().
> >> 5. When the original master has received 'closed' messages back from
> >> all its workers, original master calls process.exit().
> >> The original master and it workers are now dead, and a new master and
> >> worker process are running. However, if a long running connection is
> >> running on one of the original workers and the new cluster starts
> >> before that connection finishes, the new cluster doesn't receive any
> >> HTTP requests. I assume this is because the already port is in use.
> >> If I don't start the new cluster (step 1) until after all the original
> >> workers have exited just before closing the original master (step 5),
> >> everything works fine. However there is an interval of time where
> >> I'm am not accepting any connections. If I don't start a new master
> >> process but just close and restart individual workers everything works
> >> fine also. However, my goal is no downtime, and reloading of all
> >> node processes including the master (to pick up new master code, a new
> >> node version, etc).
> >> The node-cluster module from LearnBoost accomplished this, but I'm
> >> trying to use node 0.6 and its built in cluster support.