Graceful cluster restart of master and workers with node 0.6, with zero downtime

813 views
Skip to first unread message

Steve Molitor

unread,
Jan 24, 2012, 6:22:29 PM1/24/12
to nod...@googlegroups.com
Using the node 0.6.x cluster module, what is the best way to gracefully restart a cluster, including both the workers *and* the master, without any downtime?  My current naive attempt is as follows:

0. Start a cluster of workers.  Each worker calls 'server.listen(80)'. 

When master receives SIGHUP signal:

1. Spawn a new master and set of workers.  Each new worker calls 'server.listen(80)'.  (uh, problem here - same port)
2. Original master sends message to each of its workers telling it to close.
3. Upon receipt of message, each worker calls server.close().  No new connections are accepted.
4. On server 'close' event, each worker tells its master that it is closed, and calls process.exit().
5. When the original master has received 'closed' messages back from all its workers, original master calls process.exit().

The original master and it workers are now dead, and a new master and worker process are running.  However, if a long running connection is running on one of the original workers and the new cluster starts before that connection finishes, the new cluster doesn't receive any HTTP requests.  I assume this is because the already port is in use.

If I don't start the new cluster (step 1) until after all the original workers have exited just before closing the original master (step 5), everything works fine.  However there is an interval of time where I'm am not accepting any connections.  If I don't start a new master process but just close and restart individual workers everything works fine also.  However, my goal is no downtime, and reloading of all node processes including the master (to pick up new master code, a new node version, etc).

The node-cluster module from LearnBoost accomplished this, but I'm trying to use node 0.6 and its built in cluster support.

Thanks,

Steve

Diogo Resende

unread,
Jan 24, 2012, 6:31:28 PM1/24/12
to nod...@googlegroups.com
For really zero downtime, you have to do this way:

0. Start master and workers

- SIGHUP comes in

1. Send that information to, for example, half of the workers
2. This half should stop accepting connections and as soon as they
serve they last request they should exit
3. The master know when the workers start exiting gracefully and
starts new workers
4. When half of your workers have restarted you can do the same to
the others

Remember that you will have your workers possibly running different
code versions at the same so ensure this won't be a problem.

The half 1st, half later is just a strategy. You can do one by one
or anything other. Just don't notify all workers at the same time or
you might have them closed too fast for you to start new ones..

---
Diogo R.

> --
> Job Board: http://jobs.nodejs.org/ [1]
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> [2]
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to nod...@googlegroups.com
> To unsubscribe from this group, send email to
> nodejs+un...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en [3]
>
>
> Links:
> ------
> [1] http://jobs.nodejs.org/
> [2]
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> [3] http://groups.google.com/group/nodejs?hl=en?hl=en

Steve Molitor

unread,
Jan 24, 2012, 9:50:08 PM1/24/12
to nod...@googlegroups.com
Thanks for the suggestion, but I'm trying to kill the original master and start a new master, in addition to the workers.  However, it seems I can't have two master processes with workers handling requests on the same port, and the same time.   I know how to get zero downtime, if I don't restart the master process.  And I know how to kill and start a new master, if I give up on zero downtime.  But I'm trying to both restart the master, and have zero downtime.

Steve


You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to

Andrew Chilton

unread,
Jan 24, 2012, 10:07:47 PM1/24/12
to nod...@googlegroups.com
On 25 January 2012 15:50, Steve Molitor <stevem...@gmail.com> wrote:
> Thanks for the suggestion, but I'm trying to kill the original master and
> start a new master, in addition to the workers.  However, it seems I can't
> have two master processes with workers handling requests on the same port,
> and the same time.   I know how to get zero downtime, if I don't restart the
> master process.  And I know how to kill and start a new master, if I give up
> on zero downtime.  But I'm trying to both restart the master, and have zero
> downtime.

There are other ways of doing it too. For example in a recent project
of mine, I start two node.js servers listening on different local
ports. Both are proxied to by Nginx in the same server{} stanza.
Basically I rarely need to restart Nginx but I _can_ restart one
nodejs server, then the other to get new code running.

I know this isn't using cluster like you asked, but extrapolating can
make this work for your use-case too. I'm afraid I don't know how to
do it with a single master process without downtime, but I'd be keen
on knowing if this is possible. I think Nginx can also be restarted
with zero downtime - so the hard problem is already solved - so that
is also an option (as well as making it serve your static content)
which means you can still get what you want.

In an older project of mine where we had 4 webservers, we load
balanced to each of them, but took one out of the load balancer at a
time to load new code. Again, no downtime but a different solution to
what you asked for. Maybe these have given you some ideas. :)

Let us know when you come to a solution you like!

Cheers,
Andy

--
Andrew Chilton
e: chi...@appsattic.com
w: http://www.appsattic.com/

Karl Tiedt

unread,
Jan 24, 2012, 10:08:58 PM1/24/12
to nod...@googlegroups.com
software only on 1 system... its not really feasible... you would have
to have a load balancer basically in front of 2 systems... where you
can shut 1 down and still restart the other... you cant really
overcome those limitations w/o downtime otherwise have to add another
layer... (hardware or virtual)

-Karl Tiedt

Karl Tiedt

unread,
Jan 24, 2012, 10:14:04 PM1/24/12
to nod...@googlegroups.com
Not sure what load balancers you used, but don't most support not
serving to a down server already? Seems like extra work to reconfig
for that purpose ;)

-Karl Tiedt

Andrew Chilton

unread,
Jan 24, 2012, 10:22:48 PM1/24/12
to nod...@googlegroups.com
On 25 January 2012 16:14, Karl Tiedt <kti...@gmail.com> wrote:
> Not sure what load balancers you used, but don't most support not
> serving to a down server already? Seems like extra work to reconfig
> for that purpose ;)

Absolutely, nothing will come through to a dead server (which is the
point of this solution).

However, just blithly stopping a server may cause some requests to
fail (maybe this was an artifact of our system) but we found it useful
to take a server out of the lb, wait until zero requests were being
sent to it and then we knew we could restart it without anyone seeing
any problems. :) This could have been automated quite easily so it's
not really a problem - just an extra safety net. :)

You're probably right, but this was what we found in our environment.
Others may differ and playing with your own system will teach you what
you need to do. Hope that makes sense.

Karl Tiedt

unread,
Jan 24, 2012, 10:29:05 PM1/24/12
to nod...@googlegroups.com
Very true Andrew, I was neglecting to consider open connections - good catch :)

-Karl Tiedt

knc

unread,
Jan 25, 2012, 1:36:35 AM1/25/12
to nodejs
Any specific reason why you want to restart the master process as
well?

I recently started working on a wrapper/helper for the core cluster
moduler. Pretty much a work in progress, but a lot of what you have
said would be handy to have in this module.

https://github.com/kishorenc/clusterize

Regards,

Kishore.

Alan Hoffmeister

unread,
Jan 25, 2012, 10:03:04 AM1/25/12
to nod...@googlegroups.com
Kishore, any chances to have auto reload on file changes and
coffescritpt support? Also some api fo getting worker/master status
would be nice.

--
Att,
Alan Hoffmeister

Steve Molitor

unread,
Jan 25, 2012, 10:29:50 AM1/25/12
to nod...@googlegroups.com
One reason would be to seamlessly deploy new node versions.  

Steve

--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to

Steve Molitor

unread,
Jan 25, 2012, 11:16:10 AM1/25/12
to nod...@googlegroups.com
If I somehow passed the original socket to the new master and had it use that, would that work?  Could I have two clusters servicing requests on the same port (temporarily)?  Is that what learn boost's cluster module does?

Steve

On Wed, Jan 25, 2012 at 12:36 AM, knc <kish...@gmail.com> wrote:

billywhizz

unread,
Jan 25, 2012, 3:02:09 PM1/25/12
to nodejs
you won't be able to have two unrelated processes listening on the
same port at the same time so if you want to restart the master
process without downtime you will need to have a load balancer in
front of it. there's a good summary of the options here:
http://www.loadbalancing.org/

if you are on linux/FreeBSD then LVS is probably the best option:
http://www.linuxvirtualserver.org/whatis.html

as far as i know you can only pass a socket to a related (child)
process and not to an unrelated process which would pretty much rule
out what you suggest above. it might be possible to spawn a child
process (which should use the newly installed version of node) and
send the socket to that before killing the parent, as long as there is
nothing that breaks between the two version of node, but you'll still
have to deal with the issue of not being able to copy the new version
of node over the old one while it's running.

knc

unread,
Jan 25, 2012, 10:48:26 PM1/25/12
to nodejs
With respect to two processes using the same port, can someone please
example what's said on this page:

http://nodejs.org/docs/latest/api/net.html#server.listen

It says "All sockets in Node set SO_REUSEADDR already". In TCP, can't
we have two programs listening on the same socket, if we set the
SO_REUSEADDR on the socket before bind?

Matt

unread,
Jan 25, 2012, 11:53:58 PM1/25/12
to nod...@googlegroups.com
No.

dhruvbird

unread,
Jan 27, 2012, 7:29:47 AM1/27/12
to nodejs
IIRC, you can use linux domain sockets to share stuff across unrelated
processes - I might be wrong on this though.

Regards,
-Dhruv.

Dobes

unread,
Jan 28, 2012, 8:52:03 AM1/28/12
to nodejs
If your master restarts quickly enough you can have the client wait
for it to restart without rejecting the connection - it appears as a
couple second pause for users but not an error.

If not, you can put a proxy in front that listens on the one port and
switches automatically. Presumably this proxy would be shut down less
frequently than your master process, allowing most upgrades to avoid
downtime. Perhaps use something like nginx for that proxy.

If you can run the new master on a different port and have new clients/
workers use the new port the proxying logic could be coded right into
the master to save running an extra process - perhaps the master is
told there's a new master and it re-wires itself to proxy all requests
to the new master instead of processing them itself.


On Jan 25, 10:50 am, Steve Molitor <stevemoli...@gmail.com> wrote:
> Thanks for the suggestion, but I'm trying to kill the original master and
> start a new master, in addition to the workers.  However, it seems I can't
> have two master processes with workers handling requests on the same port,
> and the same time.   I know how to get zero downtime, if I don't restart
> the master process.  And I know how to kill and start a new master, if I
> give up on zero downtime.  But I'm trying to both restart the master, and
> have zero downtime.
>
> Steve
>
> >>https://github.com/joyent/**node/wiki/Mailing-List-**Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
> >> [2]
> >>  You received this message because you are subscribed to the Google
> >>  Groups "nodejs" group.
> >>  To post to this group, send email to nod...@googlegroups.com
> >>  To unsubscribe from this group, send email to
> >>  nodejs+unsubscribe@**googlegroups.com<nodejs%2Bunsu...@googlegroups.com >
> >>  For more options, visit this group at
> >>  http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>[3]
>
> >> Links:
> >> ------
> >> [1]http://jobs.nodejs.org/
> >> [2]https://github.com/joyent/**node/wiki/Mailing-List-**
> >> Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
> >> [3]http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>
> > --
> > Job Board:http://jobs.nodejs.org/
> > Posting guidelines:https://github.com/joyent/**node/wiki/Mailing-List-**
> > Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
> > You received this message because you are subscribed to the Google
> > Groups "nodejs" group.
> > To post to this group, send email to nod...@googlegroups.com
> > To unsubscribe from this group, send email to
> > nodejs+unsubscribe@**googlegroups.com<nodejs%2Bunsu...@googlegroups.com >
> > For more options, visit this group at
> >http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
Reply all
Reply to author
Forward
0 new messages