HA storm & numbus fail-over

903 views
Skip to first unread message

Dan

unread,
Jan 28, 2012, 8:00:38 AM1/28/12
to storm...@googlegroups.com
Hey,

We're just evaluating storm and looking into how we can use it in a HA
setup for dealing with real-time requests.
From the wiki I can see there is quite a bit of automate fault
tolerance in the design but I've got a few questions on parts of it.

With the Nimbus if you start a pair of these for fail-over in the case
of a machine failing can both be running connected to zookeeper and
fail over when one dies? Or would we have to use linux-ha to ensure
that we've got one running at a time? We'll need to use the latter for
ip fail-over for the Nimbus anyway.

For the worker processes, if a machines with a set of workers on it
dies, what timeouts have to occur before they will get reassigned to
another task slot?

Also with the task slots if the storm cluster has fewer task slots
free than a topology asks for does it just get assigned what's left
and spreads it's workers over those instead?

Any other tips/experiences on productions setups for storm would be
great to hear about too!

Thanks,
Dan

Nathan Marz

unread,
Jan 29, 2012, 6:26:17 AM1/29/12
to storm...@googlegroups.com
Hey Dan,

This page talks about Storm's fault-tolerance: https://github.com/nathanmarz/storm/wiki/Fault-tolerance

To summarize, the only event that can cause a problem is a total loss of the Nimbus machine. The Nimbus daemon process dying or the machine restarting is not an issue. 

You can't currently run multiple Nimbus nodes, because Nimbus stores some state on local disk (the topology jars and configs). If you're really worried about losing the Nimbus node, you can reduce the probability by beefing up the node with redundant power, etc.

The configuration "nimbus.task.timeout.secs" contains the timeout before Nimbus detects the task has died and restarts it on another worker slot.

If there are fewer worker slots available than configured for a topology, Storm will squeeze all the tasks for the topology into fewer workers. If more slots become available (like you add a node), Storm will automatically rebalance the topology to be on the proper amount of workers.

It's recommended to run your clusters with some extra capacity to avoid squeezing topologies.

-Nathan
--
Twitter: @nathanmarz
http://nathanmarz.com

Reply all
Reply to author
Forward
0 new messages