How to set up a high availability framework for nimbus

867 views
Skip to first unread message

charming

unread,
Apr 5, 2012, 10:03:13 AM4/5/12
to storm...@googlegroups.com

     Hi, I found that nimbus was a important component in storm. Now I want to set up a high availability storm topology, one task is to improve the availability of nimbus. Can I run two or more nimbus in storm? What the framework will be like? Where can I find the related guide or information?

     Any comments are highly appriecated !

freevictor

unread,
Apr 6, 2012, 5:33:28 AM4/6/12
to storm...@googlegroups.com
I saw same discussion here: https://groups.google.com/d/topic/storm-user/qGoP6_wFw3w/discussion
Nathan said currently run 2 nimbus won't work since nimbus needs local files.

But still question to Nathan and all:

Is possible setup nimubs master-slave with HA software and shared storage like this:
- setup 2 nimbus on 2 difference machines
- put the local file that needed on a shared storage(like SAN )
- use HA software like keepalive to monitor the nimbus process 
- master machine fail, keepalive switch the slave machine and attach shared storage on slave 

Dan

unread,
Apr 6, 2012, 7:58:14 AM4/6/12
to storm...@googlegroups.com
Hey,

Following on from that discussion we decided to have just a single Nimbus running for now. It's important but if it fails storm will still run, the only things that won't work if Nimbus is down is deploying or killing topologies or rebalancing topologies. So with Nimbus down if a supervisor node failed then any workers on it will not be reallocated to other nodes, which is the only big issue.

With the current Nimbus design if we wanted to do HA we would use DRDB to replicate the local storage between two boxes then Heartbeat to move over the IP and DRDB master if one was to fail. It's a little more fiddly but would work.

I think what might make more sense would be to change how Nimbus current works in Storm. I think we should use ZK to discover the Nimbus host then we would be able to start many and have master election for it to fail over more simply.

Then we just need some shared storage support, this is just used for deploying the topology Jar's and config, so we could use something like S3/HDFS (would work nicely for Ted with MapR clusters as that'll be on hand) or just make sure it deploys to all Nimbuses before we accept the topology?

If we do the same for moving the DRPC endpoints to ZK for host discovery we won't need any hosts in the config anymore which would make that simpler and cleaner for fail-over as well.

Would be keen to discuss and help out with this work if others think it makes sense?

Ted Dunning

unread,
Apr 6, 2012, 5:42:08 PM4/6/12
to storm...@googlegroups.com
A better solution is to use a ZK leader election for detecting the down machine.

Then use shared storage, but not a SAN. Use something like NFS or MapR to store the nimbus state.  Whenever a nimbus fails, the spares will do a leader election and whichever wins can read the shared state and continue.

DRBD is a bad option because updates can occur out of order.  Much better to have a shared file system.

Few alternatives for failure detection are as reliable as ZK and since having Storm implies that ZK is available, it has no marginal cost. 

freevictor

unread,
Apr 7, 2012, 9:28:46 PM4/7/12
to storm...@googlegroups.com
Many thanks Dan and Ted for the insight.   Could you elaborate what config and jar files that Nimbus needs?


Nathan Marz

unread,
Apr 9, 2012, 5:03:12 PM4/9/12
to storm...@googlegroups.com
Nimbus stores the topology jar file and topology config on the local filesystem. I agree that the right modifications to make to Storm to make Nimbus highly available are:

1. Use leader election
2. Have a shared storage system of some kind
   - either make this pluggable (so you can use HDFS, MapR, NFS, etc.)
   - Bittorrent would be another interesting way to accomplish this

One design I've been thinking about is to get rid of a separate Nimbus service and instead merge that code with the supervisor. The "storm" client would be modified to discover a supervisor node to upload topologies to via ZK.


On Sat, Apr 7, 2012 at 9:28 PM, freevictor <freev...@gmail.com> wrote:
Many thanks Dan and Ted for the insight.   Could you elaborate what config and jar files that Nimbus needs?





--
Twitter: @nathanmarz
http://nathanmarz.com

Reply all
Reply to author
Forward
0 new messages