Hey,
Following on from that discussion we decided to have just a single Nimbus running for now. It's important but if it fails storm will still run, the only things that won't work if Nimbus is down is deploying or killing topologies or rebalancing topologies. So with Nimbus down if a supervisor node failed then any workers on it will not be reallocated to other nodes, which is the only big issue.
With the current Nimbus design if we wanted to do HA we would use DRDB to replicate the local storage between two boxes then Heartbeat to move over the IP and DRDB master if one was to fail. It's a little more fiddly but would work.
I think what might make more sense would be to change how Nimbus current works in Storm. I think we should use ZK to discover the Nimbus host then we would be able to start many and have master election for it to fail over more simply.
Then we just need some shared storage support, this is just used for deploying the topology Jar's and config, so we could use something like S3/HDFS (would work nicely for Ted with MapR clusters as that'll be on hand) or just make sure it deploys to all Nimbuses before we accept the topology?
If we do the same for moving the DRPC endpoints to ZK for host discovery we won't need any hosts in the config anymore which would make that simpler and cleaner for fail-over as well.
Would be keen to discuss and help out with this work if others think it makes sense?