Hi folks,I've been experimenting with Couchbase's clustering and hit an unexpected hiccup with the behavior upon node failure.I have 2 database servers running in a simple master/standby setup and I wanted to add couchbase to both machines in as highly available a configuration as possible.Initially I just setup an independent couchbase instance on each machine since we're mostly using CB for session storage, so it's bad, but not horrible if the data is lost.. it's more important for us that the app stays available with a quick switch-over in case the primary fails.So I was hoping I could do one better and cluster the 2 nodes such that in case of the primary failing, we'd simply be able to instantly direct traffic at the secondary and have it continue services while preserving our data.However, I've found this does not work that way, even worse I found I was actually reducing our availability because if the backup went down it takes out the primary and requests would start coming back with messages like "failed with: SERVER_ERROR proxy write to downstream 192.168.0.16"
I've seen that it's possible to do auto-failover with 3 nodes, but even that delays for some-odd 30 seconds, and also we don't have a third machine available.
I've also seen we could try to write our own fail-over system with the API to monitor the servers and make API calls to do the equivalent of pressing the fail-over button.However, before going down that path.. I was wondering if this is a known use case and there's a standard solution that others are using? Is there a standard way to do a master/backup type solution?
Hi Aliaksey, Chad,
Thanks for the responses --
I saw the github article, definitely good food for thought.
I understand and thanks for the extra explanations! At first I was just taken aback when taking one node offline broke the cluster as I imagined the cluster would keep working automagically somehow you know? However, I see that's what auto-failover is for? Do I have it right this would be the same case with a cluster of 10 nodes.. e.g. if 1 node goes down, the whole cluster will no longer be able to write for 30 seconds until auto-failover (or an admin) removes the broken node?
I think the problem is that what I really want is a replication solution as opposed to a cluster solution. I'd like a primary master that replicates out to a standby secondary, analogous to our existing master/slave setup for our Postgres db where we use Slony for that purpose. However, I don't think CB currently provides just a replication facility?
I've been mulling over the various A/B failover methods and concluded it's probably not worth the effort compared to just adding a third node and using auto-failover.
However if we do want to just use the 2 CB nodes, we could use our load balancers as the management nodes and write a check for heartbeat to monitor the CB service on both machines, something like:
Check A is alive, check B is alive
If both nodes alive or both nodes dead, no action
If one node is dead and one node is alive, use API on live node to failover dead node and also remove dead node from the load balancer VIP
Disable further actions, notify admins
Would there be any problem with that?
Thanks!
Andrew
However if we do want to just use the 2 CB nodes, we could use our load balancers as the management nodes and write a check for heartbeat to monitor the CB service on both machines, something like:
Check A is alive, check B is alive
If both nodes alive or both nodes dead, no action
If one node is dead and one node is alive, use API on live node to failover dead node and also remove dead node from the load balancer VIP
Disable further actions, notify adminsWould there be any problem with that?
Aliaksey -I was talking before about scripting a 2 node fail-over setup...with a check something like this:You mentioned that if the check was run on the nodes themselves you could get a split brain (definitely agree), but I was thinking about running the checks from a 3rd party server(s). We have 2 other machines we use for load balancers, so I could have them act as the controller. Would you see any problem with this setup?
- Check A is alive, check B is alive
- If both nodes alive or both nodes dead, no action
- If one node is dead and one node is alive, use API on live node to failover dead node and also remove dead node from the load balancer VIP
- Disable further actions, notify admins