Failover support in consul

1,076 views
Skip to first unread message

DR

unread,
Apr 19, 2016, 12:53:52 AM4/19/16
to Consul
Hi experts,

Recently delved into consul and must say it is a great way to do service registration and discovery.

Here is what I plan to achieve:
1. A DC where there are multiple clients using multiple services.
2. A central agent which is responsible for forwarding requests/responses to the correct service/client based on some routing logic.
2. All the clients and services are connected via the central agent mentioned on #2.

My questions are as follows:
1. To maintain the robustness of this central agent, I would like to run this agent in HA mode(1 active+1 standby). Does the consul provide any mechanism with which I can failover to the backup node seamlessly, if the active node fails?
2. I configured a watch in this central agent to identify services registering/deregistering to the consul cluster, based on which the routing logic is updated.
consul watch -type services <watch-handler>

But I observed "services watch command" provides us with only the list of services presently active in this cluster. So, to effectively identify services that have just been registered/deregistered, I would have to first identify the service that has been registered/deregistered and query again with the service name to get the detailed info.
I sincerely think there should be a better way to do this. So my next question is, how can I monitor the services registering/deregistering in an optimal way possible?

with best regards
DR

cosmo....@anchor.com.au

unread,
Apr 19, 2016, 3:16:16 AM4/19/16
to Consul
Hi DR,

I'm far from an expert but I think I can help out.

On Tuesday, April 19, 2016 at 2:53:52 PM UTC+10, DR wrote:
1. To maintain the robustness of this central agent, I would like to run this agent in HA mode(1 active+1 standby). Does the consul provide any mechanism with which I can failover to the backup node seamlessly, if the active node fails?

Is there a reason the agent needs to run in active+standby mode? Can you instead run multiple active instances of the agent and load balance traffic between them?

If so, you can run an arbitrary number of copies of the agent on different nodes, register them all in consul with a health check, then access them via consul's DNS API [0].

If not, then I believe you'll have to implement failover yourself. It would be difficult for consul to do this on its own, since it doesn't have a way of knowing if the necessary failover steps have taken place within the agent itself.

You could also do something like register the 'active' and 'standby' mods as separate services with a health check that will fail unless the agent is in that specific mode, but you should avoid that if at all possible. I'd strongly advise architecting the agent so that it can multiple active instances running at the same time. That will also make scaling your agent out far easier should it ever become a bottleneck.

But I observed "services watch command" provides us with only the list of services presently active in this cluster. So, to effectively identify services that have just been registered/deregistered, I would have to first identify the service that has been registered/deregistered and query again with the service name to get the detailed info.
I sincerely think there should be a better way to do this. So my next question is, how can I monitor the services registering/deregistering in an optimal way possible?

I think you're correct about this restriction. As per the documentation on watches[1], the services watch type "maps to the /v1/catalog/services API internally". That API path only returns a list of services, not the specifics of each service[2].

The "service" watch, which maps to /v1/catalog/service/<service>, does look to return the full catalog entry for a service. You'll probably want to set up your agent to use a "services" watch to identify when services are added or removed and a set of "service" watches to catch specific changes to services.

You may also want to check out fabio[3], an HTTP router which is configured via consul. It sounds like there's some overlap with what you're trying to achieve, though note that fabio doesn't currently support raw TCP[4]. Even if it doesn't help out directly you might get some ideas from the way it deals with consul[5].

David Adams

unread,
Apr 19, 2016, 10:36:26 AM4/19/16
to consu...@googlegroups.com
On Tue, Apr 19, 2016 at 2:16 AM, <cosmo....@anchor.com.au> wrote:
Is there a reason the agent needs to run in active+standby mode? Can you instead run multiple active instances of the agent and load balance traffic between them?

If so, you can run an arbitrary number of copies of the agent on different nodes, register them all in consul with a health check, then access them via consul's DNS API [0].

I'd like to jump in here because this hits on a use case I'd really like to see Consul support. Cosmo is correct that Consul doesn't support a nice active-passive failover case at the moment, but there's no reason it couldn't. Right now DNS results for a particular service or query are returned in random order. That's great for a fully round-robin service, but it's not so great for other scenarios. For example, we've got a service where we'd like to do a soft active-passive model, wherein we have identical database proxies running for each of our database servers in each AWS availability zone in which our application operates. Any database proxy can talk to any database server, but in the optimal case when everything is running, we prefer each group of app servers in one AZ to talk to its database server via its default proxies. But when a proxy server fails, we'd like the DNS to route traffic to the nearest (network topology-wise) working proxy. But the current model for stored queries (as far as I understand it) is not robust enough for ordering by network proximity, nor for returning a limited number of results, nor for using service tags for priority. I've poked around at the Go code for this stuff a little bit, but it's a bit beyond my ability or spare time as yet.

Anyway, the point is that while what we want is somewhat complicated and maybe relatively unique, the case of a hard active-passive DNS failover would be relatively easy to accomplish via some basic tagging priority. eg, by creating a stored query that said to return an ordered list of nodes with certain services and certain tags, but to prioritize a given extra tag (like, "active"), and to only return one result. Then if the active service went offline, it would easily swap over to the failover box. The services would be responsible for performing their own failover logic, but for a lot of services, no such logic is necessary.

Sorry for taking a tangent there. I suppose I should file a feature request for this, eh?

-dave

cosmo....@anchor.com.au

unread,
Apr 19, 2016, 7:58:24 PM4/19/16
to Consul
On Wednesday, April 20, 2016 at 12:36:26 AM UTC+10, David Adams wrote:
I'd like to jump in here because this hits on a use case I'd really like to see Consul support. Cosmo is correct that Consul doesn't support a nice active-passive failover case at the moment, but there's no reason it couldn't. Right now DNS results for a particular service or query are returned in random order. That's great for a fully round-robin service, but it's not so great for other scenarios. For example, we've got a service where we'd like to do a soft active-passive model, wherein we have identical database proxies running for each of our database servers in each AWS availability zone in which our application operates. Any database proxy can talk to any database server, but in the optimal case when everything is running, we prefer each group of app servers in one AZ to talk to its database server via its default proxies. But when a proxy server fails, we'd like the DNS to route traffic to the nearest (network topology-wise) working proxy. But the current model for stored queries (as far as I understand it) is not robust enough for ordering by network proximity, nor for returning a limited number of results, nor for using service tags for priority. I've poked around at the Go code for this stuff a little bit, but it's a bit beyond my ability or spare time as yet.

 Good point. Maybe what you're describing would also be a good case for some kind of latency-based ordering of results? The API can already do that via the `?near=NODE` parameter[0]. It would be nice to have something similar for DNS queries, since it isn't always feasible to have a service use consul's API directly.

Sorry for taking a tangent there. I suppose I should file a feature request for this, eh?

It looks like issue 1584 [1] is pretty close to the use case that you and DR are trying to handle, though there's not much discussion there as of yet.

Chris Stevens

unread,
Apr 19, 2016, 9:15:04 PM4/19/16
to Consul
Does the (relatively new) prepared query templates feature achieve what you want via either DNS or API??

cosmo....@anchor.com.au

unread,
Apr 19, 2016, 10:18:06 PM4/19/16
to Consul
On Wednesday, April 20, 2016 at 11:15:04 AM UTC+10, Chris Stevens wrote:
Does the (relatively new) prepared query templates feature achieve what you want via either DNS or API?? 

That's a great point Chris. However I don't believe they'll help with failover unless the active and standby services are in separate consul DCs.

As for latency-based results, the docs note that `?near` can be used when executing a query via the API, but that's not too helpful when executing via DNS. I guess having a separate consul datacentre in each availability zone would solve that, but that's quite extreme. For David's use case it would be nice if there was a way to prepare a query such that it would return N of the nearest results, or something like that.

David Adams

unread,
Apr 20, 2016, 5:13:31 AM4/20/16
to consu...@googlegroups.com
As Cosmo said, currently the HTTP API allows sorting by network coordinates with prepared query results at query time, but via the DNS API there's no way to do this. And honestly, if I have to look things up with the HTTP API using query-time parameters, then prepared queries are doing very little for me.

I'd also like to be able to prioritize on other factors as well. Issue #1584 that Cosmo pointed out is part of what I want, but not all of it. I am going to try to put some time into a PR or two that will add the features I'm looking for, because some of it seems relatively straightforward.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/56a1de84-998c-4ba4-ae89-d3aef83d4852%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

cosmo....@anchor.com.au

unread,
Apr 20, 2016, 9:33:18 PM4/20/16
to Consul
On Wednesday, April 20, 2016 at 7:13:31 PM UTC+10, David Adams wrote:
I'd also like to be able to prioritize on other factors as well. [...] I am going to try to put some time into a PR or two that will add the features I'm looking for, because some of it seems relatively straightforward.

That sounds fantastic! 
Reply all
Reply to author
Forward
0 new messages