Using consul for service discovery

528 views
Skip to first unread message

Brian Zambrano

unread,
Sep 26, 2014, 7:31:08 PM9/26/14
to consu...@googlegroups.com
I'm investigating Consul and have been reading a ton.  While the concepts in Consul itself make sense, I'm still struggling to find the best way to actually *use* Consul for service discovery and routing.

consul-haproxy looks like a really nice way to handle connecting to services dynamically as things come and go in a cloudy environment.  Some of the scenarios I'm thinking about and trying to design for are:

- a new DB slave come online/offline
- a new AWS AutoScaled webserver comes online/offline
- the primary database fails over and points to a new host

My question is, what is a suggested way to actually connect to these services as the come and go?  My initial thoughts are:

- run consul-haproxy on every single server and always connect to localhost:PORT.  For example, hosts could connect to localhost:3307 to be routed to 1 of n MySQL slaves.  This would mean that if a host arrived or disappeared, HAProxy would get reloaded on all of the hosts.

- run a bank of HAProxy nodes running consul-haproxy.  Any updates based on services changing would mean only these few HAProxy instances get updated.  Clients would run HAProxy on localhost but would connect to this bank of consul-haproxy instances....this would give us HA since the local HAProxys could health check the bank of consul-haproxy machines.

Thoughts or suggestions appreciated.

BZ

David Pelaez

unread,
Sep 28, 2014, 2:53:13 PM9/28/14
to consu...@googlegroups.com
Brian,

This is my humble opinion mostly with the intention to make you life extremely simple when using consul. Here it is:

Something like consul-haproxy is typically fit for externally exposed load-balanced services with a single entry point that cannot change very often, usually DNS queries with a not so small TTL. These are usually websites and externally exposed APIs where a unique domain should end up pointing to all the nodes offering that service and you cannot be certain that consumer's resolution respects TTLs or any caching in the middle that could take up to 72 hours to be fixed. Haproxy is also a must when a well known port is to be used, since SRV isn't very widely supported.

However, inside your cluster you can have shorter TTL, you can know that DNS is properly respected and there aren't DNS caches and you can use SRV records in many cases. When a service goes down many implementations of different services consumers (e.g redis gem for ruby) will handle the reconnection and since you'd be using a FQDN (e.g. redis.service.cluster) the language itself or sometimes the OS (e.g ruby) would respect the DNS TTL and the resolution would happen again. Consul randomly gives you a record of any of the redis offerers effectively giving you a sort of internal load balancing that eliminates the need for haproxy or similar. But do notice that the port isn't standard and is to be read from the SRV record.

To simplify your question, I think you need to use consul-haproxy mostly for publicly exposed services where the consumer cannot access the cluster's DNS (.consul domains) or when consumer's cannot adapt to a change in the destination port. In any other cases you should use SRV records or a tool to query consul's registry and ask yourself some of this questions before you use consul DNS:

- Will the service consumer reconnect if something fails?
- Will my implementation/language automatically resolve the domain again respecting TTL.
- How could I enable the previous two behaviors if they're not 'natively' available?

It is true that there can be cases where it could get complicated to point to internal services without something like haproxy, but this usually has to do more with limitations in the service consumer. If you can get around working in the consumers it's way simpler for you, it's less services to deploy and manage and it's a way better usage of your resources. With consul and adapted consumers you have simply and reliable load balancing very easily, if you now add haproxy you are in someway replicating a part of the functionality and introducing more failure points. If things are still very unclear check an example (ruby in this case but similar solutions exists for other languages) https://github.com/WeAreFarmGeek/diplomat where a gem queries the consul registry to get the relevant information for a PostgreSQL DB.

I understand that modifying consumer can be difficult in several cases. In that case you could consider something simpler than many haproxy instances and run routing containers (very similar essentially but looks much easier to me), check this to get your head spinning a bit: https://github.com/progrium/ambassadord The link that I sent you could actually run outside docker as the compiled go binary, but using docker let's you run it very very easily. Since ambassadord can understand SRV records directly you take a lot of the haproxy config from your hands and I think it's just simpler to run a binary with a parameter or two than haproxy with configuration files.

Hope this helped in some way.

Cheers

Brian Zambrano

unread,
Sep 29, 2014, 3:57:36 PM9/29/14
to consu...@googlegroups.com
David, thanks for your clear and thorough response...this is exactly the type of feedback I was hoping for.

I absolutely agree that running consul-haproxy in a clustered setup or as individual agents means running an entirely new service(s) which themselves needs to be maintained, monitored, etc.  And yes, HAProxy replicates a lot of the functionality of Consul when used like this.  In some ways my proposed solutions didn't feel quite right to me....but the biggest question I had is how to deal with the delta between node failure and discovery of that failure by a client.

Using HAProxy in front of Consul, the answer is really easy...and I still believe this is a nice feature of the system.  Since HAProxy will do health checking the number of client errors due to unhealthy backends is pretty negligible.  Consul's built-in DNS looks like an elegant way to deal with a dynamic environment.  What I was mostly having trouble understanding was how to use the DNS features from Consul for the auto load balancing, but also keeping down the latency between service failure and a client noticing.  It setting a 5 second TTL enough, or are there other ways to handle this problem such that clients aren't routed to unhealthy services?

The other thing I was having trouble wrapping my mind around was the SRV records for discovering ports.  Adapted consumers seem like the only real way that would work.  We're a Django/Python shop, so writing a bit of code to get port number from Consul inside a settings file would be quite trivial, as is done in the Diplomat Ruby gem you linked to.

This is very helpful feedback indeed.  If there are any other ideas on using Consul's DNS functionality while also handling failures gracefully from within Python processes I'd love to hear them.  Even a low TTL for .consul lookups (5-10 seconds) seems too long to accept connection failures from clients.

BZ

Armon Dadgar

unread,
Sep 29, 2014, 5:24:54 PM9/29/14
to consu...@googlegroups.com, Brian Zambrano
Hey Brian,

I would echo the same sentiment, using the HAProxy to manage the connection pools is a nice
feature when you absolutely need the performance or cannot modify the interaction between
various services in your infrastructure.

In terms of ease of use and maintainability, using the DNS interface is the way to go. By default
there is a 0 TTL, so you don’t get any stale lookups unless you are explicitly allowing it.

With respect to the SRV records, they do a good job handling the case of dynamic ports, but
in many cases you can also just use static port assignment for services in your fleet and
just use the standard name resolution with a well known port.

Hope that helps!

Best Regards,
Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Pelaez

unread,
Oct 1, 2014, 12:14:06 PM10/1/14
to consu...@googlegroups.com, bri...@gmail.com
Armon & Brian,

glad it helped, luckily for me your reply in return has opened a very
concrete question. I was wondering if there's a downside to having very low
intervals in checks to detect faster if a service isn't properly working.
Would a 1s interval affect in some way memory consumption or strain the
cluster in some way? This seems like a very specific question that Armon
could answer :)

Armon also makes an obvious point that I was missing in a way: use a known
port and you can use dns without any changes. I mostly ignored this because
of my internal use of the tool where we run everything inside container
with dynamic port allocation on the host, but we have a noticeable
exception, nsq messaging daemons where we simply use the known port, so
there's nothing but DNS going on with those tools to use Consul. Also, TTL
0 for DNS is nice given that dns resolution should be almost instant hence
the benefits of caching with a TTL can be overlooked in lieu of low
roundtrip times to the dns server.

Cheers


On Monday, September 29, 2014 4:24:54 PM UTC-5, Armon Dadgar wrote:
Hey Brian,

I would echo the same sentiment, using the HAProxy to manage the connection pools is a nice
feature when you absolutely need the performance or cannot modify the interaction between
various services in your infrastructure.

In terms of ease of use and maintainability, using the DNS interface is the way to go. By default
there is a 0 TTL, so you don’t get any stale lookups unless you are explicitly allowing it.

With respect to the SRV records, they do a good job handling the case of dynamic ports, but
in many cases you can also just use static port assignment for services in your fleet and
just use the standard name resolution with a well known port.

Hope that helps!

Best Regards,
Armon Dadgar

Armon Dadgar

unread,
Oct 1, 2014, 12:41:57 PM10/1/14
to David Pelaez, consu...@googlegroups.com, bri...@gmail.com
Hey David,

Because the health check is run locally, reducing the interval only will affect the load
on that particular machine. If the check is relatively cheap (check for a 200 status code, etc),
then there are no issues with it. As a note, a major optimization is that health check
output is not sync’ed immediately if the state does not change. (See the check_update_interval
configuration).

Best Regards,
Armon Dadgar

David Pelaez

unread,
Oct 1, 2014, 12:44:12 PM10/1/14
to consu...@googlegroups.com, pela...@gmail.com, bri...@gmail.com
OK great to know. Thanks for the explanation, it makes a lot of sense and I didn't know about the check_update_interval config. Thanks.


On Wednesday, October 1, 2014 11:41:57 AM UTC-5, Armon Dadgar wrote:
Hey David,

Because the health check is run locally, reducing the interval only will affect the load
on that particular machine. If the check is relatively cheap (check for a 200 status code, etc),
then there are no issues with it. As a note, a major optimization is that health check
output is not sync’ed immediately if the state does not change. (See the check_update_interval
configuration).

Best Regards,
Armon Dadgar
Reply all
Reply to author
Forward
0 new messages