New to Consul, some questions and observations

1,034 views

Skip to first unread message

Chris Miller

unread,

Nov 5, 2014, 12:29:20 PM11/5/14

to consu...@googlegroups.com

I've been playing with a test cluster and am impressed with what I've seen so far. I did hit a few bumps along the way and have a couple of questions though.

Suppose I have three servers, S1, S2 and S3. Server S1 starts running a Consul server first, followed by S2 then S3.

When using -bootstrap-expect 3, using -retry-join on all servers works as I expected but -join does not. I was expecting S1 to try once to join with (the not yet existing) S2 and S3 Consul instances, then wait there for S2 and S3 to eventually join it. Instead, Consul on S1 just exits immediately with "connection refused". Is that intentional? It seems more intuitive to me that -bootstrap-expect should mean S1 stays running/waiting rather than making a quick exit.
For ease of admin and maintenance, I'd like to be able to use the exact same configuration for all three servers. The sticking point currently is that each server has to have a list of the other servers to -retry-join with BUT it isn't allowed to include itself in that list. For example, if S1 is started with -retry-join=S1 -retry-join=S2 -retry-join=S3, the following happens:
...
2014/11/05 13:52:15 [INFO] agent: Joining cluster...
2014/11/05 13:52:15 [INFO] agent: (LAN) joining: [S1, S2, S3]
2014/11/05 13:52:15 [INFO] agent: (LAN) joined: 1 Err: <nil>
2014/11/05 13:52:15 [INFO] agent: Join completed. Synced with 1 initial agents
2014/11/05 13:52:17 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
2014/11/05 13:52:41 [ERR] agent: failed to sync remote state: No cluster leader
2014/11/05 13:52:59 [ERR] agent: failed to sync remote state: No cluster leader
2014/11/05 13:53:25 [ERR] agent: failed to sync remote state: No cluster leader
...

If S1 is started with just -retry-join=S2 -retry-join=S3 however, the following (correct) behaviour is seen instead:
...
2014/11/05 13:55:08 [INFO] agent: Joining cluster...
2014/11/05 13:55:08 [INFO] agent: (LAN) joining: [S2, S3]
2014/11/05 13:55:08 [INFO] agent: (LAN) joined: 0 Err: dial tcp 10.0.0.10:8301: connection refused
2014/11/05 13:55:08 [WARN] agent: Join failed: dial tcp 10.0.0.10:8301: connection refused, retrying in 30s
2014/11/05 13:55:09 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
2014/11/05 13:55:26 [ERR] agent: failed to sync remote state: No cluster leader
2014/11/05 13:55:38 [INFO] agent: (LAN) joining: [S2, S3]
2014/11/05 13:55:38 [INFO] agent: (LAN) joined: 0 Err: dial tcp 10.0.0.10:8301: connection refused
2014/11/05 13:55:38 [WARN] agent: Join failed: dial tcp 10.0.0.10:8301: connection refused, retrying in 30s
2014/11/05 13:55:53 [ERR] agent: failed to sync remote state: No cluster leader
...

Would it be possible for each node to detect itself in the -retry-join (and -join) list and then just ignore that entry, moving on to try and join the remaining nodes instead? That would make configuration and admin much easier.
The "Introduction to Consul" docs (https://consul.io/intro/) state the following: "Every node that provides services to Consul runs a Consul agent". Does that strictly need to be true? Why can't I run a service on say S4, and have the service register itself (and set up health checks etc) with S1, S2 or S3? I haven't tried this myself yet so apologies if the reason will become obvious to me when I do. It would be helpful if the reasoning was made clear in the documentation however to save people from trying to figure it out for themselves.
I've read the "DNS Interface" docs and I saw the "Bootstrapping Consul" thread from a few weeks ago (https://groups.google.com/forum/?fromgroups#!topic/consul-tool/lyJ5jBDw1A8) which has been helpful to me, but it would be great if the Consul documentation gave a clearer overview of the various ways to bootstrap an application so it can find the cluster dynamically in the first place. For example, suppose I have a mobile device (or laptop, desktop PC etc) on the intranet with no Consul agent running on it. The device is running an app that wants to find the cluster so it can use the KV store, query for available services etc. It seems a DNS query would need to know a pre-existing node or service name (neither of which might be available), whereas an HTTP API request requires a URL (which in this case is not localhost). That seems to imply the app needs a predefined URL which points to a load balancer. The load balancer then needs to somehow figure out a (random, healthy) node to pass the request on to. I can't find it in the docs but perhaps Consul already provides a standard service that can help with this? If not, would it be possible to query the DNS or have a built-in service to get details of a (random, healthy) node that HTTP requests can then be sent to? Or am I missing something here (eg I know DNS wouldn't be able to provide a port number)? Any thoughts on how best to tackle this would be greatly appreciated.

This ended up a lot longer than I expected. Thanks for those who managed to read through this far, and thanks HashiCorp for creating a very useful tool!

Armon Dadgar

unread,

Nov 5, 2014, 2:02:23 PM11/5/14

to consu...@googlegroups.com, Chris Miller

Hey Chris,

Answers are inlined below!

Best Regards,

Armon Dadgar

From: Chris Miller <chri...@gmail.com>
Reply: Chris Miller <chri...@gmail.com>>
Date: November 5, 2014 at 9:29:21 AM
To: consu...@googlegroups.com <consu...@googlegroups.com>>
Subject: New to Consul, some questions and observations

I've been playing with a test cluster and am impressed with what I've seen so far. I did hit a few bumps along the way and have a couple of questions though.

Suppose I have three servers, S1, S2 and S3. Server S1 starts running a Consul server first, followed by S2 then S3.

When using -bootstrap-expect 3, using -retry-join on all servers works as I expected but -join does not. I was expecting S1 to try once to join with (the not yet existing) S2 and S3 Consul instances, then wait there for S2 and S3 to eventually join it. Instead, Consul on S1 just exits immediately with "connection refused". Is that intentional? It seems more intuitive to me that -bootstrap-expect should mean S1 stays running/waiting rather than making a quick exit.

This is intentional. The semantics for `-join` are to attempt once, and to exit if none of the joins succeed. We added `-retry-join` to continuously attempt the join until we succeed.

Probably the best way to do this is to add all the nodes as part of a `-join`. This way, doing a `-join` on the local node will always succeed so you won’t quit, but the 2nd and 3rd node will connect to the live nodes and join properly.

The "Introduction to Consul" docs (https://consul.io/intro/) state the following: "Every node that provides services to Consul runs a Consul agent". Does that strictly need to be true? Why can't I run a service on say S4, and have the service register itself (and set up health checks etc) with S1, S2 or S3? I haven't tried this myself yet so apologies if the reason will become obvious to me when I do. It would be helpful if the reasoning was made clear in the documentation however to save people from trying to figure it out for themselves.

It isn’t strictly necessary. You *could* use the API to manage all the registration, de-registration, health checks, etc. It’s just a lot of work, and you will end up rebuilding the agent essentially.

I've read the "DNS Interface" docs and I saw the "Bootstrapping Consul" thread from a few weeks ago (https://groups.google.com/forum/?fromgroups#!topic/consul-tool/lyJ5jBDw1A8) which has been helpful to me, but it would be great if the Consul documentation gave a clearer overview of the various ways to bootstrap an application so it can find the cluster dynamically in the first place. For example, suppose I have a mobile device (or laptop, desktop PC etc) on the intranet with no Consul agent running on it. The device is running an app that wants to find the cluster so it can use the KV store, query for available services etc. It seems a DNS query would need to know a pre-existing node or service name (neither of which might be available), whereas an HTTP API request requires a URL (which in this case is not localhost). That seems to imply the app needs a predefined URL which points to a load balancer. The load balancer then needs to somehow figure out a (random, healthy) node to pass the request on to. I can't find it in the docs but perhaps Consul already provides a standard service that can help with this? If not, would it be possible to query the DNS or have a built-in service to get details of a (random, healthy) node that HTTP requests can then be sent to? Or am I missing something here (eg I know DNS wouldn't be able to provide a port number)? Any thoughts on how best to tackle this would be greatly appreciated.

So typically with the agent on every machine approach, you always rely on the agent existing at localhost, so you don’t need to determine the DNS/HTTP address. This gets more complex and goes back to #2 of why not to write your own agent :) For the bootstrapping issue, there is always a well-known address problem. Either you need to know the IP address of an existing node, or the DNS name (which requires a well-known DNS IP address), or you use a separate bootstrapping service (well-known DNS IP+Hostname). There isn’t necessarily a great answer to this and it depends on the context of your infrastructure. In many cases, depending on 3 well known IPs or a DNS record works well.

This ended up a lot longer than I expected. Thanks for those who managed to read through this far, and thanks HashiCorp for creating a very useful tool!

Glad you find it useful!

--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages