Consul-Agent architecture .. the node-id issue after upgrading to 0.8.1 - conceptual issue?

655 views
Skip to first unread message

Eugen Mayer

unread,
May 2, 2017, 4:33:40 AM5/2/17
to Consul
Hello,

i am not sure where the root of my problem actually comes from, so i try to paint the bigger picture.

In short, the symptom: After upgrading from 0.7.3 to 0.8.1 m agents ( explaining that below ) could no longer connect to the cluster leader du to doublicate node-ids ( why that probably happens, explained below).
I could neither fix it with https://www.consul.io/docs/agent/options.html#_disable_host_node_id nor fully understand, why i run into this .. and thats where the bigger picture and maybe even different questions comes from.

I have the following setup:

1. I run a application stack with about 8 containers for different services ( different micoservices, DB-types and so on).

2. I use a single consul server per stack (yes the consul server runs in the software stack, it has its reasons because i need this to be offline-deployable and every stack lives for itself)

3. The consul-server does handle the registration, service discovery and also KV/configuration

4. Important/Questionable: Every container has a consul agent started with with "consul agent -config-dir /etc/consul.d" .. connecting the this one server. The configuration looks like this  .. including to other files with they encrypt token / acl token. Do not wonder about servicename() .. it replaced by a m4 macro during image build time

5. The clients are secured by a gossip key and ACL keys 

6. Important: All containers are on the same hardware node

7. Server configuration looks like this, if any important. In addition, ACLs looks like this, and a ACL-master and client token/gossip json files are in that configurtion folder
---

Sorry for this probably TLTR above, but now the reasons behind this multi-agent setup ( or agent per container )/

My reasons for that:

A) I use tiller to configure the containers, so a dimploy gem will try to usually connect to localhost:8500 .. to acomplish that without making the consul-configuration extraordinary complicated, i use this local agent, which then forwards the request to the actual server and thus handles all the encryption-key/ACL negation stuff

B) i use several 'consul watch' tasks on the server to trigger re-configuration, they also run on localhost:8500 without any extra configuration

That said, the reason i run a consul-agent per container is, the simplicity for local services to talk to the consul-backend without really caring about authentication as long as they connect through 127.0.0.1:8500 ( as the level of security )

Final Question:
Is that multi-consul agent actually designed to be used that way? The reason i ask is, because as far as i understand, the node-id duplication issue i get now when starting a 0.8.1 comes from "the host" being the same, so the hardware node being identical for all consul-agents .. right?

Is my design wrong or do i need to generate my own node-ids from now on and its all just fine?

Thanks for even reading through this :)

James Phillips

unread,
May 3, 2017, 12:36:10 PM5/3/17
to consu...@googlegroups.com
Hi Eugen,

> Is that multi-consul agent actually designed to be used that way? The reason i ask is, because as far as i understand, the node-id duplication issue i get now when starting a 0.8.1 comes from "the host" being the same, so the hardware node being identical for all consul-agents .. right?

Consul is designed for one agent per host - that's why we made the
deterministic host-based IDs (a Nomad agent running on the same host
will pick the same ID for itself, which makes it nice for correlating
things between the two, for example). Folks using Docker often use
-net=host or bind Consul to an address on the bridge network. This
post from the community has an interesting alternate approach to these
two - https://medium.com/zendesk-engineering/making-docker-and-consul-get-along-5fceda1d52b9.

If your architecture is working for you then it's hard to say it's
wrong :-) It doesn't map super cleanly into Consul, though, to have
multiple agents running on the same host. You have to be careful about
giving them unique node names and addresses, even though they are on
the same node, and Consul's gossip protocol means that the agents will
all probe each other to see if a host is down, so you might have peer
agents probing each other which doesn't make sense. I'd definitely
recommend trying to get to one Consul agent per host if you can.

As far as the host-based IDs are concerned,
https://www.consul.io/docs/agent/options.html#_disable_host_node_id
should be all you need. When the agent starts up it will generate its
UUID and save it off into Consul's data directory in a file called
"node-id". Is it possible that the process that makes your Consul
container starts it up so that all your containers are getting the
same ID?

-- James
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/db422e4a-c82f-41b7-a7a6-d2b087182b6f%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Eugen Mayer

unread,
May 3, 2017, 1:04:15 PM5/3/17
to Consul
Thank you James, once again.


Am Mittwoch, 3. Mai 2017 18:36:10 UTC+2 schrieb James Phillips:

Consul is designed for one agent per host - that's why we made the
deterministic host-based IDs (a Nomad agent running on the same host
will pick the same ID for itself, which makes it nice for correlating
things between the two, for example). Folks using Docker often use
-net=host or bind Consul to an address on the bridge network. This
post from the community has an interesting alternate approach to these
two - https://medium.com/zendesk-engineering/making-docker-and-consul-get-along-5fceda1d52b9.
Very interesting article. In the article the author writes

 
Installing a Consul agent per container. Consul’s architecture anticipates a single agent per host IP address; and in most environments, a Docker host has a single network-accessible IP address. Running more than one Consul agent per container would cause multiple agents to join the Consul network and claim responsibility for the host, causing major instability in the cluster.

But that does not happen, if you start all the other agents in "agent" mode, not server? This way you have one cluster leader ( the service called consul with the server consul ) and all others are just agents. I have not seen any cluster-leader issues yet, so i wonder if i or he got this thing wrong?

He also writes

ing Consul to bind to the Docker bridge IP address. The routing would work properly, but: (a) typically, bridge interfaces are assigned dynamically by Docker; (b) there may be more than one bridge interface; (c) containers would have to know the IP address of the selected bridge interface; and (d) the Consul agent and dnsmasq (discussed below) would not be able to start until the Docker engine has started. We don’t want to create any unnecessary dependencies.

I solved this by creating a docker network, and the consul service gets a static ip there ( since changing IPs for consul is pretty bad ). Any service is waiting for consul to become available ( wait-for-it ) and then start and register themself through the local agent. Cannot see a lot of bad things here either
 


If your architecture is working for you then it's hard to say it's
wrong :-) It doesn't map super cleanly into Consul, though, to have
multiple agents running on the same host. You have to be careful about
giving them unique node names and addresses, even though they are on
the same node, and Consul's gossip protocol means that the agents will
all probe each other to see if a host is down, so you might have peer
agents probing each other which doesn't make sense. I'd definitely
recommend trying to get to one Consul agent per host if you can.

For now running anything on the host which is not part of the docker-compose.yml is not portable and does not scale very well.
Consider you start the same stack in development - you know also setup the dev machine to have some consul-server up and running. 

Its simply no longer pack and go, its depends on the host and i see only cons in that. 

I understand, that if you build a multi-host cluster, you would extract the consul-server cluster to a external space, so they run independent from hosts, but in the "one stack on a server which also hosts other things" case, doing anything on the host like this is just not scaling. This also means, you cannot run anything else consul related on the host, you will not able to use anything like swarm/kubernetes and alikes.

I understand putting consul offside the actual host the stack starts on, and those connecting to that consul - that makes sense. Since that server is there before the stack starts and the stack can move around. But moving the stack with host dependency sounds like a bad idea to me.

net=host is something i generally completely avoid - that basically reserves the host for this stack only and creates probably security issues if you do not do so. Besides that, that does just lack of using the docker-network properly - which is then more portable.

For now, just in terms of looking at my case and comparing his arguments, it seems like what i have gives generally more advantages and the only disadvantage is the "node-id" topic. (see below)

As far as the host-based IDs are concerned,
https://www.consul.io/docs/agent/options.html#_disable_host_node_id
should be all you need. When the agent starts up it will generate its
UUID and save it off into Consul's data directory in a file called
"node-id". Is it possible that the process that makes your Consul
container starts it up so that all your containers are getting the
same ID?
That did not work, i do not really know why. I general, a docker-container can be seen as a host in my terms, its a full stack OS in a container. Running consul in every VM you host would not be in question either.
I understand that the node-id is derived from something hardware-near and this creates issues in the docker-world, since hardware is not abstracted. I would rather consider generating the ids myself by using own metrics like service name+stackname+customer_name and putting that onto the containers starting up.

I will try either ways, generating own ids or gtruing to disable host_node_ids.

--

In general James, thank you for sharing your thoughts on this. I feel better know understanding the potential risks and mapping them against my usecase and not just "prove my concept works for me - because it just right now works for me".

Would you say, generally speaking, consul is not really docker-centric or focused at all. I was thinking that consul was more or less made for it, but i understand, clusters / microservices have been there for ages, so it does of course work there. Would you consider rather picking something more docker-first-class citizen for the job?

Thanks for the feedback!

James Phillips

unread,
May 3, 2017, 1:18:48 PM5/3/17
to consu...@googlegroups.com
> But that does not happen, if you start all the other agents in "agent" mode, not server? This way you have one cluster leader ( the service called consul with the server consul ) and all others are just agents. I have not seen any cluster-leader issues yet, so i wonder if i or he got this thing wrong?

With your setup I don't see how you'd get instability, once you solve
the node ID issue. I think the main concern in the article is that you
can have different agents on the same host with a different sense of
"this host is alive", which might cause weird issues up the stack.
It's also not clear where you'd want to run host-level checks like
disk/cpu/etc. if you have multiple agents. Running one agent on the
host makes that a lot clearer.

> That did not work, i do not really know why.

That should definitely work, so we should be able to get to the bottom
of that. If you run with debug-level logging it will print where it's
getting the node ID from, so that should tell us what's going on.

> Would you say, generally speaking, consul is not really docker-centric or focused at all. I was thinking that consul was more or less made for it, but i understand, clusters / microservices have been there for ages, so it does of course work there. Would you consider rather picking something more docker-first-class citizen for the job?

Definitely not - we definitely want to make Consul work well with
Docker and have been improving this over time with things like the
official image, ability to dynamically figure out bind addresses, etc.
I think the main open issue is whether Consul is an infrastructure
piece, similar to the Docker daemon, that you run on your host, or if
you deploy it as part of your app. We've obviously been leaning
towards the former because that fits better with Consul's overall
design. We will keep listening to the community about their
experiences using Consul with Docker, and working to improve that
experience.

Hope that helps and let's figure out what's going on with the node IDs.

-- James
> https://groups.google.com/d/msgid/consul-tool/4c0bb7e4-7cf7-47a4-a327-3fd9fee04d0c%40googlegroups.com.

Miguel Terrón

unread,
May 3, 2017, 9:20:22 PM5/3/17
to Consul
Just for another datapoint.

In my case, if my container/vm/host needs consul, I add it to the corresponding abstraction. So if 1 of my docker containers require an agent, I pack the agent with the container.

This way there are no special cases. I always have docker at 127.0.0.1:8500 no matter where I am sitting, inside a container, inside a vm or inside a dedicated physical host. I find that consistency very useful, the node-id is completely irrelevant for me.

Cheers

Eugen Mayer

unread,
May 4, 2017, 3:08:55 AM5/4/17
to Consul
Exactly my point also Miguel, it's just far more portable and convenient.

Beside that, monitoring the host is the same aspect as monitoring a VM. You either want to monitor host or the application container, no matter it's a VM or LXC. Nevertheless I would never consider consul for either of those tasks, rather datadog, icinga, monit, munin ... So the monitoring aspect of consul is irrelevant for me, it's just not the right tool for that.

In the end it comes down to the aspect you named James, does consul want to be a infrastructure element plainly or also application aware.

Considering all the features consul offers:

- KV
- Services not only nodes
- Watches to generate configuration

Using consul only as a infrastructure element would not do consul fair, it would degrade it to much less it actually is.

That said, consul should see a container as it's host, no matter what it is running on. All it cares about this container/node, being up, with its services. No matter if below is a VM and below this comes the physical host. It simply should not matter.

It should also not matter if a cluster is spread on different physical hosts, or on VMs on the same host or on containers.

That is of course my POV, iam sure I do not see the whole global picture.

I really like the controversy of this discussion, it brings some light in those corners you usually do not look at.

Also thank you a lot James for sharing the insight/mission Hashicorp is currently driving behind consul, very valuable.

Reply all
Reply to author
Forward
0 new messages