Run Consul agent in Kubernetes against cluster outside Kubernetes

1,516 views
Skip to first unread message

Trond Hindenes

unread,
Sep 12, 2018, 10:29:23 AM9/12/18
to Consul
Hi, 
We have a multi-DC cluster running on regular EC2 instances, and don't plan to change that. However, we want to run Consul on the kubernetes pods aswell.

The use-case right now is to expose Consul UI and rest api using Kubernetes pods via (https-secured) ingresses so we can lock down the Consul servers themselves (we use the Consul rest api extensively).

I'm struggling with getting this stable, and I guess it's down to my lack of understanding of the Consul networking stuff. I know that all consul agents (also those running in "client mode") are forming a mesh and talk to each other, but that queries eventually get proxied/forwarded to the nodes running in "server mode".

During startup of our Consul pods, we're getting the host ip of the underlying host using ec2-metadata, and inject that into the configuration like this:
"advertise_addr": "${HOST_IP}"
However, it looks like the "mesh" part of Consul's settings is controlled by a different set of config parameters, (the "serf*" options). 

It would be SO helpful to have a clear and understandable diagram of how consul nodes (both clients and servers) communicate, which ports/ip addresses are used and the config flags to control those.

For example, I tried adjusting the "serf-bind" parameter, but since Consul isn't able to physically bind to the underlying host ip that failed. I guess I'm a bit stumped if this is gonna work at all, and looking to discuss with someone who's done a "hybrid" setup where the "server mode" nodes are outside of Kubernetes.

Right now the Consul pods are running, but they're flapping so it's not a good situation.

Jason McIntosh

unread,
Sep 12, 2018, 10:55:42 AM9/12/18
to consu...@googlegroups.com
Some of this is tied more to kubernetes networking than Consul.  Here's roughly what happens, and depends entirely on your k8s network layer.

Flannel/Canal:  Docker images get a custom IP that's private - you can't hit them directly as a general rule.  You have to expose services over K8s nodeport to get direct access.  This can be moderately tricky to implement as you have to know that port.  https://kubernetes.io/docs/concepts/services-networking/service/#nodeport  has docs on this.
VPC Networking - each docker image gets an IP - this is one you absolutely can advertise the IP, and get access to it pretty easily.  From a networking perspective so far this has been dead simple to work with.  BUT lots of gotchas too.
Other network layers?  Not super familiar with so can't say much.


Has information on contacting a service.  Unless you're using a network that is actually routable from your pod, it's just not going to work.  It gets worse if you're going across things like VPC's where it gets really confusing.  

Consul wise:
Consul has an ip & set of ports.  Gossip is used on one set of ports, rpc on another, etc.  These are documented.  Clients communicate over gossip protocols for consensus/awareness.  
MOST of the communications you'll care about are RPC.  Gossip aka serf is used for "cluster awareness" - e.g. membership & failure detection and primarily is UDP.
SO consul agents & servers HAVE to be able to reach each other on the RPC/Gossip ports.  8301 (serf lan for local), 8300 (server port), 8302 (serf wan).  HTTP API is 8500. DNS is 8600 if you want to use it.  RPC is 8400. 

Hope this helps!



--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/92c849cb-6978-44f6-83d2-6c6b8de7d44a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Trond Hindenes

unread,
Sep 12, 2018, 11:08:57 AM9/12/18
to Consul
Thanks for that, that's a really helpful diagram.

Could you say anything about the implications of not being able to do a "full mesh"? What are the consequences of not all clients being able to reach all other clients? Does that affect whether or not Consul sees the node as "down"?

Jason McIntosh

unread,
Sep 12, 2018, 11:15:55 AM9/12/18
to consu...@googlegroups.com
I've not tested personally "consul agents not being able to talk to each other".  Impacts "mesh" wise are when you're dealing with multiple clusters in different datacenters - I believe in that case it's JUST the servers that have to be able to talk to each other (directly not over a LB).  Not sure - not really tested that functionality.  I know servers not being able to talk to each other is a major issue, and servers WAN joined this is an issue.  Though consul enterprise solves some of this with some very nice improvements to the network stack (e.g. being able to create segments where A->B works and B-C works but don't need to be able to have A->C communications).  "Consul down" is I believe though server to agent based, but as always - TEST IT :)   

On Wed, Sep 12, 2018 at 10:08 AM Trond Hindenes <tr...@hindenes.com> wrote:
Thanks for that, that's a really helpful diagram.

Could you say anything about the implications of not being able to do a "full mesh"? What are the consequences of not all clients being able to reach all other clients? Does that affect whether or not Consul sees the node as "down"?

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mitchell Hashimoto

unread,
Sep 12, 2018, 12:13:30 PM9/12/18
to Consul
Hey Trond,

I can help with some of this as well. Responses inline:
On Sep 12, 2018, 8:08 AM -0700, Trond Hindenes <tr...@hindenes.com>, wrote:
Thanks for that, that's a really helpful diagram.

Could you say anything about the implications of not being able to do a "full mesh"? What are the consequences of not all clients being able to reach all other clients? Does that affect whether or not Consul sees the node as "down”? 

If a cluster doesn’t have full mesh capability, you’ll see constant node health flapping. Node health implies service health for services registered on those nodes, so your services will also become unavailable (for discovery) while the node is flapping. The constant flapping also makes the network much more noisy as agents attempt to diagnose and verify downness. This probably isn’t too much data, but it’ll be more.

If you can’t have all agents communicate via all the ports Jason mentioned, then there are really only two alternatives: 1.) treat each fully connected area as a different datacenter. This requires that the server agents (in all DCs) can communicate with each other via the RPC/WAN ports, but that’s a much smaller surface area. 2.) [Paid:] Consul Enterprise has support for non-fully connected networks via network segments: https://www.consul.io/docs/enterprise/network-segments/index.html

Kubernetes generally:

We’re about to announce an official Helm chart for running Consul in and across Kubernetes/non-Kubernetes. This has some docs for your use case I believe, and you can view the docs in the following link. This will be deployed to the website sometime later today. https://github.com/hashicorp/consul/commit/5943c79ed490b5e61f4fa5403138dc7985dc00bf

We run a 1-client-agent-per-node deployment (via DaemonSet). Our initial deployment requires all pod IPs be routable throughout the cluster (internal and external). You can also technically expose all the Consul ports via `hostPort` to lower the routing requirement to the node (not the pod), but we don’t do that automatically via the Helm chart yet. But that gives you two potentially options: pod routability or host routability. 

If that isn’t possible, we recommend a multi-DC setup.

Happy to answer any further questions!

Best,
Mitchell

 --
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages