I'm working on setting up a 3 node RabbitMQ cluster inside a Kubernetes cluster, but we have a requirement to Consul as the service discovery mechanism (we also have a functioning Consul). The Consul server cluster is also in the same K8s namespace, and each RabbitMQ pod has sidecar Consul containers coupled with them so that the appropriate IP addresses are reported to Consul.
Here are some details about how this is currently configured:
RabbitMQ version: 3.7.4
Consul version: 1.0.6
Kubernetes version: 1.9.2
Contents of rabbitmq.conf
log.console.level = debug
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_formation.consul.host = localhost
cluster_formation.consul.svc = rmq
# do compute service address
cluster_formation.consul.svc_addr_auto = true
# compute service address using node name
cluster_formation.consul.svc_addr_use_nodename = true
# use long RabbitMQ node names?
cluster_formation.consul.use_longname = true
# append a suffix (node.rabbitmq.example.local) to node names retrieved from Consul
cluster_formation.consul.domain_suffix = node.consul
After all the pods start, the peer discovery plugin successfully registers the RabbitMQ nodes as rabbitmq-0, rabbitmq-1, rabbitmq-2 under the rmq service name (the RabbitMQ management UI and logs report the same). However, it appears that during the RabbitMQ cluster formation attempt that the RabbitMQ nodes are trying to reach each other using the short name and doesn't attempt to use rabbitmq.node.consul at all. Here are some relevant log messages from the rabbitmq-1 node.
2018-04-05 17:51:18.830 [debug] <0.193.0> Response: {ok,{{"HTTP/1.1",200,"OK"},[{"date","Thu, 05 Apr 2018 17:51:18 GMT"},{"content-length","958"},{"content-type","application/json"},{"x-consul-index","564"},{"x-consul-knownleader","true"},{"x-consul-lastcontact","0"}],"[{\"Node\":{\"ID\":\"01912744-c26d-ee97-e571-b581dd69237d\",\"Node\":\"rabbitmq-0\",\"Address\":\"10.1.1.11\",\"Datacenter\":\"flybaby\",\"TaggedAddresses\":{\"lan\":\"10.1.1.11\",\"wan\":\"10.1.1.11\"},\"Meta\":{\"consul-network-segment\":\"\"},\"CreateIndex\":557,\"ModifyIndex\":558},\"Service\":{\"ID\":\"rmq:rabbitmq-0\",\"Service\":\"rmq\",\"Tags\":[],\"Address\":\"rabbitmq-0\",\"Port\":5672,\"EnableTagOverride\":false,\"CreateIndex\":562,\"ModifyIndex\":562},\"Checks\":[{\"Node\":\"rabbitmq-0\",\"CheckID\":\"serfHealth\",\"Name\":\"Serf Health Status\",\"Status\":\"passing\",\"Notes\":\"\",\"Output\":\"Agent alive and reachable\",\"ServiceID\":\"\",\"ServiceName\":\"\",\"ServiceTags\":[],\"Definition\":{},\"CreateIndex\":557,\"ModifyIndex\":557},{\"Node\":\"rabbitmq-0\",\"CheckID\":\"service:rmq:rabbitmq-0\",\"Name\":\"Service 'rmq' check\",\"Status\":\"passing\",\"Notes\":\"RabbitMQ Consul-based peer discovery plugin TTL check\",\"Output\":\"\",\"ServiceID\":\"rmq:rabbitmq-0\",\"ServiceName\":\"rmq\",\"ServiceTags\":[],\"Definition\":{},\"CreateIndex\":562,\"ModifyIndex\":562}]}]"}}
2018-04-05 17:51:18.830 [info] <0.193.0> All discovered existing cluster peers: rabbit@rabbitmq-0
2018-04-05 17:51:18.830 [info] <0.193.0> Peer nodes we can cluster with: rabbit@rabbitmq-0
2018-04-05 17:51:18.867 [warning] <0.193.0> Could not auto-cluster with node rabbit@rabbitmq-0: {badrpc,nodedown}
2018-04-05 17:51:18.867 [warning] <0.193.0> Could not successfully contact any node of: rabbit@rabbitmq-0 (as in Erlang distribution). Starting as a blank standalone node...
Logs here seem to indicate that cluster_formation.consul.domain_suffix is not being used. I've checked that the ports are open and that the pods are reachable from each other:
root@rabbitmq-1:/# nmap -p 5672,4369,25672 rabbitmq-0.node.consul
Nmap scan report for rabbitmq-0.node.consul (10.1.1.11)
Host is up (0.00010s latency).
PORT STATE SERVICE
4369/tcp open epmd
5672/tcp open amqp
25672/tcp open unknown
MAC Address: 0A:58:0A:01:01:0B (Unknown)
Nmap done: 1 IP address (1 host up) scanned in 0.84 seconds
I'm pretty sure that since the hostnames generated by Kubernetes (e.g., rabbitmq-0, rabbitmq-1, rabbitmq-2) are not network addressable and there's nothing in the logs indicating that the .node.consul domain suffix is being used, it doesn't appear that cluster_formation.consul.domain_suffix is being picked up as intended.