Problem w/ configuration flag 'cluster_formation.consul.domain_suffix'

382 views
Skip to first unread message

J Kelsey-De France

unread,
Apr 5, 2018, 4:10:13 PM4/5/18
to rabbitmq-users
I'm working on setting up a 3 node RabbitMQ cluster inside a Kubernetes cluster, but we have a requirement to Consul as the service discovery mechanism (we also have a functioning Consul). The Consul server cluster is also in the same K8s namespace, and each RabbitMQ pod has sidecar Consul containers coupled with them so that the appropriate IP addresses are reported to Consul.

Here are some details about how this is currently configured:

RabbitMQ version: 3.7.4
Consul version: 1.0.6
Kubernetes version: 1.9.2

Contents of rabbitmq.conf
log.console.level = debug

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_formation.consul.host = localhost
cluster_formation.consul.svc = rmq

# do compute service address
cluster_formation.consul.svc_addr_auto = true
# compute service address using node name
cluster_formation.consul.svc_addr_use_nodename = true
# use long RabbitMQ node names?
cluster_formation.consul.use_longname = true
# append a suffix (node.rabbitmq.example.local) to node names retrieved from Consul
cluster_formation.consul.domain_suffix = node.consul


After all the pods start, the peer discovery plugin successfully registers the RabbitMQ nodes as rabbitmq-0, rabbitmq-1, rabbitmq-2 under the rmq service name (the RabbitMQ management UI and logs report the same). However, it appears that during the RabbitMQ cluster formation attempt that the RabbitMQ nodes are trying to reach each other using the short name and doesn't attempt to use rabbitmq.node.consul at all. Here are some relevant log messages from the rabbitmq-1 node.

2018-04-05 17:51:18.828 [debug] <0.193.0> GET http://localhost:8500/v1/health/service/rmq?passing
2018-04-05 17:51:18.830 [debug] <0.193.0> Response: {ok,{{"HTTP/1.1",200,"OK"},[{"date","Thu, 05 Apr 2018 17:51:18 GMT"},{"content-length","958"},{"content-type","application/json"},{"x-consul-index","564"},{"x-consul-knownleader","true"},{"x-consul-lastcontact","0"}],"[{\"Node\":{\"ID\":\"01912744-c26d-ee97-e571-b581dd69237d\",\"Node\":\"rabbitmq-0\",\"Address\":\"10.1.1.11\",\"Datacenter\":\"flybaby\",\"TaggedAddresses\":{\"lan\":\"10.1.1.11\",\"wan\":\"10.1.1.11\"},\"Meta\":{\"consul-network-segment\":\"\"},\"CreateIndex\":557,\"ModifyIndex\":558},\"Service\":{\"ID\":\"rmq:rabbitmq-0\",\"Service\":\"rmq\",\"Tags\":[],\"Address\":\"rabbitmq-0\",\"Port\":5672,\"EnableTagOverride\":false,\"CreateIndex\":562,\"ModifyIndex\":562},\"Checks\":[{\"Node\":\"rabbitmq-0\",\"CheckID\":\"serfHealth\",\"Name\":\"Serf Health Status\",\"Status\":\"passing\",\"Notes\":\"\",\"Output\":\"Agent alive and reachable\",\"ServiceID\":\"\",\"ServiceName\":\"\",\"ServiceTags\":[],\"Definition\":{},\"CreateIndex\":557,\"ModifyIndex\":557},{\"Node\":\"rabbitmq-0\",\"CheckID\":\"service:rmq:rabbitmq-0\",\"Name\":\"Service 'rmq' check\",\"Status\":\"passing\",\"Notes\":\"RabbitMQ Consul-based peer discovery plugin TTL check\",\"Output\":\"\",\"ServiceID\":\"rmq:rabbitmq-0\",\"ServiceName\":\"rmq\",\"ServiceTags\":[],\"Definition\":{},\"CreateIndex\":562,\"ModifyIndex\":562}]}]"}}
2018-04-05 17:51:18.830 [info] <0.193.0> All discovered existing cluster peers: rabbit@rabbitmq-0
2018-04-05 17:51:18.830 [info] <0.193.0> Peer nodes we can cluster with: rabbit@rabbitmq-0
2018-04-05 17:51:18.867 [warning] <0.193.0> Could not auto-cluster with node rabbit@rabbitmq-0: {badrpc,nodedown}
2018-04-05 17:51:18.867 [warning] <0.193.0> Could not successfully contact any node of: rabbit@rabbitmq-0 (as in Erlang distribution). Starting as a blank standalone node...


Logs here seem to indicate that cluster_formation.consul.domain_suffix is not being used. I've checked that the ports are open and that the pods are reachable from each other:

root@rabbitmq-1:/# nmap -p 5672,4369,25672 rabbitmq-0.node.consul

Starting Nmap 7.40 ( https://nmap.org ) at 2018-04-05 19:55 UTC
Nmap scan report for rabbitmq-0.node.consul (10.1.1.11)
Host is up (0.00010s latency).
PORT      STATE SERVICE
4369/tcp  open  epmd
5672/tcp  open  amqp
25672/tcp open  unknown
MAC Address: 0A:58:0A:01:01:0B (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 0.84 seconds

I'm pretty sure that since the hostnames generated by Kubernetes (e.g., rabbitmq-0,  rabbitmq-1,  rabbitmq-2) are not network addressable and there's nothing in the logs indicating that the .node.consul domain suffix is being used, it doesn't appear that cluster_formation.consul.domain_suffix is being picked up as intended.

Michael Klishin

unread,
Apr 5, 2018, 6:37:34 PM4/5/18
to rabbitm...@googlegroups.com
Looking at the code [1] suggests two things:

 * domain_suffix (translated to "consul_domain" in the classic config format) is only used when long node names are used, which is your case
 * It will only be appended to Node field values if Address is blank

This was a contributed feature and I'm not sure why the logic is what it is.

Nothing else stands out in either the code or your config.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

louis gueye

unread,
Jul 31, 2019, 9:37:17 AM7/31/19
to rabbitmq-users
Hello there,

Any progress on this issue ? I'm experiencing the same.

Short node names are not resolved but domain_suffix does not fix the issue.
Reply all
Reply to author
Forward
0 new messages