Problem w/ configuration flag 'cluster_formation.consul.domain

J Kelsey-De France

unread,

Apr 5, 2018, 4:10:13 PM4/5/18

to rabbitmq-users

I'm working on setting up a 3 node RabbitMQ cluster inside a Kubernetes cluster, but we have a requirement to Consul as the service discovery mechanism (we also have a functioning Consul). The Consul server cluster is also in the same K8s namespace, and each RabbitMQ pod has sidecar Consul containers coupled with them so that the appropriate IP addresses are reported to Consul.

Here are some details about how this is currently configured:

RabbitMQ version: 3.7.4

Consul version: 1.0.6

Kubernetes version: 1.9.2

Contents of rabbitmq.conf

log.console.level = debug

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_formation.consul.host = localhost
cluster_formation.consul.svc = rmq

# do compute service address
cluster_formation.consul.svc_addr_auto = true
# compute service address using node name
cluster_formation.consul.svc_addr_use_nodename = true
# use long RabbitMQ node names?
cluster_formation.consul.use_longname = true
# append a suffix (node.rabbitmq.example.local) to node names retrieved from Consul
cluster_formation.consul.domain_suffix = node.consul

After all the pods start, the peer discovery plugin successfully registers the RabbitMQ nodes as rabbitmq-0, rabbitmq-1, rabbitmq-2 under the rmq service name (the RabbitMQ management UI and logs report the same). However, it appears that during the RabbitMQ cluster formation attempt that the RabbitMQ nodes are trying to reach each other using the short name and doesn't attempt to use rabbitmq.node.consul at all. Here are some relevant log messages from the rabbitmq-1 node.

2018-04-05 17:51:18.828 [debug] <0.193.0> GET http://localhost:8500/v1/health/service/rmq?passing

2018-04-05 17:51:18.830 [debug] <0.193.0> Response: {ok,{{"HTTP/1.1",200,"OK"},[{"date","Thu, 05 Apr 2018 17:51:18 GMT"},{"content-length","958"},{"content-type","application/json"},{"x-consul-index","564"},{"x-consul-knownleader","true"},{"x-consul-lastcontact","0"}],"[{\"Node\":{\"ID\":\"01912744-c26d-ee97-e571-b581dd69237d\",\"Node\":\"rabbitmq-0\",\"Address\":\"10.1.1.11\",\"Datacenter\":\"flybaby\",\"TaggedAddresses\":{\"lan\":\"10.1.1.11\",\"wan\":\"10.1.1.11\"},\"Meta\":{\"consul-network-segment\":\"\"},\"CreateIndex\":557,\"ModifyIndex\":558},\"Service\":{\"ID\":\"rmq:rabbitmq-0\",\"Service\":\"rmq\",\"Tags\":[],\"Address\":\"rabbitmq-0\",\"Port\":5672,\"EnableTagOverride\":false,\"CreateIndex\":562,\"ModifyIndex\":562},\"Checks\":[{\"Node\":\"rabbitmq-0\",\"CheckID\":\"serfHealth\",\"Name\":\"Serf Health Status\",\"Status\":\"passing\",\"Notes\":\"\",\"Output\":\"Agent alive and reachable\",\"ServiceID\":\"\",\"ServiceName\":\"\",\"ServiceTags\":[],\"Definition\":{},\"CreateIndex\":557,\"ModifyIndex\":557},{\"Node\":\"rabbitmq-0\",\"CheckID\":\"service:rmq:rabbitmq-0\",\"Name\":\"Service 'rmq' check\",\"Status\":\"passing\",\"Notes\":\"RabbitMQ Consul-based peer discovery plugin TTL check\",\"Output\":\"\",\"ServiceID\":\"rmq:rabbitmq-0\",\"ServiceName\":\"rmq\",\"ServiceTags\":[],\"Definition\":{},\"CreateIndex\":562,\"ModifyIndex\":562}]}]"}}

2018-04-05 17:51:18.830 [info] <0.193.0> All discovered existing cluster peers: rabbit@rabbitmq-0

2018-04-05 17:51:18.830 [info] <0.193.0> Peer nodes we can cluster with: rabbit@rabbitmq-0

2018-04-05 17:51:18.867 [warning] <0.193.0> Could not auto-cluster with node rabbit@rabbitmq-0: {badrpc,nodedown}

2018-04-05 17:51:18.867 [warning] <0.193.0> Could not successfully contact any node of: rabbit@rabbitmq-0 (as in Erlang distribution). Starting as a blank standalone node...

Logs here seem to indicate that cluster_formation.consul.domain_suffix is not being used. I've checked that the ports are open and that the pods are reachable from each other:

root@rabbitmq-1:/# nmap -p 5672,4369,25672 rabbitmq-0.node.consul

Starting Nmap 7.40 ( https://nmap.org ) at 2018-04-05 19:55 UTC

Nmap scan report for rabbitmq-0.node.consul (10.1.1.11)

Host is up (0.00010s latency).

PORT STATE SERVICE

4369/tcp open epmd

5672/tcp open amqp

25672/tcp open unknown

MAC Address: 0A:58:0A:01:01:0B (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 0.84 seconds

I'm pretty sure that since the hostnames generated by Kubernetes (e.g., rabbitmq-0, rabbitmq-1, rabbitmq-2) are not network addressable and there's nothing in the logs indicating that the .node.consul domain suffix is being used, it doesn't appear that cluster_formation.consul.domain_suffix is being picked up as intended.

Michael Klishin

unread,

Apr 5, 2018, 6:37:34 PM4/5/18

to rabbitm...@googlegroups.com

Looking at the code [1] suggests two things:

* domain_suffix (translated to "consul_domain" in the classic config format) is only used when long node names are used, which is your case

* It will only be appended to Node field values if Address is blank

This was a contributed feature and I'm not sure why the logic is what it is.

Nothing else stands out in either the code or your config.

1. https://github.com/rabbitmq/rabbitmq-peer-discovery-consul/blob/58256f42de3557c065dea8d4b4bf05e1a7ff870f/src/rabbit_peer_discovery_consul.erl#L207

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

louis gueye

unread,

Jul 31, 2019, 9:37:17 AM7/31/19

to rabbitmq-users

Hello there,

Any progress on this issue ? I'm experiencing the same.

Short node names are not resolved but domain_suffix does not fix the issue.

Reply all

Reply to author

Forward

Problem w/ configuration flag 'cluster_formation.consul.domain_suffix'

J Kelsey-De France

Michael Klishin

louis gueye