AWS Peer Discovery Error

539 views
Skip to first unread message

Zachary Smith

unread,
Jun 29, 2021, 1:07:40 PM6/29/21
to rabbitmq-users
I am not sure what is going on. This was working, and now its not. I am trying to setup a RabbitMQ cluster on EC2 leveraging the aws plugin. My settings are as follows:

rabbitmq.conf

cluster_formation.peer_discovery_backend = aws

# the backend can also be specified using its module name
# cluster_formation.peer_discovery_backend = rabbit_peer_discovery_aws

cluster_formation.aws.region = {{ ec2_region }}
cluster_formation.aws.access_key_id =  {{ aws_access_key }}
cluster_formation.aws.secret_key = {{ aws_secret_key }}

cluster_formation.aws.instance_tags.RabbitRegion = {{ ec2_region }}
cluster_formation.aws.instance_tags.RabbitEnv = {{ aws_env }}
cluster_formation.aws.instance_tags.RabbitGroup = {{ group_names[0] }}

cluster_formation.aws.use_private_ip = true

log.file.level = debug


rabbitmq-env.conf
USE_LONGNAME=true

I push these settings out to 3 servers and run all the necessary steps to enable the plugin and I am setting this error:


2021-06-29 16:51:34.919 [info] <0.273.0> Will try to lock with peer discovery backend rabbit_peer_discovery_aws
2021-06-29 16:51:34.920 [debug] <0.273.0> Will use AWS access key of '###############'
2021-06-29 16:51:34.920 [debug] <0.273.0> Setting AWS region to "us-east-1"
2021-06-29 16:51:34.920 [debug] <0.273.0> Setting AWS credentials, access key: '###############'
2021-06-29 16:51:34.920 [debug] <0.273.0> Invoking AWS request {Service: "ec2"; Path: "/?Action=DescribeInstances&Filter.1.Name=tag%3ARabbitEnv&Filter.1.Value.1=prep-r&Filter.2.Name=tag%3ARabbitGroup&Filter.2.Value.1=prep_rabbitmq&Filter.3.Name=tag%3ARabbitRegion&Filter.3.Value.1=us-east-1&Version=2015-10-01"}...
2021-06-29 16:51:34.920 [debug] <0.273.0> Making sure AWS credentials are available and still valid.
2021-06-29 16:51:34.984 [debug] <0.273.0> AWS request: GETS ALL THE INFORMATION

2021-06-29 16:52:40.257 [error] <0.273.0> Failed to lock with peer discovery backend rabbit_peer_discovery_aws: "Local node rabbit@MYHOSTNAME is not part of discovered nodes ['rab...@10.32.34.125','rab...@10.32.33.63','rab...@10.32.32.45']"
2021-06-29 16:52:40.258 [debug] <0.273.0> rabbit_peer_discovery:lock returned {error,"Local node  rabbit@MYHOSTNAME   is not part of discovered nodes ['rab...@10.32.34.125','rab...@10.32.33.63','rab...@10.32.32.45']"}


I can telnet on port 4369 without any issues:

for i in 10.32.34.125 10.32.33.63 10.32.32.45; do echo 'quit' | telnet $i 4369; done
Trying 10.32.34.125...
Connected to 10.32.34.125.
Escape character is '^]'.
Connection closed by foreign host.
Trying 10.32.33.63...
Connected to 10.32.33.63.
Escape character is '^]'.
Connection closed by foreign host.
Trying 10.32.32.45...
Connected to 10.32.32.45.
Escape character is '^]'.
Connection closed by foreign host.

Anyone ever seen this error? I looked for several hours and cannot find a reference




Zachary Smith

unread,
Jun 29, 2021, 1:09:02 PM6/29/21
to rabbitmq-users
Version: rabbitmq-server-3.8.18-1.el8.noarch
OS: Red Hat Enterprise Linux release 8.4 (Ootpa)

M K

unread,
Jun 30, 2021, 1:44:17 AM6/30/21
to rabbitmq-users
Hi Zachary,

Using private IPs instead of hostnames is a rare thing to do. Try using private hostnames while we try to reproduce. Most likely
the most recent revision of the plugin assumes that hostnames are used everywhere in its test suite and every manual test used hostnames as well.

M K

unread,
Jun 30, 2021, 1:48:46 AM6/30/21
to rabbitmq-users
Indeed we found a function that asserts that our own node is on the list of discovered peers but it assumes that nodes discovered use
hostnames and not IP addresses.

Switching to using private DNS entries (which is the default for the AWS peer discovery plugin) should help.

M K

unread,
Jun 30, 2021, 2:15:27 AM6/30/21
to rabbitmq-users
So, to make it extra clear, either drop

cluster_formation.aws.use_private_ip

from rabbitmq.conf or set it to `false` (which is the default) and the check should succeed.

An IP address-aware version will ship in 3.8.19.

Zachary Smith

unread,
Jun 30, 2021, 10:24:07 AM6/30/21
to rabbitm...@googlegroups.com
So I did a couple of things to correct this. I removed the option for the IP address. I also had to downgrade to version 3.8.16 as this is the version running in the other environments we have. I will update you more shortly. Thanks for the feedback.

Regards,
Zachary

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/myNaZvO2dto/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/e4102783-456f-43df-89e8-b62072891d91n%40googlegroups.com.
Message has been deleted
Message has been deleted

david....@gmx.de

unread,
Jun 30, 2021, 11:04:38 AM6/30/21
to rabbitmq-users
Hi Zachary,

We could reproduce this issue on 3.8.18.

However, I also tested it on 3.8.17 where nodes boot successfully but cluster formation still fails (i.e. every node starts as standalone):

rabbitmqctl cluster_status prints a single node (instead of 3).

and startup logs show:
** Cannot get connection id for node 'rabbit@ip-172-xx-xx-xx.<region>.compute.internal'
2021-06-30 12:56:22.291 [warning] <0.273.0> Could not auto-cluster with node rab...@172.xx-xx-xx: {badrpc,nodedown}
2021-06-30 12:56:22.293 [warning] <0.273.0> Could not auto-cluster with node rab...@172.xx-xx-xx: {badrpc,nodedown}
2021-06-30 12:56:22.295 [warning] <0.273.0> Could not auto-cluster with node rab...@172.xx-xx-xx {badrpc,nodedown}
2021-06-30 12:56:22.295 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 0 retries left...
2021-06-30 12:56:22.796 [warning] <0.273.0> Could not successfully contact any node of: rab...@172.xx-xx-xx,rab...@172.xx-xx-xx,rab...@172.xx-xx-xx (as in Erlang distribution). Starting as a blank standalone node...

My questions are:
1. What is your use case to use cluster_formation.aws.use_private_ip = true?
2. Do you set RABBITMQ_NODENAME? If so, to what value: FQDN or IP address?
3. In 3.8.16: What is the output of rabbitmqctl cluster_status and what do the startup logs show?

Zachary Smith

unread,
Jun 30, 2021, 12:05:56 PM6/30/21
to rabbitmq-users
David,

I am seeing the same error as well:

2021-06-30 15:54:50.924 [warning] <0.273.0> Could not auto-cluster with node rab...@ip-xx-xx-32-53.xx.xxxxxx-1.xxxxx.xxxxxx: {badrpc,nodedown}
2021-06-30 15:54:50.928 [warning] <0.273.0> Could not auto-cluster with node rab...@ip-xx-xx-33-64.xx.xxxxxx-1.xxxxx.xxxxxx: {badrpc,nodedown}
2021-06-30 15:54:50.933 [warning] <0.273.0> Could not auto-cluster with node rab...@ip-xx-xx-34-62.xx.xxxxxx-1.xxxxx.xxxxxx: {badrpc,nodedown}
2021-06-30 15:54:50.933 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 0 retries left...
2021-06-30 15:54:51.434 [warning] <0.273.0> Could not successfully contact any node of: rab...@ip-xx-xx-32-53.xx.xxxxxx-1.xxxxx.xxxxxx,rab...@ip-xx-xx-33-64.xx.xxxxxx-1.xxxxx.xxxxxx,rab...@ip-xx-xx-34-62.xx.xxxxxx-1.xxxxx.xxxxxx (as in Erlang distribution). Starting as a blank standalone node...

As far as your questions:
1. What is your use case to use cluster_formation.aws.use_private_ip = true?

I don't any specific reason, I think this was to get around name resolution of was assigned dns names (This has since been corrected)

2. Do you set RABBITMQ_NODENAME? If so, to what value: FQDN or IP address?

We do not, I do pass the option:

rabbitmq-env.conf
USE_LONGNAME=true

This is more for visibility as we have hosts with similar names running in various data centers


3. In 3.8.16: What is the output of rabbitmqctl cluster_status and what do the startup logs show?

[root@{{ hostname }}01 ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@{{ hostname }}01.xx.xxxxxx.xxx ...
Basics

Cluster name: rabbit@{{ hostname }}01.xx.xxxxxx.xxx

Disk Nodes

rabbit@{{ hostname }}01.xx.xxxxxx.xxx
rabbit@{{ hostname }}02.xx.xxxxxx.xxx
rabbit@{{ hostname }}03.xx.xxxxxx.xxx

Running Nodes

rabbit@{{ hostname }}01.xx.xxxxxx.xxx
rabbit@{{ hostname }}02.xx.xxxxxx.xxx
rabbit@{{ hostname }}03.xx.xxxxxx.xxx

Versions

rabbit@{{ hostname }}01.xx.xxxxxx.xxx: RabbitMQ 3.8.16 on Erlang 24.0.3
rabbit@{{ hostname }}02.xx.xxxxxx.xxx: RabbitMQ 3.8.16 on Erlang 24.0.3
rabbit@{{ hostname }}03.xx.xxxxxx.xxx: RabbitMQ 3.8.16 on Erlang 24.0.3

Maintenance status

Node: rabbit@{{ hostname }}01.xx.xxxxxx.xxx, status: not under maintenance
Node: rabbit@{{ hostname }}02.xx.xxxxxx.xxx, status: not under maintenance
Node: rabbit@{{ hostname }}03.xx.xxxxxx.xxx, status: not under maintenance

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rabbit@{{ hostname }}01.xx.xxxxxx.xxx, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@{{ hostname }}01.xx.xxxxxx.xxx, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@{{ hostname }}01.xx.xxxxxx.xxx, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@{{ hostname }}02.xx.xxxxxx.xxx, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@{{ hostname }}02.xx.xxxxxx.xxx, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@{{ hostname }}02.xx.xxxxxx.xxx, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@{{ hostname }}03.xx.xxxxxx.xxx, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@{{ hostname }}03.xx.xxxxxx.xxx, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@{{ hostname }}03.xx.xxxxxx.xxx, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Feature flags

Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled


I am not sure if I need a new thread for this, but I am getting some strange behavior when offloading SSL on an AWS ALB. I am redirecting port 443 to 15627 on the back end. The targets are healthy. Additionally, I am sometimes getting this when going directly to the instances. Logs are running in debug and I can't seem to see any requests
rabbitmq-alb-error.jpg
rabbitmq-screen.jpg

david....@gmx.de

unread,
Jun 30, 2021, 1:16:29 PM6/30/21
to rabbitmq-users
Hey Zachary,

yes, please open a new thread for the AWS ALB SSL issue. Otherwise, it gets confusing here.

Focusing on the cluster_formation.aws.use_private_ip = true issue:
Thanks for providing the answers.
It seems you don't have to set the option cluster_formation.aws.use_private_ip = true which is good!

But I'm still interested in how that "feature" ever worked prior to 3.8.18.

2021-06-30 15:54:50.924 [warning] <0.273.0> Could not auto-cluster with node rab...@ip-xx-xx-32-53.xx.xxxxxx-1.xxxxx.xxxxxx: {badrpc,nodedown}
2021-06-30 15:54:50.928 [warning] <0.273.0> Could not auto-cluster with node rab...@ip-xx-xx-33-64.xx.xxxxxx-1.xxxxx.xxxxxx: {badrpc,nodedown}
2021-06-30 15:54:50.933 [warning] <0.273.0> Could not auto-cluster with node rab...@ip-xx-xx-34-62.xx.xxxxxx-1.xxxxx.xxxxxx: {badrpc,nodedown}
2021-06-30 15:54:50.933 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 0 retries left...
2021-06-30 15:54:51.434 [warning] <0.273.0> Could not successfully contact any node of: rab...@ip-xx-xx-32-53.xx.xxxxxx-1.xxxxx.xxxxxx,rab...@ip-xx-xx-33-64.xx.xxxxxx-1.xxxxx.xxxxxx,rab...@ip-xx-xx-34-62.xx.xxxxxx-1.xxxxx.xxxxxx (as in Erlang distribution). Starting as a blank standalone node..

Actually, above output you pasted is expected on the seed node (i.e. the node which first forms the cluster). It's not an error as the log falsely indicates (we changed that log level in 3.8.18). There should be only one node logging Starting as a blank standalone node. The cluster status you pasted also shows that the nodes clustered correctly (since 3 nodes are listed).
When I tried it out, my issue was that this log is output on all three nodes which results in nodes starting standalone clusters instead of clustering with each other.
Also, in my case log
2021-06-30 14:59:53.670 [warning] <0.273.0> Could not auto-cluster with node rabbit@172.11.11.1: {badrpc,nodedown}
is output which shows that the node tries to contact a different node by IP address in node name (which matches my expectation for "feature" cluster_formation.aws.use_private_ip = true )
In your output however it contacts the node by host name in node name as shown by the "ip-" prefix in
2021-06-30 15:54:50.924 [warning] <0.273.0> Could not auto-cluster with node rab...@ip-xx-xx-32-53.xx.xxxxxx-1.xxxxx.xxxxxx: {badrpc,nodedown}

So, are you sure that the example you provided uses cluster_formation.aws.use_private_ip = true?

david....@gmx.de

unread,
Jul 1, 2021, 4:42:33 AM7/1/21
to rabbitmq-users
I re-tested with 3.8.16.

Clustering with cluster_formation.aws.use_private_ip = true works only if
1. long node names are used, (i.e. USE_LONGNAME=true)  and
2. for all nodes: host name part of the node name is the private IP address (instead of the private DNS name)

Example:

ubuntu@ip-172-31-1-1:/etc/rabbitmq$ cat rabbitmq-env.conf
USE_LONGNAME=true

ubuntu@ip-172-31-1-1:/etc/rabbitmq$ cat rabbitmq.conf

log.file.level = debug

cluster_formation.peer_discovery_backend = aws

cluster_formation.aws.region = <region>

cluster_formation.aws.access_key_id = <access key>

cluster_formation.aws.secret_key = <secret key>

cluster_formation.aws.use_autoscaling_group = false

cluster_formation.aws.instance_tags.mykey = myvalue

cluster_formation.aws.use_private_ip = true


The same configuration also works in 3.8.18.

What changes in 3.8.18 compared to prior RabbitMQ versions is that if you don't specify the NODENAME (meaning the host name part of the node name becomes the private DNS name, for example rabbit@ip-172-31-1-1.<region>.compute.internal  instead of rab...@172.31.1.1 ), then in 3.8.18 node boot up fails (with the error message in your 1st message) vs in prior RabbitMQ versions nodes still boot "successfully" but fail to cluster resulting in standalone clusters. In that regards, I prefer the behaviour in 3.8.18 to make the node boot fail if it can't cluster rather than booting but silently failing to cluster. So, there are no breaking changes introduced in 3.8.18.
Reply all
Reply to author
Forward
0 new messages