With autocluster plugin enabled, RabbitMQ server takes about 4-5 mins to start

Harish Vishwanath

unread,

May 21, 2017, 11:17:38 PM5/21/17

to rabbitmq-users

Hello all

I am trying to setup RMQ cluster with autocluster plugin. I have tried both with autocluster v0.6.1 and v0.7.0. The issue I am seeing is that rmq server takes about 4-5 minutes to come up.

Below are the details:

I am running rmq 3.6.9.

List of plugins enabled:

root@rabbitmq-0:/usr/lib/rabbitmq/bin# rabbitmq-plugins list

Configured: E = explicitly enabled; e = implicitly enabled

| Status: [failed to contact rab...@10.168.172.232 - status not shown]

|/

[e ] amqp_client 3.6.9

[E ] autocluster 0.6.1

[e ] cowboy 1.0.4

[e ] cowlib 1.0.2

[ ] rabbitmq_amqp1_0 3.6.9

[ ] rabbitmq_auth_backend_ldap 3.6.9

[ ] rabbitmq_auth_mechanism_ssl 3.6.9

[e ] rabbitmq_aws 0.1.2

[ ] rabbitmq_consistent_hash_exchange 3.6.9

[ ] rabbitmq_event_exchange 3.6.9

[E ] rabbitmq_federation 3.6.9

[E ] rabbitmq_federation_management 3.6.9

[ ] rabbitmq_jms_topic_exchange 3.6.9

[E ] rabbitmq_management 3.6.9

[e ] rabbitmq_management_agent 3.6.9

[ ] rabbitmq_management_visualiser 3.6.9

[ ] rabbitmq_mqtt 3.6.9

[ ] rabbitmq_recent_history_exchange 3.6.9

[ ] rabbitmq_sharding 3.6.9

[E ] rabbitmq_shovel 3.6.9

[E ] rabbitmq_shovel_management 3.6.9

[ ] rabbitmq_stomp 3.6.9

[ ] rabbitmq_top 3.6.9

[ ] rabbitmq_tracing 3.6.9

[ ] rabbitmq_trust_store 3.6.9

[e ] rabbitmq_web_dispatch 3.6.9

[ ] rabbitmq_web_mqtt 3.6.9

[ ] rabbitmq_web_mqtt_examples 3.6.9

[ ] rabbitmq_web_stomp 3.6.9

[ ] rabbitmq_web_stomp_examples 3.6.9

[ ] sockjs 0.3.4

This is my env and config:

# config

root@rabbitmq-0:/usr/lib/rabbitmq/bin# cat /etc/rabbitmq/rabbitmq.config

[

{rabbit, [

{tcp_listen_options, [

{backlog, 128},

{nodelay, true},

{linger, {true,0}},

{exit_on_close, false},

{sndbuf, 12000},

{recbuf, 12000}

]},

{log_levels, [{autocluster, debug}, {connection, info}]},

{loopback_users, []}

]},

{autocluster, [

{dummy_param_without_comma, true},

{autocluster_log_level, debug},

{backend, etcd},

{autocluster_failure, stop},

{cleanup_interval, 30},

{cluster_cleanup, true},

{cleanup_warn_only, false},

{etcd_ttl, 30},

{etcd_scheme, http},

{etcd_host, "etcd.kube-system.svc.cluster.local"},

{etcd_port, 2379}

]}

].

# Env

root@rabbitmq-0:/usr/lib/rabbitmq/bin# cat /etc/rabbitmq/rabbitmq-env.conf

USE_LONGNAME=true

RABBITMQ_USE_LONGNAME=true

# k8s will inject $POD_IP

NODENAME=rabbit@$POD_IP

Starting the server essentially hangs (the broker doesn't come up for 4-5 minutes):

root@rabbitmq-0:/usr/lib/rabbitmq/bin# ./rabbitmq-server start

root@rabbitmq-0:/usr/lib/rabbitmq/bin# rabbitmqctl status

Status of node 'rab...@10.168.172.232' ...

[{pid,9882},

{running_applications,[{ssl,"Erlang/OTP SSL application","7.3"},

{public_key,"Public key infrastructure","1.1.1"},

{cowlib,"Support library for manipulating Web protocols.",

"1.0.2"},

{crypto,"CRYPTO","3.6.3"},

{mnesia,"MNESIA CXC 138 12","4.13.3"},

{asn1,"The Erlang ASN1 compiler version 4.0.2",

"4.0.2"},

{inets,"INETS CXC 138 49","6.2"},

{xmerl,"XML parser","1.3.10"},

{sasl,"SASL CXC 138 11","2.7"},

{stdlib,"ERTS CXC 138 10","2.8"},

{kernel,"ERTS CXC 138 10","4.2"}]},

{os,{unix,linux}},

{erlang_version,"Erlang/OTP 18 [erts-7.3] [source] [64-bit] [smp:8:8] [async-threads:128] [kernel-poll:true]\n"},

{memory,[{total,130689648},

{connection_readers,0},

{connection_writers,0},

{connection_channels,0},

{connection_other,0},

{queue_procs,0},

{queue_slave_procs,0},

{plugins,0},

{other_proc,10493288},

{mnesia,59296},

{metrics,0},

{mgmt_db,0},

{msg_index,0},

{other_ets,1714320},

{binary,94528},

{code,13998139},

{atom,553569},

{other_system,103776508}]},

{alarms,[]},

{listeners,[]},

{processes,[{limit,262144},{used,82}]},

{run_queue,0},

{uptime,16},

{kernel,{net_ticktime,60}}]

As you can see, the broker app along with the plugins are not up yet. I am suspecting the reason is due to rabbitmq_aws plugin making calls to AWS EC2 instance metadata server, even though the configured backend is etcd.

Here is the tcpdump output, it hangs here till retrys and timeouts are exceeded:

root@rabbitmq-0:/# tcpdump -i eth0 dst 169.254.169.254

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

03:01:19.720263 IP rabbitmq-0.rabbitmq.maglev-system.svc.cluster.local.38890 > 169.254.169.254.80: Flags [S], seq 4159351748, win 29200, options [mss 1460,sackOK,TS val 222882878 ecr 0,nop,wscale 7], length 0

03:01:19.720947 IP rabbitmq-0.rabbitmq.maglev-system.svc.cluster.local.38890 > 169.254.169.254.80: Flags [.], ack 285812580, win 229, options [nop,nop,TS val 222882878 ecr 873474382], length 0

03:01:19.728853 IP rabbitmq-0.rabbitmq.maglev-system.svc.cluster.local.38890 > 169.254.169.254.80: Flags [P.], seq 0:134, ack 1, win 229, options [nop,nop,TS val 222882880 ecr 873474382], length 134: HTTP: GET /latest/meta-data/placement/availability-zone HTTP/1.1

After this, once the broker starts, autocluster registeres with etcd and cluster comes up, but I am stuck with this initial delay.

I see that there is an issue on rabbitmq_aws repo pertaining to this, but it is closed: https://github.com/gmr/rabbitmq-aws/issues/2.

Appreciate any help!

--

Harish

Michael Klishin

unread,

May 22, 2017, 6:41:37 AM5/22/17

to rabbitm...@googlegroups.com

I already recommended several hypotheses to investigate in

https://github.com/rabbitmq/rabbitmq-autocluster/issues/27#issuecomment-302769520.

I'll copy them here.

if you want to get to the bottom of this, you need to collect as much information as possible, and debug log in this plugin is essential.

There are three reasons that are easy to imagine:

* The startup delay setting is high

* The plugin tries to contact an external service which times out

* Resolution of `etcd.kube-system.svc.cluster.local` takes a while to time out via DNS (and then other resolution mechanisms are tried)

The former is proved wrong by the following log entry:

```

=INFO REPORT==== 17-May-2017::13:39:00 ===

autocluster: Delaying startup for 3453ms.

```

It's just about 3.5 seconds, comparable to ~ 3.7 seconds with 0.6.1.

The 3rd one can be triaged by adding an entry for `etcd.kube-system.svc.cluster.local` to `/etc/hosts` (and using `dig` to resolve that name).

By far the most efficient way to investigate the 2nd hypothesis is to take a traffic capture and inspect it in Wireshark (see e.g. https://www.rabbitmq.com/amqp-wireshark.html).

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

May 22, 2017, 6:42:52 AM5/22/17

to rabbitm...@googlegroups.com

So according to your findings, a node hangs on

03:01:19.728853 IP rabbitmq-0.rabbitmq.maglev-system.svc.cluster.local.38890 > 169.254.169.254.80: Flags [P.], seq 0:134, ack 1, win 229, options [nop,nop,TS val 222882880 ecr 873474382], length 134: HTTP: GET /latest/meta-data/placement/availability-zone HTTP/1.1

which seems to be an AWS-specific operation:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html

thanks, we will take a look at why that may be happening to a Consul backend
user.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Diana Corbacho

unread,

May 22, 2017, 7:21:40 AM5/22/17

to rabbitmq-users

The `rabbitmq_aws` application requests the AWS credentials on startup, independently of the backend being selected or not. This was previously reported in https://github.com/gmr/rabbitmq-aws/issues/2

I produced a patch that should delay the request until the AWS plugin is required: https://github.com/rabbitmq/rabbitmq-aws/pull/3

It still needs to be tested on AWS, it will be merged once that is done.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

May 22, 2017, 9:46:15 AM5/22/17

to rabbitm...@googlegroups.com

Hi Harish,

We believe this was addressed in https://github.com/rabbitmq/rabbitmq-autocluster/releases/tag/0.8.0-milestone2.

Can you please give that version a try? We expect a 0.8.0 RC fairly soon (in the next week or two)

but it should be reasonably stable already.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alexey Lebedeff

unread,

May 22, 2017, 11:08:56 AM5/22/17

to rabbitm...@googlegroups.com

BTW, https://github.com/aweber/rabbitmq-autocluster/pull/105 was
intended to fix this by removing AWS dependency from the base
'autocluster' plugin and introducing a new 'autocluster_aws' plugin.

>>>> | Status: [failed to contact rab...@10.168.172.232 <javascript:> -

>>>> Status of node 'rab...@10.168.172.232 <javascript:>' ...

>>>> an email to rabbitmq-user...@googlegroups.com <javascript:>.

>>>> To post to this group, send email to rabbitm...@googlegroups.com

>>>> <javascript:>.

Harish Vishwanath

unread,

May 22, 2017, 12:40:13 PM5/22/17

to rabbitmq-users

Thanks all.

yes - the 0.8.0 milestone 2 release works fine.

Reply all

Reply to author

Forward