RabbitMQ gets stuck on start up

521 views
Skip to first unread message

Victor Gualdras de la Cruz

unread,
Sep 2, 2020, 3:34:18 AM9/2/20
to rabbitmq-users
Hi Everyone,

When provisioning a 3 nodes cluster in kubernetes with Rabbit 3.8.7 and Erlang 23.0.3, sometimes 1 of the 3 nodes gets stuck during start up. When trying to run any command using rabbitmqctl I get a message saying that the application is stopped and to start it, however when starting it nothing really happens. Restarting the server does work, but it is breaking our automation. The same process did work before with Rabbit 3.8.3 and Erlang 22. Attached are the logs which I have scrub a bit, but this is the last few log lines where it gets stuck:

2020-09-01 14:26:38.764 [debug] <0.284.0> Feature flags: registry refresh needed: yes, list of feature flags differs
2020-09-01 14:26:38.764 [debug] <0.284.0> Feature flags: (re)initialize registry (<0.284.0>)
2020-09-01 14:26:38.764 [info] <0.284.0> Feature flags: list of feature flags found:
2020-09-01 14:26:38.764 [info] <0.284.0> Feature flags: [ ] drop_unroutable_metric
2020-09-01 14:26:38.765 [info] <0.284.0> Feature flags: [ ] empty_basic_get_metric
2020-09-01 14:26:38.765 [info] <0.284.0> Feature flags: [ ] implicit_default_bindings
2020-09-01 14:26:38.765 [info] <0.284.0> Feature flags: [ ] quorum_queue
2020-09-01 14:26:38.765 [info] <0.284.0> Feature flags: [ ] virtual_host_metadata
2020-09-01 14:26:38.765 [info] <0.284.0> Feature flags: feature flag states written to disk: yes
2020-09-01 14:26:38.774 [debug] <0.284.0> Feature flags: registry module ready, loading it (<0.284.0>)...


It seems like it gets stuck when loading the feature flags registry module, but I'm not sure if it is an error on my side or an issue on the rabbit code, and I don't know where to look. Let me know if you need any specific configuration value

Regards,
Victor.
rabbit.log

Luke Bakken

unread,
Sep 2, 2020, 8:37:28 AM9/2/20
to rabbitmq-users
Hi Victor,

This might be the issue I fixed here - https://github.com/rabbitmq/rabbitmq-server/issues/2437

Which plugins do you have enabled?

Thanks -
Luke

Victor Gualdras de la Cruz

unread,
Sep 2, 2020, 9:49:04 AM9/2/20
to rabbitmq-users
Hi Luke,

This is the list, however I must say there are no messages on the queues, as this happens the first time we provision a cluster and fixes on server restart.

rabbitmq_delayed_message_exchange 3.8.0
[E*] rabbitmq_jms_topic_exchange       3.8.7
[E*] rabbitmq_management               3.8.7
[e*] rabbitmq_management_agent         3.8.7
[E*] rabbitmq_mqtt                     3.8.7
[e*] rabbitmq_peer_discovery_common    3.8.7
[E*] rabbitmq_peer_discovery_k8s       3.8.7
[E*] rabbitmq_prometheus               3.8.7
[E*] rabbitmq_shovel                   3.8.7
[E*] rabbitmq_shovel_management        3.8.7
[E*] rabbitmq_stomp                    3.8.7
[e*] rabbitmq_web_dispatch             3.8.7

Regards,
Victor.

Luke Bakken

unread,
Sep 2, 2020, 11:06:46 AM9/2/20
to rabbitmq-users
Hi Victor,

I'm not certain if messages are required to reproduce this issue. It may just be required due to the speed of my local environment. Version 3.8.8 will have the fix in it so if you could test that version we would appreciate it.

Thanks,
Luke

Victor Gualdras de la Cruz

unread,
Sep 3, 2020, 3:43:06 AM9/3/20
to rabbitmq-users
Hi Luke,

I don't have the capacity at the moment to build the source code and deploy in the same environment, but I will be able to test it once is available in packagecloud. Could I ask as well if this is a regression issue introduced at some point? We are using the same way and the same environment to provision this as we use for Rabbit 3.8.3 and Erlang 22 where we see no issues, however here we see around 1/3 of the new cluster provisions having this error.

Regards,
Victor. 

Michael Klishin

unread,
Sep 3, 2020, 1:29:57 PM9/3/20
to rabbitmq-users
3.8.4 changed a lot of things around boot, 3.8.6 moved plugin activation to the very end which coincides
with queue contents recovery. For some environments this can result in a temporary resource contention.

The addressed issue introduces an internal management plugin operation timeout where there previously was none.

Victor Gualdras de la Cruz

unread,
Sep 7, 2020, 8:06:44 AM9/7/20
to rabbitmq-users
Hi Michael,

Thanks for the hindsight. I've done further testing to try and figure out a way forward and as part of my testing it seems it is somehow related to Erlang 23 rather than the rabbit versioning. I've done some automated testing provisioning clusters with 3.8.7 and 23.0.3 and out of 30 provisions 4 got stuck. Doing the same with Erlang 22.3.4, out of 50 provisions we didn't see any problem. Hope this helps you understanding if the issue is related or if it could be something else. 

Was trying to move to 23 as Luke said there was an improvement in Erlang distributed communication in a different topic, but will hold on for now base on this.  I'll try again with Erlang 23 once 3.8.8 is out.

Regards,
Victor.

Gerhard Lazu

unread,
Sep 7, 2020, 10:25:17 AM9/7/20
to rabbitmq-users
Erlang 23 now respects CPU quotas (a.k.a. "container friendly" features): http://blog.erlang.org/OTP-23-Highlights/#container-friendly-features

What CPU limits are your RabbitMQ nodes assigned?

Reply all
Reply to author
Forward
0 new messages