"Cold start" procedure for broken RMQ cluster

542 views
Skip to first unread message

Albert Meyer

unread,
Oct 8, 2019, 8:20:54 PM10/8/19
to rabbitmq-users
At a previous employer we used a "cold start" procedure that we would run when RMQ got hosed. It would delete queues and erlang cookies, or something, and then RMQ would start fresh. Where can I find a document that describes this "cold start" process?

Wesley Peng

unread,
Oct 8, 2019, 9:09:53 PM10/8/19
to rabbitm...@googlegroups.com
Hi
May you use a docker to host and start rabbitmq? that would do better.

regards.

Michael Klishin

unread,
Oct 8, 2019, 10:45:23 PM10/8/19
to rabbitmq-users
Instead of looking for a "cold start procedure" I'd try to understand what happened, including by posting logs to this list.

To reset a node that is running, use `rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl start_app`. That would wipe out the node's
database and "reload" it.

If the node is not running, move its data directory and restart it. That's it. You don't need to delete the Erlang cookie
unless there are reasons to reset it specifically.

On Tue, Oct 8, 2019 at 7:21 PM Albert Meyer <albert...@gmail.com> wrote:
At a previous employer we used a "cold start" procedure that we would run when RMQ got hosed. It would delete queues and erlang cookies, or something, and then RMQ would start fresh. Where can I find a document that describes this "cold start" process?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/c1f147bb-af00-4684-be20-fdf10a0131db%40googlegroups.com.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Albert Meyer

unread,
Oct 9, 2019, 2:12:41 PM10/9/19
to rabbitmq-users
Hi Michael,

Thank you for your advice! The trouble started with the notifications.info queue filling up and messages not being consumed. I tried to clear the queue but I think I may have deleted it because I don't see it in the web interface anymore. Do I need to re-create it?

The error in neutron-server.log is "MessageDeliveryFailure: Unable to connect to AMQP server on us01odc-dev1-ctrl1.internal.synopsys.com:5672 after None tries: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications.info' in vhost '/' due to timeout"

After stop, reset and start I seem to have no users. I created an admin user, and also the openstack user that I see in neutron.conf, and then restarted neutron services. Now I see this in the RabbitMQ log (10 lines per second)

Channel error on connection <0.30102.61> (10.195.116.12:50264 -> 10.195.116.10:5672, vhost: '/', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_058bfcfb64a24474916d7523c708d112' in vhost '/'




On Tuesday, October 8, 2019 at 7:45:23 PM UTC-7, Michael Klishin wrote:
Instead of looking for a "cold start procedure" I'd try to understand what happened, including by posting logs to this list.

To reset a node that is running, use `rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl start_app`. That would wipe out the node's
database and "reload" it.

If the node is not running, move its data directory and restart it. That's it. You don't need to delete the Erlang cookie
unless there are reasons to reset it specifically.

On Tue, Oct 8, 2019 at 7:21 PM Albert Meyer <albert...@gmail.com> wrote:
At a previous employer we used a "cold start" procedure that we would run when RMQ got hosed. It would delete queues and erlang cookies, or something, and then RMQ would start fresh. Where can I find a document that describes this "cold start" process?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Oct 9, 2019, 2:28:11 PM10/9/19
to rabbitmq-users
If it does not exist according to management UI and `rabbitmqctl list_queues`, you have to declare it.

The exception you are looking for has a reasonably descriptive message.
What expects the exchange to be there, I'm not sure but most likely another application tries to process
a backlog of messages and attempts to publish a response but the original counterparty is not there.
See [1][2], [3][4] can help you reduce the probability of a backlog accumulated in memory (that's my guess as to what's happened).
[5][6] provides relevant monitoring information.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/61b4ea7a-3084-42d6-8df1-8b0bf6828cbf%40googlegroups.com.

Albert Meyer

unread,
Oct 9, 2019, 4:37:59 PM10/9/19
to rabbitmq-users
Maybe I'm not very good at setting up RMQ because I see it failing frequently. I have a different RMQ issue in my QA cluster now. I want to dig into these issues and fix them as time permits, but for now I would like to just reset the queues so that openstack can start over. Does anyone have a link to the "cold start" process where you delete files and erlang cookies and start over?

Michael Klishin

unread,
Oct 9, 2019, 4:46:45 PM10/9/19
to rabbitmq-users
We cannot comment on what is going on in your cluster: you have not shared any logs, configuration or even RabbitMQ
version used. Applications that use RabbitMQ have a certain degree of responsibility of what they do and how they react to and recover from various failures.
Please help others help you.

As for "the procedure", see my earlier response. There is no need to delete files unless the node cannot start. Just use `rabbitmqctl` as recommended.
If you do have to move the node's data directory, its location can be found in the log files (around boot time) [1] or docs (assuming it hasn't been overridden) [2].
Just move it and start the node, it will iniitialize a new blank database. All users, virtual hosts, permissions and so on would have to be set up from scratch.

There is NO reason to delete the Erlang cookie unless the cookie has been compromised or has changed on other nodes.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/babcc338-d669-4f46-815e-44ed35b56f65%40googlegroups.com.

Albert Meyer

unread,
Oct 9, 2019, 5:11:32 PM10/9/19
to rabbitmq-users
I'm running RabbitMQ 3.6.10 with 3 nodes. I don't have a config file; I think the config comes from Openstack. Here's what I get for status:

root@us01odc-qa-ctrl1:/var/log/rabbitmq# rabbitmqctl cluster_status
Cluster status of node 'rabbit@us01odc-qa-ctrl1'
[{nodes,[{disc,['rabbit@us01odc-qa-ctrl1']},
         {ram,['rabbit@us01odc-qa-ctrl3','rabbit@us01odc-qa-ctrl2']}]},
 {running_nodes,['rabbit@us01odc-qa-ctrl2','rabbit@us01odc-qa-ctrl3',
                 'rabbit@us01odc-qa-ctrl1']},
 {cluster_name,<<"rab...@us01odc-qa-ctrl1.internal.synopsys.com">>},
 {partitions,[]},
 {alarms,[{'rabbit@us01odc-qa-ctrl2',[]},
          {'rabbit@us01odc-qa-ctrl3',[]},
          {'rabbit@us01odc-qa-ctrl1',[]}]}]

root@us01odc-dev1-ctrl2:/var/log/rabbitmq# rabbitmqctl cluster_status
Cluster status of node 'rabbit@us01odc-dev1-ctrl2'
[{nodes,[{disc,['rabbit@us01odc-dev1-ctrl2']}]},
 {running_nodes,['rabbit@us01odc-dev1-ctrl2']},
 {cluster_name,<<"rab...@us01odc-dev1-ctrl2.internal.synopsys.com">>},
 {partitions,[]},
 {alarms,[{'rabbit@us01odc-dev1-ctrl2',[]}]}]

root@us01odc-dev1-ctrl3:/var/log/rabbitmq# rabbitmqctl cluster_status
Cluster status of node 'rabbit@us01odc-dev1-ctrl3'
[{nodes,[{disc,['rabbit@us01odc-dev1-ctrl3']}]},
 {running_nodes,['rabbit@us01odc-dev1-ctrl3']},
 {cluster_name,<<"rab...@us01odc-dev1-ctrl3.internal.synopsys.com">>},
 {partitions,[]},
 {alarms,[{'rabbit@us01odc-dev1-ctrl3',[]}]}]

ctrl1 isn't logging anything for RMQ but the service is running. Here are the logs for 2 and 3 - I get about 10 per second of this:

ctrl2: =ERROR REPORT==== 9-Oct-2019::14:07:21 ===
Channel error on connection <0.1461.0> (10.195.116.11:37366 -> 10.195.116.11:5672, vhost: '/', user: 'openstack'), channel 1:

operation basic.publish caused a channel exception not_found: no exchange 'reply_058bfcfb64a24474916d7523c708d112' in vhost '/'

ctrl3: =ERROR REPORT==== 9-Oct-2019::14:06:38 ===
Channel error on connection <0.4624.0> (10.195.116.11:48202 -> 10.195.116.12:5672, vhost: '/', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_5d64f5d5fb6e4a2a9c133e5a29f7b5e4' in vhost '/'

Albert Meyer

unread,
Oct 9, 2019, 5:18:15 PM10/9/19
to rabbitmq-users
It appears that nodes 2 and 3 have lost their policies. Do I need to re-create them?

 root@us01odc-qa-ctrl1:/var/log/rabbitmq# rabbitmqctl list_policies
Listing policies
/       ha-all  all     ^(?!amq\\.).*   {"ha-mode":"all"}0

root@us01odc-dev1-ctrl2:/var/log/rabbitmq# rabbitmqctl list_policies
Listing policies
root@us01odc-dev1-ctrl2:/var/log/rabbitmq#

root@us01odc-dev1-ctrl3:/var/log/rabbitmq# rabbitmqctl list_policies
Listing policies
root@us01odc-dev1-ctrl3:/var/log/rabbitmq#


Albert Meyer

unread,
Oct 9, 2019, 7:54:29 PM10/9/19
to rabbitmq-users
I joined nodes 2 and 3 back into the cluster, and added the HA policy. Now I don't see any errors; the logs have entries that say "=INFO REPORT==== 9-Oct-2019::16:21:58 ===
connection <0.11321.175> (10.195.92.13:39828 -> 10.195.92.12:5672 - neutron-server:2467701:0d3c3852-3c02-4138-9c55-40ed2b403325): user 'openstack' authenticated and granted access to vhost '/'"

The admin display shows the number of "Ready" messages gradually increasing in the notifications.info queue. In syslog I see neutron-linuxbridge-agent.service stopping and starting:

Oct  9 16:45:57 us01odc-qa-ctrl3 systemd[1]: neutron-linuxbridge-agent.service: Scheduled restart job, restart counter is at 464.
Oct  9 16:45:57 us01odc-qa-ctrl3 systemd[1]: Stopped Openstack Neutron Linux Bridge Agent.
Oct  9 16:45:57 us01odc-qa-ctrl3 systemd[1]: Starting Openstack Neutron Linux Bridge Agent...
Oct  9 16:45:57 us01odc-qa-ctrl3 systemd[1]: Started Openstack Neutron Linux Bridge Agent.
Oct  9 16:46:00 us01odc-qa-ctrl3 systemd[1]: neutron-linuxbridge-agent.service: Main process exited, code=exited, status=1/FAILURE
Oct  9 16:46:00 us01odc-qa-ctrl3 systemd[1]: neutron-linuxbridge-agent.service: Failed with result 'exit-code'.
Oct  9 16:46:00 us01odc-qa-ctrl3 systemd[1]: neutron-linuxbridge-agent.service: Service hold-off time over, scheduling restart.

If I stop/start neutron services then the error goes away for a while and then comes back. Do I still have a problem with RMQ, or is the queue just growing because Neutron is failing to consume it? It seems like, once RMQ gets broken, I can never fix it.

If anyone has a link to that "cold start" procedure, please shout it out!

Michael Klishin

unread,
Oct 10, 2019, 12:30:30 AM10/10/19
to rabbitmq-users
Almost certainly your cluster uses the "ignore" partition handling strategy [1][2] so nodes never tried to recover
from a partitioned state. [2] has a section that explains how to verify effective configuration.

On an unrelated note: you are running an unsupported version of RabbitMQ [3]. It's been out of any kind of support for about 17 months now.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/81ac88e4-80f7-467c-9a84-a398f310213e%40googlegroups.com.

Albert Meyer

unread,
Oct 10, 2019, 2:14:11 PM10/10/19
to rabbitmq-users
Hi Michael,

It looks like joining the nodes back into the cluster, creating the users and adding the HA policy fixed RMQ; after that I had unrelated neutron issues.Thanks for your help!

Albert Meyer

unread,
Oct 10, 2019, 2:36:37 PM10/10/19
to rabbitmq-users
My Openstack cluster is working now, but I see >1000 "Ready" messages, and the number is gradually increasing. Before RMQ got messed up, this "Ready" number was always 0 or 1. Is this a problem? Do I need to clear out the "Ready" messages to avoid future trouble?

Albert Meyer

unread,
Oct 10, 2019, 3:19:09 PM10/10/19
to rabbitmq-users
Also in the admin display I see +0 +2 next to the notifications.info queue, indicating no synchronized mirrors, and 2 unsynchronized mirrors. How can I synchronize this queue?

notifications.info us01odc-qa-ctrl1 +0 +2 ha-all running 1,529 0 1,529 0.00/s

Michael Klishin

unread,
Oct 10, 2019, 4:13:03 PM10/10/19
to rabbitmq-users
Apologies for my bluntness but you need to learn about the basics of  RabbitMQ monitoring of those "mess up" events will continue. Not understanding
the root cause while reaching for quick fixes is a very risky approach to managing infrastructure.

Your idea of "clearing ready messages" to "fix something" is like asking: we have these 10K rows in this MySQL table, something is off, should we drop the table?
The answer is: I don't know. Messages pile up because they  are not consumed. They can also be consumed but not acknowledged,
which leads to problems even faster since there is a memory cost for unacknowledged messages.

[1][2][3] should provide an overview of the basic. Queue page has a metric for the number of consumers on that queue. If it's zero, perhaps it's something to investigate.

Queue length can be limited for one, a group of or even all queues using a policy [4]. In the short term you can use that as a circuit breaker.
In the long term you must understand why some applications do not consume messages while they previously did (the Ready metric was hovering near zero).


On Thu, Oct 10, 2019 at 1:36 PM Albert Meyer <albert...@gmail.com> wrote:
My Openstack cluster is working now, but I see >1000 "Ready" messages, and the number is gradually increasing. Before RMQ got messed up, this "Ready" number was always 0 or 1. Is this a problem? Do I need to clear out the "Ready" messages to avoid future trouble?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Albert Meyer

unread,
Oct 10, 2019, 5:34:39 PM10/10/19
to rabbitmq-users
The notifications.info queue has no consumers, but I didn't setup the consumers; I just installed openstack and RMQ was included. How can I find out, why the queue has no consumers?

Albert Meyer

unread,
Oct 10, 2019, 5:42:48 PM10/10/19
to rabbitmq-users
I tried stop/starting the RabbitMQ service and it failed to start on node 1. This is the error:

=ERROR REPORT==== 10-Oct-2019::14:40:06 ===
Mnesia('rabbit@us01odc-dev1-ctrl1'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@us01odc-dev1-ctrl1_1570_743606_270473")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@us01odc-dev1-ctrl1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@us01odc-dev1-ctrl2','rabbit@us01odc-dev1-ctrl3','rabbit@us01odc-dev1-ctrl1'],[],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1569272071822073672,-576460752303418430,1},'rabbit@us01odc-dev1-ctrl1'},{{4,0},{'rabbit@us01odc-dev1-ctrl2',{1569,272089,126714}}}}, 'rabbit@us01odc-dev1-ctrl2' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@us01odc-dev1-ctrl3','rabbit@us01odc-dev1-ctrl2','rabbit@us01odc-dev1-ctrl1'],[],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1570642920543671197,-576460752303411647,1},'rabbit@us01odc-dev1-ctrl1'},{{4,0},{'rabbit@us01odc-dev1-ctrl3',{1570,657303,396912}}}}

Albert Meyer

unread,
Oct 10, 2019, 6:18:52 PM10/10/19
to rabbitmq-users
This is the process I used to "cold start" my dev cluster. After this RMQ is working properly, and the notifications.info queue doesn't exist. I didn't try this yet in QA; I'd prefer to solve the underlying issue. How can I find out, why this notifications.info queue is being created with no consumers?

On master:
service rabbitmq-server stop
ps auxw|grep rabbit
(kill any rabbit processes)
rm -rf /var/lib/rabbitmq/mnesia/*
service rabbitmq-server start
 rabbitmqctl add_user admin <password>
rabbitmqctl set_user_tags admin administrator
rabbitmqctl set_permissions -p / admin ".*" ".*" ".*"
 rabbitmqctl add_user openstack <password>
rabbitmqctl set_permissions -p / openstack ".*" ".*" ".*"
rabbitmqctl set_policy ha-all "" '{"ha-mode":"all"}'
rabbitmqctl list_policies
on slaves:

rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl start_app
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@<master>
rabbitmqctl start_app

Albert Meyer

unread,
Oct 10, 2019, 6:32:58 PM10/10/19
to rabbitmq-users
The "cold start" was only a temporary solution. After a few minutes the notifications.info queue is created again, with no consumers, and it starts to fill up. When I look at the messages, I see that they are openstack operations. I need to figure out why nothing is consuming these messages:

{"oslo.message": "{\"_context_domain\": null, \"_context_request_id\": \"req-48a63d06-6652-4535-838e-10851421b8db\", \"_context_global_request_id\": \"req-5b7627ff-ca68-4716-a140-0dd5cf410ebe\", \"_context_roles\": [\"admin\", \"member\", \"reader\"], \"event_type\": \"port.create.start\", \"_context_tenant_name\": \"itcloud\", \"timestamp\": \"2019-10-10 22:22:40.344744\", \"_context_user\": \"2cb6757679d54a69803a5b6e317b3a93\", \"_unique_id\": \"d58be5d9f1c24861bf37d77fba0036c5\", \"_context_resource_uuid\": null, \"_context_tenant_id\": \"474ae347d8ad426f8118e55eee47dcfd\", \"_context_is_admin_project\": true, \"_context_user_id\": \"2cb6757679d54a69803a5b6e317b3a93\", \"payload\": {\"port\": {\"network_id\": \"90dee5fa-5f58-4b2d-967e-14dbc7d3748a\", \"tenant_id\": \"474ae347d8ad426f8118e55eee47dcfd\", \"device_id\": \"8ede3499-e70b-48c6-b143-be51881b80e5\", \"admin_state_up\": true}}, \"_context_project_name\": \"itcloud\", \"_context_system_scope\": null, \"_context_user_identity\": \"2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81\", \"_context_auth_token\": \"<deleted>\", \"_context_show_deleted\": false, \"_context_tenant\": \"474ae347d8ad426f8118e55eee47dcfd\", \"priority\": \"INFO\", \"_context_read_only\": false, \"_context_is_admin\": true, \"_context_project_id\": \"474ae347d8ad426f8118e55eee47dcfd\", \"_context_project_domain\": \"7d3a4deab35b434bba403100a6729c81\", \"_context_timestamp\": \"2019-10-10 22:22:40.296736\", \"_context_user_domain\": \"default\", \"_context_user_name\": \"admin\", \"publisher_id\": \"network.us01odc-dev1-ctrl1\", \"message_id\": \"f5abf42d-ab61-47ca-ad35-07f5af57db76\", \"_context_project\": \"474ae347d8ad426f8118e55eee47dcfd\"}", "oslo.version": "2.0"}

Michael Klishin

unread,
Oct 10, 2019, 7:25:26 PM10/10/19
to rabbitmq-users
My only guess is that some component(s) that is supposed to consume is not running/was not stared.

On Thu, Oct 10, 2019 at 4:34 PM Albert Meyer <albert...@gmail.com> wrote:
The notifications.info queue has no consumers, but I didn't setup the consumers; I just installed openstack and RMQ was included. How can I find out, why the queue has no consumers?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Oct 10, 2019, 7:27:31 PM10/10/19
to rabbitmq-users
Queues in  AMQP 0-9-1 (or topics in MQTT and STOMP, etc) are not declared by RabbitMQ. Applications declare them [1].
One application can declare a queue but not consume anything from it, just to make sure that the queue is there so that messages
can  be published to it.

One way to find out what the  apps really do is to take a traffic capture with tcpdump and inspect it with Wireshark.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Oct 10, 2019, 7:29:51 PM10/10/19
to rabbitmq-users
This node was [re-]joined with a cluster that has a different "identity" ("cookie"). This is NOT related to the cookie that is the shared inter-node authentication secret [1].

This is possible e.g. after a node was reset but its cluster members were not. [2] has a lot of information about how clusters are formed,
how nodes behave when they are restarted, how to remove a cluster node permanently, and so on.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages