How to safely reboot the Master queue node

447 views
Skip to first unread message

Cristi Marasescu

unread,
Oct 15, 2018, 4:33:37 AM10/15/18
to rabbitmq-users
Hello team,

I have a 3-nodes RabbitMQ cluster with HA Policy enabled that is used by Sensu monitoring => each Sensu client sends its events to Sensu server via RabbitMQ => for each Sensu client a queue is created.
In my environment, one of the nodes is the Master queue node. Each time when the Master queue node is restarted, the Sensu server stop working because and we have the following errors on the RabbitMQ nodes:


Discarding message {'$gen_call',{<0.16596.1>,#Ref<0.0.2.48068>},stat} from <0.16596.1> to <0.6006.0> in an old incarnation (1) of this node (3)
Discarding message {'$gen_call',{<0.11897.0>,#Ref<0.0.2.3095>},stat} from <0.11897.0> to <0.499.0> in an old incarnation (1) of this node (3)
Channel error on connection <0.2687.5> xxx.xxx.xxx.xxx:51852 -> 172.18.0.2:5672, vhost: 'sensu', user: 'sensu'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'QUEUE_NAME' in vhost 'sensu' due to timeout

How can I prevent this behavior? Is there a safe mode to restart the master queue node? Or how can I assure that 1 of the remaining nodes can take over the queues while the Master queue node is down?

Thank you!

Michael Klishin

unread,
Oct 15, 2018, 4:50:12 AM10/15/18
to rabbitm...@googlegroups.com
There are no master nodes in RabbitMQ. Queues have masters, nodes are equal peers [1].
So what you are after is migration queue master to a different node [2][4].

According to the message the queue is durable and not mirrored [3]. So the first step is to make them mirrored [5]
and then migrate their master using a temporary policy [2].


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Cristi Marasescu

unread,
Oct 15, 2018, 8:00:49 AM10/15/18
to rabbitm...@googlegroups.com
Hello Team,

Thank you Michael Klishin for you fast response.

All queues are mirrored based on a HA-policy (ha-mode:all, ha-sync-mode:automatic) and have auto-delete option set to true.

[root@sensu01 ~]# docker exec -it sensu-int-rabbit-1 rabbitmqctl --timeout 120 list_queues -p sensu name policy pid slave_pids
Timeout: 120.0 seconds ...
Listing queues for vhost sensu ...
vehweb04.f-1.5.0-1538545665 hapolicy <rab...@sensu-int-rabbit-1.1.4032.0> [<rab...@sensu-int-rabbit-2.1.1249.0>, <rab...@sensu-int-rabbit-3.2.428.0>]
db01-1.3.3-1535137885 hapolicy <rab...@sensu-int-rabbit-1.1.3859.0> [<rab...@sensu-int-rabbit-2.1.1253.0>, <rab...@sensu-int-rabbit-3.2.432.0>]
docker02-1.5.0-1538460109 hapolicy <rab...@sensu-int-rabbit-1.1.3792.0> [<rab...@sensu-int-rabbit-2.1.1257.0>, <rab...@sensu-int-rabbit-3.2.436.0>]
keepalives hapolicy <rab...@sensu-int-rabbit-1.1.3095.0> [<rab...@sensu-int-rabbit-2.1.1265.0>, <rab...@sensu-int-rabbit-3.2.444.0>]
gw01-1.5.0-1538551206 hapolicy <rab...@sensu-int-rabbit-1.1.3602.0> [<rab...@sensu-int-rabbit-2.1.1269.0>, <rab...@sensu-int-rabbit-3.2.448.0>]
gw02-1.5.0-1537971972 hapolicy <rab...@sensu-int-rabbit-1.1.3474.0> [<rab...@sensu-int-rabbit-2.1.1273.0>, <rab...@sensu-int-rabbit-3.2.452.0>]

The problem is that while rebooting the node that is master for most of the queues, no other master nodes is elected for those queue. After the original queues master comes back online, it cannot redeclare the queues because they already exist on the other 2 nodes (each queue got replicated at some point, but the remaining nodes probably have an unsynchronised version).

So back to my original question: is there a way to safely reboot the nodes in my RabbitMQ cluster?
Should I use the policy "ha-promote-on-shutdown".

Note: my Rabbitmq cluster nodes run inside docker containers, RabbitMQ version 3.7.8

Michael Klishin

unread,
Oct 16, 2018, 9:48:38 PM10/16/18
to rabbitm...@googlegroups.com
You can move queue masters to a different node using a temporary policy as I suggested above.

You also can tell RabbitMQ to promote even out-of-sync mirrors if you value
availability more than data consistency here but that was not my recommendation.

If the queues are mirrored, the error suggests that nodes might take a while to notice their peer
disappearance [1].

Lastly you can migrate to an entirely new cluster [2].

My suggestions are still moving queue masters to a dedicated node or [2]. [3] then can be used to
redistribute queue masters after all nodes are back.

Reply all
Reply to author
Forward
0 new messages