Stopping nodes for partition recovery

53 views
Skip to first unread message

Paul Ruan

unread,
Apr 2, 2015, 10:29:35 PM4/2/15
to rabbitm...@googlegroups.com
Hi all,

I'm trying to figure out what the recommended way is for recovering from a partition.
From https://www.rabbitmq.com/partitions.html, it sounds like a reasonable way is to stop all the nodes and then start them up again:
"It may be simpler to stop the whole cluster and start it again; if so make sure that the first node you start is from the trusted partition."

For stopping each node, is it enough to do a rabbitmqctl stop_app? Or do we have to do a rabbitmqctl stop? Is it not necessary to reset the nodes in the untrusted partitions?
Also, is it not recommended to stop nodes using SIGTERM?

Thanks,
Paul

Michael Klishin

unread,
Apr 3, 2015, 2:22:03 AM4/3/15
to Paul Ruan, rabbitm...@googlegroups.com
On 3 April 2015 at 05:29:38, Paul Ruan (paul...@dropbox.com) wrote:
> I'm trying to figure out what the recommended way is for recovering
> from a partition.
> From https://www.rabbitmq.com/partitions.html, it sounds
> like a reasonable way is to stop all the nodes and then start them
> up again:
> "It may be simpler to stop the whole cluster and start it again;
> if so make sure that the first node you start is from the trusted
> partition."

Note that in any partition there are two sides: it should be fine to only stop nodes
in the minority. If you don't know which side is that and can afford to stop all
of them, that's fine.

> For stopping each node, is it enough to do a rabbitmqctl stop_app?
> Or do we have to do a rabbitmqctl stop? Is it not necessary to reset
> the nodes in the untrusted partitions?
> Also, is it not recommended to stop nodes using SIGTERM?

When nodes on the minority side stop themselves, they do what is effectively stop_app.
However, as far as manual interventions go, `rabbitmqctl stop ${PID_FILE}` is also
fine — that's what our Debian init script uses, for example.

Note that if you stop all nodes, the last node to stop must be the first one to start
to act as a seed for other nodes. 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Paul Ruan

unread,
Apr 3, 2015, 1:45:44 PM4/3/15
to rabbitm...@googlegroups.com, paul...@dropbox.com
Thanks for the quick response! 

Sounds like manually running stop_apps should be.

Does the last node to stop must also be a disk node?

Michael Klishin

unread,
Apr 3, 2015, 2:07:26 PM4/3/15
to Paul Ruan, rabbitm...@googlegroups.com
On 3 April 2015 at 20:45:48, Paul Ruan (paul...@dropbox.com) wrote:
> Does the last node to stop must also be a disk node?

Yes. RabbitMQ will refuse to start if it notices that the only node online is a RAM one.

Probably something obvious but I should mention it: unless you have a high rate of queue churn (e.g. 100s or 1000s queues created and deleted per second), there really aren't any reasons to use RAM nodes. 

Paul Ruan

unread,
Apr 3, 2015, 2:12:12 PM4/3/15
to rabbitm...@googlegroups.com, paul...@dropbox.com
We don't have a high rate of queue churn but we do have a high rate of bindings churn and we've found that (at least in a previous version of rabbitmq) binds/unbinds with RAM nodes is much quicker.

Michael Klishin

unread,
Apr 3, 2015, 2:22:35 PM4/3/15
to Paul Ruan, rabbitm...@googlegroups.com
On 3 April 2015 at 21:12:14, Paul Ruan (paul...@dropbox.com) wrote:
> We don't have a high rate of queue churn but we do have a high rate
> of bindings churn and we've found that (at least in a previous
> version of rabbitmq) binds/unbinds with RAM nodes is much quicker.

Yes, binding churn is another reason (exchange church probably isn't common). 
Reply all
Reply to author
Forward
0 new messages