mirrored cluster crashes after node failure

Matt Wheeler

unread,

Sep 30, 2013, 2:34:40 PM9/30/13

to rabbitmq...@googlegroups.com

We have a 3 node rabbitmq cluster consisting of 2 disk nodes and one memory node. (disk nodes are rabbitmq-00 and 01, memory node is core-01)

queues are durable and mirrored (show +2 in control panel, etc.) and show syncronized:

# rabbitmqctl list_queues name slave_pids synchronised_slave_pids

Listing queues ...

...
SVC_mailbox_lookup [<'rabbit@rabbitmq-01'.2.301.0>, <'rabbit@core-01'.1.268.0>] [<'rabbit@core-01'.1.268.0>, <'rabbit@rabbitmq-01'.2.301.0>]

...

# rabbitmqctl list_policies

Listing policies ...

/ ha-all ^SVC_ {"ha-mode":"all"} 0

...done.

We put in SSD mounted to '/var/lib/rabbitmq' to host the mnesia database on rabbitmq-00/01. we only did a single drive figuring that if the disk failed the node would crash and the others in the HA cluster would take over - all clients have been coded for failover.

The SSD on rabbitmq-00 failed. i don't have logs of that event from rabbitmq-00's point of view - for some reason it didn't write out anything.

I do have it from rabbitmq-01's:

=INFO REPORT==== 26-Sep-2013::16:07:20 ===
Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Slave <'rabbit@rabbitmq-01'.3.785.0> saw deaths of mirrors <'rabbit@rabbitmq-00'.3.1415.0>

=INFO REPORT==== 26-Sep-2013::16:07:20 ===
Mirrored-queue (queue 'SVC_mailbox_lookup' in vhost '/'): Promoting slave <'rabbit@rabbitmq-01'.3.785.0> to master

but then:

=ERROR REPORT==== 26-Sep-2013::16:17:17 ===
connection <0.487.0>, channel 1 - soft error:
{amqp_error,not_found,
"home node 'rabbit@core-01' of durable queue 'SVC_mailbox_lookup' in vhost '/' is down or inaccessible",
'queue.declare'}

This is repeated for each queue.

It looks like rabbitmq-01 took over as master, but then the nodes become non-responsive because they can't write to disk on core-01 (the memory node.)

we shutdown whatever was still running on rabbitmq-00. and everything was still unavailable. we then shutdown core-01 and lastly rabbitmq-01, then restarted rabbitmq-01, but it came up with NO queues.

is this an error with the way the HA cluster is handling failover or an error with our configurations - should we not mix memory and disk nodes in an HA cluster?

I'm trying to figure this our because we want to be sure that if any node in the cluster fails, the others take over seamlessly. our code does that... we just need the clusters to soldier on and that no records are lost.

Thanks.

Matt Wheeler

unread,

Sep 30, 2013, 2:40:56 PM9/30/13

to rabbitmq...@googlegroups.com

I should have included:

we are using - RabbitMQ 3.1.5, Erlang R14B04 on centos 6

Yongsheng Ma

unread,

Aug 8, 2019, 11:57:20 PM8/8/19

to rabbitmq-discuss

Hi Matt Wheeler

Have you figured out why ?

Reply all

Reply to author

Forward