Restarting persistent rabbitmq cluster in kubernetes after all nodes were killed

2,588 views
Skip to first unread message

Andrei Kochetygov

unread,
May 4, 2018, 8:38:25 AM5/4/18
to rabbitmq-users
Hi,

I have a 3.7.2 rabbitmq cluster set up in kubernetes using rabbitmq docker image and I want it to be able to recover after killing all its nodes at the same time. Cluster works fine with persistence storage, but it can't automatically recover after all nodes were killed because of following error:
Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
Waiting for Mnesia tables for 30000 ms, 0 retries left
CRASH REPORT
Process <0.229.0> with 0 neighbours exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}} in application_master:init/4 line 134
Application rabbit exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_r


Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done


As I understand following part of rabbitmq docs addresses this problem https://www.rabbitmq.com/clustering.html :

If all cluster nodes stop in a simultaneous and uncontrolled manner (for example with a power cut) you can be left with a situation in which all nodes think that some other node stopped after them. In this case you can use the force_boot command on one node to make it bootable again - consult the rabbitmqctl manpage for more information.

But docker automatically starts rabbitmq, so I can't successfully call rabbitmqctl stop_app/rabbitmqctl force_boot before docker fails(From what I see this happens because rabbitmqctl stop_app waits for rabbitmq to actually start, and rabbitmq can't start because of the error above).

I also tried using RABBITMQ_NODE_ONLY variable(Started pod with this variable set, then called rabbitmqctl force_boot, then rabbitmqctl start_app). But for some reason in this case Rabbitmq ignores some of the parameters set in env or config files.   

So I have following questions: 
1) Is there some kind of "force" option for rabbitmqctl stop_app command?(So I could stop rabbitmq before pod fail).
2) Questions about RABBITMQ_NODE_ONLY variable (for some reason I didn't find any docs about it). Is RabbitMQ supposed to ignore parameters set in env or config files when this variable is set? Is it ok to use rabbitmqctl force_boot, then rabbitmqctl start_app when this variable is set?
3) Maybe there is a better way to recover persistent cluster in kubernetes after all nodes were stopped?

Thanks!

Michael Klishin

unread,
May 4, 2018, 10:43:19 PM5/4/18
to rabbitm...@googlegroups.com
Please be more specific about the scenario you are testing. The message in question has been discussed
many times before on this list. It means that a node started, tried to contact its known peer and after a certain
point gave up (IIRC we do 10 attempts every 30 seconds each by default).

You need to start all nodes or at least a subset that considers each other "last seen" members
in that time window, then they will rejoin each other in a chain.

force_boot is for scenarios where you cannot recover some nodes (including the very last to shut down when *all* of them are down)
and at least of one of them must be forced to boot since its known peer is never coming back.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Andrei Kochetygov

unread,
May 5, 2018, 4:09:03 AM5/5/18
to rabbitmq-users
Thanks for your answer.

I am testing scenario in kubernetes(using rabbitmq docker image) when all nodes went down at the same time. So as I understand in this case it is possible that all nodes think that some other node stopped after them and because of it I am getting this error. To avoid this error I am trying to start one of the nodes forcefully
But I have problems doing this. Rabbitmqctl can't finish stop_app command(and because of it I can't use force_boot command) while the node is trying to contact other nodes. And since I am using rabbitmq docker image I can't really use stop_app command before or after that(because docker container didn't start yet or already exited). 

I also tried using RABBITMQ_NODE_ONLY variable, and it seems that in this case rabbitmq node doesn't start, but docker container doesn't exit and I can use rabbitmqctl. But with this variable set when I start rabbitmq node with rabbitmqctl start_app command Rabbitmq ignores some of the parameters set in env or config files and doesn't reform the cluster.   

суббота, 5 мая 2018 г., 6:43:19 UTC+4 пользователь Michael Klishin написал:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 5, 2018, 10:11:00 PM5/5/18
to rabbitm...@googlegroups.com
You will be getting this error if all nodes or a subset that was stopped last do not come online in the aforementioned period of time.
I don't have much to add other than to reiterate that you probably don't need to use force_boot unless
one of the nodes is never coming back.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 5, 2018, 10:50:43 PM5/5/18
to rabbitm...@googlegroups.com
The docs were updated to clarify how things work post-3.6.7:

Andrei Kochetygov

unread,
May 6, 2018, 3:45:36 AM5/6/18
to rabbitmq-users
Thanks!

I was confused by the docs a little, it seems that I 'll have to try bringing all cluster back up in this case.
Reply all
Reply to author
Forward
0 new messages