vhost_supervisor_not_running errors after upgrading to RabbitMQ 3.7.4 and Erlang OTP 20.3.2

1,634 views
Skip to first unread message

Kit Sirota

unread,
May 4, 2018, 2:56:58 PM5/4/18
to rabbitmq-users
Hi folks. 

We recently upgraded from Erlang OTP 19.3.6.2/RMQ 3.6.12 to Erlang OTP 20/RMQ 3.7.4 and since the upgrade, we're having quite a bit of trouble with some of the vhosts on this cluster.

In the management gui, we're seeing a bunch of errors like this across some but not all of the vhosts.  The errors seem to also come and go
Virtual host example experienced an error on node rabbit@b228377ffe6d115f302499578585da44 and may be inaccessible

On the rmq workers, the logs are spamming `vhost_supervisor_not_running` errors over and over again.
2018-05-04 17:45:51.521 [error] <0.2846.24> Error on AMQP connection <0.2846.24> (192.168.72.248:53250 -> 192.168.77.101:5672, vhost: 'example', user: 'example__places', state: running), channel 1:
 operation queue.declare caused a connection exception internal_error: "Cannot declare a queue 'queue 'example-fab0e9fd-6ace-4d61-875d-c4cf9426c022' in vhost 'example'' on node 'rabbit@99e02850a6df93d16471990ac52b6f82': {vhost_supervisor_not_running,<<\"example\">>}"

The only lead Ive got so far is the reference to the code itself https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_vhost_sup_sup.erl#L149

Has anyone run into anything like this?  


Thanks!

Michael Klishin

unread,
May 4, 2018, 10:47:52 PM5/4/18
to rabbitm...@googlegroups.com
In every case we have seen so far this happens in roughly the following scenario:

 * A piece of code declares a virtual host
 * and on the very next line assumes it can be used

That's not the case in 3.7.x. Each virtual host has its own message store and other state
that used to be global. That stuff takes time to start — a fraction of a second, perhaps, but
nonetheless not instant.

See server logs for more clues. We don't guess on this list.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
May 4, 2018, 10:53:27 PM5/4/18
to rabbitm...@googlegroups.com
I don't recall which doc guide mentions this (or maybe none do, although I somewhat doubt) but
when *all* nodes are restarted, there is a certain restart order dependency.

It used to be strict: last node to stop must be the first to start. Since 3.6.7 it is a lot less strict:
as long as within 5 minutes (10 attempts x 30 seconds of waiting per attempt) the last known peer
for a node comes online, it's all good.

For a complete cluster restart (which is what this experiment effectively is) this means that all nodes
must eventually be up within that time window. It can be configured e.g. by increasing the number of attempts.

The very last node to shut down has no online peers it knows about at the time it goes down, so it will start
up on its own.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages