Vhosts sometimes fail to start on all nodes in HA cluster

1,078 views
Skip to first unread message

Nick Barham

unread,
Mar 24, 2021, 5:57:30 AM3/24/21
to rabbitmq-users

Hi,

We are running a HA cluster of rabbitMQ in kubernetes (5 nodes in a statefulset and a policy of 3 mirrors for classic queues - ha-params: 3) and keep hitting an issue where sometimes newly created vhosts fail to start properly.

Specifically, the management api call (rabbitmqadmin declare vhost vhost=<name>) to create the vhost times out after 60 seconds (the http call just ends with "no content"), and the vhost is visible in the management portal, but with a status of "stopped" for some or all of the nodes. Depending on how many nodes have started, the vhost may be usable, or may fail to allow any resources to be created. Even when created, mirrored classic queues frequently have less mirrors than the set policy says they should have.

This issue only occurs during vhost creation. Once a vhost is fully up on all nodes, we have no further issues and everything seems to work correctly.  Doing a sequential reboot of the nodes fixes the issue temporarily (and also fixes vhosts that currently exhibit the issue), but eventually we start seeing the issue again, the vhost appearing to fail to start on at least one of the nodes, and sometimes all five. 

Because this server is used to test environments created by CI pipelines, we create and delete a fair number of vhosts on a regular basis, and thus we see this issue on a frequent but inconsistent basis.

This issue has persisted on at least the following rabbitmq versions: 3.8.2, 3.8.11, 3.8.14, using the Docker Community images for Rabbitmq (https://hub.docker.com/_/rabbitmq)

----

Three things stand out from logs/debug.

1) We see a large number of "Discard message from old incarnation messages" constantly coming from the nodes (on the order of 400 per minute), like this:

2021-03-17 11:04:04.828 [error] <0.547.0> Discarding message {'$gen_call',{<0.547.0>,#Ref<0.2527798144.1772355585.119646>},{info,[state]}} from <0.547.0> to <0.22046.103> in an old incarnation (1615212109) of this node (1615971788)

2021-03-17 11:04:04.828 [error] emulator Discarding message {'$gen_call',{<0.547.0>,#Ref<0.2527798144.1772355585.119646>},{info,[state]}} from <0.547.0> to <0.22046.103> in an old incarnation (1615212109) of this node (1615971788)

2) When the majority of nodes fail to start the vhost, we see occasional warnings in the logs in this form:

2021-03-23 12:49:32.305 [warning] <0.11689.511> Mirrored queue 'cabinevents.incoming' in vhost 'pipeline-id-76806': Unable to start queue mirror on node ''rab...@rabbitmq-0.rabbitmq.rabbitmq.svc.cluster.local''. Target virtual host is not running: {vhost_supervisor_not_running,<<"pipeline-id-76806">>}

3) Running rabbitmq-diagnostics maybe_stuck on nodes where a vhost failed to start shows a number of "suspicious processes" that we don't see on nodes that haven't failed to start a vhost. (specifically, we always see the first two processes "erts_code_purger", and "dbg.erl" on all our nodes, but the following 20 processes seem to be related to this issue)

(See attached file for example output from maybe_stuck)

---

We are currently trying to replicate the issue on a completely separate kubernetes cluster so that we have the ability to post more logs / debug information without worrying about accidentally sharing confidential data in the logs from our dev environment.



rabbitmq_logs.txt

Nick Barham

unread,
Mar 29, 2021, 10:33:45 AM3/29/21
to rabbitmq-users
Additionally, we have started seeing the following types of errors from the client side on affected vhosts:

Connection closed: 541 (INTERNAL-ERROR) with message "INTERNAL_ERROR - Cannot declare a queue 'queue 'testQueue' in vhost 'pipeline-id-77026'' on node 'rab...@rabbitmq-2.rabbitmq.rabbitmq.svc.cluster.local': {vhost_supervisor_not_running,<<"pipeline-id-77026">>}"

Michal Kuratczyk

unread,
Mar 29, 2021, 11:16:47 AM3/29/21
to rabbitm...@googlegroups.com
Hi,

1. Can you please enable debug logs (rabbitmqctl set_log_level debug) and share logs from all nodes when the problem occurred (with the name of the vhost to look for)?

2. Is RabbitMQ under heavy load when this happens?

3. Are you able to reproduce this issue when deploying RabbitMQ using the RabbitMQ Cluster Operator?

Thanks,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fa88fda0-2717-4850-93ab-bb238008475an%40googlegroups.com.


--
Michał
RabbitMQ team

Nick Barham

unread,
Mar 29, 2021, 11:59:07 AM3/29/21
to rabbitmq-users
Thanks for responding.

1) I have enabled debug logs and will gather logs the next time this happens - the timing is super inconsistent for how long it takes for the issue to start appearing, so it could be a couple of days before I get new relevant logs, but I will post here once I get them. 

2) There doesn't seem to be a correlation between load levels and the issue occurring, over the past few months of us having this issue.

3) I may not easily be able to set up another RabbitMQ cluster using the RabbitMQ Cluster Operator and load it the same, due to the way our kubernetes are managed. 
(We have been trying to replicate on a different cluster, but so far haven't seen the issue. We did however completely tear down our existing cluster, wiped all persistent storage, and recreated the cluster, and continued to have the issue)

Many thanks,
Nik.

Nick Barham

unread,
Apr 1, 2021, 7:02:30 AM4/1/21
to rabbitmq-users
One complete rebuild of our rabbitmq cluster due to unrelated failures later (when it rains, it pours), I have a set of debug logs from a vhost failing to start on all 5 nodes. The affected vhost is "pipeline-id-77294"

Apologies the logs are fairly noisy due to some issues in other vhosts with a poorly behaving client creating a lot of connections on a different vhost...

The logs cover from when the vhost was set up and the following few minutes

Please let me know if you need any other information. Unfortunately, I'm going to have to resolve the current issues by systematically rebooting the rabbitmq instances, as we need these CI pipelines to work. 

Many thanks in advance,
Nik
rabbitmq.zip

Michal Kuratczyk

unread,
Apr 2, 2021, 6:53:26 AM4/2/21
to rabbitm...@googlegroups.com
Hi,

I'm afraid I don't see what I expected. Any chance there is something in the logs you didn't include? I'd look for "shutdown" and "crash" in particular.
The logs from node 0 don't even cover this vhost being created (no mention of "pipeline-id-77294" in rabbit_0.txt.

For now, I can only suggest `rabbitmqctl restart_vhost` as a potentially less disruptive way to restore the vhost.



--
Michał
RabbitMQ team

Nick Barham

unread,
Apr 6, 2021, 4:34:56 AM4/6/21
to rabbitmq-users
I will try `rabbitmqctl restart_vhost`, though I know trying to trigger the restart vhost from the management GUI exhibits the same issue (i.e. it times out without successfully executing the action)

Also, the issue seems to co-incide with number of processes being reported by the `rabbitmq-diagnostics maybe_stuck` command. I can provide you will the output from this command next time the issue occurs if this helps?

Nik.

M K

unread,
Apr 6, 2021, 5:29:24 AM4/6/21
to rabbitmq-users
We need to know how exactly to reproduce this. A virtual host has to be started on every cluster nodes. We previously have seen scripts
create a virtual host and then immediately assume it can be used on any node. These days `rabbitmqctl add_vhost` won't return until all nodes
report success, and that seems to have helped.

I don't remember seeing this in the context of Kubernetes, where such scripted automation must be rare.

M K

unread,
Apr 6, 2021, 5:34:58 AM4/6/21
to rabbitmq-users
There are no errors of any kind besides "operation X failed because virtual host is down" in any of these logs.

There is evidence of TCP connections that send no data (likely TCP port health checks).
There is evidence of a new connection being created every fraction of a second. That's not how the messaging protocols
RabbitMQ supports are supposed to be used [1].

Finally, there are some authentication failures:
> PLAIN login refused: user 'vjt-ccg-grm' - invalid credentials

The logs do not seem to be complete as they do not mention the startup banner and so on.
There are no messages that would indicate when a virtual host was added, or definitions imported, or anything like that.

How do you provision this cluster, and what do your applications do to it? And why it works that way?

Michal Kuratczyk

unread,
Apr 6, 2021, 6:20:39 AM4/6/21
to rabbitm...@googlegroups.com
I noticed that the definitions are imported to all vhosts. I guess you have the definitions as a ConfigMap mounted to all pods and each node imports the definitions.
I'd suggest trying to only import the definitions once. There is no trivial way to do that in Kubernetes but at least for testing you could either import the definitions manually
or use an init container that only executes if the pod index is 0 or something like that.

Not sure if you are using the Operator but we do have an issue to look into better definition import support: https://github.com/rabbitmq/cluster-operator/issues/631

Best,



--
Michał
RabbitMQ team

Nick Barham

unread,
Apr 6, 2021, 6:32:50 AM4/6/21
to rabbitmq-users
In answer to your first question, we use the rabbitmqadmin cli to create the vhost (`rabbitmqadmin declare vhost`). We have a startup script (init container) for our k8s environments that tries to provision a new vhost for that k8s environment by creating a vhost, a user, a set of policies and a handful of default queues and exchanges. If the above command returns then everything after that works fine, but when we see this issue, the http call that rabbitmqadmin does under the hood times out.

The logs I provided are only for the time around when the issue occurred. Unfortunately, at debug level, so much log is output that our kubernetes environment log rotates the logs, so by the time the issue reoccurs, we don't have the startup logs, and besides it would be literally gigabytes of logs (based on the amount of data that get's pushed to our external logging kibana in our production environment when rabbitmq is in debug mode)

I suspect that the repeated connection is the misbehaving client I mentioned in previous messages. The client is from a release line that is code-frozen, and still being tested by UA - unfortunately, even though we have a fix and have declared it critical, it has been decided by higher ups that we can't deploy the fix to that release line. Yay for big corporation red tape and management layers :( 

In regards to definitions, we have a base definition that just sets up an admin account (plus a queue and policy on vhost '/' as a basic test to ensure rabbit has started up correctly) - see attached file.

Then for each vhost we set up, we push a set of management definition files via the `rabbitmqadmin import` cli command. We do this so that we can arrange any changes to the rabbitmq vhost as a set of "migration files", to allow us to upgrade production systems. So far we haven't had any issue with this aspect of the rabbitmq management - when it fails to set up the vhost it doesn't even get to this step, and when it does set up the vhost, these management definition files load perfectly every time.

Also, I tried to use the restart_vhost command (`rabbitmqctl -p pipeline-id-77453 restart_vhost`) but it merely output 
"Trying to restart vhost 'pipeline-id-77453' on node 'rab...@rabbitmq-4.rabbitmq.rabbitmq.svc.cluster.local' ..."
and sat there for over an hour with no output, and nothing in any of the logs to indicate anything was happening.

Nik Barham

unread,
Apr 6, 2021, 6:41:18 AM4/6/21
to rabbitm...@googlegroups.com
For some reason I suddenly can't post a message in reply on Google Groups. Hopefully doing it via email will fix it >_>

I forgot to attach the definitions.yaml to the previous message. Attaching here.

Nik Barham

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/qJ6wNIkjKqk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f272dd47-9626-4c0d-8fad-8e4ab07198aen%40googlegroups.com.
rabbitmq-definitions.yml

Nick Barham

unread,
Apr 26, 2021, 5:44:03 AM4/26/21
to rabbitmq-users
Hi, 

Some more updates from monitoring the issues above on our cluster - just had an occurrence that highlighted a new side to the issue. Had a vhost that failed to start on all five nodes earlier, and went and ran `rabbitmq-diagnostics maybe_stuck` on each node. Four of the nodes only listed the two processes that I always see ("erts_code_purger", and a call to "dbg.erl"), and for those four nodes, I was able to click the "restart" button in the vhost page in the management plugin, and successfully (and instantly) started the affected vhost on those nodes. 

For the final node, the `maybe_stuck` command returned a whopping 724 "suspicious processes". Of note was the following one:
2021-04-26 09:27:56 [{pid,<11328.600.0>},
{registered_name,rabbit_vhost_sup_sup},
{current_stacktrace,
    [{supervisor2,shutdown,2,
         [{file,"src/supervisor2.erl"},{line,1038}]},
     {supervisor2,do_terminate,2,
         [{file,"src/supervisor2.erl"},{line,1000}]},
     {supervisor2,handle_call,3,
         [{file,"src/supervisor2.erl"},{line,487}]},
     {gen_server,try_handle_call,4,
         [{file,"gen_server.erl"},{line,715}]},
     {gen_server,handle_msg,6,
         [{file,"gen_server.erl"},{line,744}]},
     {proc_lib,init_p_do_apply,3,
         [{file,"proc_lib.erl"},{line,226}]}]},
{initial_call,{proc_lib,init_p,5}},
{message_queue_len,719},
{links,
[<11328.18675.0>,<11328.8422.1>,<11328.5613.795>,
<11328.17293.3>,<11328.2653.3>,<11328.22708.4>,
<11328.3345.2169>,<11328.21006.1416>,<11328.601.0>,
<11328.4413.4>,<11328.4046.0>,<11328.347.0>]},
{monitors,[{process,<11328.10540.2167>}]},
{monitored_by,
[<11328.24779.2656>,<11328.10834.2656>,
...356 more lines with two process ids per line...
<11328.29669.2169>,<11328.29059.2169>]},
{heap_size,987}]


As well as ~700 entries that looked identical to this, excepting the processId:
2021-04-26 09:27:56 [{pid,<11328.748.2497>},
{registered_name,[]},
{current_stacktrace,
    [{gen,do_call,4,[{file,"gen.erl"},{line,208}]},
     {gen_server,call,3,
         [{file,"gen_server.erl"},{line,242}]},
     {rabbit_vhost_sup_sup,stop_and_delete_vhost,1,
         [{file,"src/rabbit_vhost_sup_sup.erl"},
          {line,84}]}]},
{initial_call,{erlang,apply,2}},
{message_queue_len,0},
{links,[]},
{monitors,[{process,<11328.600.0>}]},
{monitored_by,[]},
{heap_size,610}]


(Have attached zip of full output)

I'm hoping this helps - it definitely appears to be something hanging whilst trying to shutdown vhost processes :/

Nik.
rabbitmq_maybestuck.zip
Reply all
Reply to author
Forward
0 new messages