Hi,
We are running a HA cluster of rabbitMQ in kubernetes (5 nodes in a statefulset and a policy of 3 mirrors for classic queues - ha-params: 3) and keep hitting an issue where sometimes newly created vhosts fail to start properly.
Specifically, the management api call (rabbitmqadmin declare vhost vhost=<name>) to create the vhost times out after 60 seconds (the http call just ends with "no content"), and the vhost is visible in the management portal, but with a status of "stopped" for some or all of the nodes. Depending on how many nodes have started, the vhost may be usable, or may fail to allow any resources to be created. Even when created, mirrored classic queues frequently have less mirrors than the set policy says they should have.
This issue only occurs during vhost creation. Once a vhost is fully up on all nodes, we have no further issues and everything seems to work correctly. Doing a sequential reboot of the nodes fixes the issue temporarily (and also fixes vhosts that currently exhibit the issue), but eventually we start seeing the issue again, the vhost appearing to fail to start on at least one of the nodes, and sometimes all five.
Because this server is used to test environments created by CI pipelines, we create and delete a fair number of vhosts on a regular basis, and thus we see this issue on a frequent but inconsistent basis.
This issue has persisted on at least the following rabbitmq versions: 3.8.2, 3.8.11, 3.8.14, using the Docker Community images for Rabbitmq (https://hub.docker.com/_/rabbitmq)
----
Three things stand out from logs/debug.
1) We see a large number of "Discard message from old incarnation messages" constantly coming from the nodes (on the order of 400 per minute), like this:
2021-03-17 11:04:04.828 [error] <0.547.0> Discarding message {'$gen_call',{<0.547.0>,#Ref<0.2527798144.1772355585.119646>},{info,[state]}} from <0.547.0> to <0.22046.103> in an old incarnation (1615212109) of this node (1615971788)
2021-03-17 11:04:04.828 [error] emulator Discarding message {'$gen_call',{<0.547.0>,#Ref<0.2527798144.1772355585.119646>},{info,[state]}} from <0.547.0> to <0.22046.103> in an old incarnation (1615212109) of this node (1615971788)
2) When the majority of nodes fail to start the vhost, we see occasional warnings in the logs in this form:
2021-03-23 12:49:32.305 [warning] <0.11689.511> Mirrored queue 'cabinevents.incoming' in vhost 'pipeline-id-76806': Unable to start queue mirror on node ''rab...@rabbitmq-0.rabbitmq.rabbitmq.svc.cluster.local''. Target virtual host is not running: {vhost_supervisor_not_running,<<"pipeline-id-76806">>}
3) Running rabbitmq-diagnostics maybe_stuck on nodes where a vhost failed to start shows a number of "suspicious processes" that we don't see on nodes that haven't failed to start a vhost. (specifically, we always see the first two processes "erts_code_purger", and "dbg.erl" on all our nodes, but the following 20 processes seem to be related to this issue)
(See attached file for example output from maybe_stuck)
---
We are currently trying to replicate the issue on a completely separate kubernetes cluster so that we have the ability to post more logs / debug information without worrying about accidentally sharing confidential data in the logs from our dev environment.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fa88fda0-2717-4850-93ab-bb238008475an%40googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/b5794a11-4b12-4759-83fb-4e3ef851d850n%40googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/6953e817-4e06-4d6b-8d35-a63225d4045dn%40googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/qJ6wNIkjKqk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f272dd47-9626-4c0d-8fad-8e4ab07198aen%40googlegroups.com.