Non Promotable Followers in Quorum Queues

75 views
Skip to first unread message

sai datta

unread,
Feb 3, 2026, 10:40:50 AM (9 days ago) Feb 3
to rabbitmq-users
Hi Team,

We are seeing a weird situation in some of our RabbitMQ clusters running version 4.1.5. Whenever there is a scale up of the cluster and after we have run the rabbitmq-queues grow command on all the queues, some of the queues become unavailable with the following error logs

Error on AMQP connection <0.753291.0> (100.82.135.187:40546 -> 100.82.179.86:5672, vhost: 'haq', user: 'guest', state: closed), channel 37:
operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue 'emailNotifications' in vhost 'haq': {haq_emailNotifications,
'rab...@rabbitmq-ha-2.rabbitmq-ha-headless.svc.cluster.local'}"


When running the command rabbitmq-queues quorum_status emailNotifications --vhost=haq we are seeing the following state of the queues:


┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Node Name │ Raft State │ Membership │ Last Log Index │ Last Written │ Last Applied │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ rab...@rabbitmq-ha-0.rabbitmq-ha-headless.svc.cluster.local │ pre_vote │ voter │ 37 │ 37 │ 37 │ 37 │ -1 │ 2 │ 7 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ rab...@rabbitmq-ha-1.rabbitmq-ha-headless.svc.cluster.local │ follower │ promotable │ 37 │ 37 │ 37 │ 37 │ -1 │ 2 │ 7 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ rab...@rabbitmq-ha-2.rabbitmq-ha-headless.svc.cluster.local │ follower │ promotable │ 37 │ 37 │ 37 │ 37 │ -1 │ 2 │ 7 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

I understand that this probably means that the first node is trying to call an election. However it is unable to do so as the other nodes are not ready to cast a vote and hence they stay stuck in this situation. We have tried multiple methods on trying to force an election, like trying to shrink and grow the cluster, restarting the RabbitMQ nodes, the low level erlang commands which force an election but none worked.

Enabling debug logs, we can see this repeating for all the queues:

queue 'emailNotifications' in vhost 'haq': follower ignored pre_vote_rpc, non-voter: promotable


My question is why is this suddenly happening in this specific version? It never happened before when we used RabbitMQ 3.13.7 and scaled up the cluster. Also how do we recover from this without losing data?

Thanks in advance for looking.

Michal Kuratczyk

unread,
Feb 4, 2026, 3:24:58 AM (8 days ago) Feb 4
to rabbitm...@googlegroups.com
> My question is why is this suddenly happening in this specific version? It never happened before when we used RabbitMQ 3.13.7 and scaled up the cluster. 

Because we made changes to RabbitMQ? I'm not sure what else is there to say. We obviously try to make RabbitMQ better,
but sometimes there are unintended consequences of the changes we make.

Have you used any versions between 3.13.7 and 4.1.5?

> Also how do we recover from this without losing data?

Judging by the output you shared, it looks like this queue is pretty much empty already. In fact, with the index of just 37,
this queue not only (almost) empty, but it hasn't even processed many messages in the past (if you publish and consume a lot of messages,
even an empty queue would have a higher applied index number). So it looks like it might be too late to recover the data.

Please provide step by step process of what you do in terms of cluster scale down and scale up, not only rabbitmq-queues commands
but also what exactly you do regarding the RabbitMQ nodes (forget_cluster_node, etc)

By the way, community support for RabbitMQ 4.1.x ended a few days ago (https://www.rabbitmq.com/release-information).
If we fix any issues based on this conversation, those fixes will only be available in 4.1.x for customers with long-term support contracts.
Assuming you have no such contract, you'll need to upgrade to 4.2 to get those fixes. I'd highly recommend
testing your workflow on the latest 4.2 as soon as you can.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rabbitmq-users/bbcd922f-55b0-4ff4-b011-55cb63dba3fbn%40googlegroups.com.


--
Michal
RabbitMQ Team

sai datta

unread,
Feb 5, 2026, 9:48:29 PM (7 days ago) Feb 5
to rabbitmq-users
Hi Michal,

First of all, thanks for looking into this. Answers to your questions:

Have you used any versions between 3.13.7 and 4.1.5?

No.

Judging by the output you shared, it looks like this queue is pretty much empty already

Yes, this is because the example I shared with you was from an environment that was not used much. However we do have other environments where we plan to upgrade with 4.1.5. In case we see this situation happen, what would you recommend we do?


Please provide step by step process of what you do in terms of cluster scale down and scale up, not only rabbitmq-queues commands
but also what exactly you do regarding the RabbitMQ nodes (forget_cluster_node, etc)


So, we use the RabbitMQ bitnami helm chart (15.4.1) by overriding the image tag and simply change the replica count config to scale up. These environments were running a non-HA setup of RabbitMQ 3.13.7 which were upgraded to 4.1.5 first and then scaled up to 3 replicas. 

Afterwards, all the queues were replicated using the rabbitmq-queues grow command as previously stated. We never scale down our environments.

Michal Kuratczyk

unread,
Feb 6, 2026, 4:31:06 AM (6 days ago) Feb 6
to rabbitm...@googlegroups.com
I'd suggest you intentionally try to see this situation again and we fix it.

If you still have logs from when this happened - please share the logs from all nodes.

Either way, automate this process (with debug logs on) and keep running it in a test env until the issue occurs. Then we'll be able to focus on debugging it.

I've just tried a similar set of steps (although using the Operator, not the chart) a few times and it worked without issues for me,
but it could depend on the exact steps and/or timing of those steps and/or other factors. Or perhaps even the same steps will only occasionally lead to such a problem.




--
Michal
RabbitMQ Team
Reply all
Reply to author
Forward
0 new messages