Messages lost during upgrade

159 views
Skip to first unread message

Magnus Bäck

unread,
Oct 17, 2025, 10:11:29 AMOct 17
to rabbitmq-users
I recently upgraded our three-node RabbitMQ cluster from 4.0.9 to 4.1.4, and a previous observed problem with message loss resurfaced. I can't make out what's going on.

I started the upgrade process with nodes 01 and 03, leaving the node 02 upgrade until the end. All clients were connected to that node.

One of the queues had 39 unacked messages right before node 02 went down for the upgrade. The sole queue consumer has a prefetch of 1000. Each message takes about 20 seconds to process, so clearing the queue should've taken a number of minutes (ignoring any new messages that might arrive). After node 02 came back after the upgrade, the queue was empty and the reconnected consumer was idle.

This was the same behavior as in the preceding 3.13.7 to 4.0.9 upgrade, which followed a migration from classic mirrored queues to quorum queues. The service with the consumer has been running for years, undergoing several RabbitMQ upgrades and restarts triggered by configuration changes, and I've never seen this behavior before.

We collect metrics for all queues and send them to Datadog, and I note that the queue length as reported by the still-online nodes 01 and 03 plummeted from 39 to 0 as soon as the consumer disconnected following "rabbitmq-upgrade drain" on node 02. I only have 10-second resolution for the queue metrics, so the exact timing can be off. FWICT, node 02 was the leader of the queue before that node was shut down.

What's clear is that there are a number of queued messages that were never processed by the consumer, i.e. it can't just be incorrect queue length metrics. What's going on here? I'd be happy to provide additional details like logs, configuration, or a general timeline of the cluster upgrade.

Thanks,
Magnus

Michal Kuratczyk

unread,
Oct 20, 2025, 2:22:16 AM (13 days ago) Oct 20
to rabbitm...@googlegroups.com
Hard to tell exactly given the high level description, but I'd start by investigating the delivery limit:

By default, a quorum queue will try to deliver a given message 20 times. If it gets rejected 20 times,
the message will be dead lettered or dropped if dead lettering is not configured. You can:
1. check the following Prometheus metric to see if the delivery limit was reached for any messages:
rabbitmq_global_messages_dead_lettered_delivery_limit_total
2. disable the delivery limit (set it to -1)
3. configure dead-lettering (most likely you want it regardless of this specific issue)

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rabbitmq-users/cd0dee02-4012-4c47-84c4-a3b68b70712cn%40googlegroups.com.


--
Michal
RabbitMQ Team

Magnus Bäck

unread,
Oct 20, 2025, 9:27:49 AM (12 days ago) Oct 20
to rabbitmq-users
On Monday, October 20, 2025 at 8:22:16 AM UTC+2 Michal Kuratczyk wrote:
Hard to tell exactly given the high level description,

Yeah, I was short on details to avoid information overload.
 
but I'd start by investigating the delivery limit:

The delivery limit is set to -1 via a policy:

$ curl -nsS https://<hostname>:15671/api/queues/%2F/<queue> | jq .effective_policy_definition
{
  "consumer-timeout": 43200000,
  "delivery-limit": -1,
  "expires": 864000000,
  "max-length": 200000
}

The queue says it had no redeliveries anywhere near the time of the lost messages. The queue had been empty on multiple occasions prior to the incident, so the 39 messages were fresh and sent to the consumer for their first delivery attempt.

By default, a quorum queue will try to deliver a given message 20 times. If it gets rejected 20 times,
the message will be dead lettered or dropped if dead lettering is not configured. You can:
1. check the following Prometheus metric to see if the delivery limit was reached for any messages:
rabbitmq_global_messages_dead_lettered_delivery_limit_total

The metric currently reports 98 messages for dead_letter_strategy="disabled" on node 02, 22 on node 01, and 141 on node 03. Unfortunately I'm not yet collecting Prometheus metrics, only what the management API provides (I know I have work to do there), so I don't know when the deadlettering occurred. It's a bit curious to see non-zero numbers there since all queues (a few hundred) should be covered by the disabled delivery limit. The metrics counters are reset upon node restart, I suppose? Not sure it matters since the message count metric appears to have dropped before node 02 was restarted. Here's a timeline of the restart:

12:42:10 The queue had 39 messages according to nodes 01 and 03.
12:42:11.808 Node 02 was put in maintenance mode via rabbitmq-upgrade drain.
12:42:20 The queue had 0 messages according to node 01.

12:42:25.962 

Node 02 reports getting SIGTERM.
12:42:30 The queue had 0 messages according to node 03.
12:42:32.868 Docker container with old version stops on node 02.
12:42:33.174 Docker container with new version starts on node 03.

Metrics only have 1 s resolution and are quantized to 10 s. Looking in the HTTP logs of node 01, the queue endpoint was scraped at 12:42:26, and I'm guessing that's what ended up being recorded for 12:42:20. If not, that reading must've come from the prior scraping at 12:42:11.

2. disable the delivery limit (set it to -1)
3. configure dead-lettering (most likely you want it regardless of this specific issue)

Perhaps. If nothing else it would help in definitively confirming or refuting the hypothesis that my messages are lost to deadlettering.

Thanks,
Magnus

Michal Kuratczyk

unread,
Oct 22, 2025, 2:20:36 AM (11 days ago) Oct 22
to rabbitm...@googlegroups.com
I don't know what's going on. There's a chance that your application does something unexpected when it reconnects
or perhaps your docker setup is incorrect. Try to build a test case that shows the problem, ideally with https://perftest.rabbitmq.com/.

Something like this:
perf-test -qq -u qq -L 20000000 -c 1 -C 1234 -D 1234 -qa x-delivery-limit=-1,expires=864000000 -q 1000

gives you a QQ with 1234 messages and a consumer with 20s "processing" time per message and a prefetch of 1000.
If you can provide the steps where a restart of a node causes messages to be lost, we'll be all over it.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.


--
Michal
RabbitMQ Team

Magnus Bäck

unread,
Oct 22, 2025, 12:06:03 PM (10 days ago) Oct 22
to rabbitmq-users
I've deployed a test cluster that's as identical to the production cluster as I can make it, but I can't reproduce the problem with perftest (at least not with a simple node restart; I haven't tried upgrading from one version to another). However, when looking at the queue details I noticed a suspicious difference between the prod and test clusters' queues.

Prod cluster:

Test cluster:
$ curl -nsS http://172.27.140.131:15672/api/queues/%2F/queuename | jq .delivery_limit
"unlimited"

Thanks,
Magnus

Michal Kuratczyk

unread,
Oct 22, 2025, 12:39:24 PM (10 days ago) Oct 22
to rabbitm...@googlegroups.com
The "unlimited" is expected. What version is the production cluster on? How do you set that limit?



--
Michal
RabbitMQ Team

Magnus Bäck

unread,
Oct 22, 2025, 1:10:20 PM (10 days ago) Oct 22
to rabbitmq-users
Both clusters run 4.1.4. The delivery limit it set through a policy in both cases.

Prod cluster:
$ curl -nsS https://rabbitmq-prod.example.com:15671/api/queues/%2F/queuename | jq '{ policy, effective_policy_definition, delivery_limit }'
{
  "policy": "quorum-default",
  "effective_policy_definition": {

    "consumer-timeout": 43200000,
    "delivery-limit": -1,
    "expires": 864000000,
    "max-length": 200000
  },
  "delivery_limit": -1
}
$ curl -nsS https://rabbitmq-prod.example.com:15671/api/policies/%2F/quorum-default | jq .
{
  "vhost": "/",
  "name": "quorum-default",
  "pattern": ".*",
  "apply-to": "quorum_queues",
  "definition": {
    "consumer-timeout": 43200000,
    "delivery-limit": -1
  },
  "priority": 1
}

Test cluster:
$ curl -nsS http://172.27.140.131:15672/api/queues/%2F/queuename | jq '{ policy, effective_policy_definition, delivery_limit }'
{
  "policy": "quorum-default",
  "effective_policy_definition": {

    "consumer-timeout": 43200000,
    "delivery-limit": -1,
    "expires": 864000000,
    "max-length": 200000
  },
  "delivery_limit": "unlimited"
}
$ curl -nsS http://172.27.140.131:15672/api/policies/%2F/quorum-default | jq .
{
  "vhost": "/",
  "name": "quorum-default",
  "pattern": ".*",
  "apply-to": "quorum_queues",
  "definition": {
    "consumer-timeout": 43200000,
    "delivery-limit": -1
  },
  "priority": 1
}

The test cluster was set up from scratch today while the prod cluster has been upgraded step by step over the years from 3.7. The quorum-default policy was created while we were running 3.12 but didn't include the delivery limit until we'd upgraded to 4.0.9.

Since the migration to quorum queues I've also had issues with queue expiration; despite an operator policy with expires=864000000 it looks like abandoned queues aren't being deleted anymore. I'd planned to start a different thread about that problem, but now I'm wondering if they might have the same root cause.

Thanks,
Magnus

Morten Ottestad

unread,
Oct 22, 2025, 3:42:27 PM (10 days ago) Oct 22
to rabbitmq-users
4.1.4 contain a bug that affects quourum queues when delivery limit is set to unlimited. The queues will not expire if delivery limit is set to unlimited. It will be fixed in 4.1.5 and 4.2.0.

Magnus Bäck

unread,
Oct 22, 2025, 4:27:27 PM (10 days ago) Oct 22
to rabbitmq-users
On Wednesday, October 22, 2025 at 9:42:27 PM UTC+2 Morten Ottestad wrote:
4.1.4 contain a bug that affects quourum queues when delivery limit is set to unlimited. The queues will not expire if delivery limit is set to unlimited. It will be fixed in 4.1.5 and 4.2.0.

Thanks! Then we can at least close the books on that one.

Cheers,
Magnus

Magnus Bäck

unread,
Oct 23, 2025, 7:03:15 AM (9 days ago) Oct 23
to rabbitmq-users
On Wednesday, October 22, 2025 at 6:06:03 PM UTC+2 Magnus Bäck wrote:
I've deployed a test cluster that's as identical to the production cluster as I can make it, but I can't reproduce the problem with perftest (at least not with a simple node restart; I haven't tried upgrading from one version to another). However, when looking at the queue details I noticed a suspicious difference between the prod and test clusters' queues.

Prod cluster:

Test cluster:
$ curl -nsS http://172.27.140.131:15672/api/queues/%2F/queuename | jq .delivery_limit
"unlimited"

If I clone the policy in question in the production cluster, queues using the new poilcy get the expected "unlimited" delivery limit and not -1.

Could this difference be the root cause of the lost messages? If so, can I recreate the policy to get the desired behavior and not risk further loss, or do you want evidence to be preserved for debugging?

Thanks,
Magnus

Michal Kuratczyk

unread,
Oct 23, 2025, 9:05:00 AM (9 days ago) Oct 23
to rabbitm...@googlegroups.com
Turns out the -1 value cannot be set via a policy. I wasn't aware of that restriction but will mention it in the docs now that I do know.
So unfortunately, even though your effective policy says -1, it is not unlimited. For the time being, you can only set -1 when declaring
a queue (x-delivery-limit=-1 queue argument). This limitation should be lifted in 4.3.

If you can't redeclare the queues, you can set the redelivery limit to a very high value through a policy for now.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.


--
Michal
RabbitMQ Team

Michal Kuratczyk

unread,
Oct 23, 2025, 10:03:16 AM (9 days ago) Oct 23
to rabbitm...@googlegroups.com
Just a quick follow up that what I said above is incorrect, at least in some cases. It seems like different
versions may behave differently, and I'm not even 100% sure at this point, whether a single version
will always behave the same way (there might be a bug where the order of operations affects the outcome).
Using a queue argument is the best bet if you can do that. In 4.3 things will be easier. As for 4.1 and 4.2,
I will keep you posted.
--
Michal
RabbitMQ Team

Magnus Bäck

unread,
Oct 23, 2025, 10:42:03 AM (9 days ago) Oct 23
to rabbitmq-users
On Thursday, October 23, 2025 at 4:03:16 PM UTC+2 Michal Kuratczyk wrote:
Just a quick follow up that what I said above is incorrect, at least in some cases. It seems like different
versions may behave differently, and I'm not even 100% sure at this point, whether a single version
will always behave the same way (there might be a bug where the order of operations affects the outcome).

That might explain why my old policy gets -1 but the new one get "unlimited". I'll update the old policy to use a really large integer.
 
Using a queue argument is the best bet if you can do that. In 4.3 things will be easier. As for 4.1 and 4.2,
I will keep you posted.

Thanks for all the help!

Cheers,
Magnus
Reply all
Reply to author
Forward
0 new messages