RabbitMQ publisher/producer losing messages on broker restart

115 views
Skip to first unread message

Linus9000

unread,
Jun 21, 2024, 4:02:03 AMJun 21
to rabbitmq-users
Hello!

I wrote a post on StackOverflow this week regarding my Spring Boot application. I noticed that my application, which acts as a publisher/producer, has problems with message delivery when the broker is restarted, which unfortunately does not result in any catchable exceptions in my application.


The gist of it is: We are using the RabbitMQ chart by bitnami, which, at the time of noticing the problem, provides RabbitMQ in version 3.13.2 / Erlang 26.2.4. While testing the "new" quorum queues as replacement for our currently used mirrored classic queues, I noticed that when I restart the broker (which runs with three replicas, all of which act as members of my quorum queue), for example with kubectl rollout restart sts/rabbitmq, while publishing/producing messages, some of those messages get lost. We are connecting to a Kubernetes Service of type LoadBalancer, the implementation is MetalLB as we are on-premises with Rancher as the management platform.

While at first I was losing tens of thousands of messages out of a million, due to not using all the Spring properties that should guarantee successful delivery to the broker, I gradually enabled those and got down to about a dozen messages that are just lost.

Artem Bilan has kindly verified (https://stackoverflow.com/questions/78636146/handling-broker-restarts-as-publisher-producer-without-losing-messages#comment138653995_78636146) this using TestContainers, which basically eliminated multiple things as possible causes:

1. It doesn't use Kubernetes
2. It doesn't use any plugins (we are using delayed message exchange)
3. Assumption: It also only uses one container. Due to organizational policy I can not use TestContainers to verify this, but might be able to do so on my home machine.

He also states that this also happens "without quorum", which is consistent with my observation: Mirrored classic queues suffer the same problem.

The thing is: To me it seems like RabbitMQ sends a publisher confirm, or else Spring would have attempted redelivery. But some messages are still not in my destination queue..

So I wonder: Is there anything in RabbitMQ I could configure to alleviate this problem?

For reference you may also find my simple test project on GitHub: https://github.com/Linus9000/spring-amqp-demo

For those not entirely familiar with Spring: most properties are set in the file src/main/resources/application.yaml, in my case this includes publisher-confirm-type: correlated, template.mandatory: true and retry.enabled: true with some ridiculous values, so that redelivery is attempted often enough to cover all of my testing scenarios. I also set the delivery mode to persistent in src/main/java/com/example/rabbitmqdemo/RabbitController.java.

Thank you for your time!

Kind regards,
Linus

Luke Bakken

unread,
Jun 21, 2024, 10:16:36 AM (14 days ago) Jun 21
to rabbitmq-users
Hi Linus,

Please add a README.md file to your test project explaining exactly how to use it to reproduce the issue.

Thanks,
Luke

Arnaud Cogoluègnes

unread,
Jun 24, 2024, 4:08:53 AM (11 days ago) Jun 24
to rabbitmq-users
There is indeed not enough information to reproduce the issue, we need a detailed procedure.

Please also provide a project that uses the RabbitMQ AMQP Java client. The fewer layers, the easier it is for us to investigate.

Linus9000

unread,
Jun 24, 2024, 12:16:04 PM (11 days ago) Jun 24
to rabbitmq-users
Hello,

I just updated the project. It now includes a README with the exact steps I've taken to reproduce this (even without Kubernetes), I hope I included all the necessary information.

As I've added to the README I feel obliged to add: Sometimes it needs a few tries to reproduce the problem, which is.. annoying, to say the least. Locally I had at least a few runs where it seems like everything works fine, only to break on the next run.

I also pushed a branch "raw-amqp", where I, so far, have not reproduced the same problem. Instead I now have too many messages in the queue sometimes, which makes verifying the rest basically impossible. Given that I have no prior experience with using the official client I assume I made a mistake somewhere. If you can spot it I will gladly fix it and give it a few more test runs. I have also observed that my method of publishing the messages with the official client is a lot slower than what I see with Spring.

What confuses me even more is that, although I use waitForConfirmsOrDie(30_000), I'm getting way more exceptions than I expected. I can see that those are AlreadyClosedException, most likely due to the broker shutting down cleanly instead of some network error, so maybe I need additional handling for that. How that might look is not clear to me right now though.

Thanks again for your time.

Kind regards,
Linus

Arnaud Cogoluègnes

unread,
Jun 25, 2024, 2:08:44 AM (10 days ago) Jun 25
to rabbitmq-users
Blocking to wait for publish confirms is likely to decrease throughput. See the publish confirm tutorial [1].

Arnaud Cogoluègnes

unread,
Jun 25, 2024, 3:05:38 AM (10 days ago) Jun 25
to rabbitmq-users
Note a message can be considered lost if it has been confirmed by the broker but is not in the queue. I don't see any evidence of this so far.

You should read the publish confirms tutorial [1] and adapt your code accordingly.

Linus9000

unread,
Jun 26, 2024, 12:00:51 PM (9 days ago) Jun 26
to rabbitmq-users
Hi,

I updated the raw-amqp branch according to the tutorial by using strategy 3.

What I am seeing is still somewhat confusing to me. Yes, the performance improved by two orders of magnitude, but: When testing with two million messages I get the following log output: Sent 317382 messages in total. Failed count: 1682618. Outstanding confirms: 1931200

But I only have 69,200 messages in my queue.

So, I got a quarter of the messages I expected in my queue. Small takeaway so far: Exceptions thrown by the basicPublish method are absolutely meaningless to me and I have to rely on publisher confirms.

Now, we know that 317,382 sent messages plus 1,682,618 failed messages is two million messages total. 1,931,200 outstanding confirms plus 69,200 messages in the queue equates to 2,000,400.

I have no idea where the 400 extra confirms come from. Maybe RabbitMQ forgot it sent those and sent them again after restarting? At least I see it's properly reacting to the SIGTERM it received, so it shuts down cleanly, as far as i can tell, so everything I could do is guess.

2024-06-26 15:19:18.889636+00:00 [notice] <0.64.0> SIGTERM received - shutting down
2024-06-26 15:19:18.889636+00:00 [notice] <0.64.0>
2024-06-26 15:19:18.893599+00:00 [warning] <0.549.0> HTTP listener registry could not find context rabbitmq_prometheus_tls
2024-06-26 15:19:18.901318+00:00 [warning] <0.549.0> HTTP listener registry could not find context rabbitmq_management_tls
2024-06-26 15:19:18.909398+00:00 [info] <0.687.0> stopped TCP listener on 0.0.0.0:5672
2024-06-26 15:19:18.910113+00:00 [info] <0.472.0> Virtual host '/' is stopping
2024-06-26 15:19:18.910221+00:00 [info] <0.1440.0> Closing all connections in vhost '/' on node 'rabbit@my-rabbit' because the vhost is stopping
2024-06-26 15:19:18.910394+00:00 [info] <0.485.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2024-06-26 15:19:18.913177+00:00 [info] <0.485.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2024-06-26 15:19:18.913288+00:00 [info] <0.481.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2024-06-26 15:19:18.915618+00:00 [info] <0.481.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2024-06-26 15:19:18.948936+00:00 [notice] <0.86.0>     alarm_handler: {clear,system_memory_high_watermark}
2024-06-26 15:19:18.949046+00:00 [notice] <0.86.0>     alarm_handler: {clear,{disk_almost_full,"/"}}
2024-06-26 15:19:18.949096+00:00 [notice] <0.86.0>     
alarm_handler: {clear,{disk_almost_full,"/etc/hosts"}}

Alright, next run:

401,109 successful + 1,598,891 failed = 2,000,000
1,866,800 outstanding confirms + 133,600 in queue = 2,000,400

Next:
500,044 + 1,499,956 = 2,000,000
1,776,980 + 223,400 = 2,000,380

So, pretty consistent across multiple runs, but not totally. Just by looking at the numbers I'd be inclined to assume that some messages are duplicated (i.e. they're in the queue, but the confirm didn't make it back to me). Analysing whether all messages for which I got a successful confirm are actually in the queue will take further testing.

For the record, a run without a broker restart tells me there are 247022 outstanding confirms. WTF, I have all 2,000,000 messages in my queue, how did this happen? Is the code updating the Map wrong? Did some confirms just not arrive? Is the connection closed too early when my controller method exits?

Second takeaway (as of right now): Having outstanding confirms does not mean my messages were not delivered to the queue.

Both takeaways combined would mean that I can not rely on anything telling me how many, much less which, messages made it into the queue. 

One more takeaway: While the performance of waitForConfirmsOrDies might still be acceptable for some of my use cases, I'd have to resort to parsing those confirms asynchronously for the most important workloads, which makes concerns like message ordering when retrying the publishing and telling a user "hey, we queued all your datasets for processing", handling duplicate messages etc harder than they should be, imho. Which is not at all a dig against RabbitMQ or the official library, but I'd rather avoid those things, if at all possible. My hope was (and still is), that spring-amqp can help with that.

Which raises the question: Can anyone at least reproduce this with spring-amqp and give further guidance in that direction?

Kind regards,
Linus

Luke Bakken

unread,
Jun 26, 2024, 3:56:04 PM (9 days ago) Jun 26
to rabbitmq-users
Hi Linus,

So, I got a quarter of the messages I expected in my queue. Small takeaway so far: Exceptions thrown by the basicPublish method are absolutely meaningless to me and I have to rely on publisher confirms.

That is correct. Publisher confirmations are the only way to guarantee that messages have been enqueued correctly.
 
Now, we know that 317,382 sent messages plus 1,682,618 failed messages is two million messages total. 1,931,200 outstanding confirms plus 69,200 messages in the queue equates to 2,000,400.

I have no idea where the 400 extra confirms come from. Maybe RabbitMQ forgot it sent those and sent them again after restarting? At least I see it's properly reacting to the SIGTERM it received, so it shuts down cleanly, as far as i can tell, so everything I could do is guess

Those confirms were probably in-flight within RabbitMQ when you restarted it.
 
401,109 successful + 1,598,891 failed = 2,000,000
1,866,800 outstanding confirms + 133,600 in queue = 2,000,400

Next:
500,044 + 1,499,956 = 2,000,000
1,776,980 + 223,400 = 2,000,380

So, pretty consistent across multiple runs, but not totally. Just by looking at the numbers I'd be inclined to assume that some messages are duplicated (i.e. they're in the queue, but the confirm didn't make it back to me). Analysing whether all messages for which I got a successful confirm are actually in the queue will take further testing.

For the record, a run without a broker restart tells me there are 247022 outstanding confirms. WTF, I have all 2,000,000 messages in my queue, how did this happen? Is the code updating the Map wrong? Did some confirms just not arrive? Is the connection closed too early when my controller method exits?

I think you're not correctly handling the case of multiple messages being confirmed with the same ack. Be sure to check the value of the multiple flag when your application receives an ack for a message. RabbitMQ may be confirming multiple publishes at once.


> The broker may also set the multiple field in basic.ack to indicate that all messages up to and including the one with the sequence number have been handled.

One more takeaway: While the performance of waitForConfirmsOrDies might still be acceptable for some of my use cases, I'd have to resort to parsing those confirms asynchronously for the most important workloads, which makes concerns like message ordering when retrying the publishing and telling a user "hey, we queued all your datasets for processing", handling duplicate messages etc harder than they should be, imho. Which is not at all a dig against RabbitMQ or the official library, but I'd rather avoid those things, if at all possible. My hope was (and still is), that spring-amqp can help with that.

Which raises the question: Can anyone at least reproduce this with spring-amqp and give further guidance in that direction?

 I suggest asking in a spring-amqp specific forum:

I have some time today to look at your code, finally.
Thanks,
Luke

Luke Bakken

unread,
Jun 26, 2024, 4:53:33 PM (9 days ago) Jun 26
to rabbitmq-users
Hi Linus,

Thanks again for providing code. I am barely familiar with Java, but have found the source of the error in your application. Your try-with-resources statement was closing the channel and connection before all confirmations could be processed. Here's a PR to address that:

https://github.com/Linus9000/spring-amqp-demo/pull/1

Luke Bakken

unread,
Jun 26, 2024, 5:16:37 PM (9 days ago) Jun 26
to rabbitmq-users
I took a second to check your Spring AMQP code, and read some docs. My best guess is that your Spring AMQP code has the same issue as the "raw-amqp" branch, but I'm not sure.

Arnaud Cogoluègnes

unread,
Jun 27, 2024, 5:26:17 AM (8 days ago) Jun 27
to rabbitmq-users
I wrote a simple program to compare confirmed messages to the content of the queue [1].

You can run the program and restart the broker in the middle of the publishing. I wrote it quickly and it can be improved, but it was good enough to compare what ends up in a queue after a restart in the middle of publishing.

I ran it several times: no confirmed messages that did not make it to the queue, no duplicates (this could happen, especially if the program retries publishing), some unconfirmed messages that made it to the queue.

I suggest you experiment with this code to learn more about the failure and recovery semantics.

Reply all
Reply to author
Forward
0 new messages