Hi,
I updated the raw-amqp branch according to the tutorial by using strategy 3.
What I am seeing is still somewhat confusing to me. Yes, the performance improved by two orders of magnitude, but: When testing with two million messages I get the following log output: Sent 317382 messages in total. Failed count: 1682618. Outstanding confirms: 1931200
But I only have 69,200 messages in my queue.
So, I got a quarter of the messages I expected in my queue. Small takeaway so far: Exceptions thrown by the basicPublish method are absolutely meaningless to me and I have to rely on publisher confirms.
Now, we know that 317,382 sent messages plus 1,682,618 failed messages is two million messages total. 1,931,200 outstanding confirms plus 69,200 messages in the queue equates to 2,000,400.
I have no idea where the 400 extra confirms come from. Maybe RabbitMQ forgot it sent those and sent them again after restarting? At least I see it's properly reacting to the SIGTERM it received, so it shuts down cleanly, as far as i can tell, so everything I could do is guess.
2024-06-26 15:19:18.889636+00:00 [notice] <0.64.0> SIGTERM received - shutting down
2024-06-26 15:19:18.889636+00:00 [notice] <0.64.0>
2024-06-26 15:19:18.893599+00:00 [warning] <0.549.0> HTTP listener registry could not find context rabbitmq_prometheus_tls
2024-06-26 15:19:18.901318+00:00 [warning] <0.549.0> HTTP listener registry could not find context rabbitmq_management_tls
2024-06-26 15:19:18.909398+00:00 [info] <0.687.0> stopped TCP listener on
0.0.0.0:56722024-06-26 15:19:18.910113+00:00 [info] <0.472.0> Virtual host '/' is stopping
2024-06-26 15:19:18.910221+00:00 [info] <0.1440.0> Closing all connections in vhost '/' on node 'rabbit@my-rabbit' because the vhost is stopping
2024-06-26 15:19:18.910394+00:00 [info] <0.485.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2024-06-26 15:19:18.913177+00:00 [info] <0.485.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2024-06-26 15:19:18.913288+00:00 [info] <0.481.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2024-06-26 15:19:18.915618+00:00 [info] <0.481.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@my-rabbit/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2024-06-26 15:19:18.948936+00:00 [notice] <0.86.0> alarm_handler: {clear,system_memory_high_watermark}
2024-06-26 15:19:18.949046+00:00 [notice] <0.86.0> alarm_handler: {clear,{disk_almost_full,"/"}}
2024-06-26 15:19:18.949096+00:00 [notice] <0.86.0>
alarm_handler: {clear,{disk_almost_full,"/etc/hosts"}}
Alright, next run:
401,109 successful + 1,598,891 failed = 2,000,000
1,866,800 outstanding confirms + 133,600 in queue = 2,000,400
Next:
500,044 + 1,499,956 = 2,000,000
1,776,980 + 223,400 = 2,000,380
So, pretty consistent across multiple runs, but not totally. Just by looking at the numbers I'd be inclined to assume that some messages are duplicated (i.e. they're in the queue, but the confirm didn't make it back to me). Analysing whether all messages for which I got a successful confirm are actually in the queue will take further testing.
For the record, a run without a broker restart tells me there are 247022 outstanding confirms. WTF, I have all 2,000,000 messages in my queue, how did this happen? Is the code updating the Map wrong? Did some confirms just not arrive? Is the connection closed too early when my controller method exits?
Second takeaway (as of right now): Having outstanding confirms does not mean my messages were not delivered to the queue.
Both takeaways combined would mean that I can not rely on anything telling me how many, much less which, messages made it into the queue.
One more takeaway: While the performance of waitForConfirmsOrDies might still be acceptable for some of my use cases, I'd have to resort to parsing those confirms asynchronously for the most important workloads, which makes concerns like message ordering when retrying the publishing and telling a user "hey, we queued all your datasets for processing", handling duplicate messages etc harder than they should be, imho. Which is not at all a dig against RabbitMQ or the official library, but I'd rather avoid those things, if at all possible. My hope was (and still is), that spring-amqp can help with that.
Which raises the question: Can anyone at least reproduce this with spring-amqp and give further guidance in that direction?
Kind regards,
Linus