Hello,
We are migrating from a mono-instance RabbitMQ to a cluster in Kubernetes.
To do so we use the operator v1.8.2, with rabbitmq-server 3.8.19.
We are running some tests, to be sure we do not lose any message especially during rolling upgrade, or kubernetes node upgrade.
To run these tests, I publish at a fixed rate on a exchange, and restart the kubernetes statefulset.
I initially observed message loss between our clients and the cluster, but I fixed it with the publisher confirm feature.
Now I am facing another message loss, which occur inside the cluster.
This message loss happens with the "delay message" pattern, which is a mix of a queue without consumer but with a TTL and a deadletter, and a deadletter which is actually the consumed queue[1].
Queue delayed (classic mirrored):
name: dummy.event.queue.3.delayed
x-dead-letter-exchange:
x-dead-letter-routing-key: dummy.event.queue.3
x-message-ttl: 10000
durable: true
ha-mode: all
ha-sync-mode: automatic
Deadletter queue (quorum):
name: dummy.event.queue.3
durable: true
The loss seems to occur when the TTL is reached at the same time the node of the deadletter queue is put into maintenance mode.
A classic mirrored queue or a quorum queue directly consumed (or not consumed at all) don't lose any message.
I have read in the
rabbitmq.com documentation that publisher confirms is not turned on, therefore deadletter are not guaranteed to be safe in a cluster environment[2].
However, on
rabbitmq.docs.pivotal.io, it is written the exact oppposite[3]: "Dead-lettered messages are re-published with publisher confirms turned on internally so, the dead-letter queues the messages eventually land on must confirm the messages before they are removed from the original queue. "
As far as I understood, Pivotal speaks about his commercial version. Is the commercial version somehow different from the opensource version on this point ?
I tried the other way to produce delayed messages, with the plugin rabbitmq-delayed-message-exchange, with the exact same results. Message are lost during the routing occuring at the end of the x-delay.
Is there any way to achieve message delay with guarantee in a cluster environment ?
Please find in attached a file containing logs of the three nodes.
In this test, a few message disappeared when the TTL is reached around 15:25:23.
Thanks a lot for your help.
Regards,
Julien
[1]
https://www.cloudamqp.com/docs/delayed-messages.html[2]
https://www.rabbitmq.com/dlx.html#safety[3]
https://rabbitmq.docs.pivotal.io/36/rabbit-web-docs/dlx.html