Deadletter: message loss in cluster configuration

183 views
Skip to first unread message

Julien Tard

unread,
Sep 7, 2021, 11:11:35 AM9/7/21
to rabbitmq-users

Hello,

We are migrating from a mono-instance RabbitMQ to a cluster in Kubernetes.
To do so we use the operator v1.8.2, with rabbitmq-server 3.8.19.

We are running some tests, to be sure we do not lose any message especially during rolling upgrade, or kubernetes node upgrade.

To run these tests, I publish at a fixed rate on a exchange, and restart the kubernetes statefulset.

I initially observed message loss between our clients and the cluster, but I fixed it with the publisher confirm  feature.

Now I am facing another message loss, which occur inside the cluster.
This message loss happens with the "delay message" pattern, which is a mix of a queue without consumer but with a TTL and a deadletter, and a deadletter which is actually the consumed queue[1].

Queue delayed (classic mirrored):
name: dummy.event.queue.3.delayed
x-dead-letter-exchange:    
x-dead-letter-routing-key: dummy.event.queue.3
x-message-ttl:    10000
durable:    true
ha-mode:    all
ha-sync-mode:    automatic

Deadletter queue (quorum):
name: dummy.event.queue.3
durable:    true

The loss seems to occur when the TTL is reached at the same time the node of the deadletter queue is put into maintenance mode.

A classic mirrored queue or a quorum queue directly consumed (or not consumed at all) don't lose any message.

I have read in the rabbitmq.com documentation that publisher confirms is not turned on, therefore deadletter are not guaranteed to be safe in a cluster environment[2].
However, on rabbitmq.docs.pivotal.io, it is written the exact oppposite[3]: "Dead-lettered messages are re-published with publisher confirms turned on internally so, the dead-letter queues the messages eventually land on must confirm the messages before they are removed from the original queue. "

As far as I understood, Pivotal speaks about his commercial version. Is the commercial version somehow different from the opensource version on this point ?

I tried the other way to produce delayed messages, with the plugin rabbitmq-delayed-message-exchange, with the exact same results. Message are lost during the routing occuring at the end of the x-delay.

Is there any way to achieve message delay with guarantee in a cluster environment ?

Please find in attached a file containing logs of the three nodes.
In this test, a few message disappeared when the TTL is reached around 15:25:23.

Thanks a lot for your help.

Regards,

Julien

[1] https://www.cloudamqp.com/docs/delayed-messages.html
[2] https://www.rabbitmq.com/dlx.html#safety
[3] https://rabbitmq.docs.pivotal.io/36/rabbit-web-docs/dlx.html

extract-2021-09-07_17-09-13.csv

Michal Kuratczyk

unread,
Sep 7, 2021, 11:28:36 AM9/7/21
to rabbitm...@googlegroups.com
Hi,

This is unfortunate but documented and, with the current design, unavoidable behaviour. RabbitMQ docs are correct https://www.rabbitmq.com/dlx.html#safety
Docs on the Pivotal website must have been copied before the PR that corrected the docs: https://github.com/rabbitmq/rabbitmq-website/pull/1088
I'll ask for the docs on the Pivotal website to be updated.

We do have ideas and plans to implement a safe DLX feature but it's not currently being worked on.

Regarding a commercial version, Pivotal was acquired by VMware almost 2 years ago. Commercial version includes a handful of additional features (but nothing related to DLX).
You can learn more here: https://www.rabbitmq.com/tanzu/

Thank you for reporting the incorrect docs.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/16697a6a-6262-4e47-b9ee-ab88c106b948n%40googlegroups.com.


--
Michał
RabbitMQ team

Julien Tard

unread,
Sep 13, 2021, 8:08:58 AM9/13/21
to rabbitmq-users
Hello Michał,

Thank your for your precise and quick answer.

I have been running more tests, and replacing the quorum dead letter queue by a classic mirrored queue avoid the message loss.
This is probably due to the fact that a quorum queue, according to the documentation, is unavailable during the election of a new leader.
Apparently this is different for a classic mirrored queue.

I keep in mind it's still unsafe of course, even if my tests show no message loss.
Reply all
Reply to author
Forward
0 new messages