Hi, I believe we have found a bug in rabbitmq-server. It seems related to dead-letter-exchange and classic mirrored queues.
What's strange is that the mirror nodes can use all available RAM (32GB) even though there's only 3GB of data in the queues.
It seems to be reproducible with the following scenario:
Setup:
* 3 node RabbitMQ cluster, (AmazonMQ mq.m5.2xlarge instances - 8 vCPUs 32GB RAM, happens also in k8s and on bare-metal).
* ha-policy: all
* 2 classic durable queues
* input queue:
classic, durable, lazy,
x-dead-letter-exchange: "",
x-dead-letter-key: "output"
* output queue:
classic, durable, lazy
Steps to reproduce:
1. Inject 3M messages, 1KB each in the input queue, for a total of 3GB of data.
2. Start 10 consumers that reject (requeue=false) all the messages, so they get dead-lettered to the outbox queue.
3.
After a while, you can observe high memory usage on 1 or 2 mirror nodes
for the "output" queue, which results in OOM-killing the node(s). The
memory is used by the mirror queue process and binaries, in much larger
size than the total of data in the queues.
This
scenario is an isolated edge case from a larger application that uses
DLX and nack(requeue=false) to move messages back to a different queue
on shutdown. The original application uses AMQP-CPP C++ library. This
example uses python aio-pika, which makes me believe it's a bug in
rabbitmq-server itself.


Does anyone have an idea what could have caused this and how to fix this?
PS: I have scripts to reproduce and more screenshots on imgur, but this mailing list seems to remove my post if i add links here...