High memory usage in replica nodes for classic queues

Alessio Casco

unread,

Aug 31, 2022, 6:36:22 AM8/31/22

to rabbitmq-users

Hello folks,

We are writing here before opening an issue on github to confirm that what what see is a bug where a weird memory usage pattern is happening on our replica nodes for two specific queues.

Some basic info:

We use RabbitMQ 3.10.7, running on dedicated nodes in k8s with gp3 volumes and 16Gb available on the node with high watermark set to 12Gb.

We use the latest version of the rabbitmq library to push and consume objects.

The queues are classic, durable and with the following policy

ha-mode: exactly

ha-params:2

ha-sync-mode:automatic

These two queues have a message/rate that goes between ~20/s to ~50/s on average and no object is left on the queue, but instead consumed as soon as possible.

The problem:

What we see is that the replica nodes for these queues use quite a strange memory pattern, as you can see in these two screenshots, the memory on the replica node for one of the queues flatuates from 727Mb to 1.16Gb, back and forth, increasing with time. up to almost 6GB

rabbitmq-diagnostics list_queues -p mq-events name pid synchronised_slave_pids

event-view-demo.mq-event-store <rab...@rabbitmq-2.rabbitmq-headless.rabbitmq.svc.cluster.local.1661438942.823.0> [<rab...@rabbitmq-1.rabbitmq-headless.rabbitmq.svc.cluster.local.1661439220.17337.0>]

Here you see the memory usage from the last 3 days and you can notice a correlation between the free disk space and the memory available. It’s also evident that only the replica has this memory issue, not the master.

Do you know what can cause this weird behaviour where a lot of memory is used by a replica node for queues that have basically nothing on them?

Thanks

Alessio

kjnilsson

unread,

Sep 1, 2022, 6:24:18 AM9/1/22

to rabbitmq-users

Hi,

That does not sound like expected behaviour. As you may be aware classic mirrored queues have many technical issues and we have deprecated them for removal in RabbitMQ 4.0.

There are however some things you could try the first one being enabling classic queues v2 which have a much more reliable memory profile.

The other is migrating your system to quorum queues which is our recommended replicated, highly available queue type.

https://www.rabbitmq.com/quorum-queues.html

https://rabbitmq.com/persistence-conf.html#queue-version

Cheers

Karl

Alessio Casco

unread,

Sep 1, 2022, 7:55:13 AM9/1/22

to rabbitmq-users

Hello

Thanks for your reply.

I forgot to mention that we already tried to move the queue to Cqv2 with no real difference in the memory usage. We'll try to move these queues to quorum and let you know here if something improves.

Do you think it's worth opening a github issue in the meantime since the memory consumptio is indeeed strange here?

Thanks

Alessio

Luke Bakken

unread,

Sep 1, 2022, 9:15:21 AM9/1/22

to rabbitmq-users

I forgot to mention that we already tried to move the queue to Cqv2 with no real difference in the memory usage. We'll try to move these queues to quorum and let you know here if something improves.
Do you think it's worth opening a github issue in the meantime since the memory consumptio is indeeed strange here?

Thanks for offering to open an issue. We would need a reliable way to reproduce the issue to spend time investigating it. Some questions -

I'm assuming there are more than just these two problematic queues in your system. Is that true? Can you export your definitions and provide them?
If you remove or stop publishing to these two queues, does the issue resolve itself?
If you have other mirrored classic queues that do not exhibit this issue, what is different about the applications publishing and consuming from them?
And, as Karl asked, does moving to quorum queues resolve the issue?

Thanks -

Luke

Alessio Casco

unread,

Sep 2, 2022, 5:47:34 AM9/2/22

to rabbitmq-users

Hi!

I'll reply inline down here

On Thursday, September 1, 2022 at 2:15:21 PM UTC+1 luker...@gmail.com wrote:

Thanks for offering to open an issue. We would need a reliable way to reproduce the issue to spend time investigating it. Some questions -
I'm assuming there are more than just these two problematic queues in your system. Is that true? Can you export your definitions and provide them?

Only those two queues have the problem, and we think it's related to the message/rate since those ones have the higest rate in the cluster.

If you remove or stop publishing to these two queues, does the issue resolve itself?

If we disable HA or we set ha-params to 1 the problem dissapears immediatelly. We never tried to stop publishing, this is happening in prod, but we'll try to see if we can test something and let you know the findings.

If you have other mirrored classic queues that do not exhibit this issue, what is different about the applications publishing and consuming from them?

All our queues are classic and almost all of them have ha-mode:all and ha-sync-mode:automatic. The problem happears only on those two queue, the use case is very basic and shared with other queues too, that's why we tend to think the issue is related with the rate more than anything else.

And, as Karl asked, does moving to quorum queues resolve the issue?

Moving to quorum will take sometime since it requires changed in the app code, so it's not something we can do very quickly, but will try to test this again and report back here the results.

Thanks

Alessio

Luke Bakken

unread,

Sep 4, 2022, 4:35:46 PM9/4/22

to rabbitmq-users

Hi Alessio,

Thanks for offering to open an issue. We would need a reliable way to reproduce the issue to spend time investigating it. Some questions -
I'm assuming there are more than just these two problematic queues in your system. Is that true? Can you export your definitions and provide them?
Only those two queues have the problem, and we think it's related to the message/rate since those ones have the higest rate in the cluster.

What would be helpful is for you to export definitions and provide them, as well as typical message rates and message sizes for your queues. That way I can try to reproduce what you report using PerfTest (https://rabbitmq.github.io/rabbitmq-perf-test/stable/htmlsingle/)

If you don't wish to publish your definitions publicly, use the "Reply to author" feature in the Google Groups web interface, or email them to me at lu...@bakken.io

If you remove or stop publishing to these two queues, does the issue resolve itself?
If we disable HA or we set ha-params to 1 the problem dissapears immediatelly. We never tried to stop publishing, this is happening in prod, but we'll try to see if we can test something and let you know the findings.

When you disable HA or set ha-params to 1, does it only change the policy for the queues that cause issues, or does that policy change for more than just those two queues?

Thanks

Luke

Luke Bakken

unread,

Sep 7, 2022, 1:49:28 PM9/7/22

to rabbitmq-users

Hi again Alessio,

You may wish to check out my suggestions here - https://github.com/rabbitmq/rabbitmq-server/issues/5312#issuecomment-1193207112

I'm pretty sure switching to quorum queues will address this issue.

I'll keep an eye out for a response to the information I requested below. No big rush, I have plenty of other things to work on.

Luke

Alessio Casco

unread,

Oct 21, 2022, 4:43:03 AM10/21/22

to rabbitmq-users

Hello Luke

Sorry for my late reply on this.

> You may wish to check out my suggestions here - https://github.com/rabbitmq/rabbitmq-server/issues/5312#issuecomment-1193207112

Quorum can be a solition I know but we currently can't move over due to some constrains related to the desire to jump over AWS rabbitmq offering (which doesn't support quorum yet ) and also due to other engineering prioritites that popped in the last months.

> When you disable HA or set ha-params to 1, does it only change the policy for the queues that cause issues, or does that policy change for more than just those two queues?

When we remove HA or set ha-param to 1, it only affects 2 queues, the ones that are impacted by this behaviour.

I'll check with the team and send you all the informations.

Thanks

Alessio

Luke Bakken

unread,

Oct 21, 2022, 9:57:24 AM10/21/22

to rabbitmq-users

Hi Alessio,

What a coincidence, I was just investigating the same issue yesterday - https://github.com/rabbitmq/rabbitmq-server/issues/5312

The short of it is that we're not going to spend a lot of time fixing classic mirrored queue issues since quorum queues will be the only mirroring available in RabbitMQ 4 and beyond.

Since I can reproduce issue 5312 I may spend some time seeing if there is an obvious fix.

Thanks,

Luke

Alessio Casco

unread,

Nov 3, 2022, 11:19:43 AM11/3/22

to rabbitmq-users

Hello Luke.

Many thanks for your time investigating this.

I'll send you directly an export of the definitions as well, so you can test it.

Let me know if you need anything from us, or if there is something we can do to help

Thanks

Alessio

Luke Bakken

unread,

Nov 15, 2022, 11:37:51 AM11/15/22

to rabbitmq-users

Hello,

Please see the discussion here - https://github.com/rabbitmq/rabbitmq-server/issues/5312

The real fix is to use quorum queues, but improvements are being made.

Reply all

Reply to author

Forward