Restart Issue with over 18k queues on 3.13.0 and 3.12.x

Lance Powell

unread,

Mar 6, 2024, 6:27:10 PM3/6/24

to rabbitmq-users

We've run this cluster with around 20k queues for about 10 years and it has been very stable, great work!

We run the ubuntu docker container, 3 hosts. Hosts are rocky 9. host network vs port fwd didn't make a difference.

Starting with 3.12. (i think). This was fine on 3.11.8 at least. When we restart one of the nodes and it will not rejoin. I initially thought it was our consul peer discovery causing it to think that nodes weren't started and so moved to DNS discovery but unfortunately, that doesn't fix it. Cluster is running fine once it's up and the initial creation of the 3.13 cluster went fine, basically did a cluster shutdown upgrade to get it in.

Logs show the error while waiting for mnesia tables and I increased the timeout from 30 seconds to 60/120/300 etc and it didn't help. The only way I'm able to get nodes back into the cluster is to remove them, remove the mnesia folder and start them up. Even then it sometimes doesn't work and the debug log will just stop right after starting the plugins. Very odd.

We were hopeful 3.13 would help; these are all classic queues on v2. All feature flags enabled on 3.13.

Looking forward to hearing any ideas. here is our docker start command and our config.

sudo docker run -d \
--hostname "$(hostname)" \
--name rabbitmq \
--restart unless-stopped \
-p 5672:5672 \
-p 15672:15672 \
-p 15671:15671 \
-p 15691-15692:15691-15692 \
-p 4369:4369 \
-p 5671:5671 \
-p 25672:25672 \
--ulimit nofile=500000:500000 \
-v /etc/rabbitmq:/etc/rabbitmq:rw \
-v /var/lib/rabbitmq:/var/lib/rabbitmq:rw \
rabbitmq:${rabbitmq_version}-management

cluster_formation.peer_discovery_backend = dns
cluster_formation.dns.hostname = ${cluster_dns_name}

collect_statistics = coarse
collect_statistics_interval = 5000
delegate_count = 30
vm_memory_high_watermark.relative = 0.75
vm_memory_high_watermark_paging_ratio = 0.75
cluster_partition_handling = autoheal
channel_max = 2048
consumer_timeout = 10800000
log.default.level = critical
classic_queue.default_version = 2

[rabbitmq_federation,rabbitmq_federation_management,rabbitmq_management,rabbitmq_peer_discovery_common,rabbitmq_peer_discovery_consul,rabbitmq_shovel,rabbitmq_shovel_management,rabbitmq_prometheus].

[
{mnesia, [
{dump_log_write_threshold, 50000}
]
},
{rabbit, [
{queue_index_max_journal_entries,262144}
]
}
].

jo...@cloudamqp.com

unread,

Mar 9, 2024, 11:46:10 PM3/9/24

to rabbitmq-users

~20k queues shouldn't cause mnesia tables to get stuck (unless there are a million exchanges and bindings). Are you sure that networking between the nodes work as it should? And no readiness probes are messing things up [1] ?

[1] https://www.rabbitmq.com/docs/clustering#restarting-readiness-probes

Lance Powell

unread,

Mar 10, 2024, 1:18:34 PM3/10/24

to rabbitm...@googlegroups.com

I figured it was a consul readiness probe but moving off of it didn’t help. Really did think that was it. It must be some network issue or a local firewall or something there that can start tracking down. Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/P1nZcFAM48k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/a557872e-1598-4d1b-8008-1b1a82db414an%40googlegroups.com.

Reply all

Reply to author

Forward