We've run this cluster with around 20k queues for about 10 years and it has been very stable, great work!
We run the ubuntu docker container, 3 hosts. Hosts are rocky 9. host network vs port fwd didn't make a difference.
Starting with 3.12. (i think). This was fine on 3.11.8 at least. When we restart one of the nodes and it will not rejoin. I initially thought it was our consul peer discovery causing it to think that nodes weren't started and so moved to DNS discovery but unfortunately, that doesn't fix it. Cluster is running fine once it's up and the initial creation of the 3.13 cluster went fine, basically did a cluster shutdown upgrade to get it in.
Logs show the error while waiting for mnesia tables and I increased the timeout from 30 seconds to 60/120/300 etc and it didn't help. The only way I'm able to get nodes back into the cluster is to remove them, remove the mnesia folder and start them up. Even then it sometimes doesn't work and the debug log will just stop right after starting the plugins. Very odd.
We were hopeful 3.13 would help; these are all classic queues on v2. All feature flags enabled on 3.13.
Looking forward to hearing any ideas. here is our docker start command and our config.
sudo docker run -d \
--hostname "$(hostname)" \
--name rabbitmq \
--restart unless-stopped \
-p 5672:5672 \
-p 15672:15672 \
-p 15671:15671 \
-p 15691-15692:15691-15692 \
-p 4369:4369 \
-p 5671:5671 \
-p 25672:25672 \
--ulimit nofile=500000:500000 \
-v /etc/rabbitmq:/etc/rabbitmq:rw \
-v /var/lib/rabbitmq:/var/lib/rabbitmq:rw \
rabbitmq:${rabbitmq_version}-management
cluster_formation.peer_discovery_backend = dns
cluster_formation.dns.hostname = ${cluster_dns_name}
collect_statistics = coarse
collect_statistics_interval = 5000
delegate_count = 30
vm_memory_high_watermark.relative = 0.75
vm_memory_high_watermark_paging_ratio = 0.75
cluster_partition_handling = autoheal
channel_max = 2048
consumer_timeout = 10800000
log.default.level = critical
classic_queue.default_version = 2
[rabbitmq_federation,rabbitmq_federation_management,rabbitmq_management,rabbitmq_peer_discovery_common,rabbitmq_peer_discovery_consul,rabbitmq_shovel,rabbitmq_shovel_management,rabbitmq_prometheus].
[
{mnesia, [
{dump_log_write_threshold, 50000}
]
},
{rabbit, [
{queue_index_max_journal_entries,262144}
]
}
].