Hi All,
I have seen some interesting behaviour with RabbitMQ version 3.11.16 over the weekend and was hoping someone may be able to give some advice or thoughts.
We run a 3 node HA cluster on Ubuntu 20.04 with classic queues, on Saturday (02/06/23) 2 of the 3 nodes the RabbitMQ service stopped and couldn't be restarted due to the "classic_queue_type_delivery_support" feature flag being disabled (We never enabled this on any of the 3 nodes).
We managed to get the cluster working again on Sunday by rolling the VMs back to a backup from 01/06 and then enabled the feature flag to prevent this happening again.
On Monday at around 05:00 the 2 nodes yet again failed and reverted to a position where the feature flag was disabled, and we couldn't boot RMQ due to the missing feature flag. We added an environment variable to enable the feature flag prior to booting as we couldn't enable the feature flag without RMQ actually running (Fun little catch 22)
I understand that feature flags can't be disabled after being enabled so we aren't sure how we got into this position on Monday.
We don't seem to have any reason why the 2 nodes failed, and we are confused why one node didn't fall over despite also not having the feature flag enabled.
I'm working on getting the RMQ logs but from what we saw whilst troubleshooting on Monday there was nothing in the logs beyond the failure to boot due to the missing feature flag, so we still don't understand why the nodes terminated the RMQ service, nor why one node was perfectly fine.
Thanks,
Thomas