Hi all,
I will start with some kind of a disclaimer - I've using RabbitMQ for about 6 years now, in variety of contexts and use cases across different companies, both as a programmer and as an architect and so on. Different operating systems (Windows, Linux even Solaris), very large scale and very small scale. The idea behind this disclaimer is that I will provide here just the symptoms without too much technical details and follow you with questions - as you can guess from the disclaimer, we tried a lot, I don't think everything is relevant here so I will start "light".
In my current company we are using RabbitMQ pretty heavily, about 10 clusters, geo distributed and interconnected with federations. All of the clusters are Linux based, RabbitMQ 3.6.9+ with 3 nodes each.
Recently we started to have on 2 of them the following issue, the issue looks like the cluster goes into some kind of "zombie" mode, which means:
- Connections are not closed, but clients start to have issues (timeouts).
- If there are federations they start to die (due to heartbeat timeout, pretty much the same as the clients - which is clear). Although one of the suffering clusters doesn't have upstream nor downstream federations at all.
- In general the mode looks like RabbitMQ went into suspended mode (you can find it here: https://www.rabbitmq.com/partitions.html#suspend) - we suffered from this a while ago when VMWare had automatic DRS and migrated VMs, which of course causes Erlang machine to misbehave.
- The nodes in the cluster set to pause-minority, which is not actually taking place (just like in suspended mode), node that have issues doesn't really "understand" he is minority, thus not paused, thus not removed from load balancer and so on.
- Relevant ports staying in listening mode, which is not the case when minority is paused.
- No CPU or memory leak/spike or any other issues during this time, before or after.
- Just to emphasize the issue, the misbehaving cluster/clusters barely has any load - 20-30 messages a second. we have heavy loaded clusters with no such problems at all.
- The only way currently to get out of this state is restarting the node of Rabbit.
- In very rare cases we can't do it (RabbitMQ doesn't responds to start, stop or restart commands) so we restart the VM itself.
- In some cases after the unhealthy node restarted, another one goes into same mode after 5-10 minutes (it may be related to to the same issue not the the restart itself)
No special logs found anywhere, not in RabbitMQ logs, not in sasl logs, except of timeouts that start happening after a node goes into this state.
Any kind of hint, help or question in the right direction may help here.
Thanks a lot, and any kind of technical details can be provided on request.