RabbitMQ nodes in a cluster goes into "zombie" kind of mode

91 views
Skip to first unread message

Alik Krasner

unread,
Feb 9, 2018, 10:34:10 AM2/9/18
to rabbitmq-users
Hi all,
I will start with some kind of a disclaimer - I've using RabbitMQ for about 6 years now, in variety of contexts and use cases across different companies, both as a programmer and as an architect and so on. Different operating systems (Windows, Linux even Solaris), very large scale and very small scale. The idea behind this disclaimer is that I will provide here just the symptoms without too much technical details and follow you with questions - as you can guess from the disclaimer, we tried a lot, I don't think everything is relevant here so I will start "light".

In my current company we are using RabbitMQ pretty heavily, about 10 clusters, geo distributed and interconnected with federations. All of the clusters are Linux based, RabbitMQ 3.6.9+ with 3 nodes each.
Recently we started to have on 2 of them the following issue, the issue looks like the cluster goes into some kind of "zombie" mode, which means:
  1. Connections are not closed, but clients start to have issues (timeouts).
  2. If there are federations they start to die (due to heartbeat timeout, pretty much the same as the clients - which is clear). Although one of the suffering clusters doesn't have upstream nor downstream federations at all.
  3. In general the mode looks like RabbitMQ went into suspended mode (you can find it here: https://www.rabbitmq.com/partitions.html#suspend) - we suffered from this a while ago when VMWare had automatic DRS and migrated VMs, which of course causes Erlang machine to misbehave.
    1. The nodes in the cluster set to pause-minority, which is not actually taking place (just like in suspended mode), node that have issues doesn't really "understand" he is minority, thus not paused, thus not removed from load balancer and so on.
    2. Relevant ports staying in listening mode, which is not the case when minority is paused.
  4. No CPU or memory leak/spike or any other issues during this time, before or after.
  5. Just to emphasize the issue, the misbehaving cluster/clusters barely has any load - 20-30 messages a second. we have heavy loaded clusters with no such problems at all.
  6. The only way currently to get out of this state is restarting the node of Rabbit.
    1. In very rare cases we can't do it (RabbitMQ doesn't responds to start, stop or restart commands) so we restart the VM itself.
    2. In some cases after the unhealthy node restarted, another one goes into same mode after 5-10 minutes (it may be related to to the same issue not the the restart itself)

No special logs found anywhere, not in RabbitMQ logs, not in sasl logs, except of timeouts that start happening after a node goes into this state.
Any kind of hint, help or question in the right direction may help here.

Thanks a lot, and any kind of technical details can be provided on request. 

Luke Bakken

unread,
Feb 9, 2018, 11:22:20 AM2/9/18
to rabbitmq-users
Hi Alik -

What Erlang version are you using?

Thanks,
Luke

Alik Krasner

unread,
Feb 9, 2018, 11:53:54 AM2/9/18
to rabbitmq-users
Two different clusters:
  1. Rabbit broker: 3.6.15, Erlang: 18.3 (this cluster is also not loaded but have some federations, if it's related at all).
  2. Rabbit broker: 3.6.10, Erlang: 18.3 (this cluster has very small load 20-30 messages per second without any federations)
One of the very loaded clusters (that doesn't have such issues at all) Erlang version is also 18.3

Michael Klishin

unread,
Feb 9, 2018, 11:57:26 AM2/9/18
to rabbitm...@googlegroups.com
You are hitting known bugs in Erlang 18.x that lead nodes to not accepting any connections and fail to shut down.

This is mentioned in http://www.rabbitmq.com/which-erlang.html. Please upgrade to 19.3.6.4 or later ASAP.
Those issues have been discussed on this list multiple times before, e.g. in https://groups.google.com/d/msg/rabbitmq-users/kXkI-f3pgEw/p7WESA4cAAAJ.

RabbitMQ 3.7 requires 19.3.6 in part for that reason.

Erlang version upgrades are covered in the Upgrades guide:
http://www.rabbitmq.com/upgrade.html.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Alik Krasner

unread,
Feb 12, 2018, 4:17:10 AM2/12/18
to rabbitmq-users
Hi Michael,
I didn't replied earlier but we took it pretty fast, were testing the rolling upgrade on a testing cluster environment, we will move the PROD clusters to suggested Erlang version also and comment back here.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages