RabbitMQ 3.6.0 - "node down" detection

aros...@gmail.com

unread,

Feb 1, 2016, 7:06:33 PM2/1/16

to rabbitmq-users

Hi there,

I've been trying to determine the reason why nodes under partition think others are down; apparently in later versions there was logging introduced to give a purported reason why node A thinks node B is down (net ticktime, TCP connection, etc.). I'm not seeing anything like that on 3.6 - does one need to increase log levels to get those?

Michael Klishin

unread,

Feb 1, 2016, 7:50:31 PM2/1/16

to rabbitm...@googlegroups.com, aros...@gmail.com

The truth is, there is no reliable way of knowing why a node is unreachable. Don’t rely on it.

Timeout-based detection mechanisms (it can be called net_ticktime or something else)
are used all over the place in distributed systems these days.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

aros...@gmail.com

unread,

Feb 1, 2016, 8:08:57 PM2/1/16

to rabbitmq-users, aros...@gmail.com

Fair enough. So unless RabbitMQ tells me expressly that the TCP connection was dropped, I should not assume this... correct?

However... it seems intuitive (and perhaps very naive, but this is my starting point) that a given node has *decided* a node is unreachable based on some criteria, either it timed out on net ticks or received an OS-level connection drop or failed to deliver a message. I'm really just looking for why a node *thinks* another is unreachable, not necessarily what actually happened.

May I just ask this - presumably a net tick timeout is an obvious one? In that if the Rabbit logs don't expressly attribute the "node down" to a net tick timeout, I may rule that out?

Michael Klishin

unread,

Feb 2, 2016, 6:59:46 AM2/2/16

to rabbitm...@googlegroups.com, aros...@gmail.com

On 2 February 2016 at 04:09:00, aros...@gmail.com (aros...@gmail.com) wrote:
> Fair enough. So unless RabbitMQ tells me expressly that the
> TCP connection was dropped, I should not assume this... correct?
>
> However... it seems intuitive (and perhaps very naive, but this
> is my starting point) that a given node has *decided* a node is
> unreachable based on some criteria, either it timed out on net
> ticks or received an OS-level connection drop or failed to deliver
> a message. I'm really just looking for why a node *thinks* another
> is unreachable, not necessarily what actually happened.

but why would you need this information, given that it cannot really be 100% reliable?

> May I just ask this - presumably a net tick timeout is an obvious
> one? In that if the Rabbit logs don't expressly attribute the
> "node down" to a net tick timeout, I may rule that out?

No. In practice, a timeout (configured as kernel.net_ticktime in Erlang) is a very common way of
unavailability detection. Yes, in some cases the OS can tell RabbitMQ (or rather, the runtime) that
a peer socket was closed, or that node might have decided to disconnect explicitly. But I'm not sure how
frequent that is.

I don't remember in what log messages specifically we might have had a reason specified, while adding
it back is not a lot of work, I just don't see how it would be useful. Even when investigating an operational
issue, there can be no 100% certainty about the real cause, so the only realistic scenario I see
is debugging RabbitMQ. In 3.7.0 we will have much more verbose debug logging, including runtime-level
events.

aros...@gmail.com

unread,

Feb 2, 2016, 10:30:19 AM2/2/16

to rabbitmq-users, aros...@gmail.com

Ah - yes, it was said debugging I was asking about. For some reason I believed, from an earlier posting to this group, that such debug logging was already available (http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2014-February/033690.html):

Future releases of RabbitMQ (3.3.0, currently in the nightly builds) 
will log the reason why one node decided another was down. Unfortunately 
that's not available in 3.2.x, so I am afraid RabbitMQ is not going to 
help you determine what caused the partition.

Is it the case that said logging was removed again from later versions on the grounds you have specified here?

However, I'm curious about what you mean when you characterize the information as not 100% reliable. Presumably the control flow either hit the log message or it didn't. And having it log "net tick timeout" gives you useful information about events surrounding the partition; either a tick didn't arrive in that time or the clock jumped; in either case I have further grounds for investigation.

Michael Klishin

unread,

Feb 2, 2016, 12:48:35 PM2/2/16

to rabbitm...@googlegroups.com, aros...@gmail.com

On 2 February 2016 at 18:30:23, aros...@gmail.com (aros...@gmail.com) wrote:
> Is it the case that said logging was removed again from later
> versions on the grounds you have specified here?

RabbitMQ is thousands of lines of code; I don't know what specific cases are covered.

Nodes being down is something we've been logging for a long time.

> However, I'm curious about what you mean when you characterize
> the information as not 100% reliable. Presumably the control
> flow either hit the log message or it didn't. And having it log
> "net tick timeout" gives you useful information about events
> surrounding the partition; either a tick didn't arrive in that
> time or the clock jumped; in either case I have further grounds
> for investigation.

We cannot know why a peer failed to send a (runtime-level, not AMQP 0-9-1) heartbeat. It could
be a network communication issue, or the process was swapped out by the OS for some time,
or a tool such as vMotion interfered for a moment, or the VM was stopped, …

We cannot log *that* kind of information. Which means we can only log TCP socket issues
at best, provided that the runtime even lets us distinguish between this: in Erlang inter-communication
is really transparent, which has awesome and no-so-awesome aspects to it.

aros...@gmail.com

unread,

Feb 2, 2016, 5:18:15 PM2/2/16

to rabbitmq-users, aros...@gmail.com

OK - I didn't realize the that was the sum total of information Erlang was giving you. By runtime-level heartbeat I assume you mean the net tick (please disabuse me of this assumption if necessary!). That does seem a not-so-awesome aspect.

Oh well, for now we've just de-clustered and run erl/Rabbit on a single server. We're far from max connections and can scale the VM vertically for some time if necessary. I'm not sure we'll ever diagnose this as I don't have access to the ESXi hosts.

Thanks for all your help on this.

Michael Klishin

unread,

Feb 2, 2016, 5:22:14 PM2/2/16

to rabbitm...@googlegroups.com, aros...@gmail.com

On 3 February 2016 at 01:18:17, aros...@gmail.com (aros...@gmail.com) wrote:
> I didn't realize the that was the sum total of information Erlang
> was giving you. By runtime-level heartbeat I assume you mean
> the net tick (please disabuse me of this assumption if necessary!).
> That does seem a not-so-awesome aspect.

Your understanding is correct. The Erlang process that contacts other nodes
to notify them of “this” node availability is called “heart”, so the messages
are often called “heartbeats” even though it may or may not be the actual name
used by the OTP team.

Also, are you talking about tools such as vMotion resulting in nodes being
observed as down? We’d definitely be interested in getting to the bottom of this
at least for vMotion (but unfortunately are understaffed to start a full blown investigation).

aros...@gmail.com

unread,

Feb 2, 2016, 8:00:12 PM2/2/16

to rabbitmq-users, aros...@gmail.com

While I don't have access to the ESXi hosts to check, my Wintel team tells me that our VMs are configured to be tied to specific hosts and as such are never vMotion-migrated automatically, and since October only one VM was migrated (after hours) manually. In general the host cluster is fairly underutilized, at least for the time being, and during any of our partitioning events, no resources (CPU/RAM/NIC) appear to be over-taxed.

I, too would very much like to get to the bottom of it; unfortunately since we only see this in our production environment, and have been forced to opt for stability there by moving to a single-node broker, I'm not sure I'll even be able to reproduce it now. If I can be of any assistance by way of providing ancillary information about our setup please let me know. I had posted here originally hoping that someone with similar experiences would notice and share anything they'd come across; if I can catalogue here anything useful about the particulars of Rabbit on vSphere/ESXi than I will.

Reply all

Reply to author

Forward