On 2 February 2016 at 04:09:00,
aros...@gmail.com (
aros...@gmail.com) wrote:
> Fair enough. So unless RabbitMQ tells me expressly that the
> TCP connection was dropped, I should not assume this... correct?
>
> However... it seems intuitive (and perhaps very naive, but this
> is my starting point) that a given node has *decided* a node is
> unreachable based on some criteria, either it timed out on net
> ticks or received an OS-level connection drop or failed to deliver
> a message. I'm really just looking for why a node *thinks* another
> is unreachable, not necessarily what actually happened.
but why would you need this information, given that it cannot really be 100% reliable?
> May I just ask this - presumably a net tick timeout is an obvious
> one? In that if the Rabbit logs don't expressly attribute the
> "node down" to a net tick timeout, I may rule that out?
No. In practice, a timeout (configured as kernel.net_ticktime in Erlang) is a very common way of
unavailability detection. Yes, in some cases the OS can tell RabbitMQ (or rather, the runtime) that
a peer socket was closed, or that node might have decided to disconnect explicitly. But I'm not sure how
frequent that is.
I don't remember in what log messages specifically we might have had a reason specified, while adding
it back is not a lot of work, I just don't see how it would be useful. Even when investigating an operational
issue, there can be no 100% certainty about the real cause, so the only realistic scenario I see
is debugging RabbitMQ. In 3.7.0 we will have much more verbose debug logging, including runtime-level
events.