Rabbit 3.4.0 - Partial Partition Bug? (Bug 26474)

453 views
Skip to first unread message

Ahmed Alani

unread,
Dec 18, 2014, 9:55:38 AM12/18/14
to rabbitm...@googlegroups.com
All,

We are running a cluster of RabbitMQ nodes, version 3.4.0 with Erlang 17.3 64-bit on Windows Server 2012 VMs. We have tracert'd between the nodes and they are all one hop away (~1 ms) from each other each other. We are sporadically seeing partial partitions occur between nodes. Because we are set to pause_minority, these nodes will take themselves out of the cluster in some cases.

I noticed the bug #26474 in the latest release regarding false positives with partial partitions. Does anyone know under what conditions does this occurs?  Does anyone have any  recommendations as to how we can diagnose these errors?

Thanks,
Ahmed

Michael Klishin

unread,
Dec 18, 2014, 9:59:13 AM12/18/14
to Ahmed Alani, rabbitm...@googlegroups.com
 On 18 December 2014 at 17:55:44, Ahmed Alani (ahmed....@gmail.com) wrote:
> I noticed the bug #26474 in the latest release regarding false
> positives with partial partitions. Does anyone know under what
> conditions does this occurs? Does anyone have any recommendations
> as to how we can diagnose these errors?

See https://groups.google.com/forum/#!topic/rabbitmq-users/06OQkYtLJd8 where it was
originally reported.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Ahmed Alani

unread,
Dec 18, 2014, 1:12:40 PM12/18/14
to rabbitm...@googlegroups.com, ahmed....@gmail.com
Hey Michael,

Thanks for the reply. I think we suffered from an outage related to this bug yesterday in a small 30 second window, but the root cause is baffling. Our scenario is below. I have attached logs. Do you have any idea what the crash report means?

We have a 4 node cluster :
  • Node 2 had a crash of some sort occur yesterday at 10:27. See the logs/SASL logs attached. I can't make heads of tails of it? After the crash reports, it seems to have continued operating.
  • [Node 1, Node 3] saw connection_close() at the same time from Node 2, but decided they were in a minority because Node 4 could still see Node 2. Both shutdown for good.
  • Node 4 made no indication there was an issue. It saw Nodes 1 and 3 go down, promoted some mirrors, and continued accepting connections.
NODE001Logs.txt
NODE002Logs.txt
NODE002-SASL-Logs.txt
NODE003Logs.txt
NODE004Logs.txt

Michael Klishin

unread,
Dec 18, 2014, 3:24:24 PM12/18/14
to Ahmed Alani, rabbitm...@googlegroups.com
On 18 December 2014 at 21:12:43, Ahmed Alani (ahmed....@gmail.com) wrote:
> Thanks for the reply. I think we suffered from an outage related
> to this bug yesterday in a small 30 second window, but the root
> cause is baffling. Our scenario is below. I have attached logs.
> Do you have any idea what the crash report means?

A known issue which is partially resolve in 3.4.x releases. 26474 can be related.

Ahmed Alani

unread,
Dec 19, 2014, 10:30:15 AM12/19/14
to rabbitm...@googlegroups.com, ahmed....@gmail.com
Thanks for your help Michael. We'll plan on upgrading.

Paul Ruan

unread,
Mar 5, 2015, 5:54:12 PM3/5/15
to rabbitm...@googlegroups.com, ahmed....@gmail.com
Hi Michael, 

I was wondering if I can get a clarification on what you meant by "partially" resolved. We're running a cluster on 3.4.4 and came across a partition recently after restarting a node (should've been sent a SIGTERM). I'm wondering if it is related to this bug. 

From the logs (below), it looks to me like a falsely detected partition:
nodeX restarted
nodeA and nodeB log nodeX as being down
nodeA and nodeB log nodeX as being up
nodeA and nodeB find that the other can talk to nodeX so they disconnect from each other.

I haven't found any mentioning of partitions in the logs for nodeX at the time and there were three other nodes.

Is it possible that there's a bug with partition detection on fast restarts?

Thanks,
Paul

On nodeX:
=INFO REPORT==== 28-Feb-2015::21:41:19 ===   Setting permissions...
=INFO REPORT==== 28-Feb-2015::21:41:26 ===   Starting RabbitMQ 3.4.4 on Erlang R16B   Copyright (C) 2007-2014 GoPivotal, Inc.   Licensed under the MPL.  See http://www.rabbitmq.com/
=INFO REPORT==== 28-Feb-2015::21:41:26 === ...
=INFO REPORT==== 28-Feb-2015::21:41:26 ===   Limiting to approx 99900 file handles (89908 sockets)
=INFO REPORT==== 28-Feb-2015::21:41:29 ===   Memory limit set to 72471MB of 96628MB total.
=INFO REPORT==== 28-Feb-2015::21:41:29 ===   Disk free limit set to 50MB

On nodeA:
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   node 'rabbit@nodeX' down: connection_closed
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   node 'rabbit@nodeX' up
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   Mirrored queue 'queueA' in vhost '/': Master <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   Mirrored queue 'queueB' in vhost '/': Slave <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   Mirrored queue 'queueC' in vhost '/': Master <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   Mirrored queue 'queueD' in vhost '/': Slave <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   Mirrored queue 'queueD' in vhost '/': Promoting slave <rabbit@nodeA> to master
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   Mirrored queue 'queueE' in vhost '/': Slave <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>
=ERROR REPORT==== 28-Feb-2015::21:41:28 ===   Partial partition detected:    * We saw DOWN from rabbit@nodeX    * We can still see rabbit@nodeB which can see rabbit@nodeX   We will therefore intentionally disconnect from rabbit@nodeB

On nodeB:
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   node 'rabbit@nodeX' down: connection_closed
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   node 'rabbit@nodeX' up
=ERROR REPORT==== 28-Feb-2015::21:41:28 ===   Partial partition detected:    * We saw DOWN from rabbit@nodeX    * We can still see rabbit@nodeA which can see rabbit@nodeX   We will therefore intentionally disconnect from rabbit@nodeA

Michael Klishin

unread,
Mar 5, 2015, 6:20:06 PM3/5/15
to Paul Ruan, rabbitm...@googlegroups.com, ahmed....@gmail.com
On 6 March 2015 at 01:54:15, Paul Ruan (paul...@dropbox.com) wrote:
> I was wondering if I can get a clarification on what you meant
> by "partially" resolved

Resolved for some cases but not all.


=INFO REPORT==== 28-Feb-2015::21:41:28 ===   node 'rabbit@nodeX' down: connection_closed
=INFO REPORT==== 28-Feb-2015::21:41:28 ===   node 'rabbit@nodeX' up

in your log suggest that the partition between B and X is very short, which leads to interesting
edge cases and race conditions .

We have recently identified more edge cases in partition handling,
some will be fixed in 3.5.0, some 3.5.x.
3.5.0 should be out next week.
Reply all
Reply to author
Forward
0 new messages