Rabbit 3.4.0 - Partial Partition Bug? (Bug 26474)

Ahmed Alani

unread,

Dec 18, 2014, 9:55:38 AM12/18/14

to rabbitm...@googlegroups.com

All,

We are running a cluster of RabbitMQ nodes, version 3.4.0 with Erlang 17.3 64-bit on Windows Server 2012 VMs. We have tracert'd between the nodes and they are all one hop away (~1 ms) from each other each other. We are sporadically seeing partial partitions occur between nodes. Because we are set to pause_minority, these nodes will take themselves out of the cluster in some cases.

I noticed the bug #26474 in the latest release regarding false positives with partial partitions. Does anyone know under what conditions does this occurs? Does anyone have any recommendations as to how we can diagnose these errors?

Thanks,

Ahmed

Michael Klishin

unread,

Dec 18, 2014, 9:59:13 AM12/18/14

to Ahmed Alani, rabbitm...@googlegroups.com

On 18 December 2014 at 17:55:44, Ahmed Alani (ahmed....@gmail.com) wrote:
> I noticed the bug #26474 in the latest release regarding false
> positives with partial partitions. Does anyone know under what
> conditions does this occurs? Does anyone have any recommendations
> as to how we can diagnose these errors?

See https://groups.google.com/forum/#!topic/rabbitmq-users/06OQkYtLJd8 where it was
originally reported.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Ahmed Alani

unread,

Dec 18, 2014, 1:12:40 PM12/18/14

to rabbitm...@googlegroups.com, ahmed....@gmail.com

Hey Michael,

Thanks for the reply. I think we suffered from an outage related to this bug yesterday in a small 30 second window, but the root cause is baffling. Our scenario is below. I have attached logs. Do you have any idea what the crash report means?

We have a 4 node cluster :

Node 2 had a crash of some sort occur yesterday at 10:27. See the logs/SASL logs attached. I can't make heads of tails of it? After the crash reports, it seems to have continued operating.
[Node 1, Node 3] saw connection_close() at the same time from Node 2, but decided they were in a minority because Node 4 could still see Node 2. Both shutdown for good.
Node 4 made no indication there was an issue. It saw Nodes 1 and 3 go down, promoted some mirrors, and continued accepting connections.

NODE001Logs.txt

NODE002Logs.txt

NODE002-SASL-Logs.txt

NODE003Logs.txt

NODE004Logs.txt

Michael Klishin

unread,

Dec 18, 2014, 3:24:24 PM12/18/14

to Ahmed Alani, rabbitm...@googlegroups.com

On 18 December 2014 at 21:12:43, Ahmed Alani (ahmed....@gmail.com) wrote:
> Thanks for the reply. I think we suffered from an outage related
> to this bug yesterday in a small 30 second window, but the root
> cause is baffling. Our scenario is below. I have attached logs.
> Do you have any idea what the crash report means?

A known issue which is partially resolve in 3.4.x releases. 26474 can be related.

Ahmed Alani

unread,

Dec 19, 2014, 10:30:15 AM12/19/14

to rabbitm...@googlegroups.com, ahmed....@gmail.com

Thanks for your help Michael. We'll plan on upgrading.

Paul Ruan

unread,

Mar 5, 2015, 5:54:12 PM3/5/15

to rabbitm...@googlegroups.com, ahmed....@gmail.com

Hi Michael,

I was wondering if I can get a clarification on what you meant by "partially" resolved. We're running a cluster on 3.4.4 and came across a partition recently after restarting a node (should've been sent a SIGTERM). I'm wondering if it is related to this bug.

From the logs (below), it looks to me like a falsely detected partition:

nodeX restarted

nodeA and nodeB log nodeX as being down

nodeA and nodeB log nodeX as being up

nodeA and nodeB find that the other can talk to nodeX so they disconnect from each other.

I haven't found any mentioning of partitions in the logs for nodeX at the time and there were three other nodes.

Is it possible that there's a bug with partition detection on fast restarts?

Thanks,

Paul

On nodeX:

=INFO REPORT==== 28-Feb-2015::21:41:19 === Setting permissions...

=INFO REPORT==== 28-Feb-2015::21:41:26 === ...

=INFO REPORT==== 28-Feb-2015::21:41:26 === Limiting to approx 99900 file handles (89908 sockets)

=INFO REPORT==== 28-Feb-2015::21:41:29 === Memory limit set to 72471MB of 96628MB total.

=INFO REPORT==== 28-Feb-2015::21:41:29 === Disk free limit set to 50MB

On nodeA:

=INFO REPORT==== 28-Feb-2015::21:41:28 === node 'rabbit@nodeX' down: connection_closed

=INFO REPORT==== 28-Feb-2015::21:41:28 === node 'rabbit@nodeX' up

=INFO REPORT==== 28-Feb-2015::21:41:28 === Mirrored queue 'queueA' in vhost '/': Master <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>

=INFO REPORT==== 28-Feb-2015::21:41:28 === Mirrored queue 'queueB' in vhost '/': Slave <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>

=INFO REPORT==== 28-Feb-2015::21:41:28 === Mirrored queue 'queueC' in vhost '/': Master <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>

=INFO REPORT==== 28-Feb-2015::21:41:28 === Mirrored queue 'queueD' in vhost '/': Slave <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>

=INFO REPORT==== 28-Feb-2015::21:41:28 === Mirrored queue 'queueD' in vhost '/': Promoting slave <rabbit@nodeA> to master

=INFO REPORT==== 28-Feb-2015::21:41:28 === Mirrored queue 'queueE' in vhost '/': Slave <rabbit@nodeA> saw deaths of mirrors <rabbit@nodeX>

=ERROR REPORT==== 28-Feb-2015::21:41:28 === Partial partition detected: * We saw DOWN from rabbit@nodeX * We can still see rabbit@nodeB which can see rabbit@nodeX We will therefore intentionally disconnect from rabbit@nodeB

On nodeB:

=INFO REPORT==== 28-Feb-2015::21:41:28 === node 'rabbit@nodeX' down: connection_closed

=INFO REPORT==== 28-Feb-2015::21:41:28 === node 'rabbit@nodeX' up

=ERROR REPORT==== 28-Feb-2015::21:41:28 === Partial partition detected: * We saw DOWN from rabbit@nodeX * We can still see rabbit@nodeA which can see rabbit@nodeX We will therefore intentionally disconnect from rabbit@nodeA

Michael Klishin

unread,

Mar 5, 2015, 6:20:06 PM3/5/15

to Paul Ruan, rabbitm...@googlegroups.com, ahmed....@gmail.com

On 6 March 2015 at 01:54:15, Paul Ruan (paul...@dropbox.com) wrote:
> I was wondering if I can get a clarification on what you meant
> by "partially" resolved

Resolved for some cases but not all.

=INFO REPORT==== 28-Feb-2015::21:41:28 === node 'rabbit@nodeX' down: connection_closed
=INFO REPORT==== 28-Feb-2015::21:41:28 === node 'rabbit@nodeX' up

in your log suggest that the partition between B and X is very short, which leads to interesting
edge cases and race conditions .

We have recently identified more edge cases in partition handling,
some will be fixed in 3.5.0, some 3.5.x.
3.5.0 should be out next week.

Reply all

Reply to author

Forward