MERGE failed - conflict between MERGE3 and FLUSH

23 views
Skip to first unread message

Liviu Ioan

unread,
Apr 15, 2024, 8:23:33 AM4/15/24
to jgroups-dev

Hello,

 

We are using JGroups 3.6.20 over UDP, with a cluster of machines all connected together, with 7 permanent members (all members started from the beginning of the scenario.

We have issues when split views need to be merged.

The following issues are easily reproduced at every test.

 

Scenario no. 1

FLUSH protocol enabled in jgroups config

Start A, B, C, D, E, F, G nodes

A is the coordinator

Break network connection between nodes A and B

A and B are now coordinators

Observe different views seen by A and B, C, D, E, F, G (B, C, D, E, F, G have the same view, but different from A)

Reconnect A to B

Views are still different on A and B, C, D, E, F, G; logs indicate the MERGE has failed

Scenario no. 2

FLUSH protocol enabled in jgroups config

Start A, B, C, D, E, F, G nodes

A is the coordinator

Break network connection between nodes A and B

A and B are now coordinators

Observe different views seen by A and B, C, D, E, F, G (B, C, D, E, F, G have the same view, but different from A)

Reconnect A to B

Views are the same on A, B, C, D, E, F, G; logs indicate the MERGE has been successfully

 


Details on scenario no. 1

If a member (B) loses connection to the coordinator (A), that member will make a new view, and get some of the previous view members to enter it. Those members (B, C, D, E, F, G) will also get the new view.

The coordinator (A) does not find out or get notified about the members whose views have changed, and keeps using the old view, even if all of the members (B, C, D, E, F, G) haven't the old view anymore.

 

So, for example, initially, we have those members sharing this view:

A, B, C, D, E, F, G = (A | 1) (A, B, C, D, E, F, G)

 

2024-101T12:39:23.709Z DEBUG A [ViewHandler,A-48683]  FLUSH:192  - A-48683: installing view [A-48683|4] (7) [A-48683, B-23606, C-55586, D-37549, E-62102, F-24460, G-18865]

 

Then, after breaking the connection between A and B, there is a period of time where B suspects A to have crashed, and then B decides to make a new view. It communicates with the other reachable members, creates the new view and shares it with the members. We end up having these views:

A = (A | 1) (A, B, C, D, E, F, G)

B, C, D, E, F, G = (B | 2) (B, C, D, E, F, G)

 

2024-101T15:40:12.368Z DEBUG B [ViewHandler,B-23606]  FLUSH:192  - B-23606: installing view [B-23606|5] (6) [B-23606, C-55586, D-37549, E-62102, F-24460, G-18865]

 

Then, we resume the connection between A and B.

After the connection between A and B is resumed, a view merge is attempted.

The merge fails due to flush failures and blocks/unblocks timing out. This is then attempted over and over again, with no success.

2024-101T12:41:22.186Z WARN  A [INT-5,A-48683]  FLUSH:242  - A-48683: waiting for UNBLOCK timed out after 2000 ms

2024-101T12:41:22.186Z WARN  A [INT-5,A-48683]  GMS:242  - A-48683: GMS flush by coordinator failed

2024-101T12:41:22.186Z ERROR A [INT-5,A-48683]  GMS:282  - A-48683: failure handling the merge request: flush failed

2024-101T12:42:44.126Z TRACE A [INT-4,A-48683]  UDP:172  - A-48683: received [dst: A-48683, src: B-23606 (3 headers), size=222 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[MERGE_RSP]merge_id=A-48683::3, UNICAST3: DATA, seqno=2, conn_id=2, TP: [cluster_name=MCM]

2024-101T12:42:45.542Z TRACE A [MergeTask,A-48683]  GMS:172  - A-48683: collected 1 merge response(s) in 5002 ms

2024-101T12:42:45.542Z DEBUG A [MergeTask,A-48683]  GMS:202  - A-48683: merge leader A-48683 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge

2024-101T12:42:45.542Z WARN  A [MergeTask,A-48683]  GMS:252  - A-48683: merge is cancelled: merge leader rejected merge request

2024-101T12:43:01.823Z WARN  A [INT-3,A-48683]  FLUSH:242  - A-48683: waiting for UNBLOCK timed out after 2000 ms

2024-101T12:43:01.823Z WARN  A [INT-3,A-48683]  GMS:242  - A-48683: GMS flush by coordinator failed

2024-101T12:43:01.824Z ERROR A [INT-3,A-48683]  GMS:282  - A-48683: failure handling the merge request: flush failed

 

Q1: Why did the MERGE fail when using FLUSH?

Q2: Is this a known issue?

Q3: Is this issue solved in a newer version?

 

Thank you.

Liviu

Liviu Ioan

unread,
Apr 15, 2024, 8:28:55 AM4/15/24
to jgroups-dev
Hello again,

Sorry, I have a mistake an the 2nd scenario.

Scenario no. 2

FLUSH protocol disabled in jgroups config

Start A, B, C, D, E, F, G nodes

A is the coordinator

Break network connection between nodes A and B

A and B are now coordinators

Observe different views seen by A and B, C, D, E, F, G (B, C, D, E, F, G have the same view, but different from A)

Reconnect A to B

Views are the same on A, B, C, D, E, F, G; logs indicate the MERGE has been successfully


Thanks.
Liviu

Bela Ban

unread,
Apr 15, 2024, 2:37:00 PM4/15/24
to jgrou...@googlegroups.com
Hi Liviu

I'm afraid I won't be able to help, as fixing FLUSH is a throwaway effort, because it will be dropped in 5.4.
Sorry,
--
You received this message because you are subscribed to the Google Groups "jgroups-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jgroups-dev/f6244506-7de1-40e6-a8aa-fc797d28f39en%40googlegroups.com.

-- 
Bela Ban | http://www.jgroups.org

Reply all
Reply to author
Forward
0 new messages