Hello,
We are using JGroups 3.6.20 over UDP, with a cluster of machines all connected together, with 7 permanent members (all members started from the beginning of the scenario.
We have issues when split views need to be merged.
The following issues are easily reproduced at every test.
Scenario no. 1
FLUSH protocol enabled in jgroups config
Start A, B, C, D, E, F, G nodes
A is the coordinator
Break network connection between nodes A and B
A and B are now coordinators
Observe different views seen by A and B, C, D, E, F, G (B, C, D, E, F, G have the same view, but different from A)
Reconnect A to B
Views are still different on A and B, C, D, E, F, G; logs indicate the MERGE has failed
Scenario no. 2
FLUSH protocol enabled in jgroups config
Start A, B, C, D, E, F, G nodes
A is the coordinator
Break network connection between nodes A and B
A and B are now coordinators
Observe different views seen by A and B, C, D, E, F, G (B, C, D, E, F, G have the same view, but different from A)
Reconnect A to B
Views are the same on A, B, C, D, E, F, G; logs indicate the MERGE has been successfully
Details on scenario no. 1
If a member (B) loses connection to the coordinator (A), that member will make a new view, and get some of the previous view members to enter it. Those members (B, C, D, E, F, G) will also get the new view.
The coordinator (A) does not find out or get notified about the members whose views have changed, and keeps using the old view, even if all of the members (B, C, D, E, F, G) haven't the old view anymore.
So, for example, initially, we have those members sharing this view:
A, B, C, D, E, F, G = (A | 1) (A, B, C, D, E, F, G)
2024-101T12:39:23.709Z DEBUG A [ViewHandler,A-48683] FLUSH:192 - A-48683: installing view [A-48683|4] (7) [A-48683, B-23606, C-55586, D-37549, E-62102, F-24460, G-18865]
Then, after breaking the connection between A and B, there is a period of time where B suspects A to have crashed, and then B decides to make a new view. It communicates with the other reachable members, creates the new view and shares it with the members. We end up having these views:
A = (A | 1) (A, B, C, D, E, F, G)
B, C, D, E, F, G = (B | 2) (B, C, D, E, F, G)
2024-101T15:40:12.368Z DEBUG B [ViewHandler,B-23606] FLUSH:192 - B-23606: installing view [B-23606|5] (6) [B-23606, C-55586, D-37549, E-62102, F-24460, G-18865]
Then, we resume the connection between A and B.
After the connection between A and B is resumed, a view merge is attempted.
The merge fails due to flush failures and blocks/unblocks timing out. This is then attempted over and over again, with no success.
2024-101T12:41:22.186Z WARN A [INT-5,A-48683] FLUSH:242 - A-48683: waiting for UNBLOCK timed out after 2000 ms
2024-101T12:41:22.186Z WARN A [INT-5,A-48683] GMS:242 - A-48683: GMS flush by coordinator failed
2024-101T12:41:22.186Z ERROR A [INT-5,A-48683] GMS:282 - A-48683: failure handling the merge request: flush failed
2024-101T12:42:44.126Z TRACE A [INT-4,A-48683] UDP:172 - A-48683: received [dst: A-48683, src: B-23606 (3 headers), size=222 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[MERGE_RSP]merge_id=A-48683::3, UNICAST3: DATA, seqno=2, conn_id=2, TP: [cluster_name=MCM]
2024-101T12:42:45.542Z TRACE A [MergeTask,A-48683] GMS:172 - A-48683: collected 1 merge response(s) in 5002 ms
2024-101T12:42:45.542Z DEBUG A [MergeTask,A-48683] GMS:202 - A-48683: merge leader A-48683 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
2024-101T12:42:45.542Z WARN A [MergeTask,A-48683] GMS:252 - A-48683: merge is cancelled: merge leader rejected merge request
2024-101T12:43:01.823Z WARN A [INT-3,A-48683] FLUSH:242 - A-48683: waiting for UNBLOCK timed out after 2000 ms
2024-101T12:43:01.823Z WARN A [INT-3,A-48683] GMS:242 - A-48683: GMS flush by coordinator failed
2024-101T12:43:01.824Z ERROR A [INT-3,A-48683] GMS:282 - A-48683: failure handling the merge request: flush failed
Q1: Why did the MERGE fail when using FLUSH?
Q2: Is this a known issue?
Q3: Is this issue solved in a newer version?
Thank you.
Liviu
Scenario no. 2
FLUSH protocol disabled in jgroups config
Start A, B, C, D, E, F, G nodes
A is the coordinator
Break network connection between nodes A and B
A and B are now coordinators
Observe different views seen by A and B, C, D, E, F, G (B, C, D, E, F, G have the same view, but different from A)
Reconnect A to B
Views are the same on A, B, C, D, E, F, G; logs indicate the MERGE has been successfully
--
You received this message because you are subscribed to the Google Groups "jgroups-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jgroups-dev/f6244506-7de1-40e6-a8aa-fc797d28f39en%40googlegroups.com.
-- Bela Ban | http://www.jgroups.org