node not rejoining the group after a restart

947 views
Skip to first unread message

Thomas Houtekier

unread,
Sep 26, 2018, 2:34:58 PM9/26/18
to jgroups-dev
I have a problem with the following setup on wildfly (9.0.2.Final) which includes Jgroups-3.6.4.Final.
When restarting 1 of the nodes, it disappears from the group and it never re-joins, eventhough the logs indicate that we receive messages from that node.
The setup is supposed to have +40 nodes but because it is nowhere near stable it is currenlty downscaled to 6. But the same problems happen at this scale.

These are the logs from node 172.25.14.2 when 172.25.22.2 is restarting:
Before the shutdown, we see a group of 6 members

Sep:26,18:06:59,255 INFO  (:Incoming-11,ee-interserver,172.25.14.2:) [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel interserver: [172.25.14.2|13] (6) [172.25.14.2, 172.25.11.2, 172.25.5.2, 172.25.3.2, 172.25.22.2, 172.25.1.2]

node 172.25.22.2 is shutting down and gracefully leaving the cluster:
Sep:26,18:12:27,087 FINER (:TransferQueueBundler,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: sending 1 msgs (124 bytes (0.40 of max_bundle_size) to 1 dests(s): [ee-interserver:172.25.22.2]
Sep:26,18:12:27,112 FINER (:OOB-10,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: received [dst: 172.25.14.2, src: 172.25.22.2 (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[LEAVE_REQ]: mbr=172.25.22.2, UNICAST3: DATA, seqno=10, conn_id=11, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:27,112 FINER (:OOB-10,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UNICAST3] 172.25.14.2 <-- DATA(172.25.22.2: #10, conn_id=11)
Sep:26,18:12:27,112 FINER (:OOB-10,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UNICAST3] 172.25.14.2: delivering 172.25.22.2#10
Sep:26,18:12:27,162 FINER (:ViewHandler,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.GMS] 172.25.14.2: joiners=[], suspected=[], leaving=[172.25.22.2], new view: [172.25.14.2|14] (5) [172.25.14.2, 172.25.11.2, 172.25.5.2, 172.25.3.2, 172.25.1.2]
Sep:26,18:12:27,162 FINER (:ViewHandler,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.GMS] 172.25.14.2: sending LEAVE response to 172.25.22.2
Sep:26,18:12:27,162 FINER (:ViewHandler,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: sending msg to 172.25.22.2, src=172.25.14.2, headers are GMS: GmsHeader[LEAVE_RSP], UDP: [cluster_name=ee-interserver]
Sep:26,18:12:27,162 FINER (:ViewHandler,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.GMS] 172.25.14.2: mcasting view [172.25.14.2|14] (5) [172.25.14.2, 172.25.11.2, 172.25.5.2, 172.25.3.2, 172.25.1.2] (5 mbrs)


Sep:26,18:12:27,164 FINER (:Incoming-18,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.GMS] 172.25.14.2: received delta view [172.25.14.2|14], ref-view=[172.25.14.2|13], left=[172.25.22.2]
Sep:26,18:12:27,164 FINE  (:Incoming-18,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.GMS] 172.25.14.2: installing view [172.25.14.2|14] (5) [172.25.14.2, 172.25.11.2, 172.25.5.2, 172.25.3.2, 172.25.1.2]
Sep:26,18:12:27,164 FINER (:Incoming-18,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UNICAST3] 172.25.14.2: closing connections of non members [172.25.22.2]
Sep:26,18:12:27,164 FINE  (:Incoming-18,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.NAKACK2] 172.25.14.2: removed 172.25.22.2 from xmit_table (not member anymore)


Sep:26,18:12:27,165 FINER (:Incoming-18,ee-interserver,172.25.14.2:) [org.jgroups.protocols.FRAG2] 172.25.14.2: removed 172.25.22.2 from fragmentation table



So far so good.
Some time later, 22.2 starts again, but is never added to the group, eventhough I do see messages like this which indicates communication between 14.2 and 22.2.
Sep:26,18:12:27,180 FINER (:Timer-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: sending msg to 172.25.22.2, src=172.25.14.2, headers are UNICAST3: ACK, seqno=10, conn_id=11, ts=22, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:27,180 FINER (:TransferQueueBundler,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: sending 1 msgs (70 bytes (0.23 of max_bundle_size) to 1 dests(s): [ee-interserver:172.25.22.2]
(...)
Sep:26,18:12:31,691 FINER (:Timer-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UNICAST3] 172.25.14.2 --> XMIT(172.25.22.2: #3)
Sep:26,18:12:31,691 FINER (:Timer-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: sending msg to 172.25.22.2, src=172.25.14.2, headers are RequestCorrelator: id=200, type=RSP, id=12, rsp_expected=false, FORK: ee-interserver:interserver, UNICAST3: DATA, seqno=3, conn_id=6, UDP: [cluster_name=ee-interserver]



I'm assuming the following message is the reason why the new node is not added to the group, but I don't know how to interpret this.
As you can see the new node is present in this STABLE message. But since it is discarded, it is not processed in the digest.

Sep:26,18:12:32,132 FINER (:INT-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.STABLE] 172.25.14.2: discarded STABILITY message with different view-id [172.25.16.2|0] (my view-id=[172.25.14.2|14] (5) [172.25.14.2, 172.25.11.2, 172.25.5.2, 172.25.3.2, 172.25.1.2])
Sep:26,18:12:32,192 FINER (:Timer-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UNICAST3] 172.25.14.2 --> XMIT(172.25.22.2: #3)
Sep:26,18:12:32,192 FINER (:Timer-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: sending msg to 172.25.22.2, src=172.25.14.2, headers are RequestCorrelator: id=200, type=RSP, id=12, rsp_expected=false, FORK: ee-interserver:interserver, UNICAST3: DATA, seqno=3, conn_id=6, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:32,192 FINER (:TransferQueueBundler,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: sending 1 msgs (124 bytes (0.40 of max_bundle_size) to 1 dests(s): [ee-interserver:172.25.22.2]
Sep:26,18:12:32,275 FINER (:INT-1,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: received [dst: <null>, src: 172.25.1.2 (2 headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO: view_id=[172.25.14.2|14], logical_name=172.25.1.2, physical_addr=172.25.1.2:55201, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:32,571 FINER (:INT-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.UDP] 172.25.14.2: received [dst: 172.25.14.2, src: 172.25.3.2 (2 headers), size=103 bytes, flags=OOB|NO_RELIABILITY|INTERNAL], headers are STABLE: [STABLE_GOSSIP] view-id= [172.25.14.2|14], UDP: [cluster_name=ee-interserver]
Sep:26,18:12:32,571 FINER (:INT-2,ee-interserver,172.25.14.2:) [org.jgroups.protocols.pbcast.STABLE] 172.25.14.2: handling digest from 172.25.3.2:
mine:   172.25.14.2: [23], 172.25.11.2: [1], 172.25.5.2: [28], 172.25.3.2: [1], 172.25.1.2: [6]
other:  172.25.14.2: [23], 172.25.11.2: [1], 172.25.5.2: [28], 172.25.3.2: [1], 172.25.1.2: [6]
result: 172.25.14.2: [23], 172.25.11.2: [1], 172.25.5.2: [28], 172.25.3.2: [1], 172.25.1.2: [6]



I found the code where this message is printed (https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/pbcast/STABLE.java) but I can't figure out what the problem is.
It is printed multiple times and it seems to be the reason why thatnew node never is joined in the group.


The logs from the node that fails to join the group the other side so to speak):
At startup:

Sep:26,18:12:48,707 FINER (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.STABLE] 172.25.22.2: stable task started
Sep:26,18:12:48,756 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.UDP] sockets will use interface 172.25.22.2
Sep:26,18:12:48,762 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.UDP] socket information:
mcast_addr
=239.249.0.82:45691, bind_addr=/172.25.22.2, ttl=24
sock
: bound to 172.25.22.2:55201, receive buffer size=20000000, send buffer size=640000
mcast_sock
: bound to 172.25.22.2:45691, send buffer size=640000, receive buffer size=25000000
Sep:26,18:12:48,775 FINER (:MSC service thread 1-5:) [org.jgroups.protocols.UDP] 172.25.22.2: sending msg to null, src=172.25.22.2, headers are PING: [type=GET_MBRS_REQ, cluster=ee-interserver], UDP: [cluster_name=ee-interserver]
Sep:26,18:12:48,937 INFO  (:ServerService Thread Pool -- 8:) [org.jboss.as.jpa] WFLYJPA0010: Starting Persistence Unit (phase 2 of 2) Service 'com.barco.nms.server.ear#firebird'
Sep:26,18:12:48,978 FINER (:TransferQueueBundler,ee-interserver,172.25.22.2:) [org.jgroups.protocols.UDP] 172.25.22.2: sending 1 msgs (111 bytes (0.36 of max_bundle_size) to 1 dests(s): [ee-interserver]
(...)
Sep:26,18:12:51,421 FINER (:INT-1,ee-interserver,172.25.22.2:) [org.jgroups.protocols.UDP] 172.25.22.2: received [dst: <null>, src: 172.25.14.2 (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL: heartbeat, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:51,425 FINER (:INT-2,ee-interserver,172.25.22.2:) [org.jgroups.protocols.UDP] 172.25.22.2: received [dst: <null>, src: 0591b4e3-74b3-7655-1d43-74e949444c2a (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL: heartbeat, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:51,568 FINER (:INT-1,ee-interserver,172.25.22.2:) [org.jgroups.protocols.UDP] 172.25.22.2: received [dst: <null>, src: eb6746f9-ad4f-d89f-7522-95e2e9a90070 (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL: heartbeat, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:51,777 FINER (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.GMS] 172.25.22.2: no members discovered after 3007 ms: creating cluster as first member
Sep:26,18:12:51,781 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.NAKACK2]
[172.25.22.2 setDigest()]
existing digest
:  []
new digest:       172.25.22.2: [0 (0)]
resulting digest
: 172.25.22.2: [0 (0)]
Sep:26,18:12:51,781 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.GMS] 172.25.22.2: installing view [172.25.22.2|0] (1) [172.25.22.2]
Sep:26,18:12:51,783 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
Sep:26,18:12:51,784 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.FD_SOCK] 172.25.22.2: VIEW_CHANGE received: [172.25.22.2]
Sep:26,18:12:51,788 FINER (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.STABLE] 172.25.22.2: reset digest to 172.25.22.2: [-1]
Sep:26,18:12:51,789 FINER (:MSC service thread 1-5:) [org.jgroups.protocols.UFC] new membership: [172.25.22.2]
Sep:26,18:12:51,790 FINER (:MSC service thread 1-5:) [org.jgroups.protocols.MFC] new membership: [172.25.22.2]
Sep:26,18:12:51,791 FINER (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.STABLE] 172.25.22.2: reset digest to 172.25.22.2: [-1]
Sep:26,18:12:51,791 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
Sep:26,18:12:51,792 FINE  (:MSC service thread 1-5:) [org.jgroups.protocols.pbcast.GMS] 172.25.22.2: created cluster (first member). My view is [172.25.22.2|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl



First FD_ALL message from 14.2 (which shows there are message being received from 22.2), and then the same 'discarded' message
Sep:26,18:12:54,422 FINER (:INT-1,ee-interserver,172.25.22.2:) [org.jgroups.protocols.UDP] 172.25.22.2: received [dst: <null>, src: 172.25.14.2 (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL: heartbeat, UDP: [cluster_name=ee-interserver]
Sep:26,18:12:55,093 FINER (:INT-2,ee-interserver,172.25.22.2:) [org.jgroups.protocols.UDP] 172.25.22.2: received [dst: <null>, src: 172.25.14.2 (2 headers), size=103 bytes, flags=OOB|NO_RELIABILITY|INTERNAL], headers are STABLE: [STABILITY] view-id= [172.25.14.2|14], UDP: [cluster_name=ee-interserver]
Sep:26,18:12:55,094 FINER (:INT-2,ee-interserver,172.25.22.2:) [org.jgroups.protocols.pbcast.STABLE] 172.25.22.2: discarded STABILITY message with different view-id [172.25.14.2|14] (my view-id=[172.25.22.2|0] (1) [172.25.22.2])



I don't find any errors or obviously wrong things in this, but the result is that the restarted node (22.2) doens't rejoin the group.
Both ends are using the same config (as I said it is on wildfly)
<stack name="udp-interserver">
 
<transport type="UDP" socket-binding="jgroups-interserver-udp">
 
<property name="ip_ttl">24</property>
 
<property name="log_discard_msgs">false</property>
 
</transport>
 
<protocol type="PING"/>
 
<protocol type="MERGE3"/>
 
<protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/>
 
<protocol type="FD_ALL"/>
 
<protocol type="VERIFY_SUSPECT"/>
 
<protocol type="pbcast.NAKACK2"/>
 
<protocol type="UNICAST3"/>
 
<protocol type="pbcast.STABLE"/>
 
<protocol type="pbcast.GMS"/>
 
<protocol type="UFC"/>
 
<protocol type="MFC"/>
 
<protocol type="FRAG2"/>
 
<protocol type="RSVP"/>
</stack>



Am I missing something obvious here? Any information that can help me understand this problem will be much appreciated.


Thomas Houtekier

unread,
Sep 27, 2018, 6:38:19 AM9/27/18
to jgroups-dev
Maybe interesting to add is that this happens on one particular setup.
We have been trying to reproduce this situation on another setup but have never seen this problem, even with far more nodes.
On a setup with 35 nodes we ran a test that constantly randomly restarts nodes and verifies if the group is complete again (after the restart). It ran fine for multiple days.
Initially I suspected a network-issue, but that has been ruled out: multicast is correctly configured and wireshark shows that messages are received from all nodes. The logs show that too actually.

The problem (the restarted node not rejoining the group) becomes apparent when you start adding things to the replicated infinispan-cache, which is backed by the jgroups-group.
This causes long blocking calls and eventually replication-timeout exceptions which is the reason why this not-joining is a problem: the replicated cache becomes unusable.


Op woensdag 26 september 2018 20:34:58 UTC+2 schreef Thomas Houtekier:

Bela Ban

unread,
Sep 27, 2018, 7:30:54 AM9/27/18
to jgrou...@googlegroups.com
I'm going to be on PTO next week, so I won't be able to look into this
right away, but meanwhile I suggest you come up with a simple
*JGroups-only* test which continually has 1 member leave and rejoin, and
run this in your environment.

This would be to exclude/include the possibility that Infinispan is
causing this.

I promise to look into this when I get back.
Cheers
> 1dests(s):[ee-interserver:172.25.22.2]
> Sep:26,18:12:27,112FINER
> (:OOB-10,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:received
> [dst:172.25.14.2,src:172.25.22.2(3headers),size=0bytes,flags=OOB],headers
> are
> GMS:GmsHeader[LEAVE_REQ]:mbr=172.25.22.2,UNICAST3:DATA,seqno=10,conn_id=11,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:27,112FINER
> (:OOB-10,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UNICAST3]172.25.14.2<--DATA(172.25.22.2:#10,
> 172.25.22.2fromfragmentation table
>
> |
>
>
> So far so good.
> Some time later, 22.2 starts again, but is never added to the group,
> eventhough I do see messages like this which indicates communication
> between 14.2 and 22.2.
> |
> Sep:26,18:12:27,180FINER
> (:Timer-2,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> msg to 172.25.22.2,src=172.25.14.2,headers are
> UNICAST3:ACK,seqno=10,conn_id=11,ts=22,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:27,180FINER
> (:TransferQueueBundler,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> 1msgs (70bytes (0.23of max_bundle_size)to
> 1dests(s):[ee-interserver:172.25.22.2]
> (...)
> Sep:26,18:12:31,691FINER
> (:Timer-2,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UNICAST3]172.25.14.2-->XMIT(172.25.22.2:#3)
> Sep:26,18:12:31,691FINER
> (:Timer-2,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> msg to 172.25.22.2,src=172.25.14.2,headers are
> RequestCorrelator:id=200,type=RSP,id=12,rsp_expected=false,FORK:ee-interserver:interserver,UNICAST3:DATA,seqno=3,conn_id=6,UDP:[cluster_name=ee-interserver]
>
> |
>
>
> I'm assuming the following message is the reason why the new node is
> not added to the group, but I don't know how to interpret this.
> As you can see the new node is present in this STABLE message. But
> since it is discarded, it is not processed in the digest.
>
> |
> Sep:26,18:12:32,132 FINER (:INT-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.pbcast.STABLE]
> *_172.25.14.2 <http://172.25.14.2>: discarded STABILITY message with
> different view-id_*[172.25.16.2|0] (my view-id=[172.25.14.2|14] (5)
> [172.25.14.2, 172.25.11.2, 172.25.5.2, 172.25.3.2, 172.25.1.2])
> Sep:26,18:12:32,192 FINER (:Timer-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UNICAST3] 172.25.14.2
> --> XMIT(172.25.22.2 <http://172.25.22.2>: #3)
> Sep:26,18:12:32,192 FINER (:Timer-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: sending msg to 172.25.22.2, src=172.25.14.2,
> headers are RequestCorrelator: id=200, type=RSP, id=12,
> rsp_expected=false, FORK: ee-interserver:interserver, UNICAST3:
> DATA, seqno=3, conn_id=6, UDP: [cluster_name=ee-interserver]
> Sep:26,18:12:32,192 FINER
> (:TransferQueueBundler,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: sending 1 msgs (124 bytes (0.40 of
> max_bundle_size) to 1 dests(s): [ee-interserver:172.25.22.2]
> Sep:26,18:12:32,275 FINER (:INT-1,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: received [dst: <null>, src: 172.25.1.2 (2
> headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO:
> view_id=[172.25.14.2|14], logical_name=172.25.1.2,
> physical_addr=172.25.1.2:55201 <http://172.25.1.2:55201>, UDP:
> [cluster_name=ee-interserver]
> Sep:26,18:12:32,571 FINER (:INT-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: received [dst: 172.25.14.2, src: 172.25.3.2 (2
> headers), size=103 bytes, flags=OOB|NO_RELIABILITY|INTERNAL],
> headers are STABLE: [STABLE_GOSSIP] view-id= [172.25.14.2|14], UDP:
> [cluster_name=ee-interserver]
> Sep:26,18:12:32,571 FINER (:INT-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.pbcast.STABLE]
> 172.25.14.2 <http://172.25.14.2>: handling digest from 172.25.3.2
> <http://172.25.3.2>:
> mine: 172.25.14.2 <http://172.25.14.2>: [23], 172.25.11.2
> <http://172.25.11.2>: [1], 172.25.5.2 <http://172.25.5.2>: [28],
> 172.25.3.2 <http://172.25.3.2>: [1], 172.25.1.2 <http://172.25.1.2>: [6]
> other: 172.25.14.2 <http://172.25.14.2>: [23], 172.25.11.2
> <http://172.25.11.2>: [1], 172.25.5.2 <http://172.25.5.2>: [28],
> 172.25.3.2 <http://172.25.3.2>: [1], 172.25.1.2 <http://172.25.1.2>: [6]
> result: 172.25.14.2 <http://172.25.14.2>: [23], 172.25.11.2
> <http://172.25.11.2>: [1], 172.25.5.2 <http://172.25.5.2>: [28],
> 172.25.3.2 <http://172.25.3.2>: [1], 172.25.1.2 <http://172.25.1.2>: [6]
>
>
> |
>
> I found the code where this message is printed
> (https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/pbcast/STABLE.java
> <https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/pbcast/STABLE.java>)
> but I can't figure out what the problem is.
> It is printed multiple times and it seems to be the reason why
> thatnew node never is joined in the group.
>
>
> The logs from the node that fails to join the group the other side
> so to speak):
> At startup:
>
> |
> Sep:26,18:12:48,707FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:stable task
> started
> Sep:26,18:12:48,756FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.UDP]sockets will useinterface172.25.22.2
> Sep:26,18:12:48,762FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.UDP]socket information:
> mcast_addr=239.249.0.82:45691,bind_addr=/172.25.22.2,ttl=24
> sock:bound to 172.25.22.2:55201,receive buffer size=20000000,send
> buffer size=640000
> mcast_sock:bound to 172.25.22.2:45691,send buffer
> size=640000,receive buffer size=25000000
> Sep:26,18:12:48,775FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.UDP]172.25.22.2:sending msg to
> null,src=172.25.22.2,headers are
> PING:[type=GET_MBRS_REQ,cluster=ee-interserver],UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:48,937INFO
> (:ServerServiceThreadPool--8:)[org.jboss.as.jpa]WFLYJPA0010:StartingPersistenceUnit(phase
> 2of 2)Service'com.barco.nms.server.ear#firebird'
> Sep:26,18:12:48,978FINER
> (:TransferQueueBundler,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:sending
> 1msgs (111bytes (0.36of max_bundle_size)to 1dests(s):[ee-interserver]
> (...)
> Sep:26,18:12:51,421FINER
> (:INT-1,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:51,425FINER
> (:INT-2,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:0591b4e3-74b3-7655-1d43-74e949444c2a(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:51,568FINER
> (:INT-1,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:eb6746f9-ad4f-d89f-7522-95e2e9a90070(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:51,777FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.GMS]172.25.22.2:nomembers
> discovered after 3007ms:creating cluster asfirst member
> Sep:26,18:12:51,781FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.NAKACK2]
> [172.25.22.2setDigest()]
> existing digest:[]
> newdigest:172.25.22.2:[0(0)]
> resulting digest:172.25.22.2:[0(0)]
> Sep:26,18:12:51,781FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.GMS]172.25.22.2:installing view
> [172.25.22.2|0](1)[172.25.22.2]
> Sep:26,18:12:51,783FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]resuming message garbage
> collection
> Sep:26,18:12:51,784FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.FD_SOCK]172.25.22.2:VIEW_CHANGE
> received:[172.25.22.2]
> Sep:26,18:12:51,788FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:reset digest
> to 172.25.22.2:[-1]
> Sep:26,18:12:51,789FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.UFC]newmembership:[172.25.22.2]
> Sep:26,18:12:51,790FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.MFC]newmembership:[172.25.22.2]
> Sep:26,18:12:51,791FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:reset digest
> to 172.25.22.2:[-1]
> Sep:26,18:12:51,791FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]resuming message garbage
> collection
> Sep:26,18:12:51,792FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.GMS]172.25.22.2:created cluster
> (first member).Myview is[172.25.22.2|0],impl
> isorg.jgroups.protocols.pbcast.CoordGmsImpl
>
> |
>
>
> First FD_ALL message from 14.2 (which shows there are message being
> received from 22.2), and then the same 'discarded' message
> |
> Sep:26,18:12:54,422FINER
> (:INT-1,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:55,093FINER
> (:INT-2,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=103bytes,flags=OOB|NO_RELIABILITY|INTERNAL],headers
> are
> STABLE:[STABILITY]view-id=[172.25.14.2|14],UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:55,094FINER
> (:INT-2,ee-interserver,172.25.22.2:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:discarded
> STABILITY message withdifferent view-id
> [172.25.14.2|14](myview-id=[172.25.22.2|0](1)[172.25.22.2])
>
> |
>
>
> I don't find any errors or obviously wrong things in this, but the
> result is that the restarted node (22.2) doens't rejoin the group.
> Both ends are using the same config (as I said it is on wildfly)
> |
> <stackname="udp-interserver">
> <transporttype="UDP"socket-binding="jgroups-interserver-udp">
> <propertyname="ip_ttl">24</property>
> <propertyname="log_discard_msgs">false</property>
> </transport>
> <protocoltype="PING"/>
> <protocoltype="MERGE3"/>
> <protocoltype="FD_SOCK"socket-binding="jgroups-udp-fd"/>
> <protocoltype="FD_ALL"/>
> <protocoltype="VERIFY_SUSPECT"/>
> <protocoltype="pbcast.NAKACK2"/>
> <protocoltype="UNICAST3"/>
> <protocoltype="pbcast.STABLE"/>
> <protocoltype="pbcast.GMS"/>
> <protocoltype="UFC"/>
> <protocoltype="MFC"/>
> <protocoltype="FRAG2"/>
> <protocoltype="RSVP"/>
> </stack>
>
> |
>
>
> Am I missing something obvious here? Any information that can help
> me understand this problem will be much appreciated.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Bela Ban | http://www.jgroups.org

Thomas Houtekier

unread,
Sep 30, 2018, 11:30:00 PM9/30/18
to jgroups-dev
Thanks, I'll get back here when (if) I can reproduce this with a jgroups-only application.
In the meantime, I have this question: Where do I find info on bugs #943480 and #938584?

The message I'm concerned about ('discarded STABILITY message with different view-id') is printed in STABLE.java, and has this comment on it:

            // we won't handle the stable_digest, if its members don't match the membership in my own digest,
           
// this is part of the fix for the NAKACK problem (bugs #943480 and #938584)
           
if(!view_id.equals(this.view.getViewId())) {
                log
.trace("%s: discarded STABILITY message with different view-id %s (my view-id=%s)",
                          local_addr
, view_id, view);
               
// resetDigest();
               
return;
           
}
            log
.trace("%s: received stability msg from %s: %s", local_addr, sender, printDigest(stable_digest));


            num_stability_msgs_received
++;
            resetDigest
();

I would like to know the context of this. Is it for instance an option to remove this check and call resetDigest() anyway?
It's difficult to know what to expect if I don't know the background of this condition. Where can I find info on those 2 bugs #943480 and #938584?
Thanks

Op donderdag 27 september 2018 21:00:54 UTC+9:30 schreef Bela Ban:

Thomas Houtekier

unread,
Oct 2, 2018, 3:55:26 AM10/2/18
to jgroups-dev
As you suggested, I've made a jgroups-only app (3.6.4.Final), and with it, I can reproduce the same problem as I stated before: members are not joining the cluster and I see the same message generated by STABLE.java such as.
FINER: 10.0.75.1: discarded STABILITY message with different view-id [10.202.86.37|0] (my view-id=MergeView::[10.0.75.1|5] (2) [10.0.75.1, 10.202.86.33], 1 subgroups: [10.202.86.33|1] (2) [10.202.86.33, 10.0.75.1])”

In the above example 10.202.86.37, 10.202.86.33 and 10.0.75.1 should form a cluster.
All of them have multiple network-interfaces, 10.0.75.1 is using one on the other network for its name, but they are all connected to the same network with 1 connection.

The testApp is as simple as it gets, it looks like this:
public TestAppJgroups() throws Exception{
 channel
=new JChannel("udp.xml");
 
}
 
public static void main(String[] args) {
 
System.setProperty("java.net.preferIPv4Stack" , "true");
 
if (args == null || args.length == 0) {
 
throw new RuntimeException("please specify the port for the webserver");
 
}
 logger
.log(Level.WARNING, "starting ");
 
int port = Integer.parseInt(args[0]);
 
try {
 
TestAppJgroups app = new TestAppJgroups();
 
HttpServer server = HttpServer.create(new InetSocketAddress(port), 0);
 
//I add a webserver here so that I can manually stop the app (system.exit) and request the view of the channel
 
//
 server
.start();
 app
.start();
 
}catch (Exception e) {
 e
.printStackTrace();
 
}
 
 
}
 
public void start() throws Exception{
 
      channel
.setReceiver(this);
      channel
.setName(Inet4Address.getLocalHost().getHostAddress());
      channel
.connect("jgroups-test-cluster");
      eventLoop
(); //as in http://www.jgroups.org/tutorial/
      channel
.close();      
 
}



This is the config:
<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
<UDP mcast_addr="${jboss.partition.udpGroup:239.249.0.99}"
     
mcast_port="${jboss.hapartition.mcast_port:45566}"
     
tos="8"
     
ucast_recv_buf_size="20000000"
     
ucast_send_buf_size="640000"
     
mcast_recv_buf_size="25000000"
     
mcast_send_buf_size="640000"
     
loopback="false"
     
discard_incompatible_packets="true"
     
enable_bundling="false"
     
max_bundle_size="64000"
     
max_bundle_timeout="30"
     
ip_ttl="${jgroups.udp.ip_ttl:10}"/>
     
<PING/>
     
<MERGE3/>
     
<FD_SOCK/>
     
<FD_ALL/>
     
<VERIFY_SUSPECT/>
     
<pbcast.NAKACK2/>
     
<UNICAST3/>
     
<pbcast.STABLE/>
     
<pbcast.GMS/>
     
<UFC/>
     
<MFC/>
     
<FRAG2/>
     
<RSVP/>
 
</config>

 
As an experiment, I changed STABLE.java to not discard those messages in that condition, and continue with updateLocalDigest() or resetDigest().
It didn't make it work, it results in the following logs. (note the updated FINER message about not discarding the message).
Remarkable is the '-1' in the message 'reset digest to 10.202.86.33: [-1]'

Oct 02, 2018 2:26:43 PM org.jgroups.protocols.pbcast.STABLE handleStabilityMessage
FINER
: 10.202.86.33: PATCH: NOT discarded STABILITY message with different view-id [10.202.86.37|0] (my view-id=[10.202.86.33|0])
Oct 02, 2018 2:26:43 PM org.jgroups.protocols.pbcast.STABLE handleStabilityMessage
FINER
: 10.202.86.33: received stability msg from 10.202.86.37:
Oct 02, 2018 2:26:43 PM org.jgroups.protocols.pbcast.STABLE resetDigest
FINER
: 10.202.86.33: reset digest to 10.202.86.33: [-1]
Oct 02, 2018 2:26:43 PM org.jgroups.protocols.pbcast.NAKACK2 stable
FINER
: 10.202.86.33: received stable digest 10.202.86.37: [0 (0)]
Oct 02, 2018 2:26:47 PM org.jgroups.protocols.TP passMessageUp
FINER
: 10.202.86.33: received [dst: <null>, src: 10.202.86.37 (2 headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO: view_id=[10.202.86.37|0], logical_name=10.202.86.37, physical_addr=172.25.8.2:62444, UDP: [cluster_name=interOR-test-cluster]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.TP passMessageUp
FINER
: 10.202.86.33: received [dst: <null>, src: 10.202.87.31 (2 headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO: view_id=[10.202.87.31|22], logical_name=10.202.87.31, physical_addr=172.25.20.2:64271, UDP: [cluster_name=interOR-test-cluster]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.pbcast.STABLE sendStableMessage
FINER
: 10.202.86.33: sending stable msg to 10.202.86.33: 10.202.86.33: [0]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.TP down
FINER
: 10.202.86.33: sending msg to 10.202.86.33, src=10.202.86.33, headers are STABLE: [STABLE_GOSSIP] view-id= [10.202.86.33|0], UDP: [cluster_name=interOR-test-cluster]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.TP loopback
FINER
: 10.202.86.33: looping back message [dst: 10.202.86.33, src: 10.202.86.33 (2 headers), size=21 bytes, flags=OOB|NO_RELIABILITY|INTERNAL]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.TP passMessageUp
FINER
: 10.202.86.33: received [dst: 10.202.86.33, src: 10.202.86.33 (2 headers), size=21 bytes, flags=OOB|NO_RELIABILITY|INTERNAL], headers are STABLE: [STABLE_GOSSIP] view-id= [10.202.86.33|0], UDP: [cluster_name=interOR-test-cluster]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.pbcast.STABLE updateLocalDigest
FINER
: 10.202.86.33: handling digest from 10.202.86.33:
mine
:   10.202.86.33: [-1]
other
:  10.202.86.33: [0]
result
: 10.202.86.33: [0]


Oct 02, 2018 2:26:52 PM org.jgroups.protocols.pbcast.STABLE resetDigest
FINER
: 10.202.86.33: reset digest to 10.202.86.33: [-1]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.pbcast.NAKACK2 stable
FINER
: 10.202.86.33: received stable digest 10.202.86.33: [0 (0)]
Oct 02, 2018 2:26:52 PM org.jgroups.protocols.pbcast.NAKACK2 stable
FINER
: 10.202.86.33: deleting msgs <= 0 from 10.202.86.33



To be able to make sense out of this, I at least need to know the context of this comment in STABLE.java: What are bugs #943480 and #938584? 
From STABLE.a
// we won't handle the stable_digest, if its members don't match the membership in my own digest,
// this is part of the fix for the NAKACK problem (bugs #943480 and #938584)


Op donderdag 27 september 2018 21:00:54 UTC+9:30 schreef Bela Ban:
I'm going to be on PTO next week, so I won't be able to look into this

Thomas Houtekier

unread,
Oct 4, 2018, 4:59:07 AM10/4/18
to jgroups-dev
This is where I am at now:
This test-application reproduces the problem I described, but if I switch to jgroups-4.0.14-final, then I can't reproduce the problem anymore.
I still see those messages: discarded STABILITY message with different view-id , but eventually it resolves and the restarted node joins the cluster.
If I use the latest 3.6(.16) the problem is still there.

I told in my initial post that I'm on wildfly-9.0.2: upgrading jgroups to 4.0.14 is not possible: it is not compatible with org.jboss.as.clustering.jgroups.JChannelFactory:
It crashes with:
 Caused by: java.lang.VerifyError: Bad return type
 
Exception Details:
   
Location:
 org
/jboss/as/clustering/jgroups/JChannelFactory.createChannel(Ljava/lang/String;)Lorg/jgroups/Channel; @541: areturn
   
Reason:
 
Type 'org/jgroups/JChannel' (current frame, stack[0]) is not assignable to 'org/jgroups/Channel' (from method signature)



I'm currently investigating if I can get away with a wrapper or something to make JChannelFactory-callers happy, unless you know a better way to integrate jgroups-4.0.14 into wildfly-9.

Thomas

Op dinsdag 2 oktober 2018 17:25:26 UTC+9:30 schreef Thomas Houtekier:

Thomas Houtekier

unread,
Oct 5, 2018, 7:53:03 AM10/5/18
to jgroups-dev
During tests with 3.6.4 I noticed the following, which is 100% reproducable:
  • a node which is restarted never rejoins the cluster and becomes a singleton, unless that node happens to be the mergeleader.
  • when restarting the mergeleader, the cluster forms completely and all nodes have a consistent view. Also the ones that were a singleton before that restart.
I would really want to understand what is going on here. On the same setup, same config but jgroups-4.0.14 everyhing works fine, always. It has nothing to do with the secondary network, which I also suspected for a while.
I included logfiles of such a scenario with 3.6.4 and 3 nodes. Check the 'discarded STABILITY message with different view-id'-prints. At the end of the run, node 14 was the leader and was restarted and the complete cluster was formed. I home this gives enough documentation to figure this out.
This only happens on a certain setup, I don't have direct access to it anymore, but if there are any questions or other checks that could be done to further narrow this down, I would be happy to hear them.

kind regards and thanks for any input
Thomas




Op donderdag 4 oktober 2018 18:29:07 UTC+9:30 schreef Thomas Houtekier:
DiscardedMessages.zip

Bela Ban

unread,
Oct 12, 2018, 11:24:51 AM10/12/18
to jgrou...@googlegroups.com
OK, so if you can reproduce this with 3.6.16:
* What's the config you've been using with 3.6.16?
* What exactly do you do with your test app? Start multiple apps on the
same node, leave one of them and restart?
* In your initial log, the time between leaving and rejoining was a
couple of hundred milliseconds, so this cannot be done manually!
* How many nodes do you start?
* What's the diff between your app and the tutorial (SampleChat)? Could
I just use SampleChat with your config?


On 04/10/18 10:59, Thomas Houtekier wrote:
> This is where I am at now:
> This test-application reproduces the problem I described, but if I
> switch to jgroups-4.0.14-final, then I can't reproduce the problem anymore.
> I still see those messages: discarded STABILITY message with different
> view-id , but eventually it resolves and the restarted node joins the
> cluster.
> If I use the latest 3.6(.16) the problem is still there.
>
> I told in my initial post that I'm on wildfly-9.0.2: upgrading jgroups
> to 4.0.14 is not possible: it is not compatible with
> org.jboss.as.clustering.jgroups.JChannelFactory:
> It crashes with:
> |
> Causedby:java.lang.VerifyError:Badreturntype
> ExceptionDetails:
> Location:
>  org/jboss/as/clustering/jgroups/JChannelFactory.createChannel(Ljava/lang/String;)Lorg/jgroups/Channel;@541:areturn
> Reason:
> Type'org/jgroups/JChannel'(current frame,stack[0])isnotassignable to
> 'org/jgroups/Channel'(frommethod signature)
>
> |
>
>
> I'm currently investigating if I can get away with a wrapper or
> something to make JChannelFactory-callers happy, unless you know a
> better way to integrate jgroups-4.0.14 into wildfly-9.
>
> Thomas
>
> Op dinsdag 2 oktober 2018 17:25:26 UTC+9:30 schreef Thomas Houtekier:
>
> As you suggested, I've made a jgroups-only app (3.6.4.Final), and
> with it, I can reproduce the same problem as I stated before:
> members are not joining the cluster and I see the same message
> generated by STABLE.java such as.
> |
> FINER:10.0.75.1:discarded STABILITY message withdifferent view-id
> [10.202.86.37|0](myview-id=MergeView::[10.0.75.1|5](2)[10.0.75.1,10.202.86.33],1subgroups:[10.202.86.33|1](2)[10.202.86.33,10.0.75.1])”
> |
>
> In the above example 10.202.86.37, 10.202.86.33 and 10.0.75.1 should
> form a cluster.
> All of them have multiple network-interfaces, 10.0.75.1 is using one
> on the other network for its name, but they are all connected to the
> same network with 1 connection.
>
> The testApp is as simple as it gets, it looks like this:
> |
> publicTestAppJgroups()throwsException{
>  channel =newJChannel("udp.xml");
> }
> publicstaticvoidmain(String[]args){
> System.setProperty("java.net.preferIPv4Stack","true");
> if(args ==null||args.length ==0){
> thrownewRuntimeException("please specify the port for the webserver");
> }
>  logger.log(Level.WARNING,"starting ");
> intport =Integer.parseInt(args[0]);
> try{
> TestAppJgroupsapp =newTestAppJgroups();
> HttpServerserver =HttpServer.create(newInetSocketAddress(port),0);
> //I add a webserver here so that I can manually stop the app
> (system.exit) and request the view of the channel
> //
>  server.start();
>  app.start();
> }catch(Exceptione){
>  e.printStackTrace();
> }
>
> }
> publicvoidstart()throwsException{
>
>       channel.setReceiver(this);
>       channel.setName(Inet4Address.getLocalHost().getHostAddress());
>       channel.connect("jgroups-test-cluster");
>       eventLoop();//as in http://www.jgroups.org/tutorial/
> <http://www.jgroups.org/tutorial/>
>       channel.close();
> }
>
> |
>
>
> This is the config:
> |
> <configxmlns="urn:org:jgroups"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance
> <http://www.jgroups.org/schema/jgroups.xsd>">
> <UDPmcast_addr="${jboss.partition.udpGroup:239.249.0.99}"
> Remarkable is the '*-1'* in the message '*reset digest to
> 10.202.86.33 <http://10.202.86.33>: [-1]*'
>
> |
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.STABLE
> handleStabilityMessage
> FINER:10.202.86.33:PATCH:NOT discarded STABILITY message
> withdifferent view-id [10.202.86.37|0](myview-id=[10.202.86.33|0])
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.STABLE
> handleStabilityMessage
> FINER:10.202.86.33:received stability msg from10.202.86.37:
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.STABLE resetDigest
> FINER:10.202.86.33:reset digest to 10.202.86.33:[-1]
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.NAKACK2 stable
> FINER:10.202.86.33:received stable digest 10.202.86.37:[0(0)]
> Oct02,20182:26:47PM org.jgroups.protocols.TP passMessageUp
> FINER:10.202.86.33:received
> [dst:<null>,src:10.202.86.37(2headers),size=0bytes,flags=INTERNAL],headers
> are
> MERGE3:INFO:view_id=[10.202.86.37|0],logical_name=10.202.86.37,physical_addr=172.25.8.2:62444,UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.TP passMessageUp
> FINER:10.202.86.33:received
> [dst:<null>,src:10.202.87.31(2headers),size=0bytes,flags=INTERNAL],headers
> are
> MERGE3:INFO:view_id=[10.202.87.31|22],logical_name=10.202.87.31,physical_addr=172.25.20.2:64271,UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.STABLE
> sendStableMessage
> FINER:10.202.86.33:sending stable msg to 10.202.86.33:10.202.86.33:[0]
> Oct02,20182:26:52PM org.jgroups.protocols.TP down
> FINER:10.202.86.33:sending msg to
> 10.202.86.33,src=10.202.86.33,headers are
> STABLE:[STABLE_GOSSIP]view-id=[10.202.86.33|0],UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.TP loopback
> FINER:10.202.86.33:looping back message
> [dst:10.202.86.33,src:10.202.86.33(2headers),size=21bytes,flags=OOB|NO_RELIABILITY|INTERNAL]
> Oct02,20182:26:52PM org.jgroups.protocols.TP passMessageUp
> FINER:10.202.86.33:received
> [dst:10.202.86.33,src:10.202.86.33(2headers),size=21bytes,flags=OOB|NO_RELIABILITY|INTERNAL],headers
> are
> STABLE:[STABLE_GOSSIP]view-id=[10.202.86.33|0],UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.STABLE
> updateLocalDigest
> FINER:10.202.86.33:handling digest from10.202.86.33:
> mine:10.202.86.33:[-1]
> other:10.202.86.33:[0]
> result:10.202.86.33:[0]
>
>
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.STABLE resetDigest
> FINER:10.202.86.33:reset digest to 10.202.86.33:[-1]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.NAKACK2 stable
> FINER:10.202.86.33:received stable digest 10.202.86.33:[0(0)]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.NAKACK2 stable
> FINER:10.202.86.33:deleting msgs <=0from10.202.86.33
> |
>
>
>
> To be able to make sense out of this, I at least need to know the
> context of this comment in STABLE.java: _What are bugs #943480 and
> #938584?_
> <http://239.249.0.82:45691>,bind_addr=/172.25.22.2
> <http://172.25.22.2>,ttl=24
> >     sock:bound to 172.25.22.2:55201
> <http://172.25.22.2:55201>,receive buffer size=20000000,send
> >     buffer size=640000
> >     mcast_sock:bound to 172.25.22.2:45691
> <http://172.25.22.2:45691>,send buffer
> > <mailto:jgroups-dev...@googlegroups.com>.
> <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
> --
> Bela Ban | http://www.jgroups.org
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/03ffd837-bf56-42b5-829a-13b08c6c6027%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/03ffd837-bf56-42b5-829a-13b08c6c6027%40googlegroups.com?utm_medium=email&utm_source=footer>.

Bela Ban

unread,
Oct 12, 2018, 11:27:38 AM10/12/18
to jgrou...@googlegroups.com


On 05/10/18 13:53, Thomas Houtekier wrote:
> During tests with 3.6.4 I noticed the following, which is 100% reproducable:

I'm not interested in 3.6.4, as this is from 2015! However, if you can
reproduce this on 3.6.16, then I'm *very* interested...

> * a node which is restarted never rejoins the cluster and becomes a
> singleton, unless that node happens to be the mergeleader.
> * when restarting the mergeleader, the cluster forms completely and
> all nodes have a consistent view. Also the ones that were a
> singleton before that restart.
>
> I would really want to understand what is going on here. On the same
> setup, same config but jgroups-4.0.14 everyhing works fine, always. It
> has nothing to do with the secondary network, which I also suspected for
> a while.

OK, so the network doesn't seem to be the issue then...
> Type'org/jgroups/JChannel'(current frame,stack[0])isnotassignable to
> 'org/jgroups/Channel'(frommethod signature)
>
> |
>
>
> I'm currently investigating if I can get away with a wrapper or
> something to make JChannelFactory-callers happy, unless you know a
> better way to integrate jgroups-4.0.14 into wildfly-9.
>
> Thomas
>
> Op dinsdag 2 oktober 2018 17:25:26 UTC+9:30 schreef Thomas Houtekier:
>
> As you suggested, I've made a jgroups-only app (3.6.4.Final),
> and with it, I can reproduce the same problem as I stated
> before: members are not joining the cluster and I see the same
> message generated by STABLE.java such as.
> |
> FINER:10.0.75.1:discarded STABILITY message withdifferent
> view-id
> [10.202.86.37|0](myview-id=MergeView::[10.0.75.1|5](2)[10.0.75.1,10.202.86.33],1subgroups:[10.202.86.33|1](2)[10.202.86.33,10.0.75.1])”
> |
>
> In the above example 10.202.86.37, 10.202.86.33 and 10.0.75.1
> should form a cluster.
> All of them have multiple network-interfaces, 10.0.75.1 is using
> one on the other network for its name, but they are all
> connected to the same network with 1 connection.
>
> The testApp is as simple as it gets, it looks like this:
> |
> publicTestAppJgroups()throwsException{
>  channel =newJChannel("udp.xml");
> }
> publicstaticvoidmain(String[]args){
> System.setProperty("java.net.preferIPv4Stack","true");
> if(args ==null||args.length ==0){
> thrownewRuntimeException("please specify the port for the
> webserver");
> }
>  logger.log(Level.WARNING,"starting ");
> intport =Integer.parseInt(args[0]);
> try{
> TestAppJgroupsapp =newTestAppJgroups();
> HttpServerserver =HttpServer.create(newInetSocketAddress(port),0);
> //I add a webserver here so that I can manually stop the app
> (system.exit) and request the view of the channel
> //
>  server.start();
>  app.start();
> }catch(Exceptione){
>  e.printStackTrace();
> }
>
> }
> publicvoidstart()throwsException{
>
>       channel.setReceiver(this);
>
> channel.setName(Inet4Address.getLocalHost().getHostAddress());
>       channel.connect("jgroups-test-cluster");
>       eventLoop();//as in http://www.jgroups.org/tutorial/
> <http://www.jgroups.org/tutorial/>
>       channel.close();
> }
>
> |
>
>
> This is the config:
> |
> <configxmlns="urn:org:jgroups"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance
> <http://www.jgroups.org/schema/jgroups.xsd>">
> <UDPmcast_addr="${jboss.partition.udpGroup:239.249.0.99}"
> Remarkable is the '*-1'* in the message '*reset digest to
> 10.202.86.33 <http://10.202.86.33>: [-1]*'
>
> |
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.STABLE
> handleStabilityMessage
> FINER:10.202.86.33:PATCH:NOT discarded STABILITY message
> withdifferent view-id [10.202.86.37|0](myview-id=[10.202.86.33|0])
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.STABLE
> handleStabilityMessage
> FINER:10.202.86.33:received stability msg from10.202.86.37:
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.STABLE resetDigest
> FINER:10.202.86.33:reset digest to 10.202.86.33:[-1]
> Oct02,20182:26:43PM org.jgroups.protocols.pbcast.NAKACK2 stable
> FINER:10.202.86.33:received stable digest 10.202.86.37:[0(0)]
> Oct02,20182:26:47PM org.jgroups.protocols.TP passMessageUp
> FINER:10.202.86.33:received
> [dst:<null>,src:10.202.86.37(2headers),size=0bytes,flags=INTERNAL],headers
> are
> MERGE3:INFO:view_id=[10.202.86.37|0],logical_name=10.202.86.37,physical_addr=172.25.8.2:62444,UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.TP passMessageUp
> FINER:10.202.86.33:received
> [dst:<null>,src:10.202.87.31(2headers),size=0bytes,flags=INTERNAL],headers
> are
> MERGE3:INFO:view_id=[10.202.87.31|22],logical_name=10.202.87.31,physical_addr=172.25.20.2:64271,UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.STABLE
> sendStableMessage
> FINER:10.202.86.33:sending stable msg to
> 10.202.86.33:10.202.86.33:[0]
> Oct02,20182:26:52PM org.jgroups.protocols.TP down
> FINER:10.202.86.33:sending msg to
> 10.202.86.33,src=10.202.86.33,headers are
> STABLE:[STABLE_GOSSIP]view-id=[10.202.86.33|0],UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.TP loopback
> FINER:10.202.86.33:looping back message
> [dst:10.202.86.33,src:10.202.86.33(2headers),size=21bytes,flags=OOB|NO_RELIABILITY|INTERNAL]
> Oct02,20182:26:52PM org.jgroups.protocols.TP passMessageUp
> FINER:10.202.86.33:received
> [dst:10.202.86.33,src:10.202.86.33(2headers),size=21bytes,flags=OOB|NO_RELIABILITY|INTERNAL],headers
> are
> STABLE:[STABLE_GOSSIP]view-id=[10.202.86.33|0],UDP:[cluster_name=interOR-test-cluster]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.STABLE
> updateLocalDigest
> FINER:10.202.86.33:handling digest from10.202.86.33:
> mine:10.202.86.33:[-1]
> other:10.202.86.33:[0]
> result:10.202.86.33:[0]
>
>
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.STABLE resetDigest
> FINER:10.202.86.33:reset digest to 10.202.86.33:[-1]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.NAKACK2 stable
> FINER:10.202.86.33:received stable digest 10.202.86.33:[0(0)]
> Oct02,20182:26:52PM org.jgroups.protocols.pbcast.NAKACK2 stable
> FINER:10.202.86.33:deleting msgs <=0from10.202.86.33
> |
>
>
>
> To be able to make sense out of this, I at least need to know
> the context of this comment in STABLE.java: _What are bugs
> #943480 and #938584?_
> <http://172.25.1.2:55201> <http://172.25.1.2:55201>, UDP:
> >     mcast_addr=239.249.0.82:45691
> <http://239.249.0.82:45691>,bind_addr=/172.25.22.2
> <http://172.25.22.2>,ttl=24
> >     sock:bound to 172.25.22.2:55201
> <http://172.25.22.2:55201>,receive buffer size=20000000,send
> >     buffer size=640000
> >     mcast_sock:bound to 172.25.22.2:45691
> <http://172.25.22.2:45691>,send buffer
> > <mailto:jgroups-dev...@googlegroups.com>.
> <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
> --
> Bela Ban | http://www.jgroups.org
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/9ee8a7f3-8c70-4cf2-9ba8-0f7823ca134a%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/9ee8a7f3-8c70-4cf2-9ba8-0f7823ca134a%40googlegroups.com?utm_medium=email&utm_source=footer>.

Thomas Houtekier

unread,
Oct 12, 2018, 12:26:45 PM10/12/18
to jgroups-dev
Hi Bela,

Thanks for getting back to me, much appreciated.
The config used with 3.6.16 was the exact same as with 3.6.4 and 4.0.14, it didn't make a difference. Only changing the jgoups-version made a difference.
Initial config is with everything default, as I posted in my initial message.
I also used this config, which is the config the demo-app comes with, but with all timeouts and buffers increased, to try and rule out it has something to do with timeouts/buffersizes:
<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

       
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
<UDP mcast_addr="${jboss.partition.udpGroup:239.249.0.99}"
     
mcast_port="${jboss.hapartition.mcast_port:45566}"
     
tos="8"
     
ucast_recv_buf_size="50M"
         
ucast_send_buf_size="50M"
         
mcast_recv_buf_size="50M"
         
mcast_send_buf_size="50M"
     
max_bundle_size="64K"

     
ip_ttl="${jgroups.udp.ip_ttl:10}"/>
     
<PING/>

   
<MERGE3 max_interval="30000"
           
min_interval="10000"/>
     
<FD_SOCK/>
     
<FD_ALL/>
<VERIFY_SUSPECT timeout="5000"  />
<pbcast.NAKACK2 xmit_interval="500"
                   
xmit_table_num_rows="100"
                   
xmit_table_msgs_per_row="2000"
                   
xmit_table_max_compaction_time="30000"
                   
use_mcast_xmit="false"
                   
discard_delivered_msgs="true"/>
<UNICAST3 xmit_interval="500"
             
xmit_table_num_rows="100"
             
xmit_table_msgs_per_row="2000"
             
xmit_table_max_compaction_time="60000"
             
conn_expiry_timeout="0"/>
         
<pbcast.STABLE desired_avg_gossip="50000"
                   
max_bytes="4M"/>
     
<pbcast.GMS print_local_addr="true" join_timeout="2000"/>
     
<UFC max_credits="2M"
         
min_threshold="0.4"/>
   
<MFC max_credits="2M"
         
min_threshold="0.4"/>
   
<FRAG2 frag_size="60K"  />
     
<RSVP resend_interval="2000" timeout="10000"/>
 
</config>


 
The problem occured with this configuration on 3.6.16, but unfortunately I don't have logs of that.
I gave up on 3.6.16 the moment I saw the exact same behaviour with it: the cluster not forming and the same "discarded STABILITY message" print in the logs. So I assumed it is the same problem.
The only reason to try 3.6.16 was because this one would (probably) integrate easier in wildfly-9. I already knew by that time that 4.0.14 didn't show the failures.
The same config was used with 4.0.14.

I didn't use the sampleChat app, but a very simple application, I posted the code in my message of 10/02.
Testing was always a single instance per physical machine, each machine in a different vlan.
The testapp has an embedded webserver (com.sun.net.httpserver.HttpServer) which allows to stop it remotely, and read its view remotely.
Together with a sort-or-watchdog batch-script (which monitors the jgroups-app) I could remotely restart random nodes in the cluster and check if they rejoined: restart was effectively stopping the java-application and let is start again by this watchdog-script.
I can mail you all this if you want, it's pretty straigt-forward actually.
The problem occured with even 2 nodes. With 4.0.14 I haven't seen it fail with up to 23, that was the maximum number of machines we had to test with.
We were able to rule almost everything:
* network/networkconfig such as routing-tables
* jre-version
* OS version (windows in this case: 7 or 10)
* network-infrastructure, cables
Everything checks out. Changing only the version of jgroups makes the difference between working and not working.
The only logs generated by jgroups are very low-level, no errors or warnings, yet the cluster is not formed.
In the failure case, the 'discarded STABILITY message' print is always there, repeatedly, and the cluster is never formed.
With 4.0.14, I see those prints too, but it seems to resolve in the end: the cluster is always formed.

Thomas


Op vrijdag 12 oktober 2018 17:27:38 UTC+2 schreef Bela Ban:
>              > <mailto:jgroups-dev+unsub...@googlegroups.com>.
>              > To post to this group, send email to
>             jgrou...@googlegroups.com
>              > <mailto:jgrou...@googlegroups.com>.
>              > To view this discussion on the web visit
>              >
>             https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com
>             <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com>
>
>              >
>             <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer
>             <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>              > For more options, visit
>             https://groups.google.com/d/optout
>             <https://groups.google.com/d/optout>.
>
>             --
>             Bela Ban | http://www.jgroups.org
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com

Bela Ban

unread,
Oct 15, 2018, 6:32:32 AM10/15/18
to jgrou...@googlegroups.com


On 27/09/18 12:38 PM, Thomas Houtekier wrote:
> Maybe interesting to add is that this happens on one particular setup.

So what's the difference between the 2 setups? Same code base, same
config, same network?

You're saying that this works on one setup, but not on another one?

> We have been trying to reproduce this situation on another setup but
> have never seen this problem, even with far more nodes.
If everything is the same, but the network/OS(?) is different, I'm
inclined to believe that this is a network issue...

> On a setup with 35 nodes we ran a test that constantly randomly
> restarts nodes and verifies if the group is complete again (after the
> restart). It ran fine for multiple days.

Was this with 3.6.4, or with 4.0.x?

> Initially I suspected a network-issue, but that has been ruled out:
> multicast is correctly configured and wireshark shows that messages
> are received from all nodes. The logs show that too actually.
>
> The problem (the restarted node not rejoining the group) becomes
> apparent when you start adding things to the replicated
> infinispan-cache, which is backed by the jgroups-group.
> This causes long blocking calls and eventually replication-timeout
> exceptions which is the reason why this not-joining is a problem: the
> replicated cache becomes unusable.
>
>
> Op woensdag 26 september 2018 20:34:58 UTC+2 schreef Thomas Houtekier:
>
> I have a problem with the following setup on wildfly (9.0.2.Final)
> which includes Jgroups-3.6.4.Final.
> When restarting 1 of the nodes, it disappears from the group and
> it never re-joins, eventhough the logs indicate that we receive
> messages from that node.
> The setup is supposed to have +40 nodes but because it is nowhere
> near stable it is currenlty downscaled to 6. But the same problems
> happen at this scale.
>
> These are the logs from node 172.25.14.2 when 172.25.22.2 is
> restarting:
> Before the shutdown, we see a group of 6 members
>
> |
> Sep:26,18:06:59,255INFO
> (:Incoming-11,ee-interserver,172.25.14.2:)[org.infinispan.remoting.transport.jgroups.JGroupsTransport]ISPN000094:Receivednewcluster
> view forchannel
> interserver:[172.25.14.2|13](6)[172.25.14.2,172.25.11.2,172.25.5.2,172.25.3.2,172.25.22.2,172.25.1.2]
> |
>
> node 172.25.22.2 is shutting down and gracefully leaving the cluster:
> |
> Sep:26,18:12:27,087FINER
> (:TransferQueueBundler,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> 1msgs (124bytes (0.40of max_bundle_size)to
> 1dests(s):[ee-interserver:172.25.22.2]
> Sep:26,18:12:27,112FINER
> (:OOB-10,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:received
> [dst:172.25.14.2,src:172.25.22.2(3headers),size=0bytes,flags=OOB],headers
> are
> GMS:GmsHeader[LEAVE_REQ]:mbr=172.25.22.2,UNICAST3:DATA,seqno=10,conn_id=11,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:27,112FINER
> (:OOB-10,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UNICAST3]172.25.14.2<--DATA(172.25.22.2:#10,
> conn_id=11)
> Sep:26,18:12:27,112FINER
> (:OOB-10,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UNICAST3]172.25.14.2:delivering
> 172.25.22.2#10
> Sep:26,18:12:27,162FINER
> (:ViewHandler,ee-interserver,172.25.14.2:)[org.jgroups.protocols.pbcast.GMS]172.25.14.2:joiners=[],suspected=[],leaving=[172.25.22.2],newview:[172.25.14.2|14](5)[172.25.14.2,172.25.11.2,172.25.5.2,172.25.3.2,172.25.1.2]
> Sep:26,18:12:27,162FINER
> (:ViewHandler,ee-interserver,172.25.14.2:)[org.jgroups.protocols.pbcast.GMS]172.25.14.2:sending
> LEAVE response to 172.25.22.2
> Sep:26,18:12:27,162FINER
> (:ViewHandler,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> msg to 172.25.22.2,src=172.25.14.2,headers are
> GMS:GmsHeader[LEAVE_RSP],UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:27,162FINER
> (:ViewHandler,ee-interserver,172.25.14.2:)[org.jgroups.protocols.pbcast.GMS]172.25.14.2:mcasting
> view
> [172.25.14.2|14](5)[172.25.14.2,172.25.11.2,172.25.5.2,172.25.3.2,172.25.1.2](5mbrs)
>
>
> Sep:26,18:12:27,164FINER
> (:Incoming-18,ee-interserver,172.25.14.2:)[org.jgroups.protocols.pbcast.GMS]172.25.14.2:received
> delta view
> [172.25.14.2|14],ref-view=[172.25.14.2|13],left=[172.25.22.2]
> Sep:26,18:12:27,164FINE
> (:Incoming-18,ee-interserver,172.25.14.2:)[org.jgroups.protocols.pbcast.GMS]172.25.14.2:installing
> view
> [172.25.14.2|14](5)[172.25.14.2,172.25.11.2,172.25.5.2,172.25.3.2,172.25.1.2]
> Sep:26,18:12:27,164FINER
> (:Incoming-18,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UNICAST3]172.25.14.2:closing
> connections of non members [172.25.22.2]
> Sep:26,18:12:27,164FINE
> (:Incoming-18,ee-interserver,172.25.14.2:)[org.jgroups.protocols.pbcast.NAKACK2]172.25.14.2:removed
> 172.25.22.2fromxmit_table (notmember anymore)
>
>
> Sep:26,18:12:27,165FINER
> (:Incoming-18,ee-interserver,172.25.14.2:)[org.jgroups.protocols.FRAG2]172.25.14.2:removed
> 172.25.22.2fromfragmentation table
>
> |
>
>
> So far so good.
> Some time later, 22.2 starts again, but is never added to the
> group, eventhough I do see messages like this which indicates
> communication between 14.2 and 22.2.
> |
> Sep:26,18:12:27,180FINER
> (:Timer-2,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> msg to 172.25.22.2,src=172.25.14.2,headers are
> UNICAST3:ACK,seqno=10,conn_id=11,ts=22,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:27,180FINER
> (:TransferQueueBundler,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> 1msgs (70bytes (0.23of max_bundle_size)to
> 1dests(s):[ee-interserver:172.25.22.2]
> (...)
> Sep:26,18:12:31,691FINER
> (:Timer-2,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UNICAST3]172.25.14.2-->XMIT(172.25.22.2:#3)
> Sep:26,18:12:31,691FINER
> (:Timer-2,ee-interserver,172.25.14.2:)[org.jgroups.protocols.UDP]172.25.14.2:sending
> msg to 172.25.22.2,src=172.25.14.2,headers are
> RequestCorrelator:id=200,type=RSP,id=12,rsp_expected=false,FORK:ee-interserver:interserver,UNICAST3:DATA,seqno=3,conn_id=6,UDP:[cluster_name=ee-interserver]
>
> |
>
>
> I'm assuming the following message is the reason why the new node
> is not added to the group, but I don't know how to interpret this.
> As you can see the new node is present in this STABLE message. But
> since it is discarded, it is not processed in the digest.
>
> |
> Sep:26,18:12:32,132 FINER (:INT-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.pbcast.STABLE]
> *_172.25.14.2 <http://172.25.14.2>: discarded STABILITY message
> with different view-id_*[172.25.16.2|0] (my
> view-id=[172.25.14.2|14] (5) [172.25.14.2, 172.25.11.2,
> 172.25.5.2, 172.25.3.2, 172.25.1.2])
> Sep:26,18:12:32,192 FINER (:Timer-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UNICAST3]
> 172.25.14.2 --> XMIT(172.25.22.2 <http://172.25.22.2>: #3)
> Sep:26,18:12:32,192 FINER (:Timer-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: sending msg to 172.25.22.2, src=172.25.14.2,
> headers are RequestCorrelator: id=200, type=RSP, id=12,
> rsp_expected=false, FORK: ee-interserver:interserver, UNICAST3:
> DATA, seqno=3, conn_id=6, UDP: [cluster_name=ee-interserver]
> Sep:26,18:12:32,192 FINER
> (:TransferQueueBundler,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: sending 1 msgs (124 bytes (0.40 of
> max_bundle_size) to 1 dests(s): [ee-interserver:172.25.22.2]
> Sep:26,18:12:32,275 FINER (:INT-1,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: received [dst: <null>, src: 172.25.1.2 (2
> headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO:
> view_id=[172.25.14.2|14], logical_name=172.25.1.2,
> physical_addr=172.25.1.2:55201 <http://172.25.1.2:55201>, UDP:
> [cluster_name=ee-interserver]
> Sep:26,18:12:32,571 FINER (:INT-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.UDP] 172.25.14.2
> <http://172.25.14.2>: received [dst: 172.25.14.2, src: 172.25.3.2
> (2 headers), size=103 bytes, flags=OOB|NO_RELIABILITY|INTERNAL],
> headers are STABLE: [STABLE_GOSSIP] view-id= [172.25.14.2|14],
> UDP: [cluster_name=ee-interserver]
> Sep:26,18:12:32,571 FINER (:INT-2,ee-interserver,172.25.14.2
> <http://172.25.14.2>:) [org.jgroups.protocols.pbcast.STABLE]
> 172.25.14.2 <http://172.25.14.2>: handling digest from 172.25.3.2
> <http://172.25.3.2>:
> mine: 172.25.14.2 <http://172.25.14.2>: [23], 172.25.11.2
> <http://172.25.11.2>: [1], 172.25.5.2 <http://172.25.5.2>: [28],
> 172.25.3.2 <http://172.25.3.2>: [1], 172.25.1.2
> <http://172.25.1.2>: [6]
> other: 172.25.14.2 <http://172.25.14.2>: [23], 172.25.11.2
> <http://172.25.11.2>: [1], 172.25.5.2 <http://172.25.5.2>: [28],
> 172.25.3.2 <http://172.25.3.2>: [1], 172.25.1.2
> <http://172.25.1.2>: [6]
> result: 172.25.14.2 <http://172.25.14.2>: [23], 172.25.11.2
> <http://172.25.11.2>: [1], 172.25.5.2 <http://172.25.5.2>: [28],
> 172.25.3.2 <http://172.25.3.2>: [1], 172.25.1.2
> <http://172.25.1.2>: [6]
>
>
> |
>
> I found the code where this message is printed
> (https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/pbcast/STABLE.java
> <https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/pbcast/STABLE.java>)
> but I can't figure out what the problem is.
> It is printed multiple times and it seems to be the reason why
> thatnew node never is joined in the group.
>
>
> The logs from the node that fails to join the group the other side
> so to speak):
> At startup:
>
> |
> Sep:26,18:12:48,707FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:stable task
> started
> Sep:26,18:12:48,756FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.UDP]sockets will useinterface172.25.22.2
> Sep:26,18:12:48,762FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.UDP]socket information:
> mcast_addr=239.249.0.82:45691,bind_addr=/172.25.22.2,ttl=24
> sock:bound to 172.25.22.2:55201,receive buffer size=20000000,send
> buffer size=640000
> mcast_sock:bound to 172.25.22.2:45691,send buffer
> size=640000,receive buffer size=25000000
> Sep:26,18:12:48,775FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.UDP]172.25.22.2:sending msg to
> null,src=172.25.22.2,headers are
> PING:[type=GET_MBRS_REQ,cluster=ee-interserver],UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:48,937INFO
> (:ServerServiceThreadPool--8:)[org.jboss.as.jpa]WFLYJPA0010:StartingPersistenceUnit(phase
> 2of 2)Service'com.barco.nms.server.ear#firebird'
> Sep:26,18:12:48,978FINER
> (:TransferQueueBundler,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:sending
> 1msgs (111bytes (0.36of max_bundle_size)to 1dests(s):[ee-interserver]
> (...)
> Sep:26,18:12:51,421FINER
> (:INT-1,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:51,425FINER
> (:INT-2,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:0591b4e3-74b3-7655-1d43-74e949444c2a(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:51,568FINER
> (:INT-1,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:eb6746f9-ad4f-d89f-7522-95e2e9a90070(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:51,777FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.GMS]172.25.22.2:nomembers
> discovered after 3007ms:creating cluster asfirst member
> Sep:26,18:12:51,781FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.NAKACK2]
> [172.25.22.2setDigest()]
> existing digest:[]
> newdigest:172.25.22.2:[0(0)]
> resulting digest:172.25.22.2:[0(0)]
> Sep:26,18:12:51,781FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.GMS]172.25.22.2:installing view
> [172.25.22.2|0](1)[172.25.22.2]
> Sep:26,18:12:51,783FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]resuming message garbage
> collection
> Sep:26,18:12:51,784FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.FD_SOCK]172.25.22.2:VIEW_CHANGE
> received:[172.25.22.2]
> Sep:26,18:12:51,788FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:reset digest
> to 172.25.22.2:[-1]
> Sep:26,18:12:51,789FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.UFC]newmembership:[172.25.22.2]
> Sep:26,18:12:51,790FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.MFC]newmembership:[172.25.22.2]
> Sep:26,18:12:51,791FINER (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:reset digest
> to 172.25.22.2:[-1]
> Sep:26,18:12:51,791FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.STABLE]resuming message garbage
> collection
> Sep:26,18:12:51,792FINE (:MSC service thread
> 1-5:)[org.jgroups.protocols.pbcast.GMS]172.25.22.2:created cluster
> (first member).Myview is[172.25.22.2|0],impl
> isorg.jgroups.protocols.pbcast.CoordGmsImpl
>
> |
>
>
> First FD_ALL message from 14.2 (which shows there are message
> being received from 22.2), and then the same 'discarded' message
> |
> Sep:26,18:12:54,422FINER
> (:INT-1,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=0bytes,flags=INTERNAL],headers
> are FD_ALL:heartbeat,UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:55,093FINER
> (:INT-2,ee-interserver,172.25.22.2:)[org.jgroups.protocols.UDP]172.25.22.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=103bytes,flags=OOB|NO_RELIABILITY|INTERNAL],headers
> are
> STABLE:[STABILITY]view-id=[172.25.14.2|14],UDP:[cluster_name=ee-interserver]
> Sep:26,18:12:55,094FINER
> (:INT-2,ee-interserver,172.25.22.2:)[org.jgroups.protocols.pbcast.STABLE]172.25.22.2:discarded
> STABILITY message withdifferent view-id
> [172.25.14.2|14](myview-id=[172.25.22.2|0](1)[172.25.22.2])
>
> |
>
>
> I don't find any errors or obviously wrong things in this, but the
> result is that the restarted node (22.2) doens't rejoin the group.
> Both ends are using the same config (as I said it is on wildfly)
> |
> <stackname="udp-interserver">
> <transporttype="UDP"socket-binding="jgroups-interserver-udp">
> <propertyname="ip_ttl">24</property>
> <propertyname="log_discard_msgs">false</property>
> </transport>
> <protocoltype="PING"/>
> <protocoltype="MERGE3"/>
> <protocoltype="FD_SOCK"socket-binding="jgroups-udp-fd"/>
> <protocoltype="FD_ALL"/>
> <protocoltype="VERIFY_SUSPECT"/>
> <protocoltype="pbcast.NAKACK2"/>
> <protocoltype="UNICAST3"/>
> <protocoltype="pbcast.STABLE"/>
> <protocoltype="pbcast.GMS"/>
> <protocoltype="UFC"/>
> <protocoltype="MFC"/>
> <protocoltype="FRAG2"/>
> <protocoltype="RSVP"/>
> </stack>
>
> |
>
>
> Am I missing something obvious here? Any information that can help
> me understand this problem will be much appreciated.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Bela Ban, JGroups lead (http://www.jgroups.org)

Bela Ban

unread,
Oct 15, 2018, 6:34:50 AM10/15/18
to jgrou...@googlegroups.com


On 01/10/18 5:30 AM, Thomas Houtekier wrote:
> Thanks, I'll get back here when (if) I can reproduce this with a
> jgroups-only application.
> In the meantime, I have this question: Where do I find info on bugs
> #943480 and #938584?

These were old bugs tracked by a SourceForge tracking system, long
before I switched to JIRA.

>
> The message I'm concerned about ('discarded STABILITY message with
> different view-id') is printed in STABLE.java, and has this comment on it:
>
> |
> // we won't handle the stable_digest, if its members don't match the
> membership in my own digest,
> // this is part of the fix for the NAKACK problem (bugs #943480 and
> #938584)
> if(!view_id.equals(this.view.getViewId())){
> log.trace("%s: discarded STABILITY message with
> different view-id %s (my view-id=%s)",
> local_addr,view_id,view);
> // resetDigest();
> return;
> }
> log.trace("%s: received stability msg from %s:
> %s",local_addr,sender,printDigest(stable_digest));
>
>
> num_stability_msgs_received++;
> resetDigest();
> |
>
> I would like to know the context of this. Is it for instance an option
> to remove this check and call resetDigest() anyway?
> It's difficult to know what to expect if I don't know the background
> of this condition. Where can I find info on those 2 bugs #943480 and
> #938584?

I don't think this is the root cause. This is just a delayed reception
of a STABILITY messages, and will likely succeed the next time around.
> > mcast_addr=239.249.0.82:45691
> <http://239.249.0.82:45691>,bind_addr=/172.25.22.2
> <http://172.25.22.2>,ttl=24
> > sock:bound to 172.25.22.2:55201
> <http://172.25.22.2:55201>,receive buffer size=20000000,send
> > buffer size=640000
> > mcast_sock:bound to 172.25.22.2:45691
> <http://172.25.22.2:45691>,send buffer
> > an email to jgroups-dev...@googlegroups.com <javascript:>
> > <mailto:jgroups-dev...@googlegroups.com <javascript:>>.
> > To post to this group, send email to jgrou...@googlegroups.com
> <javascript:>
> > <mailto:jgrou...@googlegroups.com <javascript:>>.
> <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
> --
> Bela Ban | http://www.jgroups.org
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/5f705404-413e-439a-8a93-204b55abd26f%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/5f705404-413e-439a-8a93-204b55abd26f%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--

Thomas Houtekier

unread,
Oct 15, 2018, 7:31:16 AM10/15/18
to jgroups-dev


Op maandag 15 oktober 2018 12:32:32 UTC+2 schreef Bela Ban:


On 27/09/18 12:38 PM, Thomas Houtekier wrote:
> Maybe interesting to add is that this happens on one particular setup.

So what's the difference between the 2 setups? Same code base, same
config, same network?

You're saying that this works on one setup, but not on another one?
 
Yes, that is exactly what I'm saying: it works on setup A but not on setup B.
There are quite some difference. In order to rule out most of them, we replicated as much as possible from setup B, but failed to reproduce the same thing.
To narrow down the cause as much as possible, I created the test-application I mentioned before, with only jgroups. With that application, the problem could be reproduced on setup on setup B, but not on setup A.
On setup B, the issue disappeared when switching to jgroups-4.0.14 (without doing any other changes).
I realize this sounds very much as a problem with that particular setup, but I can't think of anything anymore, so I'm really doubting this hypothesis. I literally traveled half way across the globe to have physical access to that setup to try and pinpoint the rootcause. The only conclusion that still holds after all the experiments is that it works with 4.0.14, but not with 3.6.4 (or 3.6.16 for that matter)
Even if in the end it turns out to be caused by something special about that setup, I still need to know exactly what causes this.
 

> We have been trying to reproduce this situation on another setup but
> have never seen this problem, even with far more nodes.
If everything is the same, but the network/OS(?) is different, I'm
inclined to believe that this is a network issue...

> On a setup with 35 nodes we ran a test that constantly randomly
> restarts nodes and verifies if the group is complete again (after the
> restart). It ran fine for multiple days.

Was this with 3.6.4, or with 4.0.x?
This tests was with the complete application-stack (on wildfly-9), so with jgroups-3.6.4.
 

Thomas Houtekier

unread,
Oct 15, 2018, 7:36:07 AM10/15/18
to jgroups-dev


Op maandag 15 oktober 2018 12:34:50 UTC+2 schreef Bela Ban:
Ok, but it is interesting never the less: there is a very important correlation between this print in the logs and the issue. Jgroups keeps on printing this 'discarded STABILITY' message (the discarded message contains the name of the node that never joins the cluster).
You say that it will likely succeed the next time.
What can cause the merging to never happen on 3.6.4?
On 4.0.14 I see the print too, ant there indeed, as I mentioned before, the merging happens in the end.


 
>     > <mailto:jgroups-dev+unsub...@googlegroups.com <javascript:>>.
>     > To post to this group, send email to jgrou...@googlegroups.com
>     <javascript:>
>     > <mailto:jgrou...@googlegroups.com <javascript:>>.
>     > To view this discussion on the web visit
>     >
>     https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com
>     <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com>
>
>     >
>     <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer
>     <https://groups.google.com/d/msgid/jgroups-dev/56a93b4d-85e7-451e-813b-66f0e303b30a%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
>     --
>     Bela Ban | http://www.jgroups.org
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com

Bela Ban

unread,
Oct 15, 2018, 8:01:40 AM10/15/18
to jgrou...@googlegroups.com


On 15/10/18 1:31 PM, Thomas Houtekier wrote:
>
>
> Op maandag 15 oktober 2018 12:32:32 UTC+2 schreef Bela Ban:
>
>
>
> On 27/09/18 12:38 PM, Thomas Houtekier wrote:
> > Maybe interesting to add is that this happens on one particular
> setup.
>
> So what's the difference between the 2 setups? Same code base, same
> config, same network?
>
> You're saying that this works on one setup, but not on another one?
>
> Yes, that is exactly what I'm saying: it works on setup A but not on
> setup B.
> There are quite some difference. In order to rule out most of them, we
> replicated as much as possible from setup B, but failed to reproduce
> the same thing.
> To narrow down the cause as much as possible, I created the
> test-application I mentioned before, with only jgroups. With that
> application, the problem could be reproduced on setup on setup B, but
> not on setup A.

So what's setup A and setup B? Different networks?

Can you re-create the setup on which this fails? If I can recreate it,
and reproduce the issue, then I can fix it.
Note that I'll only look at 3.6.16, but *not* 3.6.4. I don't want to
spend time fixing something that I've (possibly) fixed a long time ago.

So, using the code and config you posted, how do I reproduce this? You
mentioned a watchdog process which kills members? That's then *not* a
graceful leave, but a kill-leave...


> On setup B, the issue disappeared when switching to jgroups-4.0.14
> (without doing any other changes).
> I realize this sounds very much as a problem with that particular
> setup, but I can't think of anything anymore, so I'm really doubting
> this hypothesis. I literally traveled half way across the globe to
> have physical access to that setup to try and pinpoint the rootcause.
> The only conclusion that still holds after all the experiments is that
> it works with 4.0.14, but not with 3.6.4 (or 3.6.16 for that matter)
> Even if in the end it turns out to be caused by something special
> about that setup, I still need to know exactly what causes this.
>
>
> > We have been trying to reproduce this situation on another setup
> but
> > have never seen this problem, even with far more nodes.
> If everything is the same, but the network/OS(?) is different, I'm
> inclined to believe that this is a network issue...
>
> > On a setup with 35 nodes we ran a test that constantly randomly
> > restarts nodes and verifies if the group is complete again
> (after the
> > restart). It ran fine for multiple days.
>
> Was this with 3.6.4, or with 4.0.x?
>
> This tests was with the complete application-stack (on wildfly-9), so
> with jgroups-3.6.4.
>

Thomas Houtekier

unread,
Oct 15, 2018, 10:34:10 AM10/15/18
to jgroups-dev


Op maandag 15 oktober 2018 14:01:40 UTC+2 schreef Bela Ban:
A en B are just names to easily identify the setups were it was working and were it wasn't. B is the setup where I have the problem, A is basically every other setup where I tried to reproduce.
On setup B I can reproduce the problem easily with a reproduction-rate of 100%. Unfortunately I don't have direct access to it (any more).
So no, I don't think you can easily reproduce it on your end. Unless the logs I provided earlier in attachment give you more insight as to why it behaves so peculiarly. If so, I can try to mimic those circumstances.

I understand you consider 3.6.4 obsolete, but as far as this problem goes, I saw 3.6.16 behave in the exact same way.  
Unfortunately I don't have detailed logging of this problem with 3.6.16, but if you really find it necessary, I can contact someone on-site and instruct him how to reproduce the problem and collect logs with jgroups 3.6.16.
In the meantime, I can only give you the logs I attached earlier (last week). I'm afraid that's the closest thing to reproducing I can provide you with.
The testApplication I used does try to do an actual graceful leave (bmo channel.close()) before calling System.exit. This gracefull exit is triggered by the embedded webserver, as I explained before.
I didn't mention a watchdog script that kills members: the accompanying 'watchdog'-script is there to makes sure that the testApplication is started again, after it stopped itself (every instance has such a watchdog). This allow to observe if the members re-joins the group after restart.
Even if it would have been a kill of the process, it should rejoin when restarted eventually too, isn't it?

I repeat, I realize it sounds pretty weird, but i really would't bother you if I would have already tried everything I can come up with. This is the furthest I narrowed it down to: it works with 4.0.14 and it doesn't with 3.6.4 (or 3.4.16). I have trace-logging of the 3.6.4-run, which look very similar to what I saw with 3.6.16. I was hoping that (based on those logs) you could have some ideas on what could explain this. I can run other tests too on that setup, I you think that is valuable. However, having physical access to that setup is a bit of problem and not really possible.

Thomas Houtekier

unread,
Oct 15, 2018, 4:14:17 PM10/15/18
to jgroups-dev
In an attempt to make things a bit clearer, I combined the logfiles I posted earlier.
This is test with 3 nodes. They fail to form a cluster. 
These are the times when the nodes are started:

node13:
Line 1: Oct 05, 2018 12:13:30 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting
 
Line 2160: Oct 05, 2018 12:32:06 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting

node8:
Line 1: Oct 05, 2018 12:17:36 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting
 
Line 1333: Oct 05, 2018 12:26:12 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting
 
Line 1608: Oct 05, 2018 12:28:38 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting

node14.log
Line 1: Oct 05, 2018 12:22:29 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting
 
Line 685: Oct 05, 2018 12:30:45 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting
 
Line 1813: Oct 05, 2018 12:34:01 PM testinteror.jgroups.TestAppJgroups mainWARNING: starting



As you can see, by the time node14 starts the first time, 8 and 13 are already started.
These are the views from 8 and 13 by the time 14 starts:
Node13:
Oct 05, 2018 12:18:06 PM org.jgroups.protocols.FlowControl handleViewChange
 FINER
: new membership: [172.25.8.2, 172.25.13.2]

node8:
Oct 05, 2018 12:18:27 PM org.jgroups.protocols.pbcast.Merger$MergeTask consolidateMergeData
 FINER
: 172.25.8.2: consolidated view=MergeView::[172.25.8.2|1] (2) [172.25.8.2, 172.25.13.2], 2 subgroups: [172.25.8.2|0] (1) [172.25.8.2], [172.25.13.2|0] (1) [172.25.13.2]


So far so good.
When 14 starts (at Oct 05, 2018 12:22:29), thing start to go wrong.
First, both 13 and 8 receive a GET_MBRS_REQ within 20s.That's still fine I suppose
node13:
 FINER: 172.25.13.2: received GET_MBRS_REQ from 172.25.14.2, sending response 172.25.13.2, name=172.25.13.2, addr=172.25.13.2:65413, server
 
Oct 05, 2018 12:22:30 PM org.jgroups.protocols.TP down
 FINER
: 172.25.13.2: sending msg to 172.25.14.2, src=172.25.13.2, headers are PING: [type=GET_MBRS_RSP], UDP: [cluster_name=interOR-test-cluster]
 
Oct 05, 2018 12:22:30 PM org.jgroups.protocols.TP$BaseBundler sendBundledMessages



node8:
Oct 05, 2018 12:22:51 PM org.jgroups.protocols.Discovery sendDiscoveryResponse
 FINER
: 172.25.8.2: received GET_MBRS_REQ from 172.25.14.2, sending response 172.25.8.2, name=172.25.8.2, addr=172.25.8.2:60314, coord
 
Oct 05, 2018 12:22:51 PM org.jgroups.protocols.TP down
 FINER
: 172.25.8.2: sending msg to 172.25.14.2, src=172.25.8.2, headers are PING: [type=GET_MBRS_RSP], UDP: [cluster_name=interOR-test-cluster]
 

There are other messages too: 
(on node8)
Oct 05, 2018 12:23:20 PM org.jgroups.protocols.TP passMessageUp
 FINER
: 172.25.8.2: received [dst: <null>, src: 172.25.14.2 (2 headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO: view_id=[172.25.14.2|0], logical_name=172.25.14.2, physical_addr=172.25.14.2:53184, UDP: [cluster_name=interOR-test-cluster]
 
Oct 05, 2018 12:23:20 PM org.jgroups.protocols.TP passMessageUp
 FINER
: 172.25.8.2: received [dst: <null>, src: 172.25.14.2 (2 headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO: view_id=[172.25.14.2|0], logical_name=172.25.14.2, physical_addr=172.25.14.2:53184, UDP: [cluster_name=interOR-test-cluster]
 
Oct 05, 2018 12:23:23 PM org.jgroups.protocols.TP down



but then: node8 decides it is the mergeleader, with 2 participants, but for some reason node13 is not considered here: only 14 and 8
Oct 05, 2018 12:24:03 PM org.jgroups.protocols.MERGE3$ViewConsistencyChecker _run
 FINE
: I (172.25.8.2) will be the merge leader
 
Oct 05, 2018 12:24:03 PM org.jgroups.protocols.MERGE3$ViewConsistencyChecker _run
 FINER
: merge participants are [172.25.8.2, 172.25.14.2]

To me that is already very strange, but I don't know if this is really *the* problem.
The test restarts a number of nodes and in the end (after a restart of node14), they seem to form a cluster.
This is something I observed everytime: when restarting the node that is the merge-leader at that time, everyting resolves: the complete cluster is formed.
You can see this happening at the end of the logs, which are in attachement.

I know this is on the obsolete 4.6.4, but can you please have some feedback on this last part: node8 has a view of [172.25.8.2, 172.25.13.2], but when becoming merge-leader, it only considers [172.25.8.2, 172.25.14.2]. Is to be expected?
I would also be very interesting how come the cluster is completely formed in the end, after restarting node14.

node13:
Oct 05, 2018 12:34:52 PM org.jgroups.protocols.pbcast.GMS up
FINER
: 172.25.13.2: received full view: MergeView::[172.25.14.2|6] (3) [172.25.14.2, 172.25.8.2, 172.25.13.2], 3 subgroups: [172.25.13.2|0] (1) [172.25.13.2], [172.25.8.2|5] (1) [172.25.8.2], [172.25.14.2|0] (1) [172.25.14.2]


node14:
Oct 05, 2018 12:34:52 PM org.jgroups.protocols.pbcast.GMS up
FINER
: 172.25.14.2: received full view: MergeView::[172.25.14.2|6] (3) [172.25.14.2, 172.25.8.2, 172.25.13.2], 3 subgroups: [172.25.13.2|0] (1) [172.25.13.2], [172.25.8.2|5] (1) [172.25.8.2], [172.25.14.2|0] (1) [172.25.14.2]



node8:
Oct 05, 2018 12:35:13 PM org.jgroups.protocols.pbcast.GMS up
FINER
: 172.25.8.2: received full view: MergeView::[172.25.14.2|6] (3) [172.25.14.2, 172.25.8.2, 172.25.13.2], 3 subgroups: [172.25.13.2|0] (1) [172.25.13.2], [172.25.8.2|5] (1) [172.25.8.2], [172.25.14.2|0] (1) [172.25.14.2]



The testapplication can be found here (together with this watchdog-script, so it is clear what I'm using): https://github.com/thomashoutekier/jgroupsTestApplication

Thanks


Op maandag 15 oktober 2018 16:34:10 UTC+2 schreef Thomas Houtekier:
3Nodes.zip

Bela Ban

unread,
Oct 16, 2018, 4:21:10 AM10/16/18
to jgrou...@googlegroups.com
So initial discovery *never* works; if you grep for GET_MBRS_RSP (see
below), you can see that nobody gets a response from anybody else (only
from itself)!

Only later do the members merge. Suspicions:

- IP multicasting is not working correctly

- You're thread pool discards discovery requests because it is full
(unlikely)


grep GET_MBRS_RSP *.log
node13.log:FINER: 172.25.13.2: sending msg to 172.25.8.2,
src=172.25.13.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node13.log:FINER: 172.25.13.2: sending msg to 172.25.14.2,
src=172.25.13.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node13.log:FINER: 172.25.13.2: sending msg to 172.25.8.2,
src=172.25.13.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node13.log:FINER: 172.25.13.2: sending msg to 172.25.8.2,
src=172.25.13.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node13.log:FINER: 172.25.13.2: sending msg to 172.25.14.2,
src=172.25.13.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node13.log:FINER: 172.25.13.2: sending msg to 172.25.14.2,
src=172.25.13.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node14.log:FINER: 172.25.14.2: sending msg to 172.25.8.2,
src=172.25.14.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node14.log:FINER: 172.25.14.2: sending msg to 172.25.8.2,
src=172.25.14.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node14.log:FINER: 172.25.14.2: sending msg to 172.25.13.2,
src=172.25.14.2, headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node8.log:FINER: 172.25.8.2: sending msg to 172.25.14.2, src=172.25.8.2,
headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node8.log:FINER: 172.25.8.2: sending msg to 172.25.14.2, src=172.25.8.2,
headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node8.log:FINER: 172.25.8.2: sending msg to 172.25.13.2, src=172.25.8.2,
headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]
node8.log:FINER: 172.25.8.2: sending msg to 172.25.14.2, src=172.25.8.2,
headers are PING: [type=GET_MBRS_RSP], UDP:
[cluster_name=interOR-test-cluster]


On 15/10/18 10:14 PM, Thomas Houtekier wrote:
> In an attempt to make things a bit clearer, I combined the logfiles I
> posted earlier.
> This is test with 3 nodes. They fail to form a cluster.
> These are the times when the nodes are started:
>
> node13:
> |
> Line1:Oct05,201812:13:30PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> Line2160:Oct05,201812:32:06PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> |
>
> node8:
> |
> Line1:Oct05,201812:17:36PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> Line1333:Oct05,201812:26:12PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> Line1608:Oct05,201812:28:38PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> |
>
> node14.log
> |
> Line1:Oct05,201812:22:29PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> Line685:Oct05,201812:30:45PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> Line1813:Oct05,201812:34:01PM
> testinteror.jgroups.TestAppJgroupsmainWARNING:starting
> |
>
>
>
> As you can see, by the time node14 starts the first time, 8 and 13 are
> already started.
> These are the views from 8 and 13 by the time 14 starts:
> Node13:
> |
> Oct05,201812:18:06PM org.jgroups.protocols.FlowControlhandleViewChange
> FINER:newmembership:[172.25.8.2,172.25.13.2]
> |
>
> node8:
> |
> Oct05,201812:18:27PM
> org.jgroups.protocols.pbcast.Merger$MergeTaskconsolidateMergeData
> FINER:172.25.8.2:consolidated
> view=MergeView::[172.25.8.2|1](2)[172.25.8.2,172.25.13.2],2subgroups:[172.25.8.2|0](1)[172.25.8.2],[172.25.13.2|0](1)[172.25.13.2]
> |
>
>
> So far so good.
> When 14 starts (at Oct 05, 2018 12:22:29), thing start to go wrong.
> First, both 13 and 8 receive a GET_MBRS_REQ within 20s.That's still
> fine I suppose
> node13:
> |
> FINER:172.25.13.2:received GET_MBRS_REQ from172.25.14.2,sending
> response 172.25.13.2,name=172.25.13.2,addr=172.25.13.2:65413,server
> Oct05,201812:22:30PM org.jgroups.protocols.TP down
> FINER:172.25.13.2:sending msg to 172.25.14.2,src=172.25.13.2,headers
> are PING:[type=GET_MBRS_RSP],UDP:[cluster_name=interOR-test-cluster]
> Oct05,201812:22:30PM org.jgroups.protocols.TP$BaseBundler
> sendBundledMessages
>
> |
>
>
> node8:
> |
> Oct05,201812:22:51PM org.jgroups.protocols.DiscoverysendDiscoveryResponse
> FINER:172.25.8.2:received GET_MBRS_REQ from172.25.14.2,sending
> response 172.25.8.2,name=172.25.8.2,addr=172.25.8.2:60314,coord
> Oct05,201812:22:51PM org.jgroups.protocols.TP down
> FINER:172.25.8.2:sending msg to 172.25.14.2,src=172.25.8.2,headers
> are PING:[type=GET_MBRS_RSP],UDP:[cluster_name=interOR-test-cluster]
> |
> There are other messages too:
> (on node8)
> |
> Oct05,201812:23:20PM org.jgroups.protocols.TP passMessageUp
> FINER:172.25.8.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=0bytes,flags=INTERNAL],headers
> are
> MERGE3:INFO:view_id=[172.25.14.2|0],logical_name=172.25.14.2,physical_addr=172.25.14.2:53184,UDP:[cluster_name=interOR-test-cluster]
> Oct05,201812:23:20PM org.jgroups.protocols.TP passMessageUp
> FINER:172.25.8.2:received
> [dst:<null>,src:172.25.14.2(2headers),size=0bytes,flags=INTERNAL],headers
> are
> MERGE3:INFO:view_id=[172.25.14.2|0],logical_name=172.25.14.2,physical_addr=172.25.14.2:53184,UDP:[cluster_name=interOR-test-cluster]
> Oct05,201812:23:23PM org.jgroups.protocols.TP down
>
> |
>
>
> but then: node8 decides it is the mergeleader, with 2 participants,
> but for some reason node13 is not considered here: only 14 and 8
> |
> Oct05,201812:24:03PM
> org.jgroups.protocols.MERGE3$ViewConsistencyChecker _run
> FINE:I (172.25.8.2)will be the merge leader
> Oct05,201812:24:03PM
> org.jgroups.protocols.MERGE3$ViewConsistencyChecker _run
> FINER:merge participants are [172.25.8.2,172.25.14.2]
> |
>
> To me that is already very strange, but I don't know if this is really
> *the* problem.
> The test restarts a number of nodes and in the end (after a restart of
> node14), they seem to form a cluster.
> This is something I observed everytime: when restarting the node that
> is the merge-leader at that time, everyting resolves: the complete
> cluster is formed.
> You can see this happening at the end of the logs, which are in
> attachement.
>
> I know this is on the obsolete 4.6.4, but can you please have some
> feedback on this last part: node8 has a view of [172.25.8.2,
> 172.25.13.2], but when becoming merge-leader, it only
> considers [172.25.8.2, 172.25.14.2]. Is to be expected?
> I would also be very interesting how come the cluster is completely
> formed in the end, after restarting node14.
>
> node13:
> |
> Oct05,201812:34:52PM org.jgroups.protocols.pbcast.GMS up
> FINER:172.25.13.2:received full
> view:MergeView::[172.25.14.2|6](3)[172.25.14.2,172.25.8.2,172.25.13.2],3subgroups:[172.25.13.2|0](1)[172.25.13.2],[172.25.8.2|5](1)[172.25.8.2],[172.25.14.2|0](1)[172.25.14.2]
>
> |
>
> node14:
> |
> Oct05,201812:34:52PM org.jgroups.protocols.pbcast.GMS up
> FINER:172.25.14.2:received full
> view:MergeView::[172.25.14.2|6](3)[172.25.14.2,172.25.8.2,172.25.13.2],3subgroups:[172.25.13.2|0](1)[172.25.13.2],[172.25.8.2|5](1)[172.25.8.2],[172.25.14.2|0](1)[172.25.14.2]
>
> |
>
>
> node8:
> |
> Oct05,201812:35:13PM org.jgroups.protocols.pbcast.GMS up
> FINER:172.25.8.2:received full
> view:MergeView::[172.25.14.2|6](3)[172.25.14.2,172.25.8.2,172.25.13.2],3subgroups:[172.25.13.2|0](1)[172.25.13.2],[172.25.8.2|5](1)[172.25.8.2],[172.25.14.2|0](1)[172.25.14.2]
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/19cb58e6-ec4f-43eb-98cc-c56c0774c566%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/19cb58e6-ec4f-43eb-98cc-c56c0774c566%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Thomas Houtekier

unread,
Oct 16, 2018, 4:53:32 AM10/16/18
to jgroups-dev
Great, thanks.

I see that indeed. I don't know if both GET_MBRS_REQ and GET_MBRS_RSP are always printed (on both send and receive), but it certainly looks as if everyone is sending the GET_MBRS_RSP, but no one receives any.
If it is a multicast-problem (that was actually the initial hypothesis, but I would expect that it always fails):
* how come it works with 4.0.14 (same config)?
* how come it sometimes works: by the end of the test (after restarting node14), all 3 see eachother
* what's the role of the merge-leader in this: whenever the mergeleader is restarted, the problem resolves.

As far as multicast is concerned, I always assumed jgroups uses the same address and port for all its protocols. So that, if there is a problem with multicast, nothing works at all.
The very first thing I checked on that setup was subscribing to that multicastgroups with wireshark to be sure that the udp-packets are actually there.
Is PING maybe using another port or even socket, could the firewall be involved in this? But then again, how come the cluster is formed in the end.

How can I prove/disprove it has anything todo with the threadpool? Is it configurable for instance?

Thanks

Op dinsdag 16 oktober 2018 10:21:10 UTC+2 schreef Bela Ban:

Thomas Houtekier

unread,
Nov 20, 2018, 3:58:14 AM11/20/18
to jgroups-dev

We found out that a firewall-rule was wrongly configured which caused that the server never received the GET_MBRS_RSP message: there was a firewall-rule, but on the wrong UDP port.

Then it was still not working. We found a similar problem related to the FD_SOCK protocol.

For that protocol, only a server-port was configured (and enabled in the firewall), but the client_bind_port was not configured. It isn't configured in the default config of wildfly either. The result was that it uses a random TCP port to connect to its neighbor, which was not allowed by the firewall. The result was (quite similar as the UDP-config problem) that a connection (for FD_SOCK) could be created, but a packet sent from the server to the client was never received by jgroups because of the firewall.
By specifying the client_bind_port for FD_SOCK and configuring a rule in the firewall for it, we got it working.

Thomas

Op dinsdag 16 oktober 2018 10:53:32 UTC+2 schreef Thomas Houtekier:
Reply all
Reply to author
Forward
0 new messages