JChannel.connect hangs when one of the node in the cluster is shutdown and started again

361 views
Skip to first unread message

Varada

unread,
Aug 5, 2018, 1:41:31 PM8/5/18
to jgroups-dev
Hi,


I am seeing this strange problem which occurs intermittently. I have a cluster of 10 nodes. If other nodes are up and one of the node is brought down and started again,
the call to JChannel.connect does not return. The process hangs and the threaddump shows it is stuck in JChannel.connect.

Jgroup version used : jgroups-3.4.0.Alpha2.jar
Its VM with OS: RHEL 7
Protocol used is TCP
and below are the properties used

TCP(bind_addr=HOST_A;bind_port=37062):TCPPING(initial_hosts=HOST_A[37062],HOST_B[37087],HOST_C[37091],HOST_D[37095],HOST_D[37062],HOST_E[37087],HOST_F[37091],HOST_G[37095],HOST_H[37099];port_range=0;timeout=15000;num_initial_members=2):MERGE2(min_interval=3000;max_interval=5000):FD_ALL(interval=15000;timeout=20000):FD(timeout=15000;max_tries=48;level=ERROR):VERIFY_SUSPECT(timeout=15000):pbcast.NAKACK(retransmit_timeout=100,200,300,600,1200,2400,4800;discard_delivered_msgs=true):pbcast.STABLE(stability_delay=1000;desired_avg_gossip=20000;max_bytes=0):pbcast.GMS(print_local_addr=true;join_timeout=15000)

With the TCCPING timeout of 15000, the stacktrace showed:


3XMTHREADINFO      "FelixStartLevel" J9VMThread:0x0000000001FA4800, j9thread_t:0x00007F4D08099930, java/lang/Thread:0x00000000E11706D8, state:P, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1F, isDaemon:true)
3XMTHREADINFO1            (native thread ID:0xCAA, native priority:0x5, native policy:UNKNOWN, vmstate:P, vm thread flags:0x000a0001)
3XMTHREADINFO2            (native stack address range from:0x00007F4D757FB000, to:0x00007F4D7583C000, size:0x41000)
3XMCPUTIME               CPU usage total: 9.776676050 secs, current category="Application"
3XMTHREADBLOCK     Parked on: java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject@0x00000000F8F3AEB0 Owned by: <unknown>
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at sun/misc/Unsafe.park(Native Method)
4XESTACKTRACE                at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:226(Compiled Code))
4XESTACKTRACE                at java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2174)
4XESTACKTRACE                at org/jgroups/protocols/Discovery$Responses.get(Discovery.java:727)
4XESTACKTRACE                at org/jgroups/protocols/Discovery.findMembers(Discovery.java:241)
4XESTACKTRACE                at org/jgroups/protocols/Discovery.findInitialMembers(Discovery.java:208)
4XESTACKTRACE                at org/jgroups/protocols/Discovery.down(Discovery.java:551)
4XESTACKTRACE                at org/jgroups/protocols/TCPPING.down(TCPPING.java:108)
4XESTACKTRACE                at org/jgroups/protocols/MERGE2.down(MERGE2.java:185)
4XESTACKTRACE                at org/jgroups/protocols/FD_ALL.down(FD_ALL.java:217)
4XESTACKTRACE                at org/jgroups/protocols/FD.down(FD.java:307)
4XESTACKTRACE                at org/jgroups/protocols/VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:84)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/NAKACK.down(NAKACK.java:569(Compiled Code))
4XESTACKTRACE                at org/jgroups/protocols/pbcast/STABLE.down(STABLE.java:365)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/ClientGmsImpl.findInitialMembers(ClientGmsImpl.java:199)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/ClientGmsImpl.joinInternal(ClientGmsImpl.java:73)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/ClientGmsImpl.join(ClientGmsImpl.java:37)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/GMS.down(GMS.java:1013)
4XESTACKTRACE                at org/jgroups/stack/ProtocolStack.down(ProtocolStack.java:1025)
4XESTACKTRACE                at org/jgroups/JChannel.down(JChannel.java:766)
4XESTACKTRACE                at org/jgroups/JChannel._connect(JChannel.java:543)
4XESTACKTRACE                at org/jgroups/JChannel.connect(JChannel.java:290)
5XESTACKTRACE                   (entered lock: org/jgroups/JChannel@0x00000000F858D5E0, entry count: 2)
4XESTACKTRACE                at org/jgroups/JChannel.connect(JChannel.java:275)
5XESTACKTRACE                   (entered lock: org/jgroups/JChannel@0x00000000F858D5E0, entry count: 1)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:77)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:55)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/JGroupsNodeCommunication.<init>(JGroupsNodeCommunication.java:91)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/ClusterScheduler.initialize(ClusterScheduler.java:164)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/FairShareSchedulingPolicy.ConfigureCluster(FairShareSchedulingPolicy.java:174)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/WorkFlowQueueSender.ConfigureCluster(WorkFlowQueueSender.java:205)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/osgi/bundles/container/AdapterJVMActivator.start(AdapterJVMActivator.java:222)
4XESTACKTRACE                at org/apache/felix/framework/util/SecureAction.startActivator(SecureAction.java:589)
4XESTACKTRACE                at org/apache/felix/framework/Felix.startBundle(Felix.java:1458)
4XESTACKTRACE                at org/apache/felix/framework/Felix.setActiveStartLevel(Felix.java:984)
4XESTACKTRACE                at org/apache/felix/framework/StartLevelImpl.run(StartLevelImpl.java:263)
4XESTACKTRACE                at java/lang/Thread.run(Thread.java:785)
3XMTHREADINFO3           Native callstack:
4XENATIVESTACK               (0x00007F4D7B26E882 [libj9prt28.so+0x2f882])
4XENATIVESTACK               (0x00007F4D7B27DD05 [libj9prt28.so+0x3ed05])
4XENATIVESTACK               (0x00007F4D7B26E3FC [libj9prt28.so+0x2f3fc])
4XENATIVESTACK               (0x00007F4D7B26E4FE [libj9prt28.so+0x2f4fe])
4XENATIVESTACK               (0x00007F4D7B27DD05 [libj9prt28.so+0x3ed05])
4XENATIVESTACK               (0x00007F4D7B26DFDF [libj9prt28.so+0x2efdf])
4XENATIVESTACK               (0x00007F4D7B267677 [libj9prt28.so+0x28677])
4XENATIVESTACK               (0x00007F4D80F7A390 [libpthread.so.0+0x11390])
4XENATIVESTACK               pthread_cond_timedwait+0x129 (0x00007F4D80F76709 [libpthread.so.0+0xd709])
4XENATIVESTACK               j9thread_park+0x1b6 (0x00007F4D7B6B9F66 [libj9thr28.so+0x5f66])
4XENATIVESTACK               (0x00007F4D7B97B2E8 [libj9vm28.so+0xb22e8])
4XENATIVESTACK               (0x00007F4D56B7A2B2 [<unknown>+0x0])


Then I changed the timeout value to 20000 and saw that the hang is less frequent then earlier but still it occurs. however, the threaddump showed a different stack trace like this:

3XMTHREADINFO      "FelixStartLevel" J9VMThread:0x0000000002AC4900, j9thread_t:0x00007F44A811C180, java/lang/Thread:0x00000000E11C89D8, state:P, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1F, isDaemon:true)
3XMTHREADINFO1            (native thread ID:0xC84, native priority:0x5, native policy:UNKNOWN, vmstate:P, vm thread flags:0x000a0001)
3XMTHREADINFO2            (native stack address range from:0x00007F4518EFB000, to:0x00007F4518F3C000, size:0x41000)
3XMCPUTIME               CPU usage total: 9.356933747 secs, current category="Application"
3XMTHREADBLOCK     Parked on: java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject@0x00000000F5856490 Owned by: <unknown>
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at sun/misc/Unsafe.park(Native Method)
4XESTACKTRACE                at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:226(Compiled Code))
4XESTACKTRACE                at java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2174)
4XESTACKTRACE                at org/jgroups/util/Promise._getResultWithTimeout(Promise.java:141)
4XESTACKTRACE                at org/jgroups/util/Promise.getResultWithTimeout(Promise.java:40)
4XESTACKTRACE                at org/jgroups/util/Promise.getResult(Promise.java:64)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/ClientGmsImpl.joinInternal(ClientGmsImpl.java:138)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/ClientGmsImpl.join(ClientGmsImpl.java:37)
4XESTACKTRACE                at org/jgroups/protocols/pbcast/GMS.down(GMS.java:1013)
4XESTACKTRACE                at org/jgroups/stack/ProtocolStack.down(ProtocolStack.java:1025)
4XESTACKTRACE                at org/jgroups/JChannel.down(JChannel.java:766)
4XESTACKTRACE                at org/jgroups/JChannel._connect(JChannel.java:543)
4XESTACKTRACE                at org/jgroups/JChannel.connect(JChannel.java:290)
5XESTACKTRACE                   (entered lock: org/jgroups/JChannel@0x00000000F550E2F8, entry count: 2)
4XESTACKTRACE                at org/jgroups/JChannel.connect(JChannel.java:275)
5XESTACKTRACE                   (entered lock: org/jgroups/JChannel@0x00000000F550E2F8, entry count: 1)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:77)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:55)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/JGroupsNodeCommunication.<init>(JGroupsNodeCommunication.java:91)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/ClusterScheduler.initialize(ClusterScheduler.java:164)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/FairShareSchedulingPolicy.ConfigureCluster(FairShareSchedulingPolicy.java:174)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/workflow/queue/WorkFlowQueueSender.ConfigureCluster(WorkFlowQueueSender.java:205)
4XESTACKTRACE                at com/sterlingcommerce/woodstock/osgi/bundles/container/AdapterJVMActivator.start(AdapterJVMActivator.java:222)
4XESTACKTRACE                at org/apache/felix/framework/util/SecureAction.startActivator(SecureAction.java:589)
4XESTACKTRACE                at org/apache/felix/framework/Felix.startBundle(Felix.java:1458)
4XESTACKTRACE                at org/apache/felix/framework/Felix.setActiveStartLevel(Felix.java:984)
4XESTACKTRACE                at org/apache/felix/framework/StartLevelImpl.run(StartLevelImpl.java:263)
4XESTACKTRACE                at java/lang/Thread.run(Thread.java:785)


It hangs permanently. What is the cause ? is there some bug in 3.4 version of the library. Shouldn't it timeout finding the members and just connect to the cluster?
What is recommended properties?
Should all the members of the cluster be present in initial hosts settings?  And does it tries to contact each of the nodes? what if 6 out 8 nodes are down?
Our initial host entry does not contain all the nodes in the cluster and nodes are added dynamically. so A, B, C are the node in the cluster. A while joining will not know B, C will be part of the cluster on startup.
B will not know C will be part of the cluster on startup. however C knows that A, B are part of the cluster [ This is in terms of initial host entry]

however, since they all join the same group, eventually each will know about the other. Will this work like this?

Thanks,
Varada


Bela Ban

unread,
Aug 6, 2018, 12:07:31 PM8/6/18
to jgrou...@googlegroups.com


On 05/08/18 19:41, Varada wrote:
> Hi,
>
>
> I am seeing this strange problem which occurs intermittently. I have a
> cluster of 10 nodes. If other nodes are up and one of the node is
> brought down and started again,
> the call to JChannel.connect does not return. The process hangs and the
> threaddump shows it is stuck in JChannel.connect.
>
> Jgroup version used : jgroups-3.4.0.Alpha2.jar


I don't support such an old version (from 2013!), please try with 3.6.x
(latest stable)!
Your config is also very dated and is missing UNICAST!
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/5eff5e20-5b37-460f-9763-cc8769db1d7f%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/5eff5e20-5b37-460f-9763-cc8769db1d7f%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Bela Ban | http://www.jgroups.org

Varada

unread,
Aug 7, 2018, 2:09:08 PM8/7/18
to jgroups-dev
Hi Bela Ban,

Thanks for the reply and pointing out that the UNICAST is missing in configuration.
I can understand that such a old version is not supported and will upgrade to next version in the next FP release of the product.

I tried few properties changes and the issue is not observed any more. What i want to know is how will it add to my network traffic or performance? adding FD_SOCK, UNICAST and NAKACK2 (instead of NAKACK) will it affect my application in terms of performance?

Following are the properties I change to

TCP(bind_addr=HOST_A;bind_port=37062;):TCPPING(initial_hosts=HOST_A[37062],HOST_B[37087],HOST_C[37091],HOST_D[37095];port_range=0;timeout=15000;num_initial_members=2):MERGE2(min_interval=3000;max_interval=5000):FD_ALL(interval=15000;timeout=20000):FD_SOCK:FD(timeout=15000;max_tries=48;level=ERROR):VERIFY_SUSPECT(timeout=15000):pbcast.NAKACK2(use_mcast_xmit=false;discard_delivered_msgs=true):UNICAST3:pbcast.STABLE(stability_delay=1000;desired_avg_gossip=20000;max_bytes=0):pbcast.GMS(print_local_addr=true;join_timeout=15000)

Thanks,
Varada

Bela Ban

unread,
Aug 9, 2018, 8:29:51 AM8/9/18
to jgrou...@googlegroups.com


On 07/08/18 20:09, Varada wrote:
> Hi Bela Ban,
>
> Thanks for the reply and pointing out that the UNICAST is missing in
> configuration.
> I can understand that such a old version is not supported and will
> upgrade to next version in the next FP release of the product.

OK

> I tried few properties changes and the issue is not observed any more.

Note that the style of plain-text properties you use is not recommened;
I suggest use XML files instead. You can find sample config in the
JGroups JAR. I always recommend to start out with an example, e.g.
tcp.xml, then change TCP.bind_addr/bind_port and the discovery protocol
(e.g. TCPPING).

> What i want to know is how will it add to my network traffic or
> performance? adding FD_SOCK, UNICAST and NAKACK2 (instead of NAKACK)
> will it affect my application in terms of performance?

UNICAST3 and NAKACK2 added performance gains and less use of memory over
their predecessors

> Following are the properties I change to
>
> TCP(bind_addr=HOST_A;bind_port=37062;):TCPPING(initial_hosts=HOST_A[37062],HOST_B[37087],HOST_C[37091],HOST_D[37095];port_range=0;timeout=15000;num_initial_members=2):MERGE2(min_interval=3000;max_interval=5000):FD_ALL(interval=15000;timeout=20000):*FD_SOCK*:FD(timeout=15000;max_tries=48;level=ERROR):VERIFY_SUSPECT(timeout=15000):pbcast.*NAKACK2(use_mcast_xmit=false;*discard_delivered_msgs=true):*UNICAST3*:pbcast.STABLE(stability_delay=1000;desired_avg_gossip=20000;max_bytes=0):pbcast.GMS(print_local_addr=true;join_timeout=15000)

Again, I suggest copy tcp.xml and make the necessary changes
> > an email to jgroups-dev...@googlegroups.com <javascript:>
> > <mailto:jgroups-dev...@googlegroups.com <javascript:>>.
> > To post to this group, send email to jgrou...@googlegroups.com
> <javascript:>
> > <mailto:jgrou...@googlegroups.com <javascript:>>.
> <https://groups.google.com/d/msgid/jgroups-dev/5eff5e20-5b37-460f-9763-cc8769db1d7f%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
> --
> Bela Ban | http://www.jgroups.org
>
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To post to this group, send email to jgrou...@googlegroups.com
> <mailto:jgrou...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/abcade1e-1818-4b48-bea3-643b12c51992%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/abcade1e-1818-4b48-bea3-643b12c51992%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages