Hi,
I am seeing this strange problem which occurs intermittently. I have a cluster of 10 nodes. If other nodes are up and one of the node is brought down and started again,
the call to JChannel.connect does not return. The process hangs and the threaddump shows it is stuck in JChannel.connect.
Jgroup version used : jgroups-3.4.0.Alpha2.jar
Its VM with OS: RHEL 7
Protocol used is TCP
and below are the properties used
TCP(bind_addr=HOST_A;bind_port=37062):TCPPING(initial_hosts=HOST_A[37062],HOST_B[37087],HOST_C[37091],HOST_D[37095],HOST_D[37062],HOST_E[37087],HOST_F[37091],HOST_G[37095],HOST_H[37099];port_range=0;timeout=15000;num_initial_members=2):MERGE2(min_interval=3000;max_interval=5000):FD_ALL(interval=15000;timeout=20000):FD(timeout=15000;max_tries=48;level=ERROR):VERIFY_SUSPECT(timeout=15000):pbcast.NAKACK(retransmit_timeout=100,200,300,600,1200,2400,4800;discard_delivered_msgs=true):pbcast.STABLE(stability_delay=1000;desired_avg_gossip=20000;max_bytes=0):pbcast.GMS(print_local_addr=true;join_timeout=15000)
With the TCCPING timeout of 15000, the stacktrace showed:
3XMTHREADINFO "FelixStartLevel" J9VMThread:0x0000000001FA4800, j9thread_t:0x00007F4D08099930, java/lang/Thread:0x00000000E11706D8, state:P, prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1F, isDaemon:true)
3XMTHREADINFO1 (native thread ID:0xCAA, native priority:0x5, native policy:UNKNOWN, vmstate:P, vm thread flags:0x000a0001)
3XMTHREADINFO2 (native stack address range from:0x00007F4D757FB000, to:0x00007F4D7583C000, size:0x41000)
3XMCPUTIME CPU usage total: 9.776676050 secs, current category="Application"
3XMTHREADBLOCK Parked on: java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject@0x00000000F8F3AEB0 Owned by: <unknown>
3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at sun/misc/Unsafe.park(Native Method)
4XESTACKTRACE at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:226(Compiled Code))
4XESTACKTRACE at java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2174)
4XESTACKTRACE at org/jgroups/protocols/Discovery$Responses.get(Discovery.java:727)
4XESTACKTRACE at org/jgroups/protocols/Discovery.findMembers(Discovery.java:241)
4XESTACKTRACE at org/jgroups/protocols/Discovery.findInitialMembers(Discovery.java:208)
4XESTACKTRACE at org/jgroups/protocols/Discovery.down(Discovery.java:551)
4XESTACKTRACE at org/jgroups/protocols/TCPPING.down(TCPPING.java:108)
4XESTACKTRACE at org/jgroups/protocols/MERGE2.down(MERGE2.java:185)
4XESTACKTRACE at org/jgroups/protocols/FD_ALL.down(FD_ALL.java:217)
4XESTACKTRACE at org/jgroups/protocols/FD.down(FD.java:307)
4XESTACKTRACE at org/jgroups/protocols/VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:84)
4XESTACKTRACE at org/jgroups/protocols/pbcast/NAKACK.down(NAKACK.java:569(Compiled Code))
4XESTACKTRACE at org/jgroups/protocols/pbcast/STABLE.down(STABLE.java:365)
4XESTACKTRACE at org/jgroups/protocols/pbcast/ClientGmsImpl.findInitialMembers(ClientGmsImpl.java:199)
4XESTACKTRACE at org/jgroups/protocols/pbcast/ClientGmsImpl.joinInternal(ClientGmsImpl.java:73)
4XESTACKTRACE at org/jgroups/protocols/pbcast/ClientGmsImpl.join(ClientGmsImpl.java:37)
4XESTACKTRACE at org/jgroups/protocols/pbcast/GMS.down(GMS.java:1013)
4XESTACKTRACE at org/jgroups/stack/ProtocolStack.down(ProtocolStack.java:1025)
4XESTACKTRACE at org/jgroups/JChannel.down(JChannel.java:766)
4XESTACKTRACE at org/jgroups/JChannel._connect(JChannel.java:543)
4XESTACKTRACE at org/jgroups/JChannel.connect(JChannel.java:290)
5XESTACKTRACE (entered lock: org/jgroups/JChannel@0x00000000F858D5E0, entry count: 2)
4XESTACKTRACE at org/jgroups/JChannel.connect(JChannel.java:275)
5XESTACKTRACE (entered lock: org/jgroups/JChannel@0x00000000F858D5E0, entry count: 1)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:77)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:55)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/JGroupsNodeCommunication.<init>(JGroupsNodeCommunication.java:91)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/ClusterScheduler.initialize(ClusterScheduler.java:164)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/FairShareSchedulingPolicy.ConfigureCluster(FairShareSchedulingPolicy.java:174)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/WorkFlowQueueSender.ConfigureCluster(WorkFlowQueueSender.java:205)
4XESTACKTRACE at com/sterlingcommerce/woodstock/osgi/bundles/container/AdapterJVMActivator.start(AdapterJVMActivator.java:222)
4XESTACKTRACE at org/apache/felix/framework/util/SecureAction.startActivator(SecureAction.java:589)
4XESTACKTRACE at org/apache/felix/framework/Felix.startBundle(Felix.java:1458)
4XESTACKTRACE at org/apache/felix/framework/Felix.setActiveStartLevel(Felix.java:984)
4XESTACKTRACE at org/apache/felix/framework/StartLevelImpl.run(StartLevelImpl.java:263)
4XESTACKTRACE at java/lang/Thread.run(Thread.java:785)
3XMTHREADINFO3 Native callstack:
4XENATIVESTACK (0x00007F4D7B26E882 [libj9prt28.so+0x2f882])
4XENATIVESTACK (0x00007F4D7B27DD05 [libj9prt28.so+0x3ed05])
4XENATIVESTACK (0x00007F4D7B26E3FC [libj9prt28.so+0x2f3fc])
4XENATIVESTACK (0x00007F4D7B26E4FE [libj9prt28.so+0x2f4fe])
4XENATIVESTACK (0x00007F4D7B27DD05 [libj9prt28.so+0x3ed05])
4XENATIVESTACK (0x00007F4D7B26DFDF [libj9prt28.so+0x2efdf])
4XENATIVESTACK (0x00007F4D7B267677 [libj9prt28.so+0x28677])
4XENATIVESTACK (0x00007F4D80F7A390 [libpthread.so.0+0x11390])
4XENATIVESTACK pthread_cond_timedwait+0x129 (0x00007F4D80F76709 [libpthread.so.0+0xd709])
4XENATIVESTACK j9thread_park+0x1b6 (0x00007F4D7B6B9F66 [libj9thr28.so+0x5f66])
4XENATIVESTACK (0x00007F4D7B97B2E8 [libj9vm28.so+0xb22e8])
4XENATIVESTACK (0x00007F4D56B7A2B2 [<unknown>+0x0])
Then I changed the timeout value to 20000 and saw that the hang is less frequent then earlier but still it occurs. however, the threaddump showed a different stack trace like this:
3XMTHREADINFO "FelixStartLevel" J9VMThread:0x0000000002AC4900, j9thread_t:0x00007F44A811C180, java/lang/Thread:0x00000000E11C89D8, state:P, prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1F, isDaemon:true)
3XMTHREADINFO1 (native thread ID:0xC84, native priority:0x5, native policy:UNKNOWN, vmstate:P, vm thread flags:0x000a0001)
3XMTHREADINFO2 (native stack address range from:0x00007F4518EFB000, to:0x00007F4518F3C000, size:0x41000)
3XMCPUTIME CPU usage total: 9.356933747 secs, current category="Application"
3XMTHREADBLOCK Parked on: java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject@0x00000000F5856490 Owned by: <unknown>
3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at sun/misc/Unsafe.park(Native Method)
4XESTACKTRACE at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:226(Compiled Code))
4XESTACKTRACE at java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2174)
4XESTACKTRACE at org/jgroups/util/Promise._getResultWithTimeout(Promise.java:141)
4XESTACKTRACE at org/jgroups/util/Promise.getResultWithTimeout(Promise.java:40)
4XESTACKTRACE at org/jgroups/util/Promise.getResult(Promise.java:64)
4XESTACKTRACE at org/jgroups/protocols/pbcast/ClientGmsImpl.joinInternal(ClientGmsImpl.java:138)
4XESTACKTRACE at org/jgroups/protocols/pbcast/ClientGmsImpl.join(ClientGmsImpl.java:37)
4XESTACKTRACE at org/jgroups/protocols/pbcast/GMS.down(GMS.java:1013)
4XESTACKTRACE at org/jgroups/stack/ProtocolStack.down(ProtocolStack.java:1025)
4XESTACKTRACE at org/jgroups/JChannel.down(JChannel.java:766)
4XESTACKTRACE at org/jgroups/JChannel._connect(JChannel.java:543)
4XESTACKTRACE at org/jgroups/JChannel.connect(JChannel.java:290)
5XESTACKTRACE (entered lock: org/jgroups/JChannel@0x00000000F550E2F8, entry count: 2)
4XESTACKTRACE at org/jgroups/JChannel.connect(JChannel.java:275)
5XESTACKTRACE (entered lock: org/jgroups/JChannel@0x00000000F550E2F8, entry count: 1)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:77)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/jgroups/NodeInfoNotificationBus.start(NodeInfoNotificationBus.java:55)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/JGroupsNodeCommunication.<init>(JGroupsNodeCommunication.java:91)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/ClusterScheduler.initialize(ClusterScheduler.java:164)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/FairShareSchedulingPolicy.ConfigureCluster(FairShareSchedulingPolicy.java:174)
4XESTACKTRACE at com/sterlingcommerce/woodstock/workflow/queue/WorkFlowQueueSender.ConfigureCluster(WorkFlowQueueSender.java:205)
4XESTACKTRACE at com/sterlingcommerce/woodstock/osgi/bundles/container/AdapterJVMActivator.start(AdapterJVMActivator.java:222)
4XESTACKTRACE at org/apache/felix/framework/util/SecureAction.startActivator(SecureAction.java:589)
4XESTACKTRACE at org/apache/felix/framework/Felix.startBundle(Felix.java:1458)
4XESTACKTRACE at org/apache/felix/framework/Felix.setActiveStartLevel(Felix.java:984)
4XESTACKTRACE at org/apache/felix/framework/StartLevelImpl.run(StartLevelImpl.java:263)
4XESTACKTRACE at java/lang/Thread.run(Thread.java:785)
It hangs permanently. What is the cause ? is there some bug in 3.4 version of the library. Shouldn't it timeout finding the members and just connect to the cluster?
What is recommended properties?
Should all the members of the cluster be present in initial hosts settings? And does it tries to contact each of the nodes? what if 6 out 8 nodes are down?
Our initial host entry does not contain all the nodes in the cluster and nodes are added dynamically. so A, B, C are the node in the cluster. A while joining will not know B, C will be part of the cluster on startup.
B will not know C will be part of the cluster on startup. however C knows that A, B are part of the cluster [ This is in terms of initial host entry]
however, since they all join the same group, eventually each will know about the other. Will this work like this?
Thanks,
Varada