All threads stuck in JChannel.send

150 views
Skip to first unread message

Hemal Pandya

unread,
Jul 27, 2020, 9:43:02 PM7/27/20
to jgroups-dev

We use JGroups for inter-node communication


Under heavy usage all threads are stuck with the following stack-trace. We do see network errors in customer's logs but I am wondering if this is a JGroup / JDK bug and if there is any way to work around the situation.

JGroups Version 4
SAP Java Server VM 8.1.032 25.51-b02

"Runtime Worker 140390781294336 Thread" #13356330 prio=5 os_prio=0 cpu=10.00 [reset 10.00] ms elapsed=3.84 [reset 3.84] s allocated=25856 B (25.25 KB) [reset 25856 B (25.25 KB)] defined_classes=0
io= file i/o: 64/0 B, net i/o: 0/0 B, files opened:0, socks opened:0  [reset file i/o: 64/0 B, net i/o: 0/0 B, files opened:0, socks opened:0 ]
tid=0x00007fb25c7c8000 nid=0x29af3 / 170739  pthread-id=140390781294336 waiting on condition  [_thread_blocked (_at_safepoint), stack(0x00007faf46262000,0x00007faf46a62000)] [0x00007faf46a5d000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(ZJ)V(Native Method)
    - parking to wait for  <0x00007fb8b190c490> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()V(AbstractQueuedSynchronizer.java:2039)
    at org.jgroups.util.SizeBoundedQueue.add(Ljava/lang/Object;I)V(SizeBoundedQueue.java:51)
    at org.jgroups.util.NonBlockingCreditMap.addToQueue(Lorg/jgroups/Message;I)Z(NonBlockingCreditMap.java:100)
    at org.jgroups.util.NonBlockingCreditMap.decrement(Lorg/jgroups/Message;IJ)Z(NonBlockingCreditMap.java:52)
    at org.jgroups.protocols.MFC_NB.handleDownMessage(Lorg/jgroups/Message;)Ljava/lang/Object;(MFC_NB.java:86)
    at org.jgroups.protocols.FlowControl.down(Lorg/jgroups/Message;)Ljava/lang/Object;(FlowControl.java:319)
    at org.jgroups.protocols.FRAG2.down(Lorg/jgroups/Message;)Ljava/lang/Object;(FRAG2.java:151)
    at org.jgroups.stack.Protocol.down(Lorg/jgroups/Message;)Ljava/lang/Object;(Protocol.java:318)
    at org.jgroups.stack.ProtocolStack.down(Lorg/jgroups/Message;)Ljava/lang/Object;(ProtocolStack.java:907)
    at org.jgroups.JChannel.down(Lorg/jgroups/Message;)Ljava/lang/Object;(JChannel.java:677)
    at org.jgroups.JChannel.send(Lorg/jgroups/Message;)Lorg/jgroups/JChannel;(JChannel.java:516)

Bela Ban

unread,
Jul 31, 2020, 3:05:04 AM7/31/20
to jgrou...@googlegroups.com
This means your senders aren't getting any credits from the receivers,
or are slow at getting credits. We need a full stack trace to see what
the receivers are doing. This can happen if
* the receiver threads are stuck in some code and the thread pool is
full and/or
* the receiver threads are slow in processing requests
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/2bc47fdf-07f8-4641-b98d-5e2a53678a0an%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/2bc47fdf-07f8-4641-b98d-5e2a53678a0an%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Bela Ban | http://www.jgroups.org

Hemal Pandya

unread,
Aug 12, 2020, 12:16:53 AM8/12/20
to jgroups-dev

Hello Bela,

Thanks a lot of replying, and sorry for delay in follow-up I somehow missed the notification.

I have pasted the complete stack-trace below. I have masked some of the names from our commercial software, but that should not change any analysis.

You wrote that you need to see the full stacktrace to see what the receivers are doing. But this is the stacktrace of the thread trying to send a message. Did I misunderstand what information I need to provide?

In the thread-dump I also see a lot of threads at org.jgroups.util.CondVar.waitFor. Could that be the threads that are stuck trying to read? I have pasted that stack trace as well.

How can I increase the thread pool?

Stacktrace of sending thread:
"XYZ Runtime Worker 140397179041536 Thread" #13305216 prio=5 os_prio=0 cpu=0.00 [reset 0.00] ms elapsed=4827.75 [reset 4827.75] s allocated=25904 B (25.30 KB) [reset 25904 B (25.30 KB)] defined_classes=0
io= file i/o: 128/0 B, net i/o: 0/0 B, files opened:0, socks opened:0  [reset file i/o: 128/0 B, net i/o: 0/0 B, files opened:0, socks opened:0 ]
tid=0x00007fb240049800 nid=0x17a4c / 96844  pthread-id=140397179041536 waiting on condition  [_thread_blocked (_at_safepoint), stack(0x00007fb0c37c0000,0x00007fb0c3fc0000)] [0x00007fb0c3fbb000]

   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(ZJ)V(Native Method)
    - parking to wait for  <0x00007fb8b190c490> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()V(AbstractQueuedSynchronizer.java:2039)
    at org.jgroups.util.SizeBoundedQueue.add(Ljava/lang/Object;I)V(SizeBoundedQueue.java:51)
    at org.jgroups.util.NonBlockingCreditMap.addToQueue(Lorg/jgroups/Message;I)Z(NonBlockingCreditMap.java:100)
    at org.jgroups.util.NonBlockingCreditMap.decrement(Lorg/jgroups/Message;IJ)Z(NonBlockingCreditMap.java:52)
    at org.jgroups.protocols.MFC_NB.handleDownMessage(Lorg/jgroups/Message;)Ljava/lang/Object;(MFC_NB.java:86)
    at org.jgroups.protocols.FlowControl.down(Lorg/jgroups/Message;)Ljava/lang/Object;(FlowControl.java:319)
    at org.jgroups.protocols.FRAG2.down(Lorg/jgroups/Message;)Ljava/lang/Object;(FRAG2.java:151)
    at org.jgroups.stack.Protocol.down(Lorg/jgroups/Message;)Ljava/lang/Object;(Protocol.java:318)
    at org.jgroups.stack.ProtocolStack.down(Lorg/jgroups/Message;)Ljava/lang/Object;(ProtocolStack.java:907)
    at org.jgroups.JChannel.down(Lorg/jgroups/Message;)Ljava/lang/Object;(JChannel.java:677)
    at org.jgroups.JChannel.send(Lorg/jgroups/Message;)Lorg/jgroups/JChannel;(JChannel.java:516)
    at XYZ.CommunicationsJGroupImpl.sendMessage([B)V(ClusterCommunicationsJGroupImpl.java:104)
    at XYZ.ServiceImpl.sendMessage(Ljava/lang/String;[B)V(ClusterServiceImpl.java:150)
    at sun.reflect.GeneratedMethodAccessor261.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Method.java:498)
    at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;(AopUtils.java:317)
    at org.eclipse.gemini.blueprint.service.importer.support.internal.aop.ServiceInvoker.doInvoke(Ljava/lang/Object;Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(ServiceInvoker.java:56)
    at org.eclipse.gemini.blueprint.service.importer.support.internal.aop.ServiceInvoker.invoke(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(ServiceInvoker.java:60)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()Ljava/lang/Object;(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(DelegatingIntroductionInterceptor.java:132)
    at org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(DelegatingIntroductionInterceptor.java:120)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()Ljava/lang/Object;(ReflectiveMethodInvocation.java:172)
    at org.eclipse.gemini.blueprint.service.util.internal.aop.ServiceTCCLInterceptor.invokeUnprivileged(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(ServiceTCCLInterceptor.java:70)
    at org.eclipse.gemini.blueprint.service.util.internal.aop.ServiceTCCLInterceptor.invoke(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(ServiceTCCLInterceptor.java:53)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()Ljava/lang/Object;(ReflectiveMethodInvocation.java:172)
    at org.eclipse.gemini.blueprint.service.importer.support.LocalBundleContextAdvice.invoke(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(LocalBundleContextAdvice.java:57)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()Ljava/lang/Object;(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(DelegatingIntroductionInterceptor.java:132)
    at org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(Lorg/aopalliance/intercept/MethodInvocation;)Ljava/lang/Object;(DelegatingIntroductionInterceptor.java:120)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()Ljava/lang/Object;(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;(JdkDynamicAopProxy.java:204)
    at com.sun.proxy.$Proxy451.sendMessage(Ljava/lang/String;[B)V(Unknown Source)
    at XYZ.MessagingService.sendMessage(Ljava/lang/String;[B)V(SMPMessagingService.java:28)
    at XYZ..messaging.Reducer.reduce(Ljava/io/Serializable;Lcom/sap/agentry/messaging/Accumulator;Z)Ljava/io/Serializable;(Reducer.java:102)
    at XYZ.messaging.IsUserLoggedInReducer.reduce(Ljava/lang/String;Ljava/lang/String;Z)Ljava/lang/Boolean;(IsUserLoggedInReducer.java:48)
    at XYZ.UserManager.isUserLoggedOnDifferentDevice(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Z(SMPAgentryClusterUserManager.java:171)


Receiving thread:
"NioConnection.Reader [10.141.32.29:23316]-169,bc0ef3e8-81fd-4f89-99ef-ef256b53bbaa:com.XYZ.updateApplicationMailbox,XYZ-26682" #7619822 daemon prio=5 os_prio=0 cpu=2000.00 [reset 2000.00] ms elapsed=56588.64 [reset 56588.64] s allocated=24405064 B (23.27 MB) [reset 24405064 B (23.27 MB)] defined_classes=0
io= file i/o: 0/0 B, net i/o: 6589899/0 B, files opened:0, socks opened:0  [reset file i/o: 0/0 B, net i/o: 6589899/0 B, files opened:0, socks opened:0 ]
tid=0x00007fb6e56f7800 nid=0x23b26 / 146214  pthread-id=140399888488192 waiting on condition  [_thread_blocked (_at_safepoint), stack(0x00007fb1656ad000,0x00007fb1657ae000)] [0x00007fb1657ac000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(ZJ)V(Native Method)
    - parking to wait for  <0x00007fb904e40400> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(Ljava/lang/Object;J)V(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(J)J(AbstractQueuedSynchronizer.java:2078)
    at org.jgroups.util.CondVar.waitFor(Lorg/jgroups/util/Condition;JLjava/util/concurrent/TimeUnit;)Z(CondVar.java:63)
    at org.jgroups.blocks.cs.NioConnection$Reader._run()V(NioConnection.java:476)
    at org.jgroups.blocks.cs.NioConnection$Reader.run()V(NioConnection.java:449)
    at java.lang.Thread.run()V(Thread.java:808)

   Locked ownable synchronizers:
    - None

Bela Ban

unread,
Aug 12, 2020, 5:30:42 AM8/12/20
to jgrou...@googlegroups.com


On 12.08.20 06:16, Hemal Pandya wrote:
>
> Hello Bela,
>
> Thanks a lot of replying, and sorry for delay in follow-up I somehow
> missed the notification.
>
> I have pasted the complete stack-trace below. I have masked some of the
> names from our commercial software, but that should not change any
> analysis.
>
> You wrote that you need to see the full stacktrace to see what the
> receivers are doing. But this is the stacktrace of the thread trying to
> send a message. Did I misunderstand what information I need to provide?

When a sender thread is blocked, it doesn't get any credits from a (or
all) receiver(s), therefore you need to look at the stack dumps of *all*
threads in the cluster, to see which one is stuck.

If none of them is blocked (which I assume), then another possibility is
that your senders continually send messages faster than the receivers
can process them. If this case, you will see many senders blocked on
Channel.send(), but that should be a temporary condition.

> In the thread-dump I also see a lot of threads at
> org.jgroups.util.CondVar.waitFor. Could that be the threads that are
> stuck trying to read?

No, these are simply receiver threads waiting for data.
> https://groups.google.com/d/msgid/jgroups-dev/1f7fc555-a903-42ac-bd79-dc49eca0682cn%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/1f7fc555-a903-42ac-bd79-dc49eca0682cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Hemal Pandya

unread,
Aug 13, 2020, 10:59:36 PM8/13/20
to jgroups-dev
(sorry for bad formatting, not quite sure how quoating works so I have copied your response manually)

Thanks again.

> When a sender thread is blocked, it doesn't get any credits from a (or
all) receiver(s), therefore you need to look at the stack dumps of *all*
threads in the cluster, to see which one is stuck.

I will try to get successive stackdumps from all nodes in cluster. I will try to compare them to see if I can identify which receiver threads are stuck.

> If none of them is blocked (which I assume), then another possibility is
that your senders continually send messages faster than the receivers
can process them. If this case, you will see many senders blocked on
Channel.send(), but that should be a temporary condition.

There is indeed very high volume. But the condition is not temporary, successive stackdumps show all the sender threads in the same state. Another relevant point is that logs show a lot of transient network drops (host not found messages). Can this result in some condition so that the sending threads are permanently blocked?

Bela Ban

unread,
Aug 14, 2020, 2:05:31 AM8/14/20
to jgrou...@googlegroups.com


On 14.08.20 04:59, Hemal Pandya wrote:
> (sorry for bad formatting, not quite sure how quoating works so I have
> copied your response manually)
>
> Thanks again.
>
> > When a sender thread is blocked, it doesn't get any credits from a (or
> all) receiver(s), therefore you need to look at the stack dumps of *all*
> threads in the cluster, to see which one is stuck.
>
> I will try to get successive stackdumps from all nodes in cluster. I
> will try to compare them to see if I can identify which receiver threads
> are stuck.
>
> > If none of them is blocked (which I assume), then another possibility is
> that your senders continually send messages faster than the receivers
> can process them. If this case, you will see many senders blocked on
> Channel.send(), but that should be a temporary condition.
>
> There is indeed very high volume. But the condition is not temporary,
> successive stackdumps show all the sender threads in the same state.

Could be that they block on almost every message, then unblock for
sending, then block again. If they block for most of the time, that
would be what you see in stack dumps.
You could look at this with probe.sh, e.g. probe.sh jmx=MFC_NB /
jmx=UFC_NB, and then look at the number of blockings / total,avg time
blocked

> Another relevant point is that logs show a lot of transient network
> drops (host not found messages). Can this result in some condition so
> that the sending threads are permanently blocked?

Yes. If you have {A,B,C,D,E}, and there is an issue with D (e.g. route
to it dropped, or NIC went bad), then multicasts will work until there
are no credits left for D. Everybody will block until (1) D sends
credits (which won't happen) or (2) D is suspected and excluded from the
group.

(2) depends largely on your failure detection protocols and configuration.
Reply all
Reply to author
Forward
0 new messages