On 14.08.20 04:59, Hemal Pandya wrote:
> (sorry for bad formatting, not quite sure how quoating works so I have
> copied your response manually)
>
> Thanks again.
>
> > When a sender thread is blocked, it doesn't get any credits from a (or
> all) receiver(s), therefore you need to look at the stack dumps of *all*
> threads in the cluster, to see which one is stuck.
>
> I will try to get successive stackdumps from all nodes in cluster. I
> will try to compare them to see if I can identify which receiver threads
> are stuck.
>
> > If none of them is blocked (which I assume), then another possibility is
> that your senders continually send messages faster than the receivers
> can process them. If this case, you will see many senders blocked on
> Channel.send(), but that should be a temporary condition.
>
> There is indeed very high volume. But the condition is not temporary,
> successive stackdumps show all the sender threads in the same state.
Could be that they block on almost every message, then unblock for
sending, then block again. If they block for most of the time, that
would be what you see in stack dumps.
You could look at this with probe.sh, e.g. probe.sh jmx=MFC_NB /
jmx=UFC_NB, and then look at the number of blockings / total,avg time
blocked
> Another relevant point is that logs show a lot of transient network
> drops (host not found messages). Can this result in some condition so
> that the sending threads are permanently blocked?
Yes. If you have {A,B,C,D,E}, and there is an issue with D (e.g. route
to it dropped, or NIC went bad), then multicasts will work until there
are no credits left for D. Everybody will block until (1) D sends
credits (which won't happen) or (2) D is suspected and excluded from the
group.
(2) depends largely on your failure detection protocols and configuration.