I am currently experiencing an unusual issue in my JGroups-based cluster, which consists of three nodes: Node1 (acting as the leader) and Node2/Node3 (acting as followers). All three nodes host my application’s UI, allowing me to access cluster functionality from any node. This requires seamless communication between the nodes.
Issue DescriptionWhen I log in to Node1, it performs remote calls to Node2 and Node3 to retrieve their statuses.
Most of the time, this works as expected.
Occasionally, however, Node1 fails to retrieve the status from Node2 (while still successfully retrieving it from Node3). This makes it appear as though Node2 is unresponsive.
If I log in to Node3, I can successfully retrieve the status from all nodes (including Node2).
If I attempt to log in to Node2 directly, the login fails entirely.
The status retrieval is implemented using callRemoteFunctionWithFuture().
The cluster also uses JGroups-Raft for consensus and data replication. During the initial login, user preferences are fetched via Raft.
While I suspect the issue might be related to inter-node communication rather than Raft itself, I cannot rule out any potential interactions.
The behavior suggests that Node2 is not responding to messages from Node1, even though it remains accessible from Node3.
This inconsistency leads me to believe there may be an underlying issue with JGroups communication between Node1 and Node2.
I would greatly appreciate any insights or suggestions regarding:
Potential causes for this selective communication failure.
Debugging steps to identify whether the issue lies in JGroups, Raft, or network-level problems.
Best practices for diagnosing and resolving such issues in a JGroups cluster.
Thank you in advance for your help!
Hey, Martin
Thanks for the details.
> If I attempt to log in to Node2 directly, the login fails entirely.
Do you mean even SSH into Node2 fails completely?
Some potential causes and some hints to debug this:
* Some firewall communication blocking a port. E.g., some socket performing failure detection. You could list the ports in use in Node2 and compare which processes are using them according to your JGroups stack.
* You could utilize netcat on Node2 and connect with client with telnet from the other nodes to check whether the connection succeed.
* You can utilize `probe` [1] to investigate at the JGroups level. This would allow to inspect the stack, see message drops, views, etc.
Hope this helps,
Cheers
--
You received this message because you are subscribed to the Google Groups "jgroups-raft" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-raft...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jgroups-raft/4a27d42f-eb72-439f-80a5-868705568917n%40googlegroups.com.
Hi,
Thank you for your insights and suggestions. I’ve made some progress in narrowing down the issue.
Observations and SuspicionsRemote Call Behavior:
The issue seems related to a remote call using dispatcher.callRemoteMethodWithFuture(). When connecting to Node1, this node performs remote calls to Node2 and Node3 to retrieve their statuses. While this usually works, occasional failures occur when the call returns large amounts of data.
Potential Blocking:
I suspect that large data transfers might be causing Node2 to block or become unresponsive. This aligns with the observation that Node2 appears unresponsive to Node1 but remains accessible from Node3.
JGroups Configuration:
I’ve been reading about max_credits and the FRAG protocol in JGroups. While the FRAG protocol is designed to handle large messages by fragmenting them, I’m still unclear about how max_credits might impact communication. Could misconfigured max_credits or fragmentation settings lead to blocked communication when large messages are sent?
Workaround:
I modified the application logic to avoid large remote calls. I still need to test it but fingers crossed I hope this addresses the issue.
I’ll continue investigating the max_credits and FRAG settings to see if adjusting them resolves the issue. If you have any specific recommendations or debugging tips for these parameters, I’d greatly appreciate it.
Thank you again for your help!
Kind regards,
Martin
To view this discussion visit https://groups.google.com/d/msgid/jgroups-raft/da7efeee-5ced-45ce-b059-a14fc476cd36n%40googlegroups.com.
-- Bela Ban | http://www.jgroups.org
Good Morning Bela,
(1.a) I am using one chanel for jgroups and jgroups-raft.
(1.b) I invoke the callMethodWithFuture() regularly. I have one method that is resposible for all the remote (jgroup) calls. I am not sure what you are looking for. Yes it could happen that I have two calls running at the same time.
(1.c) It is very difficult to reproduce the problem, because the system can be running for days without having any problem ... i am trying to get closer to the cause ...
(2) ok ... Now i wonder if the multiple callMethodWithFuture ... could cause the issue and by making them responde faster just made it more unlikely that two run at the same time ...(3.a) ok i did not see a warning(3.b) I am using jgroups: 5.4.5.final and for raft 1.0.14.final
Since i am configuring the stack via java and not via xml it is not so easy to share the configuration. here are some code snippets:
--
You received this message because you are subscribed to the Google Groups "jgroups-raft" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-raft...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jgroups-raft/25e0535b-3ea1-48de-ba49-15d0267a93aan%40googlegroups.com.