onTimeout errors, cluster frozen, members left.(does use_nio=true help ?)

12 views

Skip to first unread message

kumar bava

unread,

Feb 9, 2026, 8:07:53 AMFeb 9

to jgroups-dev

We started to see a remote command timeout problem in one of our clusters, since we moved to RHEL9 (from RHEL7).

Setup:
=====
proc1, proc2 processes are running on a vm(4 instances of VMs, ie., 4 instances of each)
proc3 process running on other vm (4 instances of VMs, i.e., 4 instances). A firewall in between these two VMs.

cluster1: proc1, proc2 are connected over UDP (multicast). Total 8 members
cluster2: proc2, proc3 are connected through TUNNEL(via gossiprouter). Total 8 members

cluster1 config:
================
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd"
>
<UDP
bind_addr="match-interface:ens256"
mcast_port="${jgroups.udp.mcast_port:45588}"
mcast_addr="224.3.3.3"
diag.enabled="false"
thread_pool.min_threads="0"
thread_pool.max_threads="200"
thread_pool.keep_alive_time="30000"/>

<PING num_discovery_runs="5"/>
<MERGE3 max_interval="30000"
min_interval="10000"/>
<FD_SOCK/>
<FD_ALL3/>
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK2 xmit_interval="500"/>
<UNICAST3 xmit_interval="500" />
<pbcast.STABLE desired_avg_gossip="50000"
max_bytes="4M"/>
<pbcast.GMS print_local_addr="true" join_timeout="5000"/>
<UFC max_credits="10M"
min_threshold="0.4"/>
<MFC max_credits="10M"
min_threshold="0.4"/>
<FRAG2 frag_size="60K" />
<pbcast.STATE_TRANSFER />
<CENTRAL_LOCK />
</config>

cluster2 config:
================
<?xml version="1.0"?>
<config
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd"
xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<TUNNEL bind_addr="match-interface:ens256"
bind_port="0"
use_nio="false"
gossip_router_hosts="<ip1>[7800],<ip2>[7800],<ip3>[7800],<ip4>[7800]"
port_range="0" />
<PING num_discovery_runs="3"/>
<MERGE3 max_interval="30000" min_interval="10000"/>
<FD_ALL3 />
<VERIFY_SUSPECT />
<pbcast.NAKACK2 use_mcast_xmit="false" />
<UNICAST3 />
<pbcast.STABLE desired_avg_gossip="50000" max_bytes="4M" />
<pbcast.GMS print_local_addr="true" join_timeout="12000"/>
<UFC />
<FRAG2 />
<pbcast.STATE_TRANSFER />
<CENTRAL_LOCK />
</config>

gossiprouter:
============
gossiprouter is started as systemd unit with on all 4 VMs(where proc1, proc2 are running):
/usr/bin/java .... org.jgroups.stack.GossipRouter -bindaddress <ip> -port 7800

Problem we are noticing is on cluster2 only.

Observations when the problem occured:
(i) member1 shows below status

>> ss -tnp \| egrep '"7800|java"'
Thu Feb 5 04:27:46 PM UTC 2026
ESTAB 0 0 [::ffff:172.16.136.16]:58187 [::ffff:172.16.136.16]:46695 users:(("java",pid=881471,fd=56))
ESTAB 0 0 [::ffff:172.16.136.16]:46695 [::ffff:172.16.136.17]:56305 users:(("java",pid=881563,fd=110))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.19]:55627 users:(("java",pid=881447,fd=47))
ESTAB 4026691 0 [::ffff:172.16.110.16]:40310 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=65))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.25]:54247 users:(("java",pid=881447,fd=49))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.17]:45767 users:(("java",pid=881447,fd=45))
CLOSE-WAIT 1207978 0 [::ffff:172.16.110.16]:40264 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=67))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.35]:53693 users:(("java",pid=881447,fd=50))
ESTAB 3507747 0 [::ffff:172.16.110.16]:40238 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=71))
ESTAB 0 145237 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.16]:59511 users:(("java",pid=881447,fd=44))
ESTAB 59311 0 [::ffff:172.16.136.16]:43245 [::ffff:172.16.136.18]:7800 users:(("java",pid=881471,fd=54))
ESTAB 50592 0 [::ffff:172.16.136.16]:39291 [::ffff:172.16.136.19]:7800 users:(("java",pid=881471,fd=55))
ESTAB 108261 0 [::ffff:172.16.136.16]:59511 [::ffff:172.16.136.16]:7800 users:(("java",pid=881471,fd=53))
ESTAB 0 0 [::ffff:172.16.136.16]:46695 [::ffff:172.16.136.16]:58187 users:(("java",pid=881563,fd=100))
ESTAB 3633542 0 [::ffff:172.16.110.16]:40278 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=61))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.15]:56341 users:(("java",pid=881447,fd=51))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.45]:56693 users:(("java",pid=881447,fd=48))
ESTAB 57408 0 [::ffff:172.16.136.16]:50427 [::ffff:172.16.136.17]:7800 users:(("java",pid=881471,fd=58))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.18]:50909 users:(("java",pid=881447,fd=46))
ESTAB 0 0 [::ffff:172.16.136.16]:41721 [::ffff:172.16.136.17]:45435 users:(("java",pid=881563,fd=101))
ESTAB 5268046 0 [::ffff:172.16.110.16]:40292 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=63))

(ii) One or more members starts reporting message like below:
ERROR InvocationContextInterceptor.java:130: ISPN000136: Error executing command ReadWriteKeyCommand on Cache 'org.infinispan.LOCKS', writing keys
[ClusteredLockKey{name=<lockName>}]
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7012 from member-1 after 15 seconds
at org.infinispan.remoting.transport.impl.SingleTargetRequest.onTimeout(SingleTargetRequest.java:93)

(iii) After few mins, member1 leaves the cluster.
Similar ERRORs like (ii) are reported for among other members (complaining each other) and all other members starts to leave one after the other.
"ss" command shows similar to above (packets stuck at REC-Q, SEND-Q)

Upon some reading , making "use_nio=true" might help to avoid this situation particularly on RHEL8/9.

So I tried that making such update(cluster2 config) use_nio="true" and also added diag.enabled="false" to it.
Also, added arguments " -nio true" to gossiprouter service.

Problem reproduction became bit hard with this updated config. It got reproduced once, however the observation is only the problematic member left and rest of the cluster continued
working as is.

Would like to understand if anyone could explain it in detail, what could be the problem with our current config and changing config (use_nio=true) has really improved this or am I just (un)lucky ?

These are infinispan clusters(using jgroups under it)
jgroups version: 5.2.22

Bela Ban

unread,

Feb 10, 2026, 10:26:01 AMFeb 10

to jgrou...@googlegroups.com

1. If you didn't see any issues when running on RHEL 7, and you see issues now on RHEL 9, and you use the same version/configuration of JGroups/Infinispan, then I suspect RHEL 9 is the culprit.

2. You have a lot of unread/unset data in the recv/send queues (as shown my ss). Can you disable the firewall / selinux and see whether this works?

3. You have 4 GossipRouters running on VMs separated by a firewall; this means that every client (running the TUNNEL stack) will connect to all 4 GossipRouters and every time a message is sent, a GossipRouter will be picked randomly. This causes TCP connections between proce2 and proc3 processes across a firewall. Make sure you have all relevant ports open.

--
You received this message because you are subscribed to the Google Groups "jgroups-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jgroups-dev/b1f0bd44-4ba5-47af-8fd2-7cd2930a6c36n%40googlegroups.com.

-- 
Bela Ban | http://www.jgroups.org

Reply all

Reply to author

Forward

0 new messages