We started to see a remote command timeout problem in one of our clusters, since we moved to RHEL9 (from RHEL7).
Setup:
=====
proc1, proc2 processes are running on a vm(4 instances of VMs, ie., 4 instances of each)
proc3 process running on other vm (4 instances of VMs, i.e., 4 instances). A firewall in between these two VMs.
cluster1: proc1, proc2 are connected over UDP (multicast). Total 8 members
cluster2: proc2, proc3 are connected through TUNNEL(via gossiprouter). Total 8 members
cluster1 config:
================
<config xmlns="urn:org:jgroups"
xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups
http://www.jgroups.org/schema/jgroups.xsd"
>
<UDP
bind_addr="match-interface:ens256"
mcast_port="${jgroups.udp.mcast_port:45588}"
mcast_addr="224.3.3.3"
diag.enabled="false"
thread_pool.min_threads="0"
thread_pool.max_threads="200"
thread_pool.keep_alive_time="30000"/>
<PING num_discovery_runs="5"/>
<MERGE3 max_interval="30000"
min_interval="10000"/>
<FD_SOCK/>
<FD_ALL3/>
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK2 xmit_interval="500"/>
<UNICAST3 xmit_interval="500" />
<pbcast.STABLE desired_avg_gossip="50000"
max_bytes="4M"/>
<pbcast.GMS print_local_addr="true" join_timeout="5000"/>
<UFC max_credits="10M"
min_threshold="0.4"/>
<MFC max_credits="10M"
min_threshold="0.4"/>
<FRAG2 frag_size="60K" />
<pbcast.STATE_TRANSFER />
<CENTRAL_LOCK />
</config>
cluster2 config:
================
<?xml version="1.0"?>
<config
xsi:schemaLocation="urn:org:jgroups
http://www.jgroups.org/schema/jgroups.xsd"
xmlns="urn:org:jgroups"
xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance">
<TUNNEL bind_addr="match-interface:ens256"
bind_port="0"
use_nio="false"
gossip_router_hosts="<ip1>[7800],<ip2>[7800],<ip3>[7800],<ip4>[7800]"
port_range="0" />
<PING num_discovery_runs="3"/>
<MERGE3 max_interval="30000" min_interval="10000"/>
<FD_ALL3 />
<VERIFY_SUSPECT />
<pbcast.NAKACK2 use_mcast_xmit="false" />
<UNICAST3 />
<pbcast.STABLE desired_avg_gossip="50000" max_bytes="4M" />
<pbcast.GMS print_local_addr="true" join_timeout="12000"/>
<UFC />
<FRAG2 />
<pbcast.STATE_TRANSFER />
<CENTRAL_LOCK />
</config>
gossiprouter:
============
gossiprouter is started as systemd unit with on all 4 VMs(where proc1, proc2 are running):
/usr/bin/java .... org.jgroups.stack.GossipRouter -bindaddress <ip> -port 7800
Problem we are noticing is on cluster2 only.
Observations when the problem occured:
(i) member1 shows below status
>> ss -tnp \| egrep '"7800|java"'
Thu Feb 5 04:27:46 PM UTC 2026
ESTAB 0 0 [::ffff:172.16.136.16]:58187 [::ffff:172.16.136.16]:46695 users:(("java",pid=881471,fd=56))
ESTAB 0 0 [::ffff:172.16.136.16]:46695 [::ffff:172.16.136.17]:56305 users:(("java",pid=881563,fd=110))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.19]:55627 users:(("java",pid=881447,fd=47))
ESTAB 4026691 0 [::ffff:172.16.110.16]:40310 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=65))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.25]:54247 users:(("java",pid=881447,fd=49))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.17]:45767 users:(("java",pid=881447,fd=45))
CLOSE-WAIT 1207978 0 [::ffff:172.16.110.16]:40264 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=67))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.35]:53693 users:(("java",pid=881447,fd=50))
ESTAB 3507747 0 [::ffff:172.16.110.16]:40238 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=71))
ESTAB 0 145237 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.16]:59511 users:(("java",pid=881447,fd=44))
ESTAB 59311 0 [::ffff:172.16.136.16]:43245 [::ffff:172.16.136.18]:7800 users:(("java",pid=881471,fd=54))
ESTAB 50592 0 [::ffff:172.16.136.16]:39291 [::ffff:172.16.136.19]:7800 users:(("java",pid=881471,fd=55))
ESTAB 108261 0 [::ffff:172.16.136.16]:59511 [::ffff:172.16.136.16]:7800 users:(("java",pid=881471,fd=53))
ESTAB 0 0 [::ffff:172.16.136.16]:46695 [::ffff:172.16.136.16]:58187 users:(("java",pid=881563,fd=100))
ESTAB 3633542 0 [::ffff:172.16.110.16]:40278 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=61))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.15]:56341 users:(("java",pid=881447,fd=51))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.130.45]:56693 users:(("java",pid=881447,fd=48))
ESTAB 57408 0 [::ffff:172.16.136.16]:50427 [::ffff:172.16.136.17]:7800 users:(("java",pid=881471,fd=58))
ESTAB 0 0 [::ffff:172.16.136.16]:7800 [::ffff:172.16.136.18]:50909 users:(("java",pid=881447,fd=46))
ESTAB 0 0 [::ffff:172.16.136.16]:41721 [::ffff:172.16.136.17]:45435 users:(("java",pid=881563,fd=101))
ESTAB 5268046 0 [::ffff:172.16.110.16]:40292 [::ffff:10.117.107.21]:443 users:(("java",pid=881471,fd=63))
(ii) One or more members starts reporting message like below:
ERROR InvocationContextInterceptor.java:130: ISPN000136: Error executing command ReadWriteKeyCommand on Cache 'org.infinispan.LOCKS', writing keys
[ClusteredLockKey{name=<lockName>}]
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7012 from member-1 after 15 seconds
at org.infinispan.remoting.transport.impl.SingleTargetRequest.onTimeout(SingleTargetRequest.java:93)
(iii) After few mins, member1 leaves the cluster.
Similar ERRORs like (ii) are reported for among other members (complaining each other) and all other members starts to leave one after the other.
"ss" command shows similar to above (packets stuck at REC-Q, SEND-Q)
Upon some reading , making "use_nio=true" might help to avoid this situation particularly on RHEL8/9.
So I tried that making such update(cluster2 config) use_nio="true" and also added diag.enabled="false" to it.
Also, added arguments " -nio true" to gossiprouter service.
Problem reproduction became bit hard with this updated config. It got reproduced once, however the observation is only the problematic member left and rest of the cluster continued
working as is.
Would like to understand if anyone could explain it in detail, what could be the problem with our current config and changing config (use_nio=true) has really improved this or am I just (un)lucky ?
These are infinispan clusters(using jgroups under it)
jgroups version: 5.2.22