JGroups clustering hang for 15mins due to TCP retransmission

95 views
Skip to first unread message

Paul Luk

unread,
Jan 5, 2021, 11:16:20 PM1/5/21
to jgroups-dev
Hi Bela,

  I am using JGroups (4.1.10) TUNNEL protocol + gossip router.

  i hit an issue that, in openshift environment, due to unknown network hiccup/openshift app node issue, TCP response from gossip router is being dropping for the tcp stream.

  as a result, the sender keep tcp retransmission until after 15mins, the tcp retransmission failed. jgroups will then discard the tcp connect and try to make another tcp connection to the gossip router which will succeed and cluster resume...

  having multiple gossip routers seems don't help in this situation...

 while i think we can alter the '/proc/sys/net/ipv4/tcp_retries2' to a lower value in OS level (which affect all applications running in the same openshift cluster), do you think that we can handle it in jgroups (say set a timeout for 1 mins and discard the existing tcp connection and re-connect....)? i am not able to find related setting for that in the documentation....

  thank you.

below is the setting i use.
-------------------------------------
<stack name="tunnelStack">
<transport type="TUNNEL" socket-binding="jgroups-tcp">
<property name="gossip_router_hosts">host1[port],host2[port]</property>
<property name="reconnect_interval">3000</property>
<property name="port_range">0</property>
<property name="ergonomics">false</property>
</transport>
<protocol type="PING">
  <property name="async_discovery">true</property>
  <property name="ergonomics">false</property>
</protocol> 
<protocol type="MERGE3">                      
  <property name="max_interval">10000</property>
  <property name="min_interval">3000</property>
  <property name="ergonomics">false</property>
</protocol>
<protocol type="FD_ALL">
  <property name="timeout">9000</property>
  <property name="timeout_check_interval">2000</property>
  <property name="interval">3000</property>
  <property name="ergonomics">false</property>
</protocol>
<protocol type="VERIFY_SUSPECT"/>

<protocol type="pbcast.NAKACK2">
<property name="use_mcast_xmit">false</property>                        
</protocol>
<protocol type="UNICAST3"/>
<protocol type="pbcast.STABLE"/>

<protocol type="pbcast.GMS"/>                       
<protocol type="MFC"/>
<protocol type="FRAG3"/>
</stack>

Bela Ban

unread,
Jan 7, 2021, 10:10:08 AM1/7/21
to jgrou...@googlegroups.com
So you're saying that the underlying TCP connection is not able to
retransmit an IP packet? Then that's an issue in Openshift that needs to
be fixed, not in JGroups. Perhaps a firewall rule kicking in?

So obviously, the TCP connection is not closed, neither on the server
nor client side. In this case, TCP should eventually succeed with the
retransmission.
> --
> You received this message because you are subscribed to the Google
> Groups "jgroups-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jgroups-dev...@googlegroups.com
> <mailto:jgroups-dev...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jgroups-dev/dd9f0a17-aa18-4b85-a987-f0b0f3b1bdfdn%40googlegroups.com
> <https://groups.google.com/d/msgid/jgroups-dev/dd9f0a17-aa18-4b85-a987-f0b0f3b1bdfdn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Bela Ban, JGroups lead (http://www.jgroups.org)

Reply all
Reply to author
Forward
0 new messages