Retransmission storms

48 views
Skip to first unread message

Dmitry

unread,
Oct 16, 2024, 3:41:43 AM10/16/24
to jgroups-dev
We're running JGroups 4.2.30 cluster of approximately 50 nodes on a fairly complicated (multiple switches, media converters) 1Gb network. Here is the stack configuration:

    <UDP
         mcast_port="45588"
         ip_ttl="4"
         tos="8"
         ucast_recv_buf_size="5M"
         ucast_send_buf_size="5M"
         mcast_recv_buf_size="5M"
         mcast_send_buf_size="5M"
         max_bundle_size="64K"
         enable_diagnostics="true"
         thread_naming_pattern="jgroup"
         logical_addr_cache_max_size="200"
         thread_pool.min_threads="0"
         thread_pool.max_threads="500"
         thread_pool.keep_alive_time="6000"/>
    <PING />
    <MERGE3 max_interval="2000"
            min_interval="1000"
            check_interval="3000"/>
    <FD_SOCK/>
    <FD_ALL
        interval="1000"
        timeout="3000"
        timeout_check_interval="1500"
    />
    <VERIFY_SUSPECT timeout="100" />
    <pbcast.NAKACK2 xmit_interval="500"
                    xmit_table_num_rows="100"
                    xmit_table_msgs_per_row="2000"
                    xmit_table_max_compaction_time="30000"
                    use_mcast_xmit="true"
                    discard_delivered_msgs="true"/>
    <UNICAST3 xmit_interval="500"
              xmit_table_num_rows="100"
              xmit_table_msgs_per_row="2000"
              xmit_table_max_compaction_time="60000"
              conn_expiry_timeout="0"/>
    <pbcast.STABLE desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS
        join_timeout="5000"/>
    <FRAG2 frag_size="60K"  />
    <RSVP resend_interval="2000" timeout="10000"/>

Every few days, we see UDP packet loss that results in message retransmissions by JGroups. Once this happens, the volume of retransmissions quickly increases, reaching values the network cannot handle. The cluster goes belly-up at this point and does not recover even if normal (non-retransmission) traffic becomes very low.

Looking through the logs, we found two issues.

We saw a lot of " JGRP000032: no physical address for ..., dropping message" errors that prevented unicast messages that are a part of cluster merging procedure from being sent.

The second problem is that multiple nodes are repeatedly requesting retransmissions of the same messages, overwhelming the network. We tried to address this by setting use_mcast_xmit=true. This didn't help at all since NAKACK2 was still retransmitting in response to every request from every node. We added code to NAKACK2 that ignores retransmission requests if a request for the same set of messages was processed less that (xmit_interval/2) ago. This helped a lot in preventing retransmission storms, but didn't eliminate them entirely.

We would appreciate advice, particularly on how to prevent retransmissions from growing indefinitely and eventually swamping the network. Does our change to NAKACK2 looks reasonable, or is there a better way to suppress duplicate retransmissions in response to requests from multiple nodes? Is upgrading to JGroups5 likely to buy us anything?

bel...@gmail.com

unread,
Oct 16, 2024, 3:58:13 AM10/16/24
to jgroups-dev
The physical address not found issue was a regression in 5.3/5.4 (works ok in 5.2), and fixed [1]. I don't know about 4.x.
I see that you don't have any flow control protocols in your config; is this by intention? This is not a standard config, and if you have fast senders, then receivers might run out of memory.
Note that 5.4 (which will be released soon) will have NAKACK4 which - contrary to NAKACK2 - uses a fixed-size buffer for sending messages, so you can calculate the max amount of memory used for storing messages. In addition, this allows for removal of MFC and STABLE.

Dmitry

unread,
Oct 17, 2024, 2:39:17 AM10/17/24
to jgroups-dev
Thanks. In our case,  the physical address not found issue seems to be caused by a race between the thread that adds it to the cache and the thread that tries to retrieve it.
We did intentionally remove flow control protocols as they were causing occasional long (~30s) pauses. Perhaps we should have tried fine-tuning them instead. We also shortened timeouts in failure detection in an effort to prevent a single misbehaving node from bringing down the whole cluster. Otherwise, I thought our config was fairly standard.

Bela Ban

unread,
Oct 17, 2024, 2:48:16 AM10/17/24
to jgrou...@googlegroups.com
Removing flow control protocols is not recommended. It would have been better to investigate what's causing that 30s pause...
Is this reproduceable? With 5.x?
--
You received this message because you are subscribed to the Google Groups "jgroups-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jgroups-dev/44e74521-6bed-420e-b1d8-db73c0a0b4can%40googlegroups.com.

-- 
Bela Ban | http://www.jgroups.org

Dmitry

unread,
Oct 19, 2024, 10:19:12 AM10/19/24
to jgroups-dev
The problem with flow control was encountered before 5.x was officially released. My recollection now is that it was not reliably reproduceable but happened on a regular basis. Removing flow control and shortening failure detection timeouts solved the problem then, and we've been using JGroups without any significant issues until we  recently moved the system from a single-switch cluster to a more complicated network. Yesterday we managed to find a networking problem that caused occasional loss of UDP packets, so hopefully, we'll see more stable operations from now on. It's still a worry that a temporary moderate packet loss can cause a retransmission storm that permanently breaks the cluster (the only way to restore it was to restart all nodes that publish significant amount of data).

We're considering switching to 5.x at some point. Thanks for your advice.

Bela Ban

unread,
Oct 21, 2024, 12:42:27 PM10/21/24
to jgrou...@googlegroups.com
Increasing xmit_interval in NAKACK2 and UNICAST3 will certainly also help. Note that I created [1] today, so that we won't need to use MFC/UFC anymore, if NAKACK4 [2] and UNICAST5 [1] are in the stack. This will prevent retransmission storms.


[1] https://issues.redhat.com/browse/JGRP-2843
[2] https://issues.redhat.com/browse/JGRP-2780
Reply all
Reply to author
Forward
0 new messages