Ack Aggregation during Recovery

Alex

unread,

Jul 22, 2024, 9:33:03 AM7/22/24

to BBR Development

Hello,

Noticed a weird quirk in bbr_update_ack_aggregation and hoping someone can shed some light on this.

During normal transmissions and an open congestion state, some acks are going to be delayed, compressed. Which is what bbr_update_ack_aggregation is for. It does it by comparing how many bytes were received and compares to how many bytes should have been received, and the delta is our ack aggregation.

If my understanding correct, I have a question.

During recovery sacks are not going to be delayed, they'll be sent instantly, potentially sacking quite a lot of bytes very quickly, much faster than the sending rate and faster than the data would normally be acked. However, the actual max_bw is not going to be increased (much) since it is bound by min(send_rate, ack_rate).

To illustrate - After losing several packets, bbr could receive multiple back to back sacks, each carrying newly sacked data. However, delivered_mstamp is going to increase by a very small increment and hence expected_ack is also not going to grow very much between each sack. Consequently much of the sacked bytes are going to be counted as extra_acked (since bbr_bw has not changed).

Yes, eventually the expected_acked is going to catch up, and extra_acked is going to drop back down, but in the meantime we have several round trips of a new maximum extra_acked, and we're increasing cwnd quickly precisely when we don't want to (during recovery).

Is this expected and desired behaviour? Or am I completely wrong?
I looked at the rfc and couldn't find anything regarding that. Thanks for reading!

Regards,

Alex

Neal Cardwell

unread,

Jul 22, 2024, 10:26:40 AM7/22/24

to Alex, BBR Development

Hi Alex,

Thanks for the post!

I think the key part of your post is this one:

> During recovery sacks are not going to be delayed, they'll be sent instantly,

> potentially sacking quite a lot of bytes very quickly, much faster

> than the sending rate and faster than the data would normally be acked.

Note that to the extent that a network round-trip path, including the receiver, acknowledges data (cumulatively or selectively) ASAP, there will be no aggregation effects (all data will be smoothly acknowledged at exactly the rate at which it was transmitted).

Due to speed-of-light limits, it is not possible for a receiver to SACK data "much faster than the sending rate" unless at some earlier point some part of the network path delayed the transmission or generation of packets (or it is a misbehaving receiver, and violating the protocol spec by ACKing data that has not arrived yet).

Aggregation effects happen when something in the path first (a) delays the generation, transmission, or processing of either data packets or ACK packets, and then (b) initiates a burst of delayed packet/ACK generation/transmission/processing by pulling from some queue of delayed packets (at a rate that is potentially faster than the long-term data rate).

To the extent that many OSes will, when receiving out-of-order data, disable delayed ACKs and expedite the generation of ACKs with SACK blocks, this will reduce the degree of aggregation, because there will be less delay in handling of packets, and thus less opportunity for delayed work to happen in a burst.

Our experience in testing and trace analysis of real public Internet and datacenter traffic suggests that the biggest causes of aggregation are effects below the TCP layer:

+ L2/link-layer mechanisms, where some link-layer technologies must queue packets while they wait for their turn to transmit on some kind of shared medium that is multiplexed in time between different senders: cellular, wifi, and DOCSIS links are like this

+ offload mechanisms, where hardware (TSO/LRO) or software (GSO, GRO, driver) offload mechanisms build batches of packets and release them in a burst

Those mechanisms are mostly agnostic to the sequence numbers of the TCP packets – since they happen at layers below TCP – and so they are agnostic to whether data is arriving out of sequence order, or whether ACKs contain SACK blocks, and thus will not change their aggregation behavior in scenarios with SACKs. For the mechanisms that are not agnostic to sequence numbers, like LRO/GRO receiver aggregation mechanisms, or driver/qdisc ACK decimation algorithms, when there are out-of-order packets that tends to trigger immediate action, and so will tend to reduce opportunities for the kind of delay/burst pairing that is involved in aggregation effects.

The one widely-deployed mechanism that I'm aware of that potentially increases the degree of aggregation in recovery scenarios is the Linux TCP SACK compression mechanism added by Eric Dumazet in 2018:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e

However, note that even that mechanism limits the degree of aggregation to 1ms, which is the typical degree of aggregation from Linux TCP TSO/GSO anyway, and so typically does not actually increase the overall degree of aggregation during recovery, but instead makes the generation of ACKs with SACKs more closely reflect the typical degree of aggregation.

As a result of all these considerations, in our experience in testing and trace analysis, we have generally not seen higher degrees of aggregation in fast recovery scenarios.

Are you seeing something different in testing or trace analysis? Or was your concern mainly theoretical?

Thanks,

neal

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/c7257f22-3171-4e99-b417-02a7ab450a7fn%40googlegroups.com.

Alex

unread,

Jul 22, 2024, 1:29:29 PM7/22/24

to BBR Development

Hi Neal,

Thanks for the reply!

I was mostly guessing trying to interpret results I was seeing. Sometimes during recovery, I'd see the extra_acked field shoot up (if it was about 100 packets during steady state it might go to 140 or 150 during recovery. 10Gbps local link).

I understand that disabling delayed acks *should* reduce the degree of ack aggregation but it seemed on a short time scale (like a burst of sacks) where epoch_us doesn't increase very much the excess could go to extra_acked instead.

For the same reason the rate sampler is bound by the send interval and not just the ack interval, I thought the same could happen in the ack aggregation. After all, the way it's calculating extra_acked is by looking at the excess of what was expected.

But maybe what I was seeing was the aggregation epoch getting reset and all the sacked bytes going to into ack_epoch_acked.

Regards,

Alex.

Neal Cardwell

unread,

Jul 22, 2024, 1:54:09 PM7/22/24

to Alex, BBR Development

Hi Alex,

What's the exact receiver OS and kernel version here?

You mention "disabling delayed acks", but keep in mind that if this is a recent Linux kernel (since 2018-ish) then it will have the SACK compression mechanism enabled, in which case during recovery delayed ACKs will not be disabled, and instead ACKs may be delayed more than is the typical case.

Also, keep in mind that extra_acked increasing from around 100 to around 150 is not an unexpected increase at 10Gbps, if your MTU is 1500 Bytes (is it?). With a 1500B MTU, a full-sized 64KByte TSO burst is around 45 packets. So that increase from 100 to 150 is in the neighborhood of an increase from around 2 full-sized TSO bursts (90 packets) to around 3 full-sized TSO bursts (135 packets). That could easily happen due to any number of underlying causes for timing perturbation.

If you are curious about the dynamics, I would suggest grabbing a headers-only tcpdump packet trace and analysing with tcptrace/xplot.

Best regards,

neal

To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/e0025e79-b3ce-4672-8105-6d7e8db0d95an%40googlegroups.com.

Alex

unread,

Jul 22, 2024, 8:35:37 PM7/22/24

to BBR Development

Hi Neal,

The receiver is macos. The senders are two separate linux machines with 6.9.9 kernels with bbrv3. Of note is that I did turn off packet timestamps because otherwise the receiver's nic would not do LRO, this did cause the minimum rtt and cwnd values to increase a fair bit (which makes sense).

Are the packets inside a TSO burst paced, or are they sent as one (effectively at line rate)?
I think you're right and what I'm seeing is the TSO bursts, but I'll do a packet trace as suggested and update if I see anything interesting.

Kind Regards,

Alex

Neal Cardwell

unread,

Jul 22, 2024, 8:43:06 PM7/22/24

to Alex, BBR Development

On Mon, Jul 22, 2024 at 8:35 PM Alex <sale...@gmail.com> wrote:

Hi Neal,

The receiver is macos. The senders are two separate linux machines with 6.9.9 kernels with bbrv3. Of note is that I did turn off packet timestamps because otherwise the receiver's nic would not do LRO, this did cause the minimum rtt and cwnd values to increase a fair bit (which makes sense).

OK, thanks for the details!

Are the packets inside a TSO burst paced, or are they sent as one (effectively at line rate)?

Packets inside a TSO burst are sent at line rate.

I think you're right and what I'm seeing is the TSO bursts, but I'll do a packet trace as suggested and update if I see anything interesting.

OK, sounds good. It would be interesting to hear what you find out.

Best regards,

neal

To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/5992eb8d-2915-4c98-8441-59e28bcce575n%40googlegroups.com.

Reply all

Reply to author

Forward