BBR and ACK suppression/decimation/acceleration

Greg White

unread,

Nov 17, 2016, 3:02:25 AM11/17/16

to BBR Development

Yuchung, Neal, et al.,

In cases where a middlebox (e.g. a cable modem) performs TCP ACK acceleration, you may be able to calculate a more accurate BDP than using your existing approach.

In cable modems, ACK acceleration is a proprietary feature implemented by the SOC vendors, so there is limited information on how it works. But, my understanding is that it works more-or-less as follows: the cable modem maintains a separate queue for TCP ACKs; upon arrival of a new ACK, it examines this queue and if it finds an older ACK for the same connection, it replaces the older ACK with the new one, so only the newest (highest seq no) ACK for each connection is kept. When a transmission opportunity arrives, the cable modem preferentially dequeues and transmits the ACKs. I believe that the algorithms have more sophisticated handling of special cases (SACK, DupACKs, etc.).

Each cable modem gets discontinuous access to the upstream channel, mediated by a request/grant protocol. The CM has occasional (e.g. every 2ms or so) opportunities to send a request to the CMTS, and it will get a corresponding grant (tx op) typically 4-6 ms later.

So, there really are three behaviors taking place. One is that the modem has discontinuous access to the channel, maybe a grant every 2-8 ms. Another is the prioritized queuing of ACKs, such that they are a) not delayed by buffer bloat, and b) not (always) delayed by the request-grant loop. The last is that only the latest ACK is sent, as opposed to sending a burst of (possibly back-to-back) ACKs for the connection.

Now, suppose that a prescient version of BBR were to be pacing its packets at exactly the bottleneck bandwidth (with no queue built up), and simply monitoring (but not acting on) the in-flight count. That pBBR would see the in-flight count ramp up to a max value, and then, when a big "stretch" ACK arrives (ACKing many segments), immediately drop down to a min value, then ramp again to the max, and the sawtooth would repeat. If this pBBR were then to start implementing a cwnd limit, we would want it to use the top of the sawtooth (rather than the bottom) as the cwnd, otherwise it would underutilize the link.

The logic in BBR is currently using the bottom of the sawtooth as the cwnd.

In your paper, you mention using a multiplicative factor "cwnd_gain" of 2 (effectively doubling the calculated BDP) in order to attempt to correct for this. I suppose that you also have some heuristic that tells BBR when to employ this gain factor and when not to.

An alternative which (I think) would eliminate the error precisely (and thus avoid potentially introducing buffering latency or link underutilization), and which (I think) would not need to be enabled/disabled based on a heuristic, would be to simply calculate the RTT based on the oldest of the ACKed segments rather than the newest. This RTT value would then be input into your RTpropFilter as you normally do.

Perhaps you've thought of this already and it fails for a reason that I haven't anticipated. But if not, I would suggest testing this approach.

-Greg

Neal Cardwell

unread,

Nov 17, 2016, 8:53:33 PM11/17/16

to Greg White, BBR Development

Hi Greg,

Thank you so much for your suggestion. Handling ACK

suppression/decimation/acceleration is indeed one of the challenges

faced by BBR, and we appreciate your very nicely explained and

carefully reasoned suggestion.

Based on my understanding of your proposal ("calculate the RTT based

on the oldest of the ACKed segments rather than the newest") it sounds

like this would, when it is applicable, increase the RTT samples. But

since the RTprop (min_rtt) filter is a min-filter, the question would

be whether these increased RTT samples would "survive" the min-filter

and be chosen as the estimated RTprop/min_rtt. The RTprop/min_rtt

filter window is pretty long (currently 10 secs). So my guess would be

that (unless the DOCSIS link was severely congested) there would

typically be some RTT sample during the 10 sec window that did not

suffer ACK suppression/decimation/acceleration and was low enough to

be chosen as the RTpop/min_rtt estimate. So I am not sure such RTT

sample selection logic would have a significant effect, if I'm

understanding it and anticipating its effects correctly.

However, I like the direction of this suggestion a lot. In the same

vein as your suggestion, BBR could track the maximum number of packets

(or bytes) recently ACKed in a single stretch ACK, and essentially add

that amount (or some small multiple thereof) to the cwnd.

If we're going to do something like this, it would be nice to also

have the mechanism handle cases with "compressed" ACKs as well; quite

often we see a sequence of ACKs arrive in a burst, perhaps because

they were granted a slot of air time on a wifi/cellular link

together. So perhaps we can improve behavior on a wider class of these

links if we generalize this mechanism to track the number of packets

(or bytes) recently ACKed in a "time slot" equal to the amount of time

in which we'd expect a TSO burst to be ACKed.

For example, if a BBR sender is sending TSO bursts of 4 packets in

every every 1ms time slot (48Mbps), and yet sometimes sees 20 packets

ACKed in a single 1ms time slot, then perhaps it should inflate its

cwnd by at least 20 packets, to allow itself (if needed) to keep

sending while those 20 ACKs are being held up.

In fact, since September we have been experimenting in the lab with

this approach for Linux TCP BBR. Our hope was that such an approach

help in many scenarios: ACK suppression/decimation/acceleration, ACK

compression, LRO, GRO, or TCP delayed ACKs. In any of these cases BBR

can see a big gap in the ACK stream followed by a big burst of packets

being ACKed. The conjecture is that in such cases BBR will want to

allow more packets in flight.

Does it sound to you like a mechanism like this might help BBR

operate over DOCSIS links with ACK decimation?

As to your question about when BBR applies cwnd_gain: BBR always

applies cwnd_gain and a few other provisions to allow cwnd to be

bigger than the estimated BDP (see bbr_target_cwnd()). There's no

heuristic to decide when to apply those and when not too. But a key

part of the design of BBR is that it tries (and usually succeeds) in

mostly running limited by its pacing rate, rather than cwnd. So the

fact that cwnd is bigger than BDP largely only comes into play if the

ACK rate temporarily falls below the estimated bandwidth (eg with ACK

decimation or compression, or delayed ACKs).

cheers,

neal

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Greg White

unread,

Nov 20, 2016, 1:07:50 AM11/20/16

to BBR Development, whit...@gmail.com

You may be right that there is a chance that the updated RTT value (done as I suggested) would get lost in the min filter. I think it depends on the details of how both the upstream scheduling and the ACK prioritization are implemented, but I can certainly imagine implementations where that would be the case.

The alternative you suggested sounds pretty good, and would seem to eliminate the possibility of that issue occurring.

Regarding the cwnd coming into play, isn't the cwnd limit what is preventing the buffering latency from growing unbounded when the actual bandwidth falls below the pacing rate? If so, it would seem that it would be good to make this limit as low as possible without significantly under-running the free capacity.

-Greg

To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.

Yuchung Cheng

unread,

Nov 20, 2016, 1:20:24 AM11/20/16

to Greg White, BBR Development

On Sat, Nov 19, 2016 at 10:07 PM, Greg White <whit...@gmail.com> wrote:

You may be right that there is a chance that the updated RTT value (done as I suggested) would get lost in the min filter. I think it depends on the details of how both the upstream scheduling and the ACK prioritization are implemented, but I can certainly imagine implementations where that would be the case.

The alternative you suggested sounds pretty good, and would seem to eliminate the possibility of that issue occurring.

Regarding the cwnd coming into play, isn't the cwnd limit what is preventing the buffering latency from growing unbounded when the actual bandwidth falls below the pacing rate? If so, it would seem that it would be good to make this limit as low as possible without significantly under-running the free capacity.

Right. cwnd=2 * max_filtered_bw * min_filtered_rtt in this case could drive the queue fairly high and buffer overrun. so we are considering it dynamically reduce cwnd_gain or even replace a BDP multiplier with a linear variable.

One thing makes the design choice complicated: mobile (wifi/cellular) likes fair amount of backlogs to retain a good link rate. Our discussions with some cellular and wifi experts indicate often tens msec are needed.

To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+unsubscribe@googlegroups.com.

carl_k...@comcast.com

unread,

Nov 21, 2016, 12:03:32 PM11/21/16

to BBR Development

In this scenario that Greg identifies with the cable modem middlebox performing TCP ACK suppression / acceleration, coupled with the current implementation of TCP BBR on the sending side, would you expected TCP BBR to still perform better than if the sender was running TCP CUBIC, just maybe not as much improvement as other scenarios? Or would it not perform as good TCP CUBIC?

Regards,

Carl

Yuchung Cheng

unread,

Nov 23, 2016, 8:35:13 PM11/23/16

to carl_k...@comcast.com, BBR Development

On Mon, Nov 21, 2016 at 9:03 AM, <carl_k...@comcast.com> wrote:

In this scenario that Greg identifies with the cable modem middlebox performing TCP ACK suppression / acceleration, coupled with the current implementation of TCP BBR on the sending side, would you expected TCP BBR to still perform better than if the sender was running TCP CUBIC, just maybe not as much improvement as other scenarios? Or would it not perform as good TCP CUBIC?

Our experience on production is that current technique (cwnd_gain in Neal's email) works well. When ACK feedbacks are delayed, a rate-based congestion control like BBR has the choice to keep sending up to certain inflight cap to control the queue. it is a design parameter in the system. In contrast, But it is not a parameter but rather a goal to fill any available buffer space for Cubic.

Let's illustrate with an example: the graph attached is the time-sequence graph of a YouTube playback using BBR on Comcast. The blue/green/yellow lines represent the ACK, data, and ACK+rwin sequences. The vertical stairs on the ACK line are big stretched ACKs: when ACK is delayed, BBR continues to send up to another min-RTT worth of time. But sometimes the stretched ACK can be delayed by multiple RTTs to cause data gaps (green line gaps) b/c the current parameter is set conservatively. Greg's idea to dynamically tune it can help potentially, at the expense of a higher transient queue. But again is a parameter to tune.

Regards,
Carl

On Thursday, November 17, 2016 at 3:02:25 AM UTC-5, Greg White wrote:
Yuchung, Neal, et al.,

In cases where a middlebox (e.g. a cable modem) performs TCP ACK acceleration, you may be able to calculate a more accurate BDP than using your existing approach.

In cable modems, ACK acceleration is a proprietary feature implemented by the SOC vendors, so there is limited information on how it works. But, my understanding is that it works more-or-less as follows: the cable modem maintains a separate queue for TCP ACKs; upon arrival of a new ACK, it examines this queue and if it finds an older ACK for the same connection, it replaces the older ACK with the new one, so only the newest (highest seq no) ACK for each connection is kept. When a transmission opportunity arrives, the cable modem preferentially dequeues and transmits the ACKs. I believe that the algorithms have more sophisticated handling of special cases (SACK, DupACKs, etc.).

Each cable modem gets discontinuous access to the upstream channel, mediated by a request/grant protocol. The CM has occasional (e.g. every 2ms or so) opportunities to send a request to the CMTS, and it will get a corresponding grant (tx op) typically 4-6 ms later.

So, there really are three behaviors taking place. One is that the modem has discontinuous access to the channel, maybe a grant every 2-8 ms. Another is the prioritized queuing of ACKs, such that they are a) not delayed by buffer bloat, and b) not (always) delayed by the request-grant loop. The last is that only the latest ACK is sent, as opposed to sending a burst of (possibly back-to-back) ACKs for the connection.

Now, suppose that a prescient version of BBR were to be pacing its packets at exactly the bottleneck bandwidth (with no queue built up), and simply monitoring (but not acting on) the in-flight count. That pBBR would see the in-flight count ramp up to a max value, and then, when a big "stretch" ACK arrives (ACKing many segments), immediately drop down to a min value, then ramp again to the max, and the sawtooth would repeat. If this pBBR were then to start implementing a cwnd limit, we would want it to use the top of the sawtooth (rather than the bottom) as the cwnd, otherwise it would underutilize the link.

The logic in BBR is currently using the bottom of the sawtooth as the cwnd.

In your paper, you mention using a multiplicative factor "cwnd_gain" of 2 (effectively doubling the calculated BDP) in order to attempt to correct for this. I suppose that you also have some heuristic that tells BBR when to employ this gain factor and when not to.

An alternative which (I think) would eliminate the error precisely (and thus avoid potentially introducing buffering latency or link underutilization), and which (I think) would not need to be enabled/disabled based on a heuristic, would be to simply calculate the RTT based on the oldest of the ACKed segments rather than the newest. This RTT value would then be input into your RTpropFilter as you normally do.

Perhaps you've thought of this already and it fails for a reason that I haven't anticipated. But if not, I would suggest testing this approach.

-Greg

yt-comcast.png

leewei...@gmail.com

unread,

Dec 30, 2016, 3:57:48 AM12/30/16

to BBR Development, whit...@gmail.com

Hi Nick,

In cases of ACK suppression/decimation/acceleration, as #pkts_delivered may increase a lot, so bw may be over estimated, am I right? And with bw_max filter, we would use this bw for at least 10 RTT.

* send_rate = #pkts_delivered/(last_snd_time - first_snd_time)

* ack_rate = #pkts_delivered/(last_ack_time - first_ack_time)

* bw = min(send_rate, ack_rate)

If bbr already over estimate cwnd = (bw*minRtt*cwnd_gain), why cwnd_gain is 2?

在 2016年11月18日星期五 UTC+8上午9:53:33，Neal Cardwell写道：

To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.

Jonathan Morton

unread,

Dec 30, 2016, 4:37:40 AM12/30/16

to leewei...@gmail.com, BBR Development, whit...@gmail.com

> On 30 Dec, 2016, at 10:57, leewei...@gmail.com wrote:
>
> In cases of ACK suppression/decimation/acceleration, as #pkts_delivered may increase a lot, so bw may be over estimated, am I right?

While the number of packets delivered per ack received goes up with decimation, so does the average time interval between acks. With a suitable smoothing filter (over one RTT, for example), you can still make a reasonably close estimate of the true delivery rate.

> And with bw_max filter, we would use this bw for at least 10 RTT.
> * send_rate = #pkts_delivered/(last_snd_time - first_snd_time)
> * ack_rate = #pkts_delivered/(last_ack_time - first_ack_time)
> * bw = min(send_rate, ack_rate)

There are two different things being measured here, which seems to be a deliberate mechanism to avoid the exact problem you mention.

1: The send rate is measured directly, not using acks.

2: The ack rate is the delivery rate, measured remotely using acks.

The third line takes the *minimum* of the two rates, effectively acting as a sanity-check on the delivery rate, which obviously can’t be higher than the send rate. This helps with many patterns of ack queueing behaviour.

> If bbr already over estimate cwnd = (bw*minRtt*cwnd_gain), why cwnd_gain is 2?

As has been mentioned several times recently, BBR is designed to operate in pacing mode under normal conditions, so cwnd is set to a higher value to keep it out of the way.

If there is a sudden change in path conditions, such that the delivery rate suddenly goes down a lot, the cwnd limitation will eventually put BBR in ack-clocked mode like a normal TCP, until the BW estimate converges to the new path capacity.

The above notes should clarify that the BDP is not grossly overestimated in practice, so a doubling of BDP to determine cwnd is sensible in this context.

- Jonathan Morton

Message has been deleted

Neal Cardwell

unread,

Feb 2, 2021, 9:10:26 AM2/2/21

to mingkun bian, BBR Development

On Tue, Feb 2, 2021 at 2:36 AM mingkun bian <bianm...@gmail.com> wrote:

Hi Neal，

I am sorry to reply such a long post, maybe it can make us to learn bbr more clearly.

I have a question about rtt(calculate the RTT based on the oldest of the ACKed segments rather than the newest), what' meaning about oldest of the ACKed segments and newest.
1. bbr alwayls use min_rtt as factor calculated by BDP(min_rtt*bw)， so why saying bbr use the newest of the ACKed segments， and what‘s the meaning of oldest？

The distinction about the oldest of the ACKed segments vs the newest of the ACKed segments is about how to obtain an RTT sample when an ACK covers multiple segments transmitted at different times. Consider the following scenario, from the perspective of a transport sender sending 3 data segments, P1, P2, and P3, at different times:

t= 0ms: send P1

t=10ms: send P2

t=20ms: send P3

t=40ms: receive cumulative ACK for P1, P2, and P3

Here the oldest of the ACKed segments is P1, and the newest of the ACKed segments is P3.

RTT from the oldest of the ACKed segments = 40ms - 0ms = 40ms

RTT from the newest of the ACKed segments = 40ms - 20ms = 20ms

To make this more concrete, you can take a look at tcp_clean_rtx_queue() in the Linux TCP code base, where ca_rtt_us computes an RTT sample from the newest of ACKed segments (for congestion control), and seq_rtt_us computes an RTT sample from the oldest of ACKed segments (for RTO calculation).

2. What is difference of "suppression/decimation/acceleration",
a. suppression is due to delay ack、GRO...

Yes, I think it's fair to say that delayed ACKs, GRO, and LRO implement forms of ACK suppression.

b. acceleration is clearly described， due to the cable modem‘s ack aggregation

Agreed, Greg's description is very clear.

c. decimation, what scene is this？

"ACK decimation" refers to dropping or omitting some fraction of ACKs. Effective versions of ACK decimation would probably be pretty similar to what Greg describes as "ACK acceleration".

In the end, which techniques are called "suppression" vs "decimation" vs "acceleration" are somewhat subjective, and different people use different terms. I used "suppression/decimation/acceleration" to try to encompass all of them, mostly because they present similar challenges to a congestion control algorithm, so probably need to be addressed by the same mechanism(s).

best,

neal

Thanks

To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/c59023c5-5e18-4f41-97f4-d2a0257dd30fn%40googlegroups.com.

mingkun bian

unread,

Feb 3, 2021, 2:29:59 AM2/3/21

to BBR Development

Hi Neal,

Thanks for your detailed answer.

Best

Reply all

Reply to author

Forward