Latency and pacing rate while application-limited

Wesley Rosenblum

unread,

Oct 24, 2022, 8:01:37 PM10/24/22

to BBR Development

Consider an endpoint sending on a high-bandwidth path at a low rate, say 10KB every RTT, where the RTT is 100ms.

The BBR.bw value in this case would be 10KB/100ms, as 10KB are being delivered every 100ms (at least initially).

The pacing rate (in Startup) would be calculated as BBRStartupPacingGain (2.77) X BBR.bw (10KB/100ms) = 27.7KB/100ms.

Sending 10KB at a 27.7KB/100ms pacing rate would take about 36ms (BBR.send_quantum would impact this, but I'm putting it aside for now)

So now the latency to send and acknowledge a 10KB file is 36ms + 100ms = 136ms.

If pacing was disabled, the latency would be ~100ms.

Given the path can likely support a much higher pacing rate, how do we prevent the pacing rate based on application-limited bandwidth estimates from increasing latency unnecessarily?

Has any thought been given to probing the pacing rate while application-limited to determine if it can be increased without impacting RTT and loss?

Thanks,

Wesley Rosenblum

Neal Cardwell

unread,

Oct 24, 2022, 10:16:36 PM10/24/22

to Wesley Rosenblum, BBR Development

Hi Wesley,

Thanks for raising this point for discussion!

Our team is aware of the issue; that's why the fq qdisc has the initial_quantum parameter, and that's why the first 10 packets from Linux TCP-layer pacing are currently unpaced. But of course those are incomplete mechanisms rather than general solutions for the latency cost of pacing for warming-up and/or application-limited flows.

I agree it seems worth considering the kind of approach you outline. However, that is a fairly different approach than the core of the current BBRv1 / BBRv2 control loops, so it would probably need careful consideration. Some questions that come to mind:

+ Is there some simple way to add a mechanism like that to BBR, or would that add too much complexity to make it worthwhile? (In which case perhaps it would make sense to consider this mechanism in the context of some new/different algorithm?...)

+ How would the mechanism avoid being fooled by the high delay variations in wifi, cellular, DOCSIS, and datacenter Ethernet paths?

best regards,

neal

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/404c69e9-0ada-4e6b-a548-88cd1cce175bn%40googlegroups.com.

Wesley Rosenblum

unread,

Oct 25, 2022, 2:47:29 PM10/25/22

to Neal Cardwell, BBR Development

Thanks for the quick response Neal.

From our experience, the connections that experience this pacing-penalty most severely typically never exit the Startup state, as the application-limited rate they are sending at is too lower to trigger consistent loss above the BBRLossThresh and the delivery rate samples being marked application-limited prevents the bandwidth plateau startup exit criteria from being met as well.

All this to say that a pacing-probing solution that is limited to the Startup state should generally be sufficient in my view to address the majority of problematic connections. Perhaps by limiting any solution to Startup, that would reduce the amount of added complexity.

> How would the mechanism avoid being fooled by the high delay variations in wifi, cellular, DOCSIS, and datacenter Ethernet paths?

Maybe something similar to Hybrid Slow Start that samples RTT over a number of rounds? If the delay increase remains below some threshold, increase the "application-limited pacing gain". The ProbeRTT state could also act as a secondary check to make sure we haven't gone too far in ramping up the gain.

Thanks,

Wesley Rosenblum

Bob McMahon

unread,

Oct 26, 2022, 3:23:37 PM10/26/22

to Wesley Rosenblum, Neal Cardwell, BBR Development

A bit off topic but not sure how a CCA's sampling of RTTs takes in account TCP ack delays. Below is an iperf 2 bounceback test with a hold time delaying the bounceback write. Notice the RTT differences just based on the hold being 39700 usecs vs 39800 usecs. In the former the TCP ack waits and rides with the data seen by the ~40ms RTT which matches the app level bounceback time, and the latter w/RTT around 200 usecs which is now independent of the app level times.

Do CCA's RTT sampling typically account for such behaviors?

[root@fedora iperf2-code]# iperf -c 10.19.85.169 --bounceback -i 1 --bounceback-hold 39.7 -e --bounceback-no-quickack -t 30
------------------------------------------------------------
Client connecting to 10.19.85.169, TCP port 5001 with pid 11022 (1 flows)
Write buffer size: 100 Byte
Bursting: 100 Byte writes 10 times every 1.00 second(s)
Bounce-back test (size= 100 Byte) (server hold req=39700 usecs)
TOS set to 0x0 and nodelay (Nagle off)
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.19.85.206%eno1 port 34164 connected with 10.19.85.169 port 5001 (bb len/hold=100/39700) (sock=3) (icwnd/mss/irtt=14/1448/271) (ct=0.31 ms) on 2022-10-26 11:56:57 (PDT)
[ ID] Interval Transfer Bandwidth BB cnt=avg/min/max/stdev Rtry Cwnd/RTT RPS
[ 1] 0.00-1.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.312/40.126/41.069/0.270 ms 0 14K/214 us 25 rps
[ 1] 1.00-2.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.243/40.206/40.277/0.027 ms 0 14K/201 us 25 rps
[ 1] 2.00-3.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.248/40.202/40.288/0.032 ms 0 14K/205 us 25 rps
[ 1] 3.00-4.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.248/40.180/40.303/0.034 ms 0 14K/201 us 25 rps
[ 1] 4.00-5.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.244/40.214/40.269/0.022 ms 0 14K/5197 us 25 rps
[ 1] 5.00-6.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.213/40.108/40.296/0.054 ms 0 14K/30967 us 25 rps
[ 1] 6.00-7.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.209/40.086/40.285/0.058 ms 0 14K/37741 us 25 rps
[ 1] 7.00-8.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.194/40.099/40.272/0.057 ms 0 14K/39517 us 25 rps
[ 1] 8.00-9.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.210/40.135/40.286/0.049 ms 0 14K/39997 us 25 rps
[ 1] 9.00-10.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.203/40.108/40.299/0.058 ms 0 14K/40117 us 25 rps
[ 1] 10.00-11.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.212/40.091/40.290/0.060 ms 0 14K/40156 us 25 rps
[ 1] 11.00-12.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.202/40.102/40.287/0.056 ms 0 14K/40160 us 25 rps
[ 1] 12.00-13.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.204/40.142/40.268/0.045 ms 0 14K/40164 us 25 rps
[ 1] 13.00-14.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.214/40.124/40.288/0.053 ms 0 14K/40171 us 25 rps
[ 1] 14.00-15.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.215/40.138/40.289/0.050 ms 0 14K/40170 us 25 rps
[ 1] 15.00-16.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.214/40.109/40.281/0.060 ms 0 14K/40177 us 25 rps
[ 1] 16.00-17.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.196/40.109/40.285/0.052 ms 0 14K/40159 us 25 rps
[ 1] 17.00-18.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.194/40.093/40.268/0.055 ms 0 14K/40156 us 25 rps
[ 1] 18.00-19.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.172/40.086/40.268/0.063 ms 0 14K/40141 us 25 rps
[ 1] 19.00-20.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.194/40.109/40.260/0.057 ms 0 14K/40152 us 25 rps
[ 1] 20.00-21.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.183/40.049/40.253/0.068 ms 0 14K/40151 us 25 rps
[ 1] 21.00-22.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.197/40.117/40.288/0.051 ms 0 14K/40153 us 25 rps
[ 1] 22.00-23.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.196/40.091/40.290/0.062 ms 0 14K/40155 us 25 rps
[ 1] 23.00-24.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.180/40.105/40.273/0.060 ms 0 14K/40145 us 25 rps
[ 1] 24.00-25.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.215/40.141/40.276/0.057 ms 0 14K/40165 us 25 rps
[ 1] 25.00-26.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.213/40.127/40.263/0.048 ms 0 14K/40171 us 25 rps
[ 1] 26.00-27.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.205/40.122/40.280/0.060 ms 0 14K/40169 us 25 rps
[ 1] 27.00-28.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.215/40.133/40.275/0.046 ms 0 14K/40173 us 25 rps
[ 1] 28.00-29.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.220/40.137/40.285/0.042 ms 0 14K/40178 us 25 rps
[ 1] 29.00-30.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.212/40.152/40.272/0.044 ms 0 14K/40169 us 25 rps
[ 1] 0.00-30.04 sec 58.8 KBytes 16.0 Kbits/sec 301=40.213/40.049/41.069/0.072 ms 0 14K/35285 us 25 rps
[ 1] 0.00-30.04 sec BB8(f)-PDF: bin(w=100us):cnt(301)=401:8,402:108,403:183,404:1,411:1 (5.00/95.00/99.7%=402/403/411,Outliers=1,obl/obu=0/0)

[root@fedora iperf2-code]# iperf -c 10.19.85.169 --bounceback -i 1 --bounceback-hold 39.8 -e --bounceback-no-quickack -t 30
------------------------------------------------------------
Client connecting to 10.19.85.169, TCP port 5001 with pid 11025 (1 flows)
Write buffer size: 100 Byte
Bursting: 100 Byte writes 10 times every 1.00 second(s)
Bounce-back test (size= 100 Byte) (server hold req=39800 usecs)
TOS set to 0x0 and nodelay (Nagle off)
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.19.85.206%eno1 port 34166 connected with 10.19.85.169 port 5001 (bb len/hold=100/39800) (sock=3) (icwnd/mss/irtt=14/1448/252) (ct=0.30 ms) on 2022-10-26 11:57:33 (PDT)
[ ID] Interval Transfer Bandwidth BB cnt=avg/min/max/stdev Rtry Cwnd/RTT RPS
[ 1] 0.00-1.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.376/40.280/40.926/0.195 ms 0 14K/200 us 25 rps
[ 1] 1.00-2.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.340/40.272/40.386/0.045 ms 0 14K/194 us 25 rps
[ 1] 2.00-3.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.358/40.305/40.428/0.043 ms 0 14K/210 us 25 rps
[ 1] 3.00-4.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.351/40.267/40.400/0.041 ms 0 14K/195 us 25 rps
[ 1] 4.00-5.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.331/40.283/40.374/0.033 ms 0 14K/201 us 25 rps
[ 1] 5.00-6.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.344/40.302/40.396/0.032 ms 0 14K/202 us 25 rps
[ 1] 6.00-7.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.337/40.267/40.381/0.039 ms 0 14K/203 us 25 rps
[ 1] 7.00-8.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.323/40.290/40.386/0.033 ms 0 14K/195 us 25 rps
[ 1] 8.00-9.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.323/40.237/40.414/0.054 ms 0 14K/190 us 25 rps
[ 1] 9.00-10.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.356/40.284/40.410/0.047 ms 0 14K/198 us 25 rps
[ 1] 10.00-11.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.328/40.269/40.380/0.045 ms 0 14K/199 us 25 rps
[ 1] 11.00-12.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.329/40.278/40.378/0.035 ms 0 14K/201 us 25 rps
[ 1] 12.00-13.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.351/40.312/40.437/0.036 ms 0 14K/203 us 25 rps
[ 1] 13.00-14.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.355/40.311/40.405/0.029 ms 0 14K/205 us 25 rps
[ 1] 14.00-15.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.333/40.279/40.378/0.034 ms 0 14K/201 us 25 rps
[ 1] 15.00-16.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.356/40.281/40.402/0.037 ms 0 14K/204 us 25 rps
[ 1] 16.00-17.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.342/40.282/40.381/0.035 ms 0 14K/207 us 25 rps
[ 1] 17.00-18.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.339/40.288/40.385/0.035 ms 0 14K/206 us 25 rps
[ 1] 18.00-19.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.320/40.265/40.382/0.047 ms 0 14K/206 us 25 rps
[ 1] 19.00-20.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.330/40.264/40.382/0.040 ms 0 14K/203 us 25 rps
[ 1] 20.00-21.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.329/40.282/40.377/0.030 ms 0 14K/204 us 25 rps
[ 1] 21.00-22.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.330/40.288/40.374/0.031 ms 0 14K/203 us 25 rps
[ 1] 22.00-23.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.348/40.294/40.392/0.029 ms 0 14K/208 us 25 rps
[ 1] 23.00-24.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.350/40.296/40.385/0.031 ms 0 14K/206 us 25 rps
[ 1] 24.00-25.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.352/40.284/40.389/0.041 ms 0 14K/207 us 25 rps
[ 1] 25.00-26.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.325/40.289/40.377/0.027 ms 0 14K/195 us 25 rps
[ 1] 26.00-27.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.320/40.278/40.364/0.031 ms 0 14K/191 us 25 rps
[ 1] 27.00-28.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.317/40.263/40.365/0.037 ms 0 14K/194 us 25 rps
[ 1] 28.00-29.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.323/40.246/40.373/0.039 ms 0 14K/196 us 25 rps
[ 1] 29.00-30.00 sec 1.95 KBytes 16.0 Kbits/sec 10=40.315/40.270/40.379/0.039 ms 0 14K/186 us 25 rps
[ 1] 0.00-30.04 sec 58.8 KBytes 16.0 Kbits/sec 301=40.338/40.237/40.926/0.051 ms 0 14K/474 us 25 rps
[ 1] 0.00-30.04 sec BB8(f)-PDF: bin(w=100us):cnt(301)=403:66,404:226,405:8,410:1 (5.00/95.00/99.7%=403/404/410,Outliers=0,obl/obu=0/0)

Bob

To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/CAFUNvhcQrtX7LAr6X8SY%2BOdiVT0rNR4LvBFxh_OeviQEeGK7og%40mail.gmail.com.

This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

Neal Cardwell

unread,

Oct 26, 2022, 3:50:24 PM10/26/22

to Wesley Rosenblum, BBR Development

On Tue, Oct 25, 2022 at 2:47 PM Wesley Rosenblum <wes...@gmail.com> wrote:

Thanks for the quick response Neal.

From our experience, the connections that experience this pacing-penalty most severely typically never exit the Startup state, as the application-limited rate they are sending at is too lower to trigger consistent loss above the BBRLossThresh and the delivery rate samples being marked application-limited prevents the bandwidth plateau startup exit criteria from being met as well.

All this to say that a pacing-probing solution that is limited to the Startup state should generally be sufficient in my view to address the majority of problematic connections. Perhaps by limiting any solution to Startup, that would reduce the amount of added complexity.

IMHO it would be simpler and more valuable to go with a more general solution that applied such a mechanism at any time, and not just in Startup. Since available bandwidth can increase at any time, it could be beneficial to increase the pacing rate later in a connection's lifetime as well.

> How would the mechanism avoid being fooled by the high delay variations in wifi, cellular, DOCSIS, and datacenter Ethernet paths?

Maybe something similar to Hybrid Slow Start that samples RTT over a number of rounds? If the delay increase remains below some threshold, increase the "application-limited pacing gain". The ProbeRTT state could also act as a secondary check to make sure we haven't gone too far in ramping up the gain.

Yeah, I agree that some sort of min-filtered RTT sample seems like the way to go for a mechanism like that. But min-filtering across multiple rounds seems tricky, since application-limited traffic might alternate between single-packet and multi-packet rounds, where the RTT samples from single-packet rounds might give a false sense that there is no impact on the RTT from the increased pacing rate.

thanks,

neal

Neal Cardwell

unread,

Oct 26, 2022, 3:54:35 PM10/26/22

to Bob McMahon, Wesley Rosenblum, BBR Development

On Wed, Oct 26, 2022 at 3:23 PM Bob McMahon <bob.m...@broadcom.com> wrote:

A bit off topic but not sure how a CCA's sampling of RTTs takes in account TCP ack delays. Below is an iperf 2 bounceback test with a hold time delaying the bounceback write. Notice the RTT differences just based on the hold being 39700 usecs vs 39800 usecs. In the former the TCP ack waits and rides with the data seen by the ~40ms RTT which matches the app level bounceback time, and the latter w/RTT around 200 usecs which is now independent of the app level times.

Do CCA's RTT sampling typically account for such behaviors?

I can't speak to CCAs in general, but BBRv1 and BBRv2 do account for such behavior by min-filtering across 10 seconds to estimate the path's two-way propagation delay. Both versions of the algorithm estimate the path's two-way propagation delay using the minimum RTT sample in the last 10 seconds, which gives a good chance (though not 100% change, obviously) of measuring an RTT sample not distorted by delayed ACKs. Similarly, the BBR bandwidth filtering uses a max filter to filter out effects that include delayed ACKs.