BBR updates at the IETF 102 ICCRG session

705 views
Skip to first unread message

Neal Cardwell

unread,
Jul 20, 2018, 5:38:20 PM7/20/18
to BBR Development
Hi all,

The Google BBR team presented some updates at the IETF 102 ICCRG session yesterday:

+ BBR Congestion Control Work at Google: IETF 102 Update [YouTube] [slides

+ BBR Congestion Control: IETF 102 Update: BBR Startup [YouTube] [slides]  

cheers,
neal

Dave Taht

unread,
Jul 20, 2018, 11:08:59 PM7/20/18
to Neal Cardwell, BBR Development
"If ecn_mark_rate > target_ecn_mark_rate (50%) then - Enter DRAIN and
drain in-flight to estimated BDP"

So you are telling me that ECN in BBR2, currently, is an ambiguous
signal of congestion, even to get out of startup mode?
> --
> You received this message because you are subscribed to the Google Groups "BBR Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Neal Cardwell

unread,
Jul 21, 2018, 9:05:41 AM7/21/18
to Dave Taht, BBR Development
The current thinking with the ECN response for the BBR v2 design prototype is that it will target DCTCP/L4S-style ECN, with a shallow marking threshold and packet-granularity ECN marking feedback (the slides should have mentioned this; sorry for the oversight!). With that kind of ECN feedback, a single ECN mark tells the sender that there was at least a small queue at some instant in time, and so to estimate the persistence of queuing the sender needs some kind of filtering operating over some kind of time window (like the DCTCP alpha).

And for high-RTT WAN flows sharing a bottleneck with some low-RTT (e.g. intra-datacenter) flows, then the WAN flow can see, over the course of its round trip, many transitions between high-ECN-marking phases and no-ECN-marking phases, as the low-RTT flows go through cycles of probing and backing off. So from the perspective of a high-RTT WAN flow, there can be low levels of such ECN marking even when the bottleneck is not completely full over that flow's RTT time scale.

The particular ECN threshold listed there is just a parameter in the current iteration of the code. The details of the mechanism and the value of the parameter are likely to evolve a bit more as we experiment. We are happy to hear suggestions or pointers to experiment results or research from other folks on this design question.

thanks,
neal

Jonathan Morton

unread,
Jul 21, 2018, 9:47:53 AM7/21/18
to Neal Cardwell, Dave Taht, BBR Development
> On 21 Jul, 2018, at 4:05 pm, 'Neal Cardwell' via BBR Development <bbr...@googlegroups.com> wrote:
>
> And for high-RTT WAN flows sharing a bottleneck with some low-RTT (e.g. intra-datacenter) flows, then the WAN flow can see, over the course of its round trip, many transitions between high-ECN-marking phases and no-ECN-marking phases, as the low-RTT flows go through cycles of probing and backing off. So from the perspective of a high-RTT WAN flow, there can be low levels of such ECN marking even when the bottleneck is not completely full over that flow's RTT time scale.

This implies, of course, that the AQM on this bottleneck is running without flow-isolation, and that the low-RTT flows are transiently saturating their path, just on much shorter timescales than the long RTT. I don't know how often that happens in Google's datacentres, but it would be decidedly rare on my own LAN *even if* I wasn't running with flow isolation.

The unfortunate effect of BBR2's behaviour in this respect is that the response to genuine saturation of the WAN bottleneck will be severely delayed - and this is a general concern I've had about DCTCP from the start, which makes it fundamentally incompatible with the accepted definition of ECN used elsewhere on the Internet (and assumed by Codel, which begins by sending one CE mark per pre-estimated RTT). This definition is: a *single* ECE signal reaching the sender means "we are saturated, back off NOW".

In particular, it would take Codel a very long time to reach 50% CE marking at high packet rates. This means Codel cannot reliably influence BBR2 (in its current form) with ECN.

At minimum, I want to see the first ECE trigger an exit from STARTUP into one of the standard probing phases. This is the single most important application for ECN with a delay-sensitive TCP, since it gives information to that TCP more quickly than a delay signal. With NewReno, the default cwnd-halving response to ECE exactly matches and compensates for the doubling in cwnd per RTT that occurred since the provoking packet was sent; a similar argument can be stated for CUBIC.

I'd also like the CE marking threshold (currently 50%) to match the packet loss threshold (currently 1%), as that is another ECN RFC requirement - that TCP senders should react to ECE in the same way as to packet loss with respect to congestion control. This will give Codel a much better chance of successfully notifying BBR2 about adverse changes in path capacity. BBR2 can then sense the new path capacity using the most recent bandwidth estimate, as described.

- Jonathan Morton

Neal Cardwell

unread,
Jul 22, 2018, 4:40:12 PM7/22/18
to Jonathan Morton, Dave Taht, BBR Development
On Sat, Jul 21, 2018 at 9:47 AM Jonathan Morton <chrom...@gmail.com> wrote:
> On 21 Jul, 2018, at 4:05 pm, 'Neal Cardwell' via BBR Development <bbr...@googlegroups.com> wrote:
>
> And for high-RTT WAN flows sharing a bottleneck with some low-RTT (e.g. intra-datacenter) flows, then the WAN flow can see, over the course of its round trip, many transitions between high-ECN-marking phases and no-ECN-marking phases, as the low-RTT flows go through cycles of probing and backing off. So from the perspective of a high-RTT WAN flow, there can be low levels of such ECN marking even when the bottleneck is not completely full over that flow's RTT time scale.

This implies, of course, that the AQM on this bottleneck is running without flow-isolation, and that the low-RTT flows are transiently saturating their path, just on much shorter timescales than the long RTT.  I don't know how often that happens in Google's datacentres, but it would be decidedly rare on my own LAN *even if* I wasn't running with flow isolation.

Yes, this was imagining a bottleneck running without flow isolation. I agree flow isolation is nice, but I am not aware of large deployed installations of flow-isolating AQMs.  AFAIK the trend seems to be single-queue AQMs like the PIE in DOCSIS cable modem upstream links. AFAIK this is due to concerns about AQMs not really being able to see flows in important cases like VPN traffic.
 
The unfortunate effect of BBR2's behaviour in this respect is that the response to genuine saturation of the WAN bottleneck will be severely delayed - and this is a general concern I've had about DCTCP from the start, which makes it fundamentally incompatible with the accepted definition of ECN used elsewhere on the Internet (and assumed by Codel, which begins by sending one CE mark per pre-estimated RTT).  This definition is: a *single* ECE signal reaching the sender means "we are saturated, back off NOW".

In particular, it would take Codel a very long time to reach 50% CE marking at high packet rates.  This means Codel cannot reliably influence BBR2 (in its current form) with ECN.

At minimum, I want to see the first ECE trigger an exit from STARTUP into one of the standard probing phases.  This is the single most important application for ECN with a delay-sensitive TCP, since it gives information to that TCP more quickly than a delay signal.  With NewReno, the default cwnd-halving response to ECE exactly matches and compensates for the doubling in cwnd per RTT that occurred since the provoking packet was sent; a similar argument can be stated for CUBIC.

I'd also like the CE marking threshold (currently 50%) to match the packet loss threshold (currently 1%), as that is another ECN RFC requirement - that TCP senders should react to ECE in the same way as to packet loss with respect to congestion control.  This will give Codel a much better chance of successfully notifying BBR2 about adverse changes in path capacity.  BBR2 can then sense the new path capacity using the most recent bandwidth estimate, as described.

Sounds like you are advocating for RFC 3168 ECN here. The direction we are currently thinking about for BBR v2 is using DCTCP/L4S-style ECN signals.

From my perspective, here are a couple quick thoughts about RFC 3168 ECN:

o RFC 3168 is a 17-year-old standard, and yet I am not aware of any large scale deployments of bottlenecks marking with RFC 3168 ECN. For example (slide 12):
The momentum and planned large-scale deployments that I'm aware of seem to be with DCTCP/L4S-style ECN. Though perhaps there are other data points that I'm not yet aware of. 

RFC 3168 puts the congestion control decision in the hands of devices in the middle of the network. With the final decision in the middle of the network and mandated responses at the receiver and sender, algorithm evolution might be slower; bugs and/or malicious RFC 3168 CC code in routers would, I imagine, be slower to be fixed than in approaches where the congestion control decision is at the sender.

o There is concern that the RFC 3168 ECN response is not scalable enough, e.g. as discussed in:

cheers,
neal

Jonathan Morton

unread,
Jul 22, 2018, 4:56:36 PM7/22/18
to Neal Cardwell, Dave Taht, BBR Development
> On 22 Jul, 2018, at 11:39 pm, Neal Cardwell <ncar...@google.com> wrote:
>
> o RFC 3168 is a 17-year-old standard, and yet I am not aware of any large scale deployments of bottlenecks marking with RFC 3168 ECN. [...] Though perhaps there are other data points that I'm not yet aware of.

Those results are mainly due to the general absence of AQM *in general* at bottlenecks, and the general non-use of ECN in endpoint hosts, until very recently. Frankly, it's taken a criminally long time to get ECN deployed and into actual use, mostly because of severe inertia among the many middlebox vendors and operators.

The default qdisc in several major Linux distributions is now fq_codel. The sqm_scripts package is one of the most popular components of OpenWRT, and now includes Cake support - all Codel based. The make-wifi-fast patches are all based around a variant of fq_codel, and any router running a recent enough version of OpenWRT and with ath9k/10k hardware will be using it.

Existing endpoint hosts (running Windows, OSX, iOS, Linux, Android) support RFC3168-style ECN in their TCP stacks. It might not be actually turned on by default, but if it is, that's what is supported. The necessary TCP extensions for Accurate ECN reporting, which is required for DCTCP, are generally not supported. Codel is designed and carefully tuned to work with these TCP stacks.

So from a consumer point of view, Codel is *the* dominant deployment of AQM with ECN support, while anything implementing L4S/DCTCP is invisible and irrelevant. Hence my concern about BBR2's inability to benefit from Codel's signalling behaviour. These concerns would be substantially reduced if:

- A single ECE was sufficient cause to exit STARTUP and enter a normal probing mode.

- The ECN response threshold was reduced to 1%, the same as the packet loss threshold.

Anything less would be incompatible with existing deployments of ECN on the consumer-facing Internet.

- Jonathan Morton

Dave Taht

unread,
Jul 23, 2018, 12:20:04 AM7/23/18
to Jonathan Morton, Neal Cardwell, BBR Development
I'd write more but it started turning into a rant. Staying constructive:

On Sun, Jul 22, 2018 at 1:56 PM Jonathan Morton <chrom...@gmail.com> wrote:
>
> > On 22 Jul, 2018, at 11:39 pm, Neal Cardwell <ncar...@google.com> wrote:
> >
> > o RFC 3168 is a 17-year-old standard, and yet I am not aware of any large scale deployments of bottlenecks marking with RFC 3168 ECN. [...] Though perhaps there are other data points that I'm not yet aware of.
>
> Those results are mainly due to the general absence of AQM *in general* at bottlenecks, and the general non-use of ECN in endpoint hosts, until very recently. Frankly, it's taken a criminally long time to get ECN deployed and into actual use, mostly because of severe inertia among the many middlebox vendors and operators.
>
> The default qdisc in several major Linux distributions is now fq_codel. The sqm_scripts package is one of the most popular components of OpenWRT, and now includes Cake support - all Codel based. The make-wifi-fast patches are all based around a variant of fq_codel, and any router running a recent enough version of OpenWRT and with ath9k/10k hardware will be using it.
>
> Existing endpoint hosts (running Windows, OSX, iOS, Linux, Android) support RFC3168-style ECN in their TCP stacks. It might not be actually turned on by default, but if it is, that's what is supported. The necessary TCP extensions for Accurate ECN reporting, which is required for DCTCP, are generally not supported. Codel is designed and carefully tuned to work with these TCP stacks.

Apple enabled classic ecn universally over a year ago. Systemd last month.

You could get some data on the deployment from enabling classic ecn on
those google servers that still run cubic.

> So from a consumer point of view, Codel is *the* dominant deployment of AQM with ECN support, while anything implementing L4S/DCTCP is invisible and irrelevant. Hence my concern about BBR2's inability to benefit from Codel's signalling behaviour. These concerns would be substantially reduced if:
>
> - A single ECE was sufficient cause to exit STARTUP and enter a normal probing mode.

PLEASE. The world consists of hundreds of short flows in slow start a
minute. theorist tests tend to consist of two long duration flows.

> - The ECN response threshold was reduced to 1%, the same as the packet loss threshold.

Um, I'd prefer something even more drastic. A single CE response could
be min(lowest observed delay in the last 100ms,whatever cubic would
do) and, further, entering probertt in circumstances where the
observed rtt has not budged downward even that much would reduce the
latecomer problem bbr has. [1]

Wouldn't it be great if bbr found the right rate in 100ms rather than 10 sec?

Anyway, with patches of any sort for bbr2 in hand + some ecn response
I can deploy something bbr-like to my flent-wherever.bufferbloat.net
testbeds around the world... would ease my cake testing a lot, because
currently I get bbr + ecn being an unresponsive sender which is truly
ugly. (engages fq_codel's bulk dropper, and (theoretically), cake's
BLUE)

I'm hoping we really nailed the ack-filter, as random ack drop and
dctcp don't mix.

[1] I know full well this is the wrong thing for single queued RED.
But being ultra gentle as a staring point for research would be cool,
and it instantly moves the flow into the fast fq_codel queue, giving
the closest thing to
the actual rtt I can think of. our usual fq_codeled uplink also starts
falling into the fast queue for acks.

> Anything less would be incompatible with existing deployments of ECN on the consumer-facing Internet.

+10. There are some gradual steps like using ect(1) as a dctcp
indicator, which we could leverage ce_threshold for.

>
> - Jonathan Morton

Kevin Bowling

unread,
Jul 23, 2018, 3:53:22 AM7/23/18
to Dave Taht, Jonathan Morton, Neal Cardwell, BBR Development
Trying to establish signal from neckbearding, it seems second hop WAP
AQMs would conceivably be able to handle flow state but it seems
completely unrealistic in any short term across the greater internet.
If you bias toward these second hops, it would be imperative for
carriers to remain filtering ECN on the border of their networks.
What is the greater challenge, campus congestion or everything else?

Regards,

Kevin Bowling Principal Engineer
EXPERIENCE FIRST.
+1 602 850 5036+1 480 227 1233
www.limelight.com

Jonathan Morton

unread,
Jul 23, 2018, 4:08:10 AM7/23/18
to Kevin Bowling, Dave Taht, Neal Cardwell, BBR Development
> On 23 Jul, 2018, at 10:53 am, Kevin Bowling <kbow...@llnw.com> wrote:
>
> Trying to establish signal from neckbearding, it seems second hop WAP
> AQMs would conceivably be able to handle flow state but it seems
> completely unrealistic in any short term across the greater internet.

Fair, and acknowledged. BBR should remain capable of handling cases where it hears AQM signals primarily intended for other flows. But I don't think my suggestions are incompatible with that.

BBR2 appears to be designed to take "ECN signal over threshold" as a signal that network conditions have recently changed and need to be re-measured. I and Dave are just pointing out that the threshold needs to be a lot lower.

> If you bias toward these second hops, it would be imperative for
> carriers to remain filtering ECN on the border of their networks.

Why? If the second hop is the bottleneck - as it often is in practice - then it's imperative for the ECN signals it produces to get through!

> What is the greater challenge, campus congestion or everything else?

LAN, last-mile, backhaul, core, and datacentre. All very different environments from each other. It currently sounds as if the BBR folks have focused too much on the datacentre for their traffic statistics. I'll admit to a personal bias towards the last mile.

In terms of *deployment* of existing, proven AQM technology, the last mile and ISP backhaul are the biggest problems, because hardware vendors are blithely ignoring the problem, and ISPs aren't on their case about it either. As Dave mentioned, the end-host problem is finally starting to improve.

- Jonathan Morton

Kevin Bowling

unread,
Jul 23, 2018, 4:48:26 AM7/23/18
to Jonathan Morton, Dave Taht, Neal Cardwell, BBR Development
On Mon, Jul 23, 2018 at 1:08 AM, Jonathan Morton <chrom...@gmail.com> wrote:
>> On 23 Jul, 2018, at 10:53 am, Kevin Bowling <kbow...@llnw.com> wrote:
>>
>> Trying to establish signal from neckbearding, it seems second hop WAP
>> AQMs would conceivably be able to handle flow state but it seems
>> completely unrealistic in any short term across the greater internet.
>
> Fair, and acknowledged. BBR should remain capable of handling cases where it hears AQM signals primarily intended for other flows. But I don't think my suggestions are incompatible with that.
>
> BBR2 appears to be designed to take "ECN signal over threshold" as a signal that network conditions have recently changed and need to be re-measured. I and Dave are just pointing out that the threshold needs to be a lot lower.
>
>> If you bias toward these second hops, it would be imperative for
>> carriers to remain filtering ECN on the border of their networks.
>
> Why? If the second hop is the bottleneck - as it often is in practice - then it's imperative for the ECN signals it produces to get through!

I don't have anything other than conjecture but the issue to me is
that the ECN bit quickly degrades to entropy when moving between
autonomous systems. For instance, the smart people that read this
list are debating L4S and RFC 3168. Now imagine the billion dollar
hardware/firmware companies that can't be bothered to read this list
and have their own subtle meanings. It's easy to imagine validity of
the bit between a client and a WAP, or between hops in a homogeneous
AS, but as soon as you move between two or more the meaning becomes
random.

>> What is the greater challenge, campus congestion or everything else?
>
> LAN, last-mile, backhaul, core, and datacentre. All very different environments from each other. It currently sounds as if the BBR folks have focused too much on the datacentre for their traffic statistics. I'll admit to a personal bias towards the last mile.
>
> In terms of *deployment* of existing, proven AQM technology, the last mile and ISP backhaul are the biggest problems, because hardware vendors are blithely ignoring the problem, and ISPs aren't on their case about it either. As Dave mentioned, the end-host problem is finally starting to improve.

Full ACK here, campus is a very difficult problem but also seems
tractable with projects like the ones you two are working.

One of the most interesting things about BBR is that it can modulate
between channel and congestion loss, and also has heuristics for
policing. So two strong and one weaker signal. How do you qualify
different ECN meanings with available data as it is and end to end
principle? Eeek.

> - Jonathan Morton
>

Jonathan Morton

unread,
Jul 23, 2018, 5:07:05 AM7/23/18
to Kevin Bowling, Dave Taht, Neal Cardwell, BBR Development
> On 23 Jul, 2018, at 11:48 am, Kevin Bowling <kbow...@llnw.com> wrote:
>
>>> If you bias toward these second hops, it would be imperative for
>>> carriers to remain filtering ECN on the border of their networks.
>>
>> Why? If the second hop is the bottleneck - as it often is in practice - then it's imperative for the ECN signals it produces to get through!
>
> I don't have anything other than conjecture but the issue to me is
> that the ECN bit quickly degrades to entropy when moving between
> autonomous systems.

But ECN does not, in fact, "degrade" as you suggest. Packets that are marked *stay* marked - any middlebox that unmarks them is RFC-ignorant (we call them "ECN blackholes", and they have fortunately disappeared for all practical purposes). Most middleboxes aren't even aware of the mark's existence, and just pass the packet along unchanged.

There is only a difference in the marking behaviours of different AQM algorithms, and a difference between the DCTCP interpretation/response and the classical interpretation/response to each marked packet. The latter is the focus of our current discussion. Standard TCP stacks respond appropriately to any currently-deployed AQM; the presently described behaviour of BBR2 does not.

- Jonathan Morton

Kevin Bowling

unread,
Jul 23, 2018, 5:20:00 AM7/23/18
to Jonathan Morton, Dave Taht, Neal Cardwell, BBR Development
On Mon, Jul 23, 2018 at 2:07 AM, Jonathan Morton <chrom...@gmail.com> wrote:
>> On 23 Jul, 2018, at 11:48 am, Kevin Bowling <kbow...@llnw.com> wrote:
>>
>>>> If you bias toward these second hops, it would be imperative for
>>>> carriers to remain filtering ECN on the border of their networks.
>>>
>>> Why? If the second hop is the bottleneck - as it often is in practice - then it's imperative for the ECN signals it produces to get through!
>>
>> I don't have anything other than conjecture but the issue to me is
>> that the ECN bit quickly degrades to entropy when moving between
>> autonomous systems.
>
> But ECN does not, in fact, "degrade" as you suggest. Packets that are marked *stay* marked - any middlebox that unmarks them is RFC-ignorant (we call them "ECN blackholes", and they have fortunately disappeared for all practical purposes). Most middleboxes aren't even aware of the mark's existence, and just pass the packet along unchanged.

Yes good point, the behavior I would typically expect is a return to
zero not random. I am surprised you see high negotiation rates, the
most comprehensive study I recall was disclosed by Apple in IETF98.

Neal Cardwell

unread,
Jul 23, 2018, 11:53:34 AM7/23/18
to Jonathan Morton, Dave Taht, BBR Development
On Sun, Jul 22, 2018 at 4:56 PM Jonathan Morton <chrom...@gmail.com> wrote:
> On 22 Jul, 2018, at 11:39 pm, Neal Cardwell <ncar...@google.com> wrote:
>
> o RFC 3168 is a 17-year-old standard, and yet I am not aware of any large scale deployments of bottlenecks marking with RFC 3168 ECN.  [...]  Though perhaps there are other data points that I'm not yet aware of.

Those results are mainly due to the general absence of AQM *in general* at bottlenecks, and the general non-use of ECN in endpoint hosts, until very recently.  Frankly, it's taken a criminally long time to get ECN deployed and into actual use, mostly because of severe inertia among the many middlebox vendors and operators.

The default qdisc in several major Linux distributions is now fq_codel.  The sqm_scripts package is one of the most popular components of OpenWRT, and now includes Cake support - all Codel based.  The make-wifi-fast patches are all based around a variant of fq_codel, and any router running a recent enough version of OpenWRT and with ath9k/10k hardware will be using it.

Existing endpoint hosts (running Windows, OSX, iOS, Linux, Android) support RFC3168-style ECN in their TCP stacks.  It might not be actually turned on by default, but if it is, that's what is supported.  The necessary TCP extensions for Accurate ECN reporting, which is required for DCTCP, are generally not supported.  Codel is designed and carefully tuned to work with these TCP stacks.

So from a consumer point of view, Codel is *the* dominant deployment of AQM with ECN support, while anything implementing L4S/DCTCP is invisible and irrelevant.

I understand that the end systems can support RFC3168-style ECN in their TCP stacks, and actually do enable it in many cases.

My questions are more about the reality of the support in the Internet:

(1) Are there major ISPs that have actually deployed AQMs at their CMTS/DSLAM/etc that use RFC3168-ECN marking?

(2) Are there home router/AP boxes that are widely deployed that have code that's actually enabled by default that moves the downstream bottleneck to the home router and uses AQMs with RFC3168-ECN marking?
 
neal

Jonathan Morton

unread,
Jul 23, 2018, 12:04:01 PM7/23/18
to Neal Cardwell, Dave Taht, BBR Development
> On 23 Jul, 2018, at 6:53 pm, Neal Cardwell <ncar...@google.com> wrote:
>
> (1) Are there major ISPs that have actually deployed AQMs at their CMTS/DSLAM/etc that use RFC3168-ECN marking?

I think free.fr might do this. If so, it's almost unique in the world. As I said, the hardware vendors don't make it easy to do so. ISPs also have a perverse incentive here because they can always upsell.

> (2) Are there home router/AP boxes that are widely deployed that have code that's actually enabled by default that moves the downstream bottleneck to the home router and uses AQMs with RFC3168-ECN marking?

There's the IQrouter; this is in fact its raison d'être. I don't know what its sales figures are like, but it's a real commercial product that's selling enough to have a professional support team.

- Jonathan Morton

Dave Taht

unread,
Jul 23, 2018, 12:37:02 PM7/23/18
to Jonathan Morton, Neal Cardwell, BBR Development
On Mon, Jul 23, 2018 at 9:03 AM Jonathan Morton <chrom...@gmail.com> wrote:
>
> > On 23 Jul, 2018, at 6:53 pm, Neal Cardwell <ncar...@google.com> wrote:
> >
> > (1) Are there major ISPs that have actually deployed AQMs at their CMTS/DSLAM/etc that use RFC3168-ECN marking?

We don't know. Turn it on, go measure. I think quite a few dsl isps
use fq. I certainly see wred, but don't know if they bothered to
enable ecn.

Predominately clued home gamers/small business owners, desperate for
usable bandwidth, are the ones bypassing the "badwidth" they get from
the isps.

I think ce stats *after* the isp's link (on local wifi) is going to
start showing up soon in bulk data on traces > 20mbit, against apples'
devices and backends, if it hasn't already. If no google services are
using ecn, that would explain you not seeing classic ecn. :)

> I think free.fr might do this. If so, it's almost unique in the world.

And they are usually number 1 here both up and down
http://www.dslreports.com/speedtest/results/bufferbloat
http://www.dslreports.com/speedtest/results/bufferbloat?up

they wrote their own dsl driver, too. I think however they are mostly
number 1 because of fixing their now fully fq_codel'd return path, and
having sane fifo sizes on their dslams forward path.

given your vantage point you could run a test just against free.fr's
BGP space and see what happens. Or do a survey of the top 20, off
those dslreports pages

> As I said, the hardware vendors don't make it easy to do so. ISPs also have a perverse incentive here because they can always upsell.
>
> > (2) Are there home router/AP boxes that are widely deployed that have code that's actually enabled by default that moves the downstream bottleneck to the home router and uses AQMs with RFC3168-ECN marking?
>
> There's the IQrouter; this is in fact its raison d'être. I don't know what its sales figures are like, but it's a real
commercial product that's selling enough to have a professional support team.

edgerouters have shipped sqm for several years now. eero, recently
too. all the third party firmwares, led by lede/openwrt, started in
2012... some variants of streamboost are fq_codel based, things like
asus's adaptive qos are based on a fq_codel backport to linux
2.6.something, pretty much all the gaming routers I know of have
something fq_codel based in 'em.

So far as I know, none turn ecn off except at low bandwidths.

uptake in the fq_codel wifi part of the universe is going well, but as
for "ecn" in third party reimplementations of the fq_codel wifi code,
like meraki's (they do sfq-ish at a low level and push the aqm
component to click) , or qualcomm's 802.11ac propietary firmware, I
don't know. I'll go ask.

I am painfully aware that 'round here - everybody in our debloated
world outbound shapes, nearly everybody inbound shapes.


>
> - Jonathan Morton

Alexey Ivanov

unread,
Jul 23, 2018, 5:19:21 PM7/23/18
to Neal Cardwell, Jonathan Morton, Dave Taht, BBR Development
sorry to budge in, but here are my 2 cents:

> (1) Are there major ISPs that have actually deployed AQMs at their CMTS/DSLAM/etc that use RFC3168-ECN marking?
>
> (2) Are there home router/AP boxes that are widely deployed that have code that's actually enabled by default that moves the downstream bottleneck to the home router and uses AQMs with RFC3168-ECN marking?

I think for both of these cases Google is in ideal position to be able to measure ECN usage across the Internet.

--
t: @SaveTheRbtz
signature.asc

Dave Taht

unread,
Jul 23, 2018, 9:07:59 PM7/23/18
to Jonathan Morton, Neal Cardwell, BBR Development
fq_codel (and fq_pie) in freebsd as of 10.3 and 11 with ecn enabled by
default in june 2016. Didn't look like there was a way to turn it off,
either...

openbsd (don't know about ecn) in v6.2.

success story and the hfsc + fq_codel simple pf.conf config file:
https://pauladamsmith.com/blog/2018/07/fixing-bufferbloat-on-your-home-network-with-openbsd-6.2-or-newer.html

As a side note, I went looking over tcp_bbr.c yesterday and saw that
probertt lasted up to 200ms, where the typical comcast cmts buffering
at a 100mbit is over 600ms on the downlink side alone.

While we would certainly like more ISPs on their head-ends to
correctly set their buffer sizes relative to bdp, and better yet do
sfq, or drr at that more correct buffer size, instead of fifos... or
go the full monty for fq_codel or fq_pie or cake... so far... well...
we wait, pensively.

Along the edge, thus, we shape out and inbound. It's probably easier
for ISPs to fix their supplied CPE than it is to fix their head-ends.

Dave Taht

unread,
Jul 25, 2018, 10:24:44 PM7/25/18
to Kevin Bowling, Jonathan Morton, Neal Cardwell, BBR Development
I missed a portion of this thread, going back a ways:

Disabling ecn neg on bbr (for now) was a rejected patch to the linux
kernel a while back. What dave miller had to say
was worth reading. https://patchwork.ozlabs.org/patch/861983/

Another thing that's not obvious (at least about the current linux ect
marking mechanism), is that it automagically reverts to drop if a
second aqm on a link further away encounters a CE marked packet, that
it wants to mark... it will drop it rather than pass it as CE. This is
some protection against malicious senders, but not a lot.

One thing I proposed ages ago (and was in cake until recently) was
"drop and mark" in cases of extreme congestion,
so a tcp that got both signals could back off even harder. One thing I
didn't realize fully when suggesting it is that my thinking was
colored by the deterministic drop scheduling fq_codel does, and not
the random mechanisms used elsewhere, still....

I have, of course, a great hope that fq_codel based fair queuing + aqm
schedulers will sweep the edge of the internet and (especially) wifi
and other places where needed.

One of the tools I'd been using to measure one way delay and things
like ecn behaviors has been irtt. ( https://github.com/heistp/irtt
)You can send an isochronous stream ect(0) marked and observe if
anything touched it. Example:

irtt client -o test.log -i 10ms -l 1472 --dscp=0x02
flent-fremont.bufferbloat.net # -q if you want it quiet

To see CE get dropped - or passed preferentially, use 0x03.

(If you want to send a stream at higher rates than 10ms, set up your
own irtt server! And I recently found out my test cablemodem was
seriously flawed on udp traffic so I've had to discard a lot of
results. I give up.)

Anyway, all that said, "ECN is the wet paint of the congestion control
universe", and yet my hopes remain high that
a sane conservative and quickly reactive version appears for BBR along
the edge... but

I lack sufficient time, energy, money and cigarettes to chase this
chimara at the moment.

Kevin Bowling

unread,
Jul 31, 2018, 7:05:14 AM7/31/18
to Dave Taht, Jonathan Morton, Neal Cardwell, BBR Development
ISTM people are putting the cart before the horse. We happen to have
this available field, ECN, that is to define congestion events, and we
are excepted to use it in congestion controls, but it has been
multiply defined as well as used in site-local ways and at this point
is basically entropy in the context of Internet traffic. If it can be
rigorously defined, or negotiated, all my concerns would go away.

Regards,

Kevin Bowling Principal Engineer
EXPERIENCE FIRST.
+1 602 850 5036+1 480 227 1233
www.limelight.com



Jonathan Morton

unread,
Jul 31, 2018, 7:59:29 AM7/31/18
to Kevin Bowling, Dave Taht, Neal Cardwell, BBR Development
> On 31 Jul, 2018, at 2:05 pm, Kevin Bowling <kbow...@llnw.com> wrote:
>
> If it can be rigorously defined, or negotiated, all my concerns would go away.

I place the blame squarely at the feet of DCTCP. Everyone else is at least attempting to maintain some sort of compatibility with the ECN RFC. DCTCP implementations, by contrast, explicitly fail to respond to RFC-compliant ECN signals from the network (specifically those from Codel) in an RFC-compliant manner.

As you might gather from the above, ECN *is* rigorously defined, in the RFC.

There is however one element of unused codespace, namely the distinction between ECT(0) and ECT(1). RFC-compliant TCP stacks treat these as synonymous on reception, and emit only ECT(0). There was originally a second RFC which used the distinction as a guard against certain types of ECN blackholes, but nobody implemented it, so ECT(1) is effectively a spare codepoint.

It has occurred to me, for several years now, that ECT(1) could be used as a "softer" congestion signal than CE. Assuming there is a way to convey this information back to the sender (an assumption made by DCTCP in any case), you could examine a three-packet window of arriving ECT(x) codepoints, and adjust the cwnd according to the ratio of ECT(0) to ECT(1) observed - and, of course, still respond to CE according to the original ECN RFC. This would allow conventional TCP stacks to be dynamically instructed to restrict their increase rate to "Reno-style linear" at most, or to hold cwnd steady, or to decrease it linearly - permitting much finer control of congestion than is presently possible.

It should even be possible to change existing DCTCP implementations to observe the signal on ECT(x) ratio instead of on CE-mark ratio. This would allow them to be brought into compliance with the original RFC.

- Jonathan Morton

Kevin Bowling

unread,
Jul 31, 2018, 8:06:39 AM7/31/18
to Jonathan Morton, Dave Taht, Neal Cardwell, BBR Development
ABE seems to be this "softer signal" approach --
https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn-05

I am unhappy we can't come up with a simple state machine to
rigorously define compliant use and severity but I am not surprised by
the current casual state of affairs.

Regards,

Kevin Bowling Principal Engineer
EXPERIENCE FIRST.
+1 602 850 5036+1 480 227 1233
www.limelight.com



Jonathan Morton

unread,
Jul 31, 2018, 8:45:33 AM7/31/18
to Kevin Bowling, Dave Taht, Neal Cardwell, BBR Development
> On 31 Jul, 2018, at 3:06 pm, Kevin Bowling <kbow...@llnw.com> wrote:
>
> ABE seems to be this "softer signal" approach --
> https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn-05

No, it is not. It only reduces the factor of Multiplicative Decrease when a CE mark is received, relative to that in response to a lost packet, and it makes no use of ECT(1). This does not solve any fundamental problems.

- Jonathan Morton

Dave Taht

unread,
Jul 31, 2018, 12:09:40 PM7/31/18
to Jonathan Morton, Kevin Bowling, Neal Cardwell, BBR Development
I actually did an experiment last night using flent's tcp_ndown test
with 128 flows.

flent -H server --te=download_streams=128 tcp_ndown
--te=tcp_cong_control=X # There's a ton more parameters worth
capturing

Topology:

sch_fq server -> gbit-link -> box with cake bandwidth 100mbit -> client

using (dctcp, cubic, reno) (with and without ecn), and bbr (without ecn)
cake configured for a codel target of 500us and interval of 10ms. (lan mode)

In other words, the path has room for ~3 packets outstanding (130us each).

What does your intuition say for what happens?

To the throughput?
To the buffering on the server?
the apparent RTT observed by the flows?
the progression of the congestion window?
the observed buffering on the 100mbit host?, and the
impact on competing flows trying to start up while things are in the full monty?

(I need to redo the test to capture most of that data)

Dave Taht

unread,
Aug 2, 2018, 11:00:18 PM8/2/18
to Jonathan Morton, Kevin Bowling, Neal Cardwell, BBR Development
No takers? I have not had time to redo the test.

anyway, if you'd like to comment on systemd's current approach to
turning on ecn universally for tcp, a relevant bug is here:

https://github.com/systemd/systemd/issues/9748

Bless, Roland (TM)

unread,
Aug 13, 2018, 8:40:01 AM8/13/18
to Neal Cardwell, BBR Development, ic...@irtf.org
Hi Neal,

thanks for the update @IETF102.
While reviewing the presented slides some clarification questions
came up:

* When is inflight_lo set?

* Is the max delivery_rate filter still in place?
Slide 15 mentions: "Bandwidth estimator filter window
now simply covers last 2PROBE_BW cycles". Is "bandwidth
estimator filter window" the same as the max delivery_rate filter?
If so this window period could be as long as 10s?

* estimated_bdp still corresponds to the current bottleneck
share of the flow and is calculated by (filtered estimates
of) max_delivery_rate * RTT_min?

* Is BBR v2 now window-based, or, is it still rate-based
with a cwnd cap as in version 1? (e.g., Slide 5 mentions
"start probing at 1 extra packet": per what time period?).

* How is the ACK aggregation compensation (cf. ICCRG@IETF101
presentation) integrated? Will this extra amount be added to
inflight values?

* "hard ceiling" means "hit inflight_hi loss or ECN ceiling"
or something else?

* What is meant by "flow balance"?

* There is no explicit decrease action except for the
DRAIN and PROBE_BW DOWN phases (slide 30 mentions
multiplicative decrease)? How low is the mentioned
pacing rate at PROBE_BW DOWN (slide 30)?

Regards,
Roland

Neal Cardwell

unread,
Aug 13, 2018, 12:33:59 PM8/13/18
to Bless, Roland (TM), BBR Development, ic...@irtf.org
On Mon, Aug 13, 2018 at 8:39 AM Bless, Roland (TM) <roland...@kit.edu> wrote:
Hi Neal,

thanks for the update @IETF102.
While reviewing the presented slides some clarification questions
came up:

Hi Roland,

Thanks for the excellent questions. Responses in-line.

Any page numbers I mention will be in reference to the IETF 102 slides under discussion:

* When is inflight_lo set?

The inflight_lo parameter is set/updated (a) in fast recovery, or (b) when receiving ECN signals (page 10).
 
* Is the max delivery_rate filter still in place?
  Slide 15 mentions: "Bandwidth estimator filter window
  now simply covers last 2PROBE_BW cycles". Is "bandwidth
  estimator filter window" the same as the max delivery_rate filter?
  If so this window period could be as long as 10s?

Yes, the max delivery rate filter is still in place. Yes, the "bandwidth estimator filter window" is the same as the max delivery_rate filter (aka BtlBw or bbr_bw(sk) from the Linux TCP BBR code).

As you note, in the version of the design described in the slides, yes, the filter window could stretch to 2 * 5secs, or 10secs. In the current prototype design, the max bandwidth probing interval is 3 seconds, so it could stretch to 2 * 3sec = 6 secs. Note that it would only be that long for WAN flows with long RTTs. For flows with shorter RTTs the Reno probing time scale would usually be much shorter, so the BBR v2 bw probing time scale would be governed by the Reno calculation (eg for 10G datacenter environments the bandwidth probing time scale would tend to be less than 150 * 100usec ~= 15msec).

The 6 sec band width filter timescale for WAN flows is longer than ideal, but seemed like the simplest way to meet the design constraints:

  (1) bandwidth probing can cause packet loss, so the time in between bandwidth probes should be adapted to longer time-scales for high-RTT flows for better coexistence with Reno/CUBIC

  (2) to maintain a robust bandwidth estimate, the bandwidth estimator filter window needs to cover at least the most recent bandwidth probing period, and also needs to cover some amount of history before that if the most recent bandwidth probe was very recent

If anyone sees a better way to harmonize these design constraints, please let us know.
 
* estimated_bdp still corresponds to the current bottleneck
  share of the flow and is calculated by (filtered estimates
  of) max_delivery_rate * RTT_min?

Yes, exactly.
 
* Is BBR v2 now window-based, or, is it still rate-based
  with a cwnd cap as in version 1?

The latter. It is still mostly rate-based but with a cwnd cap, as in version 1. As with BBR v1, the BBR v2 code tries to smoothly control inflight with the pacing rate in the common case, but there is still a cwnd that places a hard upper bound on the amount of inflight data, and the cwnd becomes the limiting factor if (a) ACKs are delayed beyond the expected degree, or (b) the BBR sender cuts inflight to adapt to signals like packet loss (or ECN in v2).
 
(e.g., Slide 5 mentions
  "start probing at 1 extra packet": per what time period?).

In the current BBR v2 prototype code, the time scale for increasing the magnitude of probing is per round trip. This is described in more detail in slide 11: "inflight_probe grows exponentially per round trip: - 1, 2, 4, 8... packets". I think there are interesting trade-offs to be explored between making this probing time scale (a) per round trip (to allow rapid utilization of free bandwidth), or (b) per fixed wall clock interval (to improve RTT fairness). The current prototype opts for rapid utilization over RTT fairness. Partly this is based on the history of users/applications just opening more connections if TCP does not do a good job of utilizing free bandwidth. The hope is to make the congestion control efficient enough at quickly utilizing free bandwidth that users/applications do not have to resort to opening multiple connections just to make good use of newly-available bandwidth.
 
* How is the ACK aggregation compensation (cf. ICCRG@IETF101
  presentation) integrated? Will this extra amount be added to
  inflight values?

The aggregation compensation discussed in the IETF 101 slides is still implemented using the same code posted in the patches in this bbr-dev thread, early on in the computation of the target cwnd. Then in BBR v2, as the final step of the bbr_set_cwnd() function the BBR v2 code makes sure that the cwnd is at or below any upper bound imposed by the BBR v2 model of what is an appropriate inflight level to maintain at this point in the state machine (based on inflight_lo and inflight_hi). So the inflight_lo/inflight_hi take precedence over the aggregation estimator. In particular this means that if BBR tries allowing extra data in flight to compensate for aggregation and then packet loss or ECN signals provide evidence that this higher level inflight data is causing excessive queues, then BBR adapts by trimming cwnd and inflight.
 
* "hard ceiling" means "hit inflight_hi loss or ECN ceiling"
  or something else?

Yes, the "hard ceiling" phrase on slide 10 means the case where inflight_hi was set due to loss or ECN signals. 
 
* What is meant by "flow balance"?

By "flow balance" the slides mean a condition where the sender's current sending rate matches the current delivery rate for the flow.
 
* There is no explicit decrease action except for the
  DRAIN and PROBE_BW DOWN phases (slide 30 mentions
  multiplicative decrease)?

There are two other explicit decrease actions in addition to the two you note (DRAIN and PROBE_BW DOWN). These two new mechanisms are at the core of the improvements in BBR v2. The slides try to outline these decrease actions on slide 10:

First, if the last bandwidth probe saw a "hard ceiling" where loss or ECN signals suggested there was an excessive queue, this is reflected in inflight_hi. After this happens, we try to maintain an inflight level that leaves unutilized headroom below that inflight level, by maintaining an inflight level respecting the constraint:

       inflight <= (1 - headroom) * inflight_hi     (where headroom=0.15)

Note that the 0.85 factor here is the same constant used by CUBIC's somewhat analogous "fast convergence" mechanism for having bigger flows make way for smaller/new flows.

Second, upon loss or ECN signals the sender cuts inflight_lo to ensure that the sender does not increase inflight beyond the current level (until the next time it probes for bandwidth). This is somewhat similar to the way Linux TCP PRR respects packet conservation in the first round trip of fast recovery.
 
How low is the mentioned
  pacing rate at PROBE_BW DOWN (slide 30)?

The pacing rate in PROBE_BW DOWN is still 0.75 * estimated_bandwidth (as in BBR v1).

I hope that helps clarify some of these points.

best,
Neal

Rajeev Kumar

unread,
Sep 11, 2018, 5:17:25 PM9/11/18
to BBR Development
Hi All,

I was wondering whether IETF 101 and/or IETF 102 are implemented in the latest Linux Kernals? Are the new changes only available to the QUIC?

Regards,
Rajeev

Neal Cardwell

unread,
Sep 11, 2018, 7:58:58 PM9/11/18
to rk2...@nyu.edu, BBR Development
On Tue, Sep 11, 2018 at 5:17 PM Rajeev Kumar <rk2...@nyu.edu> wrote:
Hi All,

I was wondering whether IETF 101 and/or IETF 102 are implemented in the latest Linux Kernals? Are the new changes only available to the QUIC?

Regards,
Rajeev

For Linux TCP BBR, the changes described at IETF 101 are posted here:
For QUIC BBR, those IETF 101 changes are in the Chromium QUIC code.

For IETF 102 there were two presentations:

(1)  The changes we described at IETF 102 as "BBR v2":
These are still under testing, and we will post those when they are ready for wider testing/review.

(2) Ian's presentation:
This was discussing QUIC BBR, and those IETF 102 changes are in the Chromium QUIC code.

regards,
neal 


 
On Friday, July 20, 2018 at 5:38:20 PM UTC-4, Neal Cardwell wrote:
Hi all,
The Google BBR team presented some updates at the IETF 102 ICCRG session yesterday:

+ BBR Congestion Control Work at Google: IETF 102 Update [YouTube] [slides

+ BBR Congestion Control: IETF 102 Update: BBR Startup [YouTube] [slides]  

cheers,
neal

Reply all
Reply to author
Forward
0 new messages