duplicating the BBRv2 tests at iccrg in flent?

Dave Taht

unread,

Apr 5, 2019, 3:42:37 AM4/5/19

to ECN-Sane, BBR Development, flent-users

I see from the iccrg preso at 7 minutes 55 s in, that there is a test
described as:

20 BBRv2 flows
starting each 100ms, 1G, 1ms
Linux codel with ECN ce_threshold at 242us sojurn time.

I interpret this as

20 flows, starting 100ms apart
on a 1G link
with a 1ms transit time
and linux codel with ce_threshold 242us

0) This is iperf? There is no crypto?

1) "sojourn time" not as as setting the codel target to 242us?

I tend to mentally tie the concept of sojourn time to the target
variable, not ce_threshold

2) In our current SCE work we have repurposed ce_threshold to do sce
instead (to save on cpu and also to make it possible to fiddle without
making a userspace api change). Should we instead create a separate
sce_threshold option to allow for backward compatible usage?

3) Transit time on your typical 1G link is actually 13us for a big
packet, why 1ms?

is that 1ms from netem?

4) What is the topology here?

host -> qdisc -> wire -> host?

host -> qdisc -> wire -> router -> host?

5) What was the result with fq_codel instead?

--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Neal Cardwell

unread,

Apr 5, 2019, 11:11:08 AM4/5/19

to Dave Taht, ECN-Sane, BBR Development, flent-users

On Fri, Apr 5, 2019 at 3:42 AM Dave Taht <dave...@gmail.com> wrote:

I see from the iccrg preso at 7 minutes 55 s in, that there is a test
described as:

20 BBRv2 flows
starting each 100ms, 1G, 1ms
Linux codel with ECN ce_threshold at 242us sojurn time.

Hi, Dave! Thanks for your e-mail.

I interpret this as

20 flows, starting 100ms apart
on a 1G link
with a 1ms transit time
and linux codel with ce_threshold 242us

Yes, except the 1ms is end-to-end two-way propagation time.

0) This is iperf? There is no crypto?

Each flow is a netperf TCP stream, with no crypto.

1) "sojourn time" not as as setting the codel target to 242us?

I tend to mentally tie the concept of sojourn time to the target
variable, not ce_threshold

Right. I didn't mean setting the codel target to 242us. Where the slide says "Linux codel with ECN ce_threshold at 242us sojourn time" I literally mean a Linux machine with a codel qdisc configured as:

codel ce_threshold 242us

This is using the ce_threshold feature added in:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=80ba92fa1a92dea1

... for which the commit message says:

"A DCTCP enabled egress port simply have a queue occupancy threshold

above which ECT packets get CE mark. In codel language this translates to a sojourn time, so that one doesn't have to worry about bytes or bandwidth but delays."

The 242us comes from the seriailization delay for 20 packets at 1Gbps.

2) In our current SCE work we have repurposed ce_threshold to do sce
instead (to save on cpu and also to make it possible to fiddle without
making a userspace api change). Should we instead create a separate
sce_threshold option to allow for backward compatible usage?

Yes, you would need to maintain the semantics of ce_threshold for backwards compatibility for users who are relying on the current semantics. IMHO your suggestion to use a separate sce_threshold sounds like the way to go, if adding SCE to qdiscs in Linux.

3) Transit time on your typical 1G link is actually 13us for a big
packet, why 1ms?

The 1ms is the path two-way propagation delay ("min RTT"). We run a range of RTTs in our tests, and the graph happens to be for an RTT of 1ms.

is that 1ms from netem?

Yes.

4) What is the topology here?

host -> qdisc -> wire -> host?

host -> qdisc -> wire -> router -> host?

Those two won't work with Linux TCP, because putting the qdisc on the sender pulls the qdisc delays inside the TSQ control loop, giving a behavior very different from reality (even CUBIC won't bloat if the network emulation qdiscs are on the sender host).

What we use for our testing is:

host -> wire -> qdiscs -> host

Where "qdiscs" includes netem and whatever AQM is in use, if any.

5) What was the result with fq_codel instead?

With fq_codel and the same ECN marking threshold (fq_codel ce_threshold 242us), we see slightly smoother fairness properties (not surprising) but with slightly higher latency.

The basic summary:

retransmits: 0

flow throughput: [46.77 .. 51.48]

RTT samples at various percentiles:

% | RTT (ms)

------+---------

0 1.009

50 1.334

60 1.416

70 1.493

80 1.569

90 1.655

95 1.725

99 1.902

99.9 2.328

100 6.414

Bandwidth share graphs are attached. (Hopefully the graphs will make it through various lists; if not, you can check the bbr-dev group thread.)

best,
neal

bbr-v2-ecn-fq_codel-bw-individual.png

bbr-v2-ecn-fq_codel-bw-cum.png

Dave Taht

unread,

Apr 5, 2019, 11:51:16 AM4/5/19

to Neal Cardwell, ECN-Sane, BBR Development, flent-users

Thanks!

On Fri, Apr 5, 2019 at 5:11 PM Neal Cardwell <ncar...@google.com> wrote:
>
> On Fri, Apr 5, 2019 at 3:42 AM Dave Taht <dave...@gmail.com> wrote:
>>
>> I see from the iccrg preso at 7 minutes 55 s in, that there is a test
>> described as:
>>
>> 20 BBRv2 flows
>> starting each 100ms, 1G, 1ms
>> Linux codel with ECN ce_threshold at 242us sojurn time.
>
>
> Hi, Dave! Thanks for your e-mail.

I have added you to ecn-sane's allowed sender filters.

>
>>
>> I interpret this as
>>
>> 20 flows, starting 100ms apart
>> on a 1G link
>> with a 1ms transit time
>> and linux codel with ce_threshold 242us
>
>
> Yes, except the 1ms is end-to-end two-way propagation time.
>
>>
>> 0) This is iperf? There is no crypto?
>
>
> Each flow is a netperf TCP stream, with no crypto.

OK. I do wish netperf had a tls mode.

>
>>
>>
>> 1) "sojourn time" not as as setting the codel target to 242us?
>>
>> I tend to mentally tie the concept of sojourn time to the target
>> variable, not ce_threshold
>
>
> Right. I didn't mean setting the codel target to 242us. Where the slide says "Linux codel with ECN ce_threshold at 242us sojourn time" I literally mean a Linux machine with a codel qdisc configured as:
>
> codel ce_threshold 242us
>
> This is using the ce_threshold feature added in:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=80ba92fa1a92dea1
>
> ... for which the commit message says:
>
> "A DCTCP enabled egress port simply have a queue occupancy threshold
> above which ECT packets get CE mark. In codel language this translates to a sojourn time, so that one doesn't have to worry about bytes or bandwidth but delays."

I had attempted to discuss deprecating this option back in august on
the codel list:

https://lists.bufferbloat.net/pipermail/codel/2018-August/002367.html

As well as changing a few other core features. I put most of what I
discussed there into https://github.com/dtaht/fq_codel_fast which I
was using for comparison to the upcoming cake paper, and that is now
where the first cut at the sce work resides also.

> The 242us comes from the seriailization delay for 20 packets at 1Gbps.

I thought it was more because how hard it is to get an accurate
measurement below about ~500us. In our early work on attempting
virtualizations, things like Xen would frequently jitter scheduling by
10-20ms or more. While that situation has got much better, I tend to
still prefer "bare metal" when working on this stuff - and often
"weak" bare metal. like the mips processors we mostly use in the
cerowrt project.

Even then I get nervous below 500us unless it's a r/t kernel.

I used irtt to profile this underlying packet + scheduling jitter on
various virtual machine fabrics from 2ms to 10us a while back (google
cloud, aws, linode) but never got around to publishing the work. I
guess I should go pull those numbers out...

>
>> 2) In our current SCE work we have repurposed ce_threshold to do sce
>> instead (to save on cpu and also to make it possible to fiddle without
>> making a userspace api change). Should we instead create a separate
>> sce_threshold option to allow for backward compatible usage?
>
>
> Yes, you would need to maintain the semantics of ce_threshold for backwards compatibility for users who are relying on the current semantics. IMHO your suggestion to use a separate sce_threshold sounds like the way to go, if adding SCE to qdiscs in Linux.
>
>>
>> 3) Transit time on your typical 1G link is actually 13us for a big
>> packet, why 1ms?
>
>
> The 1ms is the path two-way propagation delay ("min RTT"). We run a range of RTTs in our tests, and the graph happens to be for an RTT of 1ms.
>

OK.

>>
>> is that 1ms from netem?
>
>
> Yes.
>
>>
>> 4) What is the topology here?
>>
>> host -> qdisc -> wire -> host?
>>
>> host -> qdisc -> wire -> router -> host?
>
>
> Those two won't work with Linux TCP, because putting the qdisc on the sender pulls the qdisc delays inside the TSQ control loop, giving a behavior very different from reality (even CUBIC won't bloat if the network emulation qdiscs are on the sender host).
>
> What we use for our testing is:
>
> host -> wire -> qdiscs -> host
>
> Where "qdiscs" includes netem and whatever AQM is in use, if any.

Normally how I do the "qdiscs" is I call it a "router" :) and then the
qdiscs usually look like this:

eth0 -> netem -> aqm_alg -> eth1
eth0 <- aqm_alg <- netem <- eth1

using ifb for the inbound management.

I didn't get to where I trusted netem to do this right until about a
year ago, up until that point I had also always used a separate
"delay" box.

Was GRO/GSO enabled on the router? host? server?

>
>>
>> 5) What was the result with fq_codel instead?
>
>
> With fq_codel and the same ECN marking threshold (fq_codel ce_threshold 242us), we see slightly smoother fairness properties (not surprising) but with slightly higher latency.
>
> The basic summary:
>
> retransmits: 0
> flow throughput: [46.77 .. 51.48]
> RTT samples at various percentiles:
> % | RTT (ms)
> ------+---------
> 0 1.009
> 50 1.334
> 60 1.416
> 70 1.493
> 80 1.569
> 90 1.655
> 95 1.725
> 99 1.902
> 99.9 2.328
> 100 6.414

This is lovely. Is there an open source tool you are using to generate
this from the packet capture? From wireshark? Or is this from sampling
the TCP_INFO parameter of netperf?

>
> Bandwidth share graphs are attached. (Hopefully the graphs will make it through various lists; if not, you can check the bbr-dev group thread.)
>
> best,
> neal
>

Jonathan Morton

unread,

Apr 5, 2019, 12:20:03 PM4/5/19

to Neal Cardwell, Dave Taht, ECN-Sane, BBR Development, flent-users

> On 5 Apr, 2019, at 6:10 pm, 'Neal Cardwell' via BBR Development <bbr...@googlegroups.com> wrote:
>
> Right. I didn't mean setting the codel target to 242us. Where the slide says "Linux codel with ECN ce_threshold at 242us sojourn time" I literally mean a Linux machine with a codel qdisc configured as:
>
> codel ce_threshold 242us

I infer from this that BBR's new ECN support won't work properly with standard CE marking behaviour, only with the sort of signal that DCTCP requires. Is that accurate?

SCE allows providing that sort of high-fidelity congestion signal without losing interoperability with RFC-3168 compliant flows.

- Jonathan Morton

Neal Cardwell

unread,

Apr 5, 2019, 12:58:24 PM4/5/19

to Dave Taht, ECN-Sane, BBR Development, flent-users

On Fri, Apr 5, 2019 at 11:51 AM Dave Taht <dave...@gmail.com> wrote:

Was GRO/GSO enabled on the router? host? server?

In this particular invocation of this particular test, the sender, receiver, and router functionality were all running on the same machine, using network namespaces and veth devices; TSO and GRO were enabled.

>
>>
>> 5) What was the result with fq_codel instead?
>
>
> With fq_codel and the same ECN marking threshold (fq_codel ce_threshold 242us), we see slightly smoother fairness properties (not surprising) but with slightly higher latency.
>
> The basic summary:
>
> retransmits: 0
> flow throughput: [46.77 .. 51.48]
> RTT samples at various percentiles:
> % | RTT (ms)
> ------+---------
> 0 1.009
> 50 1.334
> 60 1.416
> 70 1.493
> 80 1.569
> 90 1.655
> 95 1.725
> 99 1.902
> 99.9 2.328
> 100 6.414

This is lovely. Is there an open source tool you are using to generate
this from the packet capture? From wireshark? Or is this from sampling
the TCP_INFO parameter of netperf?

Thanks. The results and bandwidth graphs are from an internal test orchestration/evaluation/visualization tool written a few years ago by our BBR team member, Soheil Hassas Yeganeh, and further enhanced by others on our team over the years. We are trying to find the time to open-source it, but haven't yet. It can generate the graphs either from pcap files or "ss" output. This one was from "ss" output.

neal

Dave Taht

unread,

Apr 6, 2019, 7:49:56 AM4/6/19

to Neal Cardwell, ECN-Sane, BBR Development, flent-users

> With fq_codel and the same ECN marking threshold (fq_codel ce_threshold 242us), we see slightly smoother fairness properties (not surprising) but with slightly higher latency.
>
> The basic summary:
>
> retransmits: 0
> flow throughput: [46.77 .. 51.48]
> RTT samples at various percentiles:
> % | RTT (ms)
> ------+---------
> 0 1.009
> 50 1.334
> 60 1.416
> 70 1.493
> 80 1.569
> 90 1.655
> 95 1.725
> 99 1.902
> 99.9 2.328
> 100 6.414

I am still trying to decode the output of this tool. Perhaps you can
check my math?

At 0, I figure this is the bare minimum RTT latency measurement from
the first flow started, basically a syn/syn ack pair, yielding 9us, of
which gives a tiny baseline to/from the wire delay, the overhead of a
tcp connection getting going, and the routing overheads (if you are
using a route table?) of (last I looked) ~25ns per virtual hop (so *8)
for ipv4 and something much larger if you are using ipv6. This is
using ipv4?

A perfect level of interleave of 20 flows, 2 large packets each, would
yield a RTT measurement of around 532 extra us, but you are only
seeing that at the 80th percentile...

100: The almost-worst case basic scenario is that 20 flows of 64k
GRO/GSO bytes each are served in order. That's 13us * 42 * 20 =
10.920ms. It's potentially worse than that due to the laws of
probability one flow could get scheduled more than once.

1) What is the qdisc on the server and client side? sch_fq? pfifo_fast?

2) What happens when you run the same test with gso/gro disabled?

Neal Cardwell

unread,

Apr 6, 2019, 7:56:24 AM4/6/19

to Jonathan Morton, Dave Taht, ECN-Sane, BBR Development, flent-users

On Fri, Apr 5, 2019 at 12:20 PM Jonathan Morton <chrom...@gmail.com> wrote:

> On 5 Apr, 2019, at 6:10 pm, 'Neal Cardwell' via BBR Development <bbr...@googlegroups.com> wrote:
>
> Right. I didn't mean setting the codel target to 242us. Where the slide says "Linux codel with ECN ce_threshold at 242us sojourn time" I literally mean a Linux machine with a codel qdisc configured as:
>
> codel ce_threshold 242us

I infer from this that BBR's new ECN support won't work properly with standard CE marking behaviour, only with the sort of signal that DCTCP requires. Is that accurate?

Yes, that's correct. Thus far BBR v2 is targeting only DCTCP/L4S-style ECN.

SCE allows providing that sort of high-fidelity congestion signal without losing interoperability with RFC-3168 compliant flows.

Noted, thanks.

neal

Neal Cardwell

unread,

Apr 6, 2019, 8:32:18 AM4/6/19

to Dave Taht, ECN-Sane, BBR Development, flent-users

This is using IPv6.

A perfect level of interleave of 20 flows, 2 large packets each, would
yield a RTT measurement of around 532 extra us, but you are only
seeing that at the 80th percentile...

The flows would only get ~500us extra delay above the two-way propagation delay if all of those packets ended up in the queue at the same time. But the BDP here is 1Gbps*1ms = 82 packets, and there are 20 flows, so for periods where the flows keep their steady-state inflight around 4 or smaller, their aggregate inflight will be around 4*20 = 80, which is below the BDP, so there is no queue. In a steady-state test like this the ECN signal allows the flows to usually keep their inflight close to this level, so that the higher queues and queuing delays only happen when some or all of the flows are pushing up their inflight to probe for bandwidth around the same time. For a scenario like this, those dynamics are not unique to BBRv2; in a scenario like this, at a high level similar reasoning would apply to DCTCP or TCP Prague.

100: The almost-worst case basic scenario is that 20 flows of 64k
GRO/GSO bytes each are served in order. That's 13us * 42 * 20 =
10.920ms. It's potentially worse than that due to the laws of
probability one flow could get scheduled more than once.

Keep in mind that there's no reason why the flows would use maximally-sized 64KByte GSO packets in this test. The typical pacing rate for the flows is close to the throughput of 50Mbps, and the TCP/BBR TSO autosizing code will tend to choose GSO skbs with around 1ms of data (n the 24-540Mbps range), and at at 50Mbps this 1ms allotment is about 4*MSS.

1) What is the qdisc on the server and client side? sch_fq? pfifo_fast?

Sender and receiver qdiscs are pfifo_fast

2) What happens when you run the same test with gso/gro disabled?

Disabling GSO and GRO is not a realistic config, and we have limited developer resources on the BBR project, so I'm not going to have time to run that kind of test (sorry!). Perhaps you can run that test after we open-source BBRv2.

best,

neal

Neal Cardwell

unread,

Apr 8, 2019, 9:33:31 PM4/8/19

to Sebastian Moeller, flent-users, Jonathan Morton, BBR Development, ECN-Sane

On Sat, Apr 6, 2019 at 10:38 AM Sebastian Moeller <moel...@gmx.de> wrote:

Hii Neal,

On April 6, 2019 1:56:06 PM GMT+02:00, Neal Cardwell <ncar...@google.com> wrote:
>On Fri, Apr 5, 2019 at 12:20 PM Jonathan Morton <chrom...@gmail.com>
>wrote:
>
>> > On 5 Apr, 2019, at 6:10 pm, 'Neal Cardwell' via BBR Development <
>> bbr...@googlegroups.com> wrote:
>> >
>> > Right. I didn't mean setting the codel target to 242us. Where the
>slide
>> says "Linux codel with ECN ce_threshold at 242us sojourn time" I
>literally
>> mean a Linux machine with a codel qdisc configured as:
>> >
>> > codel ce_threshold 242us
>>
>> I infer from this that BBR's new ECN support won't work properly with
>> standard CE marking behaviour, only with the sort of signal that
>DCTCP
>> requires. Is that accurate?
>>
>
>Yes, that's correct. Thus far BBR v2 is targeting only DCTCP/L4S-style
>ECN.

Out of curiosity, given that BBR intentionally interprets lost packets as a lossy path instead of a signal send by an AQM to slow down, why do think that dctcp style ECN is a good fit? In classic ECN the CE mark is exactly the signal BBR should get to have a higher confidence that ignoring lost packets is acceptable, in dctcp it will take a while to convey the same signal, no? I wonder if one is willing to change ECN semantics already, by making CELighter weight than a packetdrop, why not also using an explicit signal for emergency brake? I can't help but notice that both dctcp and tcpprague face the same problem, but at least they seem to be willing to take a Paket drop at face value...

I think the L4S team has done a nice job of outlining the problems with RFC-3168-style ECN, so I will just quote their explanation:

https://tools.ietf.org/html/draft-briscoe-tsvwg-l4s-arch-01#section-5.1

5.  Rationale

5.1.  Why These Primary Components?

   Explicit congestion signalling (protocol):  Explicit congestion
      signalling is a key part of the L4S approach.  In contrast, use of
      drop as a congestion signal creates a tension because drop is both
      a useful signal (more would reduce delay) and an impairment (less
      would reduce delay).  Explicit congestion signals can be used many
      times per round trip, to keep tight control, without any
      impairment.  Under heavy load, even more explicit signals can be
      applied so the queue can be kept short whatever the load.  Whereas
      state-of-the-art AQMs have to introduce very high packet drop at
      high load to keep the queue short.  Further, TCP's sawtooth
      reduction can be smaller, and therefore return to the operating
      point more often, without worrying that this causes more signals
      (one at the top of each smaller sawtooth).  The consequent smaller
      amplitude sawteeth fit between a very shallow marking threshold
      and an empty queue, so delay variation can be very low, without
      risk of under-utilization.

      All the above makes it clear that explicit congestion signalling
      is only advantageous for latency if it does not have to be
      considered 'the same as' drop (as required with Classic ECN

[RFC3168]). ...

best,

neal

Jonathan Morton

unread,

Apr 8, 2019, 10:09:32 PM4/8/19

to Neal Cardwell, Sebastian Moeller, flent-users, BBR Development, ECN-Sane

> I wonder if one is willing to change ECN semantics already, by making CELighter weight than a packetdrop, why not also using an explicit signal for emergency brake?

This is the principle I proposed with SCE. There, CE remains a broadly drop-equivalent signal (the "emergency brake"), while ECT(1) becomes SCE, a softer and higher-precision signal which is produced in the way DCTCP expects.

As of a couple of hours ago, I have three machines in my bedroom which are running SCE-aware Linux kernels, including a "rehabilitated" version of DCTCP which responds appropriately to drops, CE and SCE and is therefore compatible with use on the general Internet.

Now I just need to blow the cobwebs off the test harnesses which were used to refine Cake, to ensure that the assertion I just made above is actually true.

- Jonathan Morton

Neal Cardwell

unread,

Apr 9, 2019, 10:34:14 AM4/9/19

to Sebastian Moeller, flent-users, Jonathan Morton, BBR Development, ECN-Sane

On Tue, Apr 9, 2019 at 2:31 AM Sebastian Moeller <moel...@gmx.de> wrote:

I guess I was too subtle.... The following is in no way incompatible with what I wrote. The point I wanted to make is that redefining CE without also introducing an equivalent of 'stop hard, ASAP' is an incomplete solution. Once you introduce the missing signal the SCE proposal is a better fit than L4S.
Also both BBR and L4S both aim at basically ignoring drops as immediate signals, both for good reasons like better throughput on links with spurious drops and some reordering tolerance.
IMHO it is wonderfully absurd in that light to try to basically shoehorn a dctcp style CE-marker into the internet, which does not allow to carry as quickly a stop hard signal as tcp-friendly ECN does today. To repeat the argument is not against finer-grained load information from the bottleneck, but rather against doing only half the job and falling short of solving the problem.
The rationale below would maybe make sense if all the internet's bottlenecks would talk dctcp style ECN, but until then the rationale falls apart.

OK, I think I now understand what you are suggesting. I can see the potential value of having both a DCGCP/L4S/SCE-style signal and a RFC3168-style signal. In your previous e-mail I thought you were arguing for a pure RFC3168-style approach; but if the proposal is to have both styles, and that's what gets deployed, that sounds usable AFAICT.

neal

Reply all

Reply to author

Forward