BBR dominates our WAN links

789 views
Skip to first unread message

Daire Byrne

unread,
Jun 6, 2017, 11:15:53 AM6/6/17
to BBR Development
Hi,

I have been doing some tests of BBR between our sites (e.g. London - Vancouver) across our VPN tunnels which traverse various fixed routers and firewalls (ours+ISP). The tests showed that BBR was much better than CUBIC at maintaining consistent and steady high throughput with only ~10 threads of iperf (yay!). But it also seems to adversely effect any other traffic traversing the WAN tunnels once we start to approach the capacity of the link (~700mbit). This can be seen as increased ping times and packet loss between various hosts.

My (limited) understanding is that BBR should also see this increased latency(+loss) and back off appropriately to try and ensure that it remains consistent over time. So the other (cubic) streams outside of BBR's view should also continue to experience low latency+loss?

So say that host1 & host2 are on one side of the WAN link and host3 & host4 the other, then the normal (reproducible) ping/loss looks like:
host1 # mtr host3 --report --report-cycles 30
HOST: host1                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                       0.0%    30    0.2   0.1   0.1   0.2   0.0
  2. x.x.x.x                       0.0%    30    0.2   0.1   0.1   0.2   0.0
  3. x.x.x.x                       0.0%    30  174.0 173.9 173.6 174.3   0.2
  4. x.x.x.x                       0.0%    30  173.9 174.1 173.8 174.3   0.2
  5. host3                         0.0%    30  174.0 174.0 173.7 174.7   0.2

But once we run an iperf using BBR between host2 & host4 (same WAN link, different client/server):
host2 # iperf -t 60 -xC -c host4 -p 5001 -P16 & (~600mbit)
host1 # mtr host3 --report --report-cycles 30
HOST: host1                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                       0.0%    30    0.1   0.1   0.1   0.2   0.0
  2. x.x.x.x                       0.0%    30    0.1   0.1   0.1   0.2   0.0
  3. x.x.x.x                       3.3%    30  329.4 334.3 315.3 387.8  16.4
  4. x.x.x.x                       6.7%    30  326.9 321.6 315.4 327.5   4.7
  5. host3                        13.3%    30  315.6 337.9 315.5 439.3  26.1

The increased loss+latency is seen between the iperf hosts (host2 & host4) too while the iperf is running. We can also fill the capacity of the link with cubic streams (we require many more iperf streams though) and we don't see any adverse effects on the independent flows between other hosts. Also, I can only achieve around 60mbit with a single BBR stream over our WAN and need ~16 streams to saturate the ~700mbit capacity.

host2 just has a simple default fq configuration:
host2 # tc qdisc
qdisc fq 8004: dev eth0 root refcnt 3 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140

host2 # sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = bbr

I also noticed that there were quite a lot of retransmits (~8%) when saturating the WAN link with BBR streams. Adding a maxrate to the fq qdisc seemed to help both reduce the retransmits and the loss+latency effects but it did not entirely fix it.

Maybe I am seeing similar things to Lawrence Brakmo in his "BBR Report" topic (shallow buffered paths)? I tried the patch in that group discussion but I didn't see any noticeable improvement in latency or loss with the iperf test over our infrastructure.

Regards,

Daire

Yuchung Cheng

unread,
Jun 6, 2017, 1:09:03 PM6/6/17
to Daire Byrne, BBR Development
On Tue, Jun 6, 2017 at 8:15 AM, Daire Byrne <daire...@gmail.com> wrote:
Hi,

I have been doing some tests of BBR between our sites (e.g. London - Vancouver) across our VPN tunnels which traverse various fixed routers and firewalls (ours+ISP). The tests showed that BBR was much better than CUBIC at maintaining consistent and steady high throughput with only ~10 threads of iperf (yay!). But it also seems to adversely effect any other traffic traversing the WAN tunnels once we start to approach the capacity of the link (~700mbit). This can be seen as increased ping times and packet loss between various hosts.

My (limited) understanding is that BBR should also see this increased latency(+loss) and back off appropriately to try and ensure that it remains consistent over time. So the other (cubic) streams outside of BBR's view should also continue to experience low latency+loss?

So say that host1 & host2 are on one side of the WAN link and host3 & host4 the other, then the normal (reproducible) ping/loss looks like:
host1 # mtr host3 --report --report-cycles 30
HOST: host1                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                       0.0%    30    0.2   0.1   0.1   0.2   0.0
  2. x.x.x.x                       0.0%    30    0.2   0.1   0.1   0.2   0.0
  3. x.x.x.x                       0.0%    30  174.0 173.9 173.6 174.3   0.2
  4. x.x.x.x                       0.0%    30  173.9 174.1 173.8 174.3   0.2
  5. host3                         0.0%    30  174.0 174.0 173.7 174.7   0.2

To double check: host1 uses Cubic?
 
But once we run an iperf using BBR between host2 & host4 (same WAN link, different client/server):
host2 # iperf -t 60 -xC -c host4 -p 5001 -P16 & (~600mbit)
host1 # mtr host3 --report --report-cycles 30
HOST: host1                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                       0.0%    30    0.1   0.1   0.1   0.2   0.0
  2. x.x.x.x                       0.0%    30    0.1   0.1   0.1   0.2   0.0
  3. x.x.x.x                       3.3%    30  329.4 334.3 315.3 387.8  16.4
  4. x.x.x.x                       6.7%    30  326.9 321.6 315.4 327.5   4.7
  5. host3                        13.3%    30  315.6 337.9 315.5 439.3  26.1

The increased loss+latency is seen between the iperf hosts (host2 & host4) too while the iperf is running. We can also fill the capacity of the link with cubic streams (we require many more iperf streams though) and we don't see any adverse effects on the independent flows between other hosts. Also, I can only achieve around 60mbit with a single BBR stream over our WAN and need ~16 streams to saturate the ~700mbit capacity.
Regarding the single flow limit at 60mbit: the culprit could be insufficient receiver window or snd buffer. What are the wmem[2] on the sender (host1,host2) and rmem[2] on the receiver (host3,host4)? sysctl net.ipv4.tcp_{w|r}mem should return the answer.

if you are running 4.9+ kernel and latest iproute2/ss utility, ss -i reports that how often a flow has suffered rwin-limit directly.
 

host2 just has a simple default fq configuration:
host2 # tc qdisc
qdisc fq 8004: dev eth0 root refcnt 3 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140

host2 # sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = bbr

I also noticed that there were quite a lot of retransmits (~8%) when saturating the WAN link with BBR streams. Adding a maxrate to the fq qdisc seemed to help both reduce the retransmits and the loss+latency effects but it did not entirely fix it.

Maybe I am seeing similar things to Lawrence Brakmo in his "BBR Report" topic (shallow buffered paths)? I tried the patch in that group discussion but I didn't see any noticeable improvement in latency or loss with the iperf test over our infrastructure.

Regards,

Daire

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daire Byrne

unread,
Jun 7, 2017, 1:44:02 PM6/7/17
to BBR Development, daire...@gmail.com
Yuchung,


On Tuesday, June 6, 2017 at 6:09:03 PM UTC+1, Yuchung Cheng wrote:

On Tue, Jun 6, 2017 at 8:15 AM, Daire Byrne <daire...@gmail.com> wrote:
So say that host1 & host2 are on one side of the WAN link and host3 & host4 the other, then the normal (reproducible) ping/loss looks like:
host1 # mtr host3 --report --report-cycles 30
HOST: host1                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                       0.0%    30    0.2   0.1   0.1   0.2   0.0
  2. x.x.x.x                       0.0%    30    0.2   0.1   0.1   0.2   0.0
  3. x.x.x.x                       0.0%    30  174.0 173.9 173.6 174.3   0.2
  4. x.x.x.x                       0.0%    30  173.9 174.1 173.8 174.3   0.2
  5. host3                         0.0%    30  174.0 174.0 173.7 174.7   0.2

To double check: host1 uses Cubic?

Yes, sorry I should have been clearer. I had also been using host1 -> host3 to send similar streams of cubic iperf while sending host2 -> host4 iperf using BBR. But on seeing that the cubic iperf performance was suffering more than the BBR streams, I just started running mtr pings instead to show the loss/latency being caused by the BBR streams on their own. The fact that BBR is so good with small amounts of loss (Figure 10) compared to cubic is what makes it so interesting for us. Our WAN links often experience some low loss which is completely out of our control.
  
But once we run an iperf using BBR between host2 & host4 (same WAN link, different client/server):
host2 # iperf -t 60 -xC -c host4 -p 5001 -P16 & (~600mbit)
host1 # mtr host3 --report --report-cycles 30
HOST: host1                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. x.x.x.x                       0.0%    30    0.1   0.1   0.1   0.2   0.0
  2. x.x.x.x                       0.0%    30    0.1   0.1   0.1   0.2   0.0
  3. x.x.x.x                       3.3%    30  329.4 334.3 315.3 387.8  16.4
  4. x.x.x.x                       6.7%    30  326.9 321.6 315.4 327.5   4.7
  5. host3                        13.3%    30  315.6 337.9 315.5 439.3  26.1

The increased loss+latency is seen between the iperf hosts (host2 & host4) too while the iperf is running. We can also fill the capacity of the link with cubic streams (we require many more iperf streams though) and we don't see any adverse effects on the independent flows between other hosts. Also, I can only achieve around 60mbit with a single BBR stream over our WAN and need ~16 streams to saturate the ~700mbit capacity.
Regarding the single flow limit at 60mbit: the culprit could be insufficient receiver window or snd buffer. What are the wmem[2] on the sender (host1,host2) and rmem[2] on the receiver (host3,host4)? sysctl net.ipv4.tcp_{w|r}mem should return the answer.

Sorry, I misrepresented the 60mbit value - that was a single stream example out of the 16 simultaneous iperf streams that were running. I actually get around 250mbit for a single lone BBR stream between host2 & host4 (cubic=150mbit). We already have net.core.{w|r}mem_max=16777216 & net.ipv4.tcp_rmem='4096 87380 16777216' so I don't think we'll improve much more on that single stream speed.

The main reason I'm interested in using multiple streams of BBR is that if we use something like ftp/rsync etc. to transfer random files, we will need multiple overlapping streams to allow for the variation in file sizes and metadata seek times. So multiple transfer streams will be required to consistently utilise all our bandwidth. But we also don't want those multiple BBR streams to degrade the WAN link so badly that users can't browse web sites (for example).


if you are running 4.9+ kernel and latest iproute2/ss utility, ss -i reports that how often a flow has suffered rwin-limit directly.

 
I'm running kernel 4.11.3 but was running an old iproute2 version (EL7). I've upgraded it now (4.11.0) and there are lots of new interesting statistics for me to play with (thanks for the reminder!)... Here's an example of a single stream of BBR iperf (host2 -> host4):
host2 # iperf -t 60 -xC -c host4 -p 5001 -P1 &
host2 # ss --info 'dst host4'
Netid  State      Recv-Q Send-Q                                                                 Local Address:Port                                                                                  Peer Address:Port 
tcp    ESTAB      0      16374544                                                                  host2:54914                                                                                  host4:commplex-link        
         bbr wscale:9,9 rto:377 rtt:176.461/1.619 mss:1372 rcvmss:536 advmss:1448 cwnd:10903 bytes_acked:1716973245 segs_out:1276058 segs_in:480885 data_segs_out:1276056 bbr:(bw:341.8Mbps,mrtt:173.574,pacing_gain:1,cwnd_gain:2) send 678.2Mbps lastsnd:1 lastrcv:53609 lastack:4 pacing_rate 354.7Mbps delivery_rate 68.9Mbps busy:53609ms rwnd_limited:6285ms(11.7%) unacked:3365 retrans:0/21202 reordering:108 rcv_space:29200 notsent:11758040 minrtt:173

Based on the reported 11.7% rwnd_limited, I increased the receiver's rmem a little more (sysctl -w net.ipv4.tcp_rmem='4096 87380 25165824') and got rwnd_limited down to around 2% and the throughput increased slightly from 250->290mbit. rwnd_limited still starts quite high initially (30%) but after 30 seconds it has dropped below 10% so I assume that this is normal.

Here's the example output of one single BBR stream out of a total of 16 simultaneous iperf streams:
host2 # iperf -t 60 -xC -c host4 -p 5001 -P16 &
host2 # ss --info 'dst host4'
Netid  State      Recv-Q Send-Q                                                                 Local Address:Port                                                                                  Peer Address:Port                
tcp    ESTAB      0      5125792                                                                  host2:54962                                                                                  host4:commplex-link        
         bbr wscale:9,9 rto:378 rtt:177.979/0.737 mss:1372 rcvmss:536 advmss:1448 cwnd:660 bytes_acked:112737277 segs_out:95073 segs_in:45137 data_segs_out:95071 bbr:(bw:20.6Mbps,mrtt:173.777,pacing_gain:1,cwnd_gain:2) send 40.7Mbps lastsnd:1 lastrcv:58721 lastack:2 pacing_rate 21.4Mbps delivery_rate 20.4Mbps busy:58721ms rwnd_limited:2546ms(4.3%) unacked:331 retrans:0/12558 rcv_space:29200 notsent:4671660 minrtt:173

And contrasting with the result of a single stream using cubic from a repeat of the 16 stream iperf test above:
host1 # iperf -t 60 -xC -c host3 -p 5001 -P16 &
host1 # ss --info 'dst host3'
Netid  State      Recv-Q Send-Q                                                                 Local Address:Port                                                                                  Peer Address:Port                
tcp    ESTAB      0      1267728                                                                  host1:55648                                                                                  host3:commplex-link        
         cubic wscale:10,9 rto:377 rtt:176.268/0.951 mss:1372 rcvmss:536 advmss:1448 cwnd:235 ssthresh:199 bytes_acked:118352873 segs_out:86564 segs_in:32679 data_segs_out:86562 send 14.6Mbps lastsnd:1 lastrcv:58209 lastack:1 pacing_rate 17.6Mbps delivery_rate 14.5Mbps busy:58209ms rwnd_limited:550ms(0.9%) unacked:234 retrans:0/60 rcv_space:29200 notsent:946680 minrtt:173

Regardless of further tweaking the total performance of the BBR streams (it's already great!), we still see a lot of loss and latency as soon as we run around 2+ simultaneous BBR streams down our WAN link. The more streams we run, the worse it gets. Maybe our firewalls (IPSEC) are struggling to deal with the BBR streams?

Regards,

Daire

Neal Cardwell

unread,
Jun 9, 2017, 12:18:54 AM6/9/17
to Daire Byrne, BBR Development
Hi Daire,

Thanks for the report, and all the additional details.

To help dig into what exactly is going on, would you be able to post some sender-side tcpdump pcap traces of the iperf test traffic? Headers only should be fine (e.g. -s 100).  Ideally it would be nice to see sender-side traces of all the CUBIC and BBR iperf flows, so we can start to understand the dynamics of the mix you are seeing. We'd probably want at least 30 seconds of traces, after the flows have settled into steady-state behavior. Would that be feasible?

thanks,
neal



--

Daire Byrne

unread,
Jun 13, 2017, 6:18:40 AM6/13/17
to BBR Development, daire...@gmail.com
Sorry for the delay, it took me a little while to go through all my other testing.


All traces start ~30 seconds into a 60 second iperf test (iperf -t 60 -xC -c host4 -p 5002 -P16).

bbr-host2-host4-only.dump - this is a trace of just host2->host4 using BBR with no other iperf tests running. It achieved ~700mbit
cubic-host1-host3-only.dump - this is host1->host3 using CUBIC with no other iperf tests running. It achieved ~380mbit.
cubic-host1-host3.dump + bbr-host2-host4.dump - these are traces captured when both host1 & host2 are doing iperfs simultaneously across the WAN. Again ~700mbit for BBR but now only 3mbit for the CUBIC host (and pings+latency suffer).

I have also done some tests with HTB to constrain the rather dominant BBR streams within a limit. e.g.
host2 # tc qdisc replace dev eth0 handle 1: root htb default 1
host2 # tc class add dev eth0 parent 1: classid 1:1 htb rate 450mbit
host2 # tc qdisc add dev eth0 parent 1:1 handle 10: fq

In our case, this seems to alleviate the ping+latency problems we have with many BBR streams dominating and as an added bonus, the individual iperf streams all fair share and return equal bandwidth. When the BBR fq streams are probing the maximum of our WAN link without the HTB limit, the iperf streams vary wildly from 15-40mbit (within 60 seconds). It also means that a single stream of BBR gives around the same total transfer bandwidth of hundreds of BBR streams which is handy in our case as we have little control over the number of streams of transfers. We want however many streams that are active (from 1-200) to maximise our WAN link without killing everything else.

Perhaps this HTB result further implicates our firewalls (two different vendors - same result) as it helps to buffer the streams and our firewalls are perhaps randomly filling buffers and dropping faster than BBR can adapt to it?

Regards,

Daire

To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.

Neal Cardwell

unread,
Jun 16, 2017, 9:04:05 PM6/16/17
to Daire Byrne, BBR Development
Hi Daire,

Thank you so much for these detailed traces. These are great! We are analyzing those as time permits, and will get back to the list with some thoughts as soon as we have a chance.

Thanks!

neal
 

To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+unsubscribe@googlegroups.com.

Neal Cardwell

unread,
Jun 19, 2017, 11:33:52 AM6/19/17
to Daire Byrne, BBR Development
Hi Daire,

Thanks again for the traces. These were very helpful. Studying these traces, my assessment agrees with yours: what we seem to have here is the known behavior of bulk BBR flows in shallow-buffered WAN links. Our team is working on improving BBR's behavior in these scenarios (among others), and we will be sure to update the list when we have another round of patches for testing.

Thanks!
neal
 

Daire Byrne

unread,
Dec 14, 2018, 8:16:28 AM12/14/18
to BBR Development
Hey, I thought I'd update this old thread with our experiences and observations over the last year. Firstly, we have been happily using BBR v1 on our WAN/VPN links to help keep our transfer speeds consistent. In the end, we worked around the shallow buffers of our routes & hardware by containing the BBR streams within a HTB limit to stop them short of maxing out our ISPs and producing loss (still eagerly awaiting BBR v2!).

But there are a couple of quirks of our ISPs (and the internet) that we have never really been able to work around. The most striking example is a VPN tunnel between two ISPs that shows great single stream BBR speeds in one direction but not the other. All our hardware and configurations are identical at both ends. Different tunnels and destinations using the same ISPs & hosts do not suffer from this problem so it must be something particular to the (asymetric?) route between them.

For the direction that shows much lower single stream performance, we can still max out the link bandwidth by using multiple BBR streams where each individual stream is held at the single stream throughput and the total aggregate scales with stream x no. streams.

It almost seems like the advertised receive window is being artificially set by some piece of hardware in between? And I would have thought a policer would work on total bandwidth rather than single streams. We can also send UDP data (without loss) in the "bad" direction and max the bandwidth so it just seems like the TCP congestion is being fooled into thinking the bandwidth or window size is much lower.

Here is a snapshot of the "good" direction:
tcp    ESTAB      0      18906426 10.27.20.36:47074                10.21.20.41:commplex-link        
         bbr wscale
:9,9 rto:353 rtt:152.429/1.302 mss:1386 rcvmss:536 advmss:1448 cwnd:1814 bytes_acked:63111547 segs_out:52781 segs_in:11070 data_segs_out:52778 bbr:(bw:134.8Mbps,mrtt:142.717,pacing_gain:1,cwnd_gain:2) send 132.0Mbps lastsnd:1 lastrcv:4869 lastack:2 pacing_rate 139.8Mbps delivery_rate 128.2Mbps busy:4869ms rwnd_limited:500ms(10.3%) unacked:2831 retrans:52/4409 lost:52 sacked:1020 reordering:300 rcv_space:29200 notsent:14982660 minrtt:142
And the same hosts in the opposite "bad" direction:
tcp    ESTAB      0      1765764 10.21.20.41:43976                10.27.20.36:commplex-link        
         bbr wscale
:9,9 rto:361 rtt:160.793/2.839 mss:1386 rcvmss:536 advmss:1448 cwnd:268 bytes_acked:2795599 segs_out:2061 segs_in:1021 data_segs_out:2059 bbr:(bw:10.1Mbps,mrtt:142.324,pacing_gain:1.25,cwnd_gain:2) send 18.5Mbps lastsnd:7 lastrcv:4963 lastack:6 pacing_rate 13.2Mbps delivery_rate 2.6Mbps busy:4963ms unacked:41 rcv_space:29200 notsent:1708938 minrtt:142.324

The cwnd just never scales and the pacing rate remains low. In case this is of any interest, I have included some tcpdump/tcptrace/xplot data of an iperf in each direction between the two identical BBR configured hosts over the same VPN tunnel & ISPs.

Daire

Dave Taht

unread,
Dec 14, 2018, 12:28:36 PM12/14/18
to daire...@gmail.com, BBR Development
we have done an awful lot of testing of bidirectional bandwidth with flent's "rrul" tests, and in general, we found that making sure both bottlenecks were FQ_codeled, rather than FIFO'd, led to enormous (as in 60% or more) improvements in bidirectional throughput. So if you were to apply fq_codel, or cake as an underlying qdisc to your htb setup, you'll win, big time. 

I am under the impression that at least older versions of sch_fq did not do the right thing with acks in the opposite direction, and in no case, did the right thing when dealing with flows not originated directly at the sending host, reverting to a large fifo in that case. try swapping in fq_codel.

Actually, cake is our latest attempt here, and includes an integral shaper,  so your

host2 # tc qdisc replace dev eth0 handle 1: root htb default 1
host2 # tc class add dev eth0 parent 1: classid 1:1 htb rate 450mbit
host2 # tc qdisc add dev eth0 parent 1:1 handle 10: fq_codel

would become

tc qdisc add dev eth0 root cake bandwidth 450mbit ack-filter # and nat or gso-split or other params to tune up the connection

The ack-filter thing is new, we see big improvements on asymmetric links with it ( http://blog.cerowrt.org/post/ack_filtering/ ) - not clear if your link is assymmetric or not.

cake entered the mainline in 4.19, and is available as backports as far back as 3.10. 
https://github.com/dtaht/sch_cake for the kernel module, and just grab a current iproute2 to be able to configure. 

So I'd love to know if it helps any in your situation. I find it works great with BBR on all sorts of tests (well I don't care for how long it takes for BBRv1 to find the right rate, but once it does, it's way better than cubic)

Daire Byrne

unread,
Dec 14, 2018, 1:07:39 PM12/14/18
to BBR Development
Thanks Dave, I have tried both fq_codel and cake very quickly and it didn't make much difference to the single stream "slow" direction case - something else must be going on. But I am inspired to look more closely at fq_codel again after your good experiences with bidirectional throughput.

In this case, sending in one direction at a time is slow and sending in the other (not bidirectional) is fast. The links are not asymmetric in the sense that the ISPs at both ends have the same bandwidth, but they are probably not routing the flows across the same devices in each direction (depending on the flow origin). I would imagine the acks for a flow do pass back along the same route but I can't verify that atm. The ack-filtering of cake might be interesting to look at too.

I think the thing that I don't quite understand is why the multi-stream BBR case seems to scale the aggregate bandwidth fine but the single stream case is always locked at around 15mbit. I'll keep digging... thanks for the pointers.

Daire

Dave Taht

unread,
Dec 14, 2018, 3:06:18 PM12/14/18
to Daire Byrne, BBR Development
On Fri, Dec 14, 2018 at 10:07 AM Daire Byrne <daire...@gmail.com> wrote:
>
> Thanks Dave, I have tried both fq_codel and cake very quickly and it didn't make much difference to the single stream "slow" direction case - something else must be going on. But I am inspired to look more closely at fq_codel again after your good experiences with bidirectional throughput.

Is your limiter actually at the bottleneck link? Is your minrtt:142ms
the actual, or induced RTT? That's quite a long distance for any
single flow of tcp to manage.

>
> In this case, sending in one direction at a time is slow and sending in the other (not bidirectional) is fast. The links are not asymmetric in the sense that the ISPs at both ends have the same bandwidth, but they are probably not routing the flows across the same devices in each direction (depending on the flow origin). I would imagine the acks for a flow do pass back along the same route but I can't verify that atm. The ack-filtering of cake might be interesting to look at too.

Weird.

Always looking for more testers of the ack-filter stuff. I don't think
we broke the internet with it, but you never know...

Cake keeps *really exhaustive* statistics of what's going on, which
you can observe via tc -s qdisc show

Flent (which just needs netperf) lets you do some nice long-term plots
and track tcp statistics also, with the --socket-stats option. So a
test in your case in one direction or another could look like

flent -H wherever -t 'whatever new idea you are trying to test'
--socket-stats --te=upload_streams=1 tcp_nup # for one flow

You can't capture ndown tcp stats in this way, but you can fire off
two tests at the same time from the two hosts, and then link them
together in the flent-gui if you have time sync'd properly. I
generally find that looking at plots of window evolution over time is
oft productive.

I (sigh) - have something like 97 blog entries in draft and haven't
got around to writing up a ton of bbr-related results. It seems likely
BBRv2 will appear before I finish writing anything up, anyway.

>
> I think the thing that I don't quite understand is why the multi-stream BBR case seems to scale the aggregate bandwidth fine but the single stream case is always locked at around 15mbit. I'll keep digging... thanks for the pointers.

I'd tend to suspect a send or recv window clamp on the hosts or....
Some firewalls attempt to clamp the window... if your RTT is actually
that long I'd suspect you are running out of window.

I don't have time this month to look at any packet captures, sorry.

> Daire
>
> On Friday, December 14, 2018 at 5:28:36 PM UTC, Dave Taht wrote:
>>
>> we have done an awful lot of testing of bidirectional bandwidth with flent's "rrul" tests, and in general, we found that making sure both bottlenecks were FQ_codeled, rather than FIFO'd, led to enormous (as in 60% or more) improvements in bidirectional throughput. So if you were to apply fq_codel, or cake as an underlying qdisc to your htb setup, you'll win, big time.
>>
>> I am under the impression that at least older versions of sch_fq did not do the right thing with acks in the opposite direction, and in no case, did the right thing when dealing with flows not originated directly at the sending host, reverting to a large fifo in that case. try swapping in fq_codel.
>>
>> Actually, cake is our latest attempt here, and includes an integral shaper, so your
>>
>> host2 # tc qdisc replace dev eth0 handle 1: root htb default 1
>> host2 # tc class add dev eth0 parent 1: classid 1:1 htb rate 450mbit
>> host2 # tc qdisc add dev eth0 parent 1:1 handle 10: fq_codel
>>
>> would become
>>
>> tc qdisc add dev eth0 root cake bandwidth 450mbit ack-filter # and nat or gso-split or other params to tune up the connection
>>
>> The ack-filter thing is new, we see big improvements on asymmetric links with it ( http://blog.cerowrt.org/post/ack_filtering/ ) - not clear if your link is assymmetric or not.
>>
>> cake entered the mainline in 4.19, and is available as backports as far back as 3.10.
>> https://github.com/dtaht/sch_cake for the kernel module, and just grab a current iproute2 to be able to configure.
>>
>> So I'd love to know if it helps any in your situation. I find it works great with BBR on all sorts of tests (well I don't care for how long it takes for BBRv1 to find the right rate, but once it does, it's way better than cubic)
>
> --
> You received this message because you are subscribed to the Google Groups "BBR Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Daire Byrne

unread,
Dec 17, 2018, 8:23:45 AM12/17/18
to BBR Development
Well, I figured out the poor performance problem - we were hitting a bad rule on our firewall which had DPI enabled. When DPI is enabled on a Sonicwall, it imposes a 256k TCP window limit by default. Silly firewalls...

Now that the mystery is solved and performance is good, I will do some tests with fq_codel and cake for a comparison.

Daire

Eric Dumazet

unread,
Dec 17, 2018, 9:03:48 AM12/17/18
to daire...@gmail.com, BBR Development
Note that BBR was only fully tested with FQ, not FQ_CODEL

To use FQ_CODEL and BBR, you need a fairly recent linux kernel.

And if you use HTB + FQ or HTB + FQ_CODEL, I highly recommend you
switch to upcoming 4.20

Another possibility is to not use HTB, but instead use the
SO_MAX_PACING_RATE if you need to rate limit a TCP flow.
Reply all
Reply to author
Forward
0 new messages