BBR 1 unfairness after probe rtt

126 views
Skip to first unread message

Mihai Mazilu

unread,
Oct 7, 2024, 12:14:14 PM10/7/24
to BBR Development
Hello,

I am coming across this phenomenon which leads to transient unfairness with bbr1 when the rtt is lower then 25~ and the queue size is high (4bdp +). Here is a mininet iperf3 experiment where this is shown with two flows, the second joining at 5 seconds, 50 mbps, 20 ms rtt and the queue size of 4 bdp


20msrtt.png
We have seen similar results in ns3 (a lot more drastic but we believe the same thing is happening)

nsa3.png

Our current understanding is that the problem occurs when testing multiple flows with RTTs around 20ms (most prominent with 20ms flows). E.g. Two flows this problem is quite clear. For example, with two flows both 20ms competing at a 50Mbps bottleneck, after each probe RTT, one flow will take more bandwidth than the other when they both leave the probe RTT. This may be potentially due to the bandwidth filter length being 10 rounds, which would be equal to 10*20ms = 200ms which is the length of a probe RTT. One flow will for some reason obtain a larger best max bandwidth filter value than the other leading to this occurring.

Is out current understanding correct? and has this been observed in any one elses experiments ?

Happy to run more experiments

Thanks, Mihai

Neal Cardwell

unread,
Oct 7, 2024, 5:55:07 PM10/7/24
to Mihai Mazilu, BBR Development
Hi,

Thanks for the report! I don't recall seeing this phenomenon in tests before. And I'm not seeing it now when I try reproducing this with two Linux TCP BBRv1 flows, with similar parameters, both with RTT=20ms and RTT=10ms.

+ Can you please share the OS distribution and exact kernel version you are using?

+ Do you have time to share tcpdump binary .pcap files showing this? From something like:

   tcpdump -w ./trace.pcap -s 120 -c 100000000 port $PORT &

Thanks!

neal



--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/e921c21c-1417-470a-bbc0-bd6401575e51n%40googlegroups.com.

Mihai Mazilu

unread,
Oct 8, 2024, 9:56:25 AM10/8/24
to BBR Development
Hey,

Here is my uname -a: Linux mihai-Virtual-Machine 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux. I am using  a hyperV virtual machine running ubuntu 22.04

Here is a link to a google drive with pcap files for c1, c2, x1 and x2, which are the sender and receivers respectively of my dumbbell topology. https://drive.google.com/drive/folders/1JbcrHw168gEZ2KhrrZLtyzfa-XAbFH7n?usp=sharing

Thank you for you help,
Mihai

Neal Cardwell

unread,
Oct 9, 2024, 12:23:28 PM10/9/24
to Mihai Mazilu, BBR Development
Thanks for the traces!

I've attached some screenshots of tcptrace / xplot time-sequence diagrams of one PROBE_RTT scenario from c1_c1 and c2_c2 (see here for a recipe for how to use those).

It looks like there are a few contributing factors. AFAICT here are the main dynamics in the scenario:

+ flow B (the one that has a lower throughput after PROBE_RTT) starts  PROBE about 39ms later than flow A, due to the flows having differing measurements of exactly when the min_rtt over the past 10 secs happened

+ thus flow A, which started PROBE_RTT earlier, exits PROBE_RTT earlier than flow B, and starts ramping up its sending rate; because flow B is still in PROBE_RTT with a low cwnd there is considerable available bandwidth, and flow A can ramp up quite a bit

+ flow A happens to get into a positive feedback loop: increasing delivery rates, increasing the BDP estimate, causing bbr_is_next_cycle_phase() to prolong probing because it is not yet triggering the inflight >= bbr_inflight(sk, bw, bbr->pacing_gain) check, causing increasing pacing rates, rinse, repeat, ...

+ so flow A achieves a 32 Mbit/sec delivery rate and flow B achieves 15.15 Mbit/sec, leading to temporary unfairness, until the flows reconverge

In BBRv2 this should be less of an issue because flows only cut cwnd to 1/2 the estimated BDP in PROBE_RTT, so there is less available bandwidth for flows to grab if they exit PROBE_RTT considerably earlier. I have ideas for improving that further in future versions of BBR, allowing cwnd to stay even higher in PROBE_RTT in common cases, and thus improving the fairness further.

Another option would be to cap the maximum time spent in the 1.25x phase to one packet-timed round trip. That might be a reasonable solution, though it has some downsides.

If folks have other ideas for improving this, then that's also great. :-)

Thanks!
neal


03 flow-B-starts-PROBE_RTT-roughly-39ms-later.png
02 scenario-zoomed-in.png
04 flow-A-ramps-up-quickly-while-flow-B-is-still-restarting.png
05 flow-A-32Mbps.png
01 scenario-zoomed-out.png
06 flow-B-15.5Mbps.png

Mihai Mazilu

unread,
Oct 9, 2024, 12:42:46 PM10/9/24
to BBR Development
Thanks for the insight Neal. Why do you think there is such a big gap between the probe_rtt's pf the flows ?

Neal Cardwell

unread,
Oct 10, 2024, 11:24:56 AM10/10/24
to Mihai Mazilu, BBR Development
On Wed, Oct 9, 2024 at 12:42 PM Mihai Mazilu <mihaif...@gmail.com> wrote:
Thanks for the insight Neal. Why do you think there is such a big gap between the probe_rtt's pf the flows ?

Skimming the time-sequence plots to look at previous episodes, the flows seem to have a fairly persistent phase offset where flow A starts PROBE_RTT a little bit earlier than flow B.

My sense is that the dynamics are:

(1) Different flows see slightly different RTT patterns over time, due to random timing variation/jitter in network queuing, interrupt processing, scheduling, etc.

(2) So when a new flow enters, initially it can have a slightly different perspective of when the min_rtt happened.

(3) Once the different flows converge and align their PROBE_RTT phases closely enough that they "agree" that the min_rtt values are happening during their PROBE_RTT phases, then their PROBE_RTT phases stabilize at a fairly stable offset. That's because of the fact that bbr_check_probe_rtt_done() sets bbr->min_rtt_stamp to the time at which PROBE_RTT ends. So as long as the min_rtt values are measured during the PROBE_RTT phase, the timing of the next PROBE_RTT phase is based on the end of the previous PROBE_RTT phase, rather than the exact wall clock time at which the min_rtt was measured.

This (3) behavior  was something that in my experience improved the reliability/stability of the PROBE_RTT best-effort distributed synchronization. Without (3), tiny differences in measured RTT could cause the PROBE_RTT phases of the various flows to shift around a lot. But if folks have experimental evidence of some other approach working better, that would be interesting/useful to hear. :-)

Thanks!
neal
 

 
Reply all
Reply to author
Forward
0 new messages