Hey everyone,
I am trying to understand BBR's operation in parking lot topology and also whether parking lot topology matters in Internet, private WAN, and/or DC settings. I wanted to know if folks here have any insights into some of the questions below, or any thoughts on my exploration.
Summary:
In the parking lot topology (described in more detail below), different flows witness different hops of congestion. This causes the different flows to have different throughputs (unfairness). According to my simplistic analytical calculations (described below), BBR should see (arbitrarily bad) starvation in this scenario. I ran BBRv1 and BBRv3 on emulated links to verify this (methodology/graphs below). In the empirical experiments, I do see drastic unfairness (compared to proportional fairness or maxmin fairness).
Questions:
Does this drastic unfairness matter in practice? i.e.:
Do parking lot style situations occur in practice?
If they do, how frequently do they occur? How long do they last for? How many hops and flows are involved?
Is it that flows are application limited and such effects don’t show up because of that? Does this generalize to developing countries (with lower bandwidths)?
Has such unfairness been noticed in practice? Or is it that this unfairness may be occurring and goes unnoticed as ground truth is not known.
Is there interest in fixing this unfairness?
The drastic unfairness is not as bad as the arbitrarily bad starvation predicted by my simple analytical model. This could just be due to the simplicity of the model, but are there some features in the BBR implementations that ensure starvation freedom in parking lot topology?
Parking lot topology:
Flow 1 sees two congested hops and Flows 2 and 3 only see one.
Si = Sender of flow i
Ri = Receiver of flow i
If the link rate of the two links is 10 mbps, then proportional fairness would allocate 3.33, 6.66, 6.66 mbps to flows 1, 2, 3, respectively. maxmin fairness would allocate 5, 5, 5. BBR seems to allocate very little bandwidth to flow 1 (see below).
Analytical calculations:
In dumbbell topology, BBR’s fairness properties are governed by growth in bandwidth estimate during probes (similar to https://github.com/google/bbr/blob/master/Documentation/bbr_bandwidth_based_convergence.pdf). Smaller flows (lower bandwidth estimates) steal bandwidth from larger flows.
And in steady state flows ping-pong in stealing roughly equal amounts of bandwidth from each other during probes.
Intuition for starvation:
If we apply the above model to the parking lot topology, then flow 1 sees both link A and link B. When flow 1 probes, the probe seen by link B is ACK rate of flow 1 after traversing link A. This ACK rate is lower than 1.25x flow 1’s bandwidth estimate. As a result, flow 1 is not able to steal as much bandwidth from flow 3 on link B (compared to if it directly sent traffic on link B). Whereas, flow 3 still steals bandwidth from 1 (by directly sending on link B) until flow 1 starves. Since flow 1 loses to flow 3 on link B, it sends at low rate, and flow 2 also gets high rate.
Math:
xi = bandwidth estimate of flow i = throughput (or ACK rate) of flow i.
si = sending rate of flow i
C = link rate
Computation of bandwidth estimates when any flow probes:
# Flows send at rates s1, s2, s3
# On link A
tmp = C * s1/(s1+s2) # rate of traffic sent to link B from flow 1
x2 = C * s2/(s1+s2)
# On link B
x1 = C * tmp/(tmp+s3)
x3 = C * s3/(tmp+s3)
# x1, x2, x3 are throughputs. For simplicity, I assume these are also the bandwidth estimates (in reality, due to the max ACK rate filter, the bandwidth estimates would be slightly higher for the non-probing flows).
When flow 1 probes, s1 is set as 1.25* x1, s2 as x2, and s3 as x3. Using the above formulas, x1, x2, x3 are updated. Similar update is done when flow 2 probes (s1=x1, s2=1.25 * x2, s3=x3) and flow 3 probes. I plot below what happens to x1, x2, x3 over time as different flows probe. I assume only one flow probes at a time, and that the bandwidth estimates converge to the throughputs before the next probe (this is a simplification of reality).
Empirical results:
I used mahimahi to emulate links and setup the parking lot topology, this allows using real kernel implementation of BBR. Used BBRv1 from Linux 5.15.0-100-generic, and BBRv3 from commit (6e321d1c986a net-tcp_bbr: v3: add a README.md for TCP BBR v3 release). The table shows the throughput (mbps) of the 3 flows with varying bottleneck bandwidths. All flows have the same propagation delay of 100 ms. The plots below show (coarse-grain) time series of flow throughput.
Table:
BBRv1
24 3.18 19.27 19.35
48 6.03 38.85 38.97
96 10.60 78.95 79.30
BBRv3
24 3.90 18.64 18.63
48 13.90 31.82 31.32
96 14.64 75.05 75.34
Graphs:
BBRv1:
BBRv3:
Thanks,
Anup
https://www.cs.cmu.edu/~anupa/
Table:
BBRv1
C x1 x2 x3
24 3.18 19.27 19.35
48 6.03 38.85 38.97
96 10.60 78.95 79.30
BBRv3
C x1 x2 x3
24 3.90 18.64 18.63
48 13.90 31.82 31.32
96 14.64 75.05 75.34
Hey everyone,
I am trying to understand BBR's operation in parking lot topology and also whether parking lot topology matters in Internet, private WAN, and/or DC settings. I wanted to know if folks here have any insights into some of the questions below, or any thoughts on my exploration.
Summary:
In the parking lot topology (described in more detail below), different flows witness different hops of congestion. This causes the different flows to have different throughputs (unfairness). According to my simplistic analytical calculations (described below), BBR should see (arbitrarily bad) starvation in this scenario. I ran BBRv1 and BBRv3 on emulated links to verify this (methodology/graphs below). In the empirical experiments, I do see drastic unfairness (compared to proportional fairness or maxmin fairness).
Questions:
Does this drastic unfairness matter in practice? i.e.:
Do parking lot style situations occur in practice?
If they do, how frequently do they occur? How long do they last for? How many hops and flows are involved?
Is it that flows are application limited and such effects don’t show up because of that? Does this generalize to developing countries (with lower bandwidths)?
Has such unfairness been noticed in practice? Or is it that this unfairness may be occurring and goes unnoticed as ground truth is not known.
Is there interest in fixing this unfairness?
The drastic unfairness is not as bad as the arbitrarily bad starvation predicted by my simple analytical model. This could just be due to the simplicity of the model,
but are there some features in the BBR implementations that ensure starvation freedom in parking lot topology?
Parking lot topology:
Flow 1 sees two congested hops and Flows 2 and 3 only see one.
Si = Sender of flow i
Ri = Receiver of flow i
If the link rate of the two links is 10 mbps, then proportional fairness would allocate 3.33, 6.66, 6.66 mbps to flows 1, 2, 3, respectively. maxmin fairness would allocate 5, 5, 5. BBR seems to allocate very little bandwidth to flow 1 (see below).
Analytical calculations:
In dumbbell topology, BBR’s fairness properties are governed by growth in bandwidth estimate during probes (similar to https://github.com/google/bbr/blob/master/Documentation/bbr_bandwidth_based_convergence.pdf). Smaller flows (lower bandwidth estimates) steal bandwidth from larger flows.
And in steady state flows ping-pong in stealing roughly equal amounts of bandwidth from each other during probes.
Intuition for starvation:
If we apply the above model to the parking lot topology, then flow 1 sees both link A and link B. When flow 1 probes, the probe seen by link B is ACK rate of flow 1 after traversing link A. This ACK rate is lower than 1.25x flow 1’s bandwidth estimate. As a result, flow 1 is not able to steal as much bandwidth from flow 3 on link B (compared to if it directly sent traffic on link B). Whereas, flow 3 still steals bandwidth from 1 (by directly sending on link B) until flow 1 starves. Since flow 1 loses to flow 3 on link B, it sends at low rate, and flow 2 also gets high rate.
Math:
xi = bandwidth estimate of flow i = throughput (or ACK rate) of flow i.
si = sending rate of flow i
C = link rate
Computation of bandwidth estimates when any flow probes:
# Flows send at rates s1, s2, s3
# On link A
tmp = C * s1/(s1+s2) # rate of traffic sent to link B from flow 1
x2 = C * s2/(s1+s2)
# On link B
x1 = C * tmp/(tmp+s3)
x3 = C * s3/(tmp+s3)
# x1, x2, x3 are throughputs. For simplicity, I assume these are also the bandwidth estimates (in reality, due to the max ACK rate filter, the bandwidth estimates would be slightly higher for the non-probing flows).
When flow 1 probes, s1 is set as 1.25* x1, s2 as x2, and s3 as x3. Using the above formulas, x1, x2, x3 are updated. Similar update is done when flow 2 probes (s1=x1, s2=1.25 * x2, s3=x3) and flow 3 probes. I plot below what happens to x1, x2, x3 over time as different flows probe. I assume only one flow probes at a time, and that the bandwidth estimates converge to the throughputs before the next probe (this is a simplification of reality).


Empirical results:
I used mahimahi to emulate links and setup the parking lot topology, this allows using real kernel implementation of BBR. Used BBRv1 from Linux 5.15.0-100-generic, and BBRv3 from commit (6e321d1c986a net-tcp_bbr: v3: add a README.md for TCP BBR v3 release). The table shows the throughput (mbps) of the 3 flows with varying bottleneck bandwidths. All flows have the same propagation delay of 100 ms. The plots below show (coarse-grain) time series of flow throughput.
Table:
BBRv1
24 3.18 19.27 19.35
48 6.03 38.85 38.97
96 10.60 78.95 79.30
BBRv3
24 3.90 18.64 18.63
48 13.90 31.82 31.32
96 14.64 75.05 75.34
Graphs:
BBRv1:
BBRv3:
Thanks,
Anup
https://www.cs.cmu.edu/~anupa/
--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/e5efff17-1ad7-4cf9-963e-56dedf4aef1bn%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/ac31debc-0fcb-4858-92a0-552a0ed8d43an%40googlegroups.com.
Thanks for your detailed response, and sharing your analytical simulator Neal!
>>> Good questions. I'm not aware of any work to quantify how often real traffic encounters multiple congested hops. Perhaps someone else on this list is aware of such work. I would be interested to see the results of a study about that question.
Yes, it would be good to know if someone has measurements for this.
>>> I suspect there may be some issues in the code you used to generate graphs from your analytical model: I get qualitatively very different different results. :-) In my analytical simulations I don't see starvation, but rather convergence to what IMHO is a reasonable share: the flow 1 has a 29.6% share of bandwidth, not far from the the 33% share that is the ideal share from proportional fairness (please see below).
>>> Do you mind comparing your code to mine (attached) to see why your code has bandwidth driving to zero while mine shows bandwidth converging to about 2.962 Mbit/sec?
I had not modeled the max bandwidth filter. So my code was effectively running your code with `#define BW_FILTER_LEN 1` instead of `#define BW_FILTER_LEN 10`. I verified that setting BW_FILTER_LEN to 1 produces similar behavior as my sim.
When the BW_FILTER_LEN > 1, the sum of the bandwidth estimates of the flows (e.g., flow 1 + flow 3 and flow 1 + flow 2) exceeds the bottleneck bandwidth. From my understanding there is no specific feature of BBR that drains queues when bandwidth estimate overestimates. As a result, BBR would likely switch to cwnd limited mode, which is likely what is also happening in the empirical experiments. This might explain why we see slowest flow getting 1/8 of the throughput for the fastest flow in empirical experiments even though the analytical simulator predicts 29.6% share for the slowest flow.
Such overestimation should also occur in dumbell topology. Is it fair to say BBR is always in cwnd limited mode when there are multiple BBR flows competing? If so, we perhaps only need to model the cwnd limited mode for purpose of steady-state fairness properties.
Regardless, I experimented a bit with your simulator with different initial conditions and parameters (drain gain and probe gain).
(scripts in https://github.com/108anup/bbr/tree/master/Documentation/simulation/topologies/parking_lot)
(outputs in https://github.com/108anup/bbr/tree/master/Documentation/simulation/topologies/parking_lot/runs/auto)
I found that if we set a drain gain <= 0.8 instead of 0.9, the final bandwidth estimates (at convergence) depend on the initial bandwidth estimates (i.e., the controller is not globally asymptotically stable). The figures below shows an extreme case with drain gain = 0.1. It shows the bandwidth estimates (at convergence) for the three flows with varying initial bandwidth estimate ratio (ratio of 10 means, f3/f1 = f2/f1 = 10). The final bandwidth estimates are close to the initial estimates.
Note:
1. In all of these experiments, the sum of bandwidth estimates exceeds bandwidth so these don't really mean much for deployed BBR as that would likely be in cwnd limited mode. But these do seem intellectually interesting for designing a BBR variant that stays in rate based mode with multiple flows (for instance if we have a feature for draining queues due to overestimation).
2. This is interesting because, in Reno/AIMD, the steady state share in parking lot is independent of the parameter choices for additive increment constant and multiplicative decrement constant, the constants only affect convergence time, where as for rate-based BBR it seems like the constants have role in determining share. For reno, throughput = k * 1/sqrt(loss event rate), where k is a constant depending on the AI/MD constants. k gets canceled out when we take the ratio of throughputs of flows.
3. I verified that indeed at the end of simulation all flows have converged (bandwidth estimates stopped changing).
4. BW_FILTER_LEN of 1 does ensure the sum of bandwidth estimates does not exceed bandwidth, but that is not what BBR does and that does lead to arbitrary starvation in both our simulators.
5. I also added: `f1.receive_bw = min(f1.receive_bw, f1.sending_bw);` so that ACK rate is <= sending rate. This was getting violated during draining phases. My simulator did not have a draining phase so I did not mention this before.
>>> Do you have ideas about how to improve this issue with BBR throughput shares in scenarios with multiple congested hops? Or do you think this is fundamental to CC algorithms that use throughput as a primary signal?
For the first question, I am guessing the 29.6% share (from the analytical sim) is okay as that is close to proportional fairness. The case that is bad is 1/8 ratio of throughputs in the empirical experiments. Given above reasoning of sum of bandwidth estimates > bandwidth, BBR is likely in cwnd limited mode. A simple model of BBR's cwnd limited mode is in https://dl.acm.org/doi/10.1145/3544216.3544223 (section 5.2). According to that model, I'd expect cwnd limited mode to get close to proportional fairness. So it does not explain why we see 1/8 ratio in empirical experiments. It could be due to interaction between rate based probes and cwnd limits or some other patch in BBR. If we figure out the cause of 1/8, we'd be closer to fixing it. For now, I am not sure of the cause.
For the second question, I earlier thought there may be limits to fairness properties of throughput based CCAs, but the investigation with different probing constants, and BW_FILTER_LEN, suggests I need to think more about throughput based CC.
>>> Another question: what results do you get in your mahimahi tests if you use CUBIC?
These are the empirical throughputs with cubic/reno:
Cubic (buffer = infinite):
C x1 x2 x3 x3/x1
24 7.70 15.55 15.37 2.02
48 14.11 32.17 32.04 2.28
96 31.42 60.91 60.97 1.94
Cubic (buffer = BDP):
C x1 x2 x3 x3/x1
24 4.64 18.44 18.49 3.98
48 11.61 34.51 34.69 2.99
96 19.91 72.25 72.44 3.64
Reno (buffer = BDP):
C x1 x2 x3 x3/x1
24 6.55 16.36 16.55 2.53
48 13.79 31.75 32.49 2.36
96 24.70 65.33 67.77 2.74
In the second table, I set the buffer size (at both links in the parking lot) to be 1 BDP. Note for BBR empirical experiments, the buffer was infinite. I can rerun those with 1 BDP buffer. I am guessing this will trigger more features which might change the throughput ratios (e.g., loss based inflight cap in BBRv3).
>>> My back-of-the-envelope application of the Mathis model suggests something like a 59% share for the one-hop flows and 41% for the two-hop flow.
Could you please share the details of your Mathis model calculations. My back of the envelope calculations seem to suggest different outcomes. My understanding is that in general the ratio will change with rtprop, buffer size, and bandwidth.
Intuition:
Mathis model: Throughput = k / (RTT * sqrt(loss event rate).
Or Throughput ratio = 1/(RTT ratio * sqrt(loss ratio)).
The average RTT of flow 1 would be higher. The ratio of the RTTs of the flows depends on the rtprop and buffer size.
For loss event rate. Say loss event rates seen by the three flows are l1, l2, l3 respectively. Assuming losses are independent, then l1 = 1-(1-l2)*(1-l3). Loss rate = 1-survival rate. I look at what is the probability of surviving at both links.
Now, assuming l2 = l3 at convergence, then l1 = 1 - (1-l3)^2. l3 would change depending on bandwidth (throughput) and so would the loss ratio = l1/l3.
Back of envelope calculation:
When bandwidth is large enough, l3 << 1. In this case l1 is roughly = 2 * l3. So loss ratio = l1/l3 = 2.
Assuming buffer = BDP = C * R. Then average RTT of flow 3 = R + R/2 (link B) and for flow 1 = R + R/2 (link A) + R/2 (link B). RTT ratio = RTT1/RTT3 = 2R / (3R/2) = 4/3.
Throughput ratio = 1/(RTT ratio * sqrt(loss ratio)) = 1/ (4/3 * sqrt(2)) = 0.53 = 1/1.88. Flows have 0.53:1 throughput ratio. In terms of bandwidth share, flow 1 gets 0.53/(0.53 + 1) = 34.6% and flow 2, 3 get 65.4% of bottleneck bandwidth.
Note: I use the flow1, flow2, flow3 as in the figure:
>>> I would imagine that the practical long-term throughput would have a floor of something like (4 packets)/(average RTT).
4 pkts/RTT of throughput is independent of the bottleneck bandwidth, so utilization = throughput/bandwidth goes to 0 as bandwidth goes to infinity. This is not what happens in empirical experiments. In the empirical experiments, the throughput increases with increasing bandwidth, so the throughput lowerbound is higher than the 4 pkts/RTT. FWIW, I would call 4 pkts/RTT as starvation as utilization goes to 0 with increasing bandwidth.