It looks like there are a few contributing factors. AFAICT here are the main dynamics in the scenario:
+ flow B (the one that has a lower throughput after PROBE_RTT) starts PROBE about 39ms later than flow A, due to the flows having differing measurements of exactly when the min_rtt over the past 10 secs happened
+ thus flow A, which started PROBE_RTT earlier, exits PROBE_RTT earlier than flow B, and starts ramping up its sending rate; because flow B is still in PROBE_RTT with a low cwnd there is considerable available bandwidth, and flow A can ramp up quite a bit
+ flow A happens to get into a positive feedback loop: increasing delivery rates, increasing the BDP estimate, causing bbr_is_next_cycle_phase() to prolong probing because it is not yet triggering the inflight >= bbr_inflight(sk, bw, bbr->pacing_gain) check, causing increasing pacing rates, rinse, repeat, ...
+ so flow A achieves a 32 Mbit/sec delivery rate and flow B achieves 15.15 Mbit/sec, leading to temporary unfairness, until the flows reconverge
In BBRv2 this should be less of an issue because flows only cut cwnd to 1/2 the estimated BDP in PROBE_RTT, so there is less available bandwidth for flows to grab if they exit PROBE_RTT considerably earlier. I have ideas for improving that further in future versions of BBR, allowing cwnd to stay even higher in PROBE_RTT in common cases, and thus improving the fairness further.
Another option would be to cap the maximum time spent in the 1.25x phase to one packet-timed round trip. That might be a reasonable solution, though it has some downsides.
If folks have other ideas for improving this, then that's also great. :-)
Thanks!
neal