Hey all,
TL;DR: p99 GET latency regresses by 25% when server congestion control algorithm switches from Cubic to BBR.
RTT ~60ms.
Any ideas on what could be the root cause of this?
Long version:I work on a storage service (e.g. Amazon S3).
I have two different proxies (e.g. L7 load balancer) sitting in front of our storage service.
I'm migrating clients from Proxy-1 to Proxy-2.
The clients and the proxies are in different regions (~60ms RTT / through private backbone)
During the migration, we have experinced ~25% p99 latency regression for GET requests.
p50 latency didn't change.
GET sizes vary from 100KB to a few MBs.
There is a connection pool and reuse ratio is 100% (all connections are created during startup).
One connection is used for one request at a time (no pipelining / no HTTP/2 streams).
After investigating quite a few possible root causes, the congestion control algorithm difference between Proxy-1 and Proxy-2 turned out to be the culprit.
Proxy-1 uses Cubic.
Proxy-2 uses BBR.
When I change Proxy-2 to use Cubic and restart the client (to create new connections) p99 latency regression disappears. When I revert to BBR, it re-appears.
I repeated the multiple times because I'm having hard time believing this as the root cause and received consistently the same results.
iPerf:
I repeated the same test with iPerf (w/ -R to generate reverse traffic to simulate GETs).
The avg throughput achieved for BBR vs Cubic are the same (~500Mbit/sec).
However I see BBR per sec throughput occasionally drops to ~350Mbit/sec level and jumps back again to ~500Mbit/sec level the next second.
Cubic doesn't have that behaviour.
Attached sorted per/sec throughput comparison w/ p50/p5/p1 comparison(ignored 1st and last seconds in iperf results). I don't know if this gives more clue.
![bbr_cubic.png](https://groups.google.com/group/bbr-dev/attach/2a77e85f38df8/bbr_cubic.png?part=0.1&view=1)