BBR Parameter Tuning

270 views
Skip to first unread message

Wayne

unread,
Apr 19, 2019, 3:28:32 AM4/19/19
to BBR Development
Hi,

We're experimenting bbr (v1) under a user facing http API scenario. The data (http payload) per http API over a connection is less than 20KB,  and multiple streams share a same tls connection. In general, the APIs are triggered by user behavior. We observed that the BBR bandwidth utilization is pretty low, it ran into full bandwidth state with a small bandwidth (250KB/s), so the cwnd decreased rapidly by a small target_window. Anyone can suggest some parameters tuning for my use case, any suggestion would be appreciated.

Thanks,
Wayne

Neal Cardwell

unread,
Apr 19, 2019, 11:00:40 AM4/19/19
to Wayne, BBR Development
Hi Wayne,

Thanks for your report! From the description of the symptoms, it
sounds like perhaps the connections are exiting the Startup phase
because of being marked as non-application-limited, when they are
really app-limited. We have some patches (under preparation for
upstreaming) that fix some cases where this can happen in some loss
recovery scenarios; those may help, if these connections are seeing
packet loss.

A couple quick questions:

(1) I presume this is Linux TCP BBR v1? What's the exact Linux kernel
version you are using (e.g. output of "uname -a")?

(2) Do you know if the connections that encounter this problem had
retransmissions (e.g. non-zero "retrans" count in "ss" output)?

(3) Would you be able to share some anonymized headers-only packet
traces that capture some connections that are manifesting this
problem? That would help us in analyzing, reproducing, and fixing any
issues you may be running into.

(4) If we post a patch with a proposed fix, would it be possible in
your environment to build a kernel with the patch, and test it?

thanks,
neal
> --
> You received this message because you are subscribed to the Google Groups "BBR Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Wei Sun

unread,
Apr 19, 2019, 1:12:31 PM4/19/19
to Neal Cardwell, BBR Development
Hi Neal,

Thanks your for the quick response. We're experimenting the QUIC BBR v1. From my observation, the cwnd decreased rapidly when exiting Startup phase. After a quick look at the problem, it turned out that the congestion window was calculated by target_window since the `is_at_full_bandwidth_` was true. However, since the measured bandwidth was small, and minRTT=5ms. As a matter of fact, the congestion window decreased from around 50KB to ~3KB(PROBE_BW state). 

I also tried to download a large file, the bandwidth is around 2MB/s. Is it possible that the bandwidth measurement doesn't work perfectly since the API payload is small? If so, can I tune some parameters to prevent the cwnd decreasing rapidly? Thanks a lot.
if (is_at_full_bandwidth_) {
congestion_window_ =
std::min(target_window, congestion_window_ + bytes_acked); // <---- run into here
} else if (add_bytes_acked &&
(congestion_window_ < target_window ||
sampler_.total_bytes_acked() < initial_congestion_window_)) {
// If the connection is not yet out of startup phase, do not decrease the
// window.
congestion_window_ = congestion_window_ + bytes_acked;
}

Thanks,
Wayne
Neal Cardwell <ncar...@google.com> 于2019年4月19日周五 下午11:00写道:
Hi Wayne,

Thanks for your report! From the description of the symptoms, it
sounds like perhaps the connections are exiting the Startup phase
because of being marked as non-application-limited, when they are
really app-limited. We have some patches (under preparation for
upstreaming) that fix some cases where this can happen in some loss
recovery scenarios; those may help, if these connections are seeing
packet loss.

A couple quick questions:

(1) I presume this is Linux TCP BBR v1? What's the exact Linux kernel
version you are using (e.g. output of "uname -a")?

It's QUIC BBR v1. 

(2) Do you know if the connections that encounter this problem had
retransmissions (e.g. non-zero "retrans" count in "ss" output)?

I didn't observe any retransmissions in the logs. 

(3) Would you be able to share some anonymized headers-only packet
traces that capture some connections that are manifesting this
problem? That would help us in analyzing, reproducing, and fixing any
issues you may be running into.

 I'll try to provide the tcpdump result if needed.

(4) If we post a patch with a proposed fix, would it be possible in
your environment to build a kernel with the patch, and test it?

Yes. I'll be more than happy to take a try. 

Neal Cardwell

unread,
Apr 19, 2019, 2:01:18 PM4/19/19
to Wei Sun, BBR Development, Ian Swett, Victor Vasiliev
Hi,

Thanks for the extra details!

Given that this is the Chromium QUIC BBR implementation, I have cc-ed some Chromium QUIC BBR developers to help diagnose.

The data point you shared that the congestion_window_ is being set in the spot you highlighted would be consistent with my theory mentioned above, that the connections with these small payloads are prematurely exiting Startup (i.e. setting is_at_full_bandwidth_), perhaps because the flows are not being marked as application-limited when they actually are application-limited.

What's the server software that is being used here?

You mention logs; if you have anonymized logs that you can share, I imagine that could really help in diagnosing the issue.

thanks,
neal

Ian Swett

unread,
Apr 25, 2019, 9:56:47 PM4/25/19
to Neal Cardwell, Wei Sun, BBR Development, Victor Vasiliev
My past experience is that the QUIC BBR is overly slow to exit STARTUP, so I'm a bit surprised there's a case when it's exiting too soon.  But it's also very possible there's a bug I'm not aware of.

Logs would be very useful. Can you export quic-trace, because that would tell us exactly what is going on and contains no application data? https://github.com/google/quic-trace

I'm not sure what portion of the QUIC code you're using, but it's critical to call OnApplicationLimited when app-limited, ie: https://cs.chromium.org/chromium/src/net/third_party/quiche/src/quic/core/quic_connection.cc?g=0&l=3749

Wei Sun

unread,
Apr 26, 2019, 1:37:24 PM4/26/19
to Ian Swett, Neal Cardwell, BBR Development, Victor Vasiliev
Hi Ian and Neal,

Thank you for the hints. Yes, we missed to call OnApplicationLimited in our codes. It worked as expected after fixing the issue. Thanks a lot for your help.

Thanks,
Wayne

Ian Swett <ians...@google.com> 于2019年4月26日周五 上午9:56写道:

Ian Swett

unread,
Apr 26, 2019, 3:06:06 PM4/26/19
to Wei Sun, Neal Cardwell, BBR Development, Victor Vasiliev
Great to hear that was the issue!

howard liao

unread,
May 8, 2019, 12:57:42 AM5/8/19
to BBR Development
hey, lan, what a great tools you provide. "quic-trace"  looks like it is very helpful to diagnose the network issue.

在 2019年4月27日星期六 UTC+8上午1:37:24,Wayne写道:
Wayne
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr...@googlegroups.com.

howard liao

unread,
May 8, 2019, 12:57:48 AM5/8/19
to BBR Development
hey, lan, what a great tools you provide. "quic-trace"  looks like it is very helpful to diagnose the network issue.

在 2019年4月27日星期六 UTC+8上午1:37:24,Wayne写道:
Hi Ian and Neal,
Wayne
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr...@googlegroups.com.

Ian Swett

unread,
May 9, 2019, 9:13:52 AM5/9/19
to BBR Development
Thanks.  Quic-trace was actually created by Victor Vasiliev, so he should take credit.
Reply all
Reply to author
Forward
0 new messages