softirq loc interrupt storms when using BBR with fq_codel on 4.15.0-1045-aws Ubuntu 18.04.3 LTS

Benjamin McAlary

unread,

Aug 26, 2019, 10:12:30 PM8/26/19

to BBR Development

Hello Team,

We recently rolled out:

net.ipv4.tcp_congestion_control = bbr

across several thousand nodes

on

4.15.0-1045-aws #47-Ubuntu SMP Fri Aug 2 13:50:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 18.04.3 LTS

and found that we would sometimes experience high 100% CPU usage on one or more cores due to softirq and LOC Local timer interrupts in situations (particularly edge nodes) where some packet loss was seen or expected.

The benefits of BBR on our long distant and lossy connections are amazing, upto a 400% increase in transfer rates, so we decided to investigate.

Interestingly raw packet rates and throughput measured by the hypervisor and OS during these high CPU periods were quite low, in the 1-5 Mbit region. We expected the softirq time to be taken up by the network irq, but instead it was in LOC.

Of course, we expect some CPU usage when packet loss occurs, but found the 99% softirq/LOC usage excessive.

It appeared as if something was getting stuck in a loop.

Disabling BBR (going to CUBIC) as a troubleshooting measure resolved the issue.

We re-enabled it and captured a flame graph on one node (this one running Envoy proxy) during a softirq event and saw the time was spent in retransmission queuing and tasklet.

After reading the following

https://github.com/systemd/systemd/issues/9725
https://community.ui.com/questions/If-youre-looking-to-better-understand-what-fqcodel-HTB-and-BQL-is-/edbeb291-83d8-45e4-953f-c0d13ec5689f
https://groups.google.com/forum/m/#!topic/bbr-dev/4jL4ropdOV8
https://groups.google.com/forum/#!topic/bbr-dev/_K2FJXgGUBg
https://groups.google.com/forum/#!topic/bbr-dev/4jL4ropdOV8
https://www.spinics.net/lists/netdev/msg528106.html
https://www.reddit.com/r/networking/comments/2fyq4z/new_fair_queuing_option_in_linux_312/
https://forum.xanmod.org/printthread.php?tid=152
https://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel/

Particularly the quotes:

"

fq is implementing pacing, and is meant for end hosts. It was designed with performance in mind.
   Codel was not added there, because it was not needed (see point 1)), and only adding wasting cpu cycles."

"

"sch_fq is for servers, fq_codel is for routers"

We decided to change from the default fq_codel qdisc in Ubuntu 18.04 to fq (sch_fq) by way of sysctl net.core.default_qdisc=fq

And the results are night and day:

CPU Usage:

We want to share this with the community in case anyone else has seen or experienced the same.

Perhaps this is expected? That on application and reverse proxy servers working over long distance or lossy links softirq/LOC/tasklet usage at low throughput (1-5 Mbit/s) will be 100%?

If this is a known issue should fq not be the default over fq_codel in Linux? There seems to be some significant argument over this: https://github.com/systemd/systemd/issues/9725

Either way, moving to "fq" alone has fixed the issue.

Benjamin McAlary

unread,

Aug 26, 2019, 10:13:44 PM8/26/19

to BBR Development

Flame graph which failed to be attached in the previous message:

image (6).png

Dave Taht

unread,

Aug 27, 2019, 12:17:40 AM8/27/19

to Benjamin McAlary, BBR Development

Thank you for

The context of that fq vs fq_codel debate was quite painful - and in the context of a safe default for the internet. where things stood at that point in time was:

https://github.com/systemd/systemd/issues/9725#issuecomment-413369212

Which is a benchmark you can run on your own deployments.

If it isn't clear from that:

IF your server's traffic is just tcp, and you are not using network namespaces, vpns or vms or containers, and especially if you want to use BBR - BY ALL MEANS switch to sch_fq. (does edf work on quic/udp stuff now?)

It's not clear to me how well sch_fq actually works in a vm, my impression is google mostly runs it on bare

metal.

My own fear is that fq_codel is only thing keeping billions of containers from melting down the internet, but

I have no data on it aside from a few benchmarks like that. I would welcome more testing and it would be

great to have one singing all dancing qdisc.

Jonathan Morton

unread,

Aug 27, 2019, 12:33:39 AM8/27/19

to Benjamin McAlary, BBR Development

> On 27 Aug, 2019, at 5:12 am, Benjamin McAlary <bmca...@atlassian.com> wrote:
>
> We recently rolled out:
>
> net.ipv4.tcp_congestion_control = bbr
>
> across several thousand nodes
>
> on
>
> 4.15.0-1045-aws #47-Ubuntu

Does that kernel have support for TCP Pacing without sch_fq as the active qdisc? If not, there's your problem.

I would advise installing a newer kernel version, certainly in the 5.x.x series, and also trying BBRv2 rather than the old version of BBR.

- Jonathan Morton

Benjamin McAlary

unread,

Aug 27, 2019, 1:22:41 AM8/27/19

to Jonathan Morton, BBR Development

Yes.

I believe pacing was added to fq_codel since 4.13 (we're on 4.15)

:/home/ubuntu# uname -a

4.15.0-1045-aws #47-Ubuntu SMP Fri Aug 2 13:50:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 18.04.3 LTS

root@:/home/ubuntu# tc qdisc show
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn

^ I believe this output would have "nopacing" if there was no pacing. example output with "nopacing" in https://groups.google.com/forum/#!topic/bbr-dev/_K2FJXgGUBg

In any case you can see pacing is working for a BBR

root@:/home/ubuntu# ss -int4 state established
Recv-Q Send-Q Local Address:Port Peer Address:Port
0 36
bbr wscale:6,7 rto:440 rtt:230.997/17.948 ato:40 mss:1424 pmtu:9001 rcvmss:1392 advmss:8949 cwnd:19 bytes_acked:5473 bytes_received:3729 segs_out:32 segs_in:45 data_segs_out:25 data_segs_in:20 bbr:(bw:182.0Kbps,mrtt:219.197,pacing_gain:2.88672,cwnd_gain:2.88672) send 937.0Kbps pacing_rate 1.5Mbps delivery_rate 182.3Kbps app_limited busy:3640ms unacked:1 rcv_rtt:232 rcv_space:26847 rcv_ssthresh:35199 minrtt:219.197

Ben McAlary

Network Engineer - Atlassian

http://www.atlassian.com

Jonathan Morton

unread,

Aug 27, 2019, 1:31:29 AM8/27/19

to Benjamin McAlary, BBR Development

> On 27 Aug, 2019, at 8:22 am, Benjamin McAlary <bmca...@atlassian.com> wrote:
>
> qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
>
> ^ I believe this output would have "nopacing" if there was no pacing. example output with "nopacing" in https://groups.google.com/forum/#!topic/bbr-dev/_K2FJXgGUBg

The example given there is for sch_fq, *not* for sch_fq_codel. The latter does not have, and never has had, explicit support for pacing, and you wouldn't see a "nopacing" keyword in tc's output.

In recent Linux kernels, and I don't think 4.15 is recent enough, pacing was added to the core TCP stack. That is what you need here.

> In any case you can see pacing is working for a BBR

That only shows that BBR is setting the pacing variables in the socket. Most likely nothing is actually using them, which is necessary for BBR to work correctly.

More generally, when reporting a bug in any software under active development, the *first* thing you need to do is upgrade to at least the latest stable version, since it's very likely that other people noticed the same bug and fixed it meanwhile.

- Jonathan Morton

Benjamin McAlary

unread,

Aug 27, 2019, 1:33:49 AM8/27/19

to Dave Taht, BBR Development

Thanks Dave,

Regarding your questions:

1. We run envoy as a docker container for the most significantly impacted workloads by this problem. Our infrastructure is on AWS.

2. The server traffic is HTTPS over TCP only. Nothing fancy.

Our feeling is that (although we were unable to recreate this issue in significant testing in our labs) since it was so apparent in live/production we are either 1. not testing right (guaranteed) 2. screwing something up really badly by running envoy as a container on Ubuntu with BBR and the OS default fq_codel or that maybe there is a real problem and we didn't create - through some mistake of our own - and perhaps fq_codel and BBR might be a bad mix for other workloads as well to the point that it can cause drastic softirq storms.

We're fine with fq_codel and fq being better or worse in some scenarios than others, and benchmarking showing as much. We just thing these irq storms taking 100% of CPU were of note outside the "one algorithm is better than the other" conversation - as its not something we expected.

Ben McAlary

Network Engineer - Atlassian

http://www.atlassian.com

Dave Taht

unread,

Aug 27, 2019, 1:45:10 AM8/27/19

to Benjamin McAlary, BBR Development

On Mon, Aug 26, 2019 at 9:17 PM Dave Taht <dave...@gmail.com> wrote:
>
> Thank you for

the flame graph

> The context of that fq vs fq_codel debate was quite painful - and in the context of a safe default for the internet. where things stood at that point in time was:
>
> https://github.com/systemd/systemd/issues/9725#issuecomment-413369212
>
> Which is a benchmark you can run on your own deployments.
>
> If it isn't clear from that:
>
> IF your server's traffic is just tcp, and you are not using network namespaces, vpns or vms or containers, and especially if you want to use BBR - BY ALL MEANS switch to sch_fq. (does edf work on quic/udp stuff now?)

I should clarify this a bit more. If you are using a reverse proxy (as
you are) to get at your containers sch_fq with pacing is even more the
right thing (with or without bbr). In these other circumstances, where
linux is acting essentially as a router, not so much.

As for tcp - particularly when observing large rtts - I'd like it if
more folk were actually monitoring their rtts relative to what the
physical path should be achieving. "Out there" are tons of folk trying
via inadequate means like policers and via inbound shaping to keep
their networks usable for gaming, voip and videoconferencing, and
making the assumption that a tcp will react to a drop with a reno-like
response that bbrv1 does not have, Overbuffering along the edge is oft
measured in seconds, when 10s of ms is needed.

As for the cpu hit you are observing - I have no way to duplicate your
workload. Does sound buggy and a worthwhile thing to analyze
independently. But I'd recommend a kernel upgrade first as a test -
and then,
packet captures and more flame graphs. :/

sch_fq can self congest - I've seen googlers recommend using a shaper
on it - and nobody actually knows
how aws does rate management in the first place! I'd love it if more
interactivish apps used tcp_ lowat.

BBRv2 looks quite promising except for the rfc3168/ and sce vs L4S debate.

> It's not clear to me how well sch_fq actually works in a vm, my impression is google mostly runs it on bare
> metal.
>
> My own fear is that fq_codel is only thing keeping billions of containers from melting down the internet, but
> I have no data on it aside from a few benchmarks like that. I would welcome more testing and it would be
> great to have one singing all dancing qdisc.
>

--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Eric Dumazet

unread,

Aug 27, 2019, 1:45:53 AM8/27/19

to Benjamin McAlary, Dave Taht, BBR Development

Let' s clarify things a bit :

TCP internal pacing is significantly more expensive than pacing done
in FQ layer.
This is because internal pacing has to send one packet at a time,
adding a high-resolution timer between each packet.
FQ pacing is better because TCP can queue several packets at once,
meaning that only first packet of a batch
has to bring into cpu caches some cache lines. Next packets traverse
all stacks (IP, qdisc) at a much lower cost.

Also FQ qdisc has a single (shared) high resolution timer, and when it
fires, the fq_check_throttled() function very often can
process many flows at once (especially when dealing with many concurrent flows)

And many packets are not paced (they are sent after receiving an ACK packet).
In the FQ world, no timer will be programmed for this very common case,
while TCP internal pacing might have started a timer.

Note that linux got much better pacing support (both in FQ or internal
pacing) in 4.20 with the adoption of EDT model
(Earliest Departure Time)

> --
> You received this message because you are subscribed to the Google Groups "BBR Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/CANyH7uycEVbXyDF%3DaCRbwj6cfVJxPPnJrpP9jkiz37daYTWZDg%40mail.gmail.com.

Eric Dumazet

unread,

Aug 27, 2019, 1:51:23 AM8/27/19

to Benjamin McAlary, Dave Taht, BBR Development

Another argument in favor of FQ pacing is that it is interacting
better when the link is saturated.

BQL and/or NIC xoff signals mean that the high-resolution timers might
be never started, since
all packets can leave FQ layer with a skb->tstamp already in the past.

While TCP internal pacing would never know about that, it would start
all these timers anyway.

Benjamin McAlary

unread,

Aug 27, 2019, 1:51:33 AM8/27/19

to Jonathan Morton, BBR Development

Hi Jonathan,

Thanks for your response.

I defer to your judgement but want to double check I understood you correctly:

>The example given there is for sch_fq, *not* for sch_fq_codel.

Are you sure we're not using codel in the problematic case?

The sysctl shows codel:

net.core.default_qdisc = fq_codel

And the tc output shows codel as well

root@ip-10-125-126-73:/home/ubuntu# tc qdisc show
qdisc noqueue 0: dev lo root refcnt 2

qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn

Whereas on an updated node where we have sch_fq manually specified we see:

root@ip-10-125-120-6:/home/ubuntu# sysctl net.core.default_qdisc
net.core.default_qdisc = fq

root@ip-10-125-120-6:/home/ubuntu# tc qdisc show
qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 0: dev ens5 root
qdisc fq 0: dev ens5 parent :2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028 initial_quantum 15140 low_rate_threshold 550Kbit refill_delay 40.0ms

Perhaps I misunderstood you.

> More generally, when reporting a bug in any software under active development, the *first* thing you need to do is upgrade to at least the latest stable version, since it's very likely that other people noticed the same bug and fixed it meanwhile.

You're absolutely right we should have done this, but changing kernel versions for our production environment (the only place the issue is seen) is not as easy as enabling and disabling fq/fq_codel (still, this isn't this groups problem, its our problem). Just to give you our perspective on why we opened this thread: we found only breadcrumbs online and wanted to raise a flag with the appropriate group since there may be a lot of other deployments out there, or similar troubled developers/teams who might blindly set sysctl -w net.ipv4.tcp_congestion_control=bbr on 18.04 LTS (a very common OS used in production environments - maybe the most common?) as we did and then begin seeing softirq/loc storms.

We wanted to simply reach out and say "hey, we booted Ubuntu 18.04 LTS, enabled BBR and started serving traffic and started hitting 100% cpu at 1-5 Mbit/s, and we don't think thats right".

We also wanted to offer a short term solution of setting "fq" instead of the default "fq_codel" which seems to not suffer from the 100% softirq storms.

I'd be happy to run a node in production with whatever combination of sysctls and linux kernel as you think would be useful to the group here if there is an interesting problem, otherwise we can leave this as a sign-post for future engineers. :)

Ben McAlary

Network Engineer - Atlassian

http://www.atlassian.com

Benjamin McAlary

unread,

Aug 27, 2019, 2:12:25 AM8/27/19

to Dave Taht, BBR Development

> As for the cpu hit you are observing - I have no way to duplicate your
workload. Does sound buggy and a worthwhile thing to analyze
independently. But I'd recommend a kernel upgrade first as a test -
and then,

packet captures and more flame graphs. :/

We agree :) haha - just need to find that elusive time.

Ben McAlary

Network Engineer - Atlassian

http://www.atlassian.com

Benjamin McAlary

unread,

Aug 27, 2019, 2:19:04 AM8/27/19

to Jonathan Morton, BBR Development

Jonathan, apologies,

I see you meant the example in https://groups.google.com/forum/#!topic/bbr-dev/_K2FJXgGUBg

and you're right 4.15 fq_codel has no configable pacing parameters - so you're right. This whole thread does seem to be the result of us simply using an incorrect default thrown at us by Ubuntu

root@ip-10-125-126-73:/home/ubuntu# tc qdisc replace dev eth0 root fq asdads
What is "asdads"?
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
[ quantum BYTES ] [ initial_quantum BYTES ]
[ maxrate RATE ] [ buckets NUMBER ]
[ [no]pacing ] [ refill_delay TIME ]
[ low_rate_threshold RATE ]
[ orphan_mask MASK]
root@ip-10-125-126-73:/home/ubuntu# tc qdisc replace dev eth0 root fq_codel asdas
What is "asdas"?
Usage: ... fq_codel [ limit PACKETS ] [ flows NUMBER ]
[ target TIME ] [ interval TIME ]
[ quantum BYTES ] [ [no]ecn ]
[ ce_threshold TIME ]

From the start we should have gone with sch_fq.

Thanks Jonathan.

And thanks Eric Dumazet and Dave.

We've big fans of all the work Jonathan, Eric and Dave and everyone is doing here. We've definitely seen some major improvements in a variety of scenarios.

Ben McAlary

Network Engineer - Atlassian

http://www.atlassian.com

Dave Taht

unread,

Aug 27, 2019, 2:59:54 PM8/27/19

to Benjamin McAlary, BBR Development

On Mon, Aug 26, 2019 at 11:12 PM Benjamin McAlary
<bmca...@atlassian.com> wrote:
>
> > As for the cpu hit you are observing - I have no way to duplicate your
> workload. Does sound buggy and a worthwhile thing to analyze
> independently. But I'd recommend a kernel upgrade first as a test -
> and then,
> packet captures and more flame graphs. :/
>
> We agree :) haha - just need to find that elusive time.

groovy.

how AWS does shaping underneath the vm or exiting the network is a
mystery. (anyone want to talk?) Azure polices, but that's all I know.

I've sometimes thought it would be better if your vm itself did the
rate limiting below what you are buying,
but don't know. Recently eric proposed a means (with edf support in
4.20 and later) using sch_fq + ebpf which
is quite interesting (
https://www.mail-archive.com/net...@vger.kernel.org/msg312364.html )
vs a vs htb or cake.

I'd LOVE a flame graph of those alternatives on your real workload.
(but sch_fq is the right thing!) My gut loathes the idea of invoking
an interpreter on every packet, but the method there is *really
elegant*. If we could somehow use timestamps and edf style stuff for
routed and udp/vpnd/namespace packets also we'd have hit a holy
grail...

... and then offload it all to hw.

Dave Taht

unread,

Aug 27, 2019, 5:04:57 PM8/27/19

to Benjamin McAlary, BBR Development

On Tue, Aug 27, 2019 at 11:59 AM Dave Taht <dave...@gmail.com> wrote:
>
> On Mon, Aug 26, 2019 at 11:12 PM Benjamin McAlary
> <bmca...@atlassian.com> wrote:
> >
> > > As for the cpu hit you are observing - I have no way to duplicate your
> > workload. Does sound buggy and a worthwhile thing to analyze
> > independently. But I'd recommend a kernel upgrade first as a test -
> > and then,
> > packet captures and more flame graphs. :/
> >
> > We agree :) haha - just need to find that elusive time.
>
> groovy.
>
> how AWS does shaping underneath the vm or exiting the network is a
> mystery. (anyone want to talk?) Azure polices, but that's all I know.
>
> I've sometimes thought it would be better if your vm itself did the
> rate limiting below what you are buying,
> but don't know. Recently eric proposed a means (with edf support in
> 4.20 and later) using sch_fq + ebpf which
> is quite interesting (
> https://www.mail-archive.com/net...@vger.kernel.org/msg312364.html )
> vs a vs htb or cake.

To build on this a little bit - if you are seeing local queue delays in
excess of X (2.5ms? 5ms?) and you are saturating your available
bandwidth, it's time to spin up another docker instance.

Seeing drops or marks locally from either that bpf proggie or an fq_codel
for a sustained period def should trigger spinning up another
instance, and conversely if you aren't
queueing like mad you can spin down some.

If the IRQ storm is coming from fq_codel dropping tons of packets and
the tcp stack going nuts to fill in the holes,
doing that bpf shaper will induce similar symptoms. (although sch_fq
will still be vastly better at this)

Roland Bless

unread,

Aug 28, 2019, 5:44:51 AM8/28/19

to Benjamin McAlary, BBR Development

Hi Benjamin,

On 27.08.19 at 04:12 Benjamin McAlary wrote:
> We recently rolled out:
>
> net.ipv4.tcp_congestion_control = bbr
>
> across several thousand nodes
>
> on
>
> 4.15.0-1045-aws #47-Ubuntu SMP Fri Aug 2 13:50:30 UTC 2019 x86_64 x86_64
> x86_64 GNU/Linux
> Ubuntu 18.04.3 LTS
>
> and found that we would sometimes experience high 100% CPU usage on one
> or more cores due to softirq and LOC Local timer interrupts in
> situations (particularly edge nodes) where some packet loss was seen or
> expected.
>
> The benefits of BBR on our long distant and lossy connections are
> amazing, upto a 400% increase in transfer rates, so we decided to
> investigate.

Better be careful and prudent:
Please, please check that you are not seeing the benefits due to pushing
away other competing traffic that is using Cubic TCP.
We did some tests with cloud servers and saw amazing throughput
improvements, but it was also clear that we suppressed other
competing flows due to BBR's (v1) quite aggressive behavior.
However, that requires a bit careful analysis of retransmission
behavior etc. I guess that BBRv2 would be a much safer variant
to roll out, because it does not ignore packet loss as congestion signal
as in BBRv1.

Regards
Roland

Reply all

Reply to author

Forward