[PATCH] softirq: let ksoftirqd do its job

Eric Dumazet

unread,

Aug 31, 2016, 1:43:02 PM8/31/16

to Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, Jesper Dangaard Brouer, linux-kernel, netdev, Jonathan Corbet

From: Eric Dumazet <edum...@google.com>

A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)

The problem seems to be that softirqs are very aggressive and are often
handled by the current process, even if we are under stress and that
ksoftirqd was scheduled, so that innocent threads would have more chance
to make progress.

This patch makes sure that if ksoftirq is running, we let it
perform the softirq work.

Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/

Tested:

- NIC receiving traffic handled by CPU 0
- UDP receiver running on CPU 0, using a single UDP socket.
- Incoming flood of UDP packets targeting the UDP socket.

Before the patch, the UDP receiver could almost never get cpu cycles and
could only receive ~2,000 packets per second.

After the patch, cpu cycles are split 50/50 between user application and
ksoftirqd/0, and we can effectively read ~900,000 packets per second,
a huge improvement in DOS situation. (Note that more packets are now
dropped by the NIC itself, since the BH handlers get less cpu cycles to
drain RX ring buffer)

Since the load runs in well identified threads context, an admin can
more easily tune process scheduling parameters if needed.

Reported-by: Paolo Abeni <pab...@redhat.com>
Reported-by: Hannes Frederic Sowa <han...@stressinduktion.org>
Signed-off-by: Eric Dumazet <edum...@google.com>
Cc: David Miller <da...@davemloft.net
Cc: Jesper Dangaard Brouer <jbr...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Rik van Riel <ri...@redhat.com>
---
kernel/softirq.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..8ed90e3a88d6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -78,6 +78,17 @@ static void wakeup_softirqd(void)
}

/*
+ * If ksoftirqd is scheduled, we do not want to process pending softirqs
+ * right now. Let ksoftirqd handle this at its own rate, to get fairness.
+ */
+static bool ksoftirqd_running(void)
+{
+ struct task_struct *tsk = __this_cpu_read(ksoftirqd);
+
+ return tsk && (tsk->state == TASK_RUNNING);
+}
+
+/*
* preempt_count and SOFTIRQ_OFFSET usage:
* - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
* softirq processing.
@@ -313,7 +324,7 @@ asmlinkage __visible void do_softirq(void)

pending = local_softirq_pending();

- if (pending)
+ if (pending && !ksoftirqd_running())
do_softirq_own_stack();

local_irq_restore(flags);
@@ -340,6 +351,9 @@ void irq_enter(void)

static inline void invoke_softirq(void)
{
+ if (ksoftirqd_running())
+ return;
+
if (!force_irqthreads) {
#ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
/*

Jesper Dangaard Brouer

unread,

Aug 31, 2016, 3:41:01 PM8/31/16

to Eric Dumazet, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 31 Aug 2016 10:42:29 -0700
Eric Dumazet <eric.d...@gmail.com> wrote:

> From: Eric Dumazet <edum...@google.com>
>
> A while back, Paolo and Hannes sent an RFC patch adding threaded-able
> napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)
>
> The problem seems to be that softirqs are very aggressive and are often
> handled by the current process, even if we are under stress and that
> ksoftirqd was scheduled, so that innocent threads would have more chance
> to make progress.
>
> This patch makes sure that if ksoftirq is running, we let it
> perform the softirq work.
>
> Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/
>
> Tested:
>
> - NIC receiving traffic handled by CPU 0
> - UDP receiver running on CPU 0, using a single UDP socket.
> - Incoming flood of UDP packets targeting the UDP socket.
>
> Before the patch, the UDP receiver could almost never get cpu cycles and
> could only receive ~2,000 packets per second.
>
> After the patch, cpu cycles are split 50/50 between user application and
> ksoftirqd/0, and we can effectively read ~900,000 packets per second,
> a huge improvement in DOS situation. (Note that more packets are now
> dropped by the NIC itself, since the BH handlers get less cpu cycles to
> drain RX ring buffer)

I can confirm the improvement of approx 900Kpps (no wonder people have
been complaining about DoS against UDP/DNS servers).

BUT during my extensive testing, of this patch, I also think that we
have not gotten to the bottom of this. I was expecting to see a higher
(collective) PPS number as I add more UDP servers, but I don't.

Running many UDP netperf's with command:
super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N

With 'top' I can see ksoftirq are still getting a higher %CPU time:

PID %CPU TIME+ COMMAND
3 36.5 2:28.98 ksoftirqd/0
10724 9.6 0:01.05 netserver
10722 9.3 0:01.05 netserver
10723 9.3 0:01.05 netserver
10725 9.3 0:01.05 netserver

> Since the load runs in well identified threads context, an admin can
> more easily tune process scheduling parameters if needed.

With this patch applied, I found that changing the UDP server process,
scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost
from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with
a single UDP stream, also tested with more)

Command used:
sudo chrt --rr -p 20 $(pgrep netserver)

The scheduling picture also change a lot:

PID %CPU TIME+ COMMAND
10783 24.3 0:21.53 netserver
10784 24.3 0:21.53 netserver
10785 24.3 0:21.52 netserver
10786 24.3 0:21.50 netserver
3 2.7 3:12.18 ksoftirqd/0

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Eric Dumazet

unread,

Aug 31, 2016, 4:42:41 PM8/31/16

to Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:

> I can confirm the improvement of approx 900Kpps (no wonder people have
> been complaining about DoS against UDP/DNS servers).
>
> BUT during my extensive testing, of this patch, I also think that we
> have not gotten to the bottom of this. I was expecting to see a higher
> (collective) PPS number as I add more UDP servers, but I don't.
>
> Running many UDP netperf's with command:
> super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N

Are you sure sender can send fast enough ?

>
> With 'top' I can see ksoftirq are still getting a higher %CPU time:
>
> PID %CPU TIME+ COMMAND
> 3 36.5 2:28.98 ksoftirqd/0
> 10724 9.6 0:01.05 netserver
> 10722 9.3 0:01.05 netserver
> 10723 9.3 0:01.05 netserver
> 10725 9.3 0:01.05 netserver

Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
and 4 sockets using SO_REUSEPORT)

10755 root 20 0 34948 4 0 S 79.7 0.0 0:33.66 udprcv
3 root 20 0 0 0 0 R 19.9 0.0 0:25.49 ksoftirqd/0

Pressing 'H' in top gives :

3 root 20 0 0 0 0 R 19.9 0.0 0:47.84 ksoftirqd/0
10756 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv
10757 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv
10758 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv
10759 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv

Patch was on top of commit 071e31e254e0e0c438eecba3dba1d6e2d0da36c2

>
>
> > Since the load runs in well identified threads context, an admin can
> > more easily tune process scheduling parameters if needed.
>
> With this patch applied, I found that changing the UDP server process,
> scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost
> from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with
> a single UDP stream, also tested with more)
>
> Command used:
> sudo chrt --rr -p 20 $(pgrep netserver)

Sure, this is what I mentioned in my changelog : Once we properly
schedule and rely on ksoftirqd, tuning is available.

Jesper Dangaard Brouer

unread,

Aug 31, 2016, 5:51:47 PM8/31/16

to Eric Dumazet, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 31 Aug 2016 13:42:30 -0700
Eric Dumazet <eric.d...@gmail.com> wrote:

> On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:
>
> > I can confirm the improvement of approx 900Kpps (no wonder people have
> > been complaining about DoS against UDP/DNS servers).
> >
> > BUT during my extensive testing, of this patch, I also think that we
> > have not gotten to the bottom of this. I was expecting to see a higher
> > (collective) PPS number as I add more UDP servers, but I don't.
> >
> > Running many UDP netperf's with command:
> > super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N
>
> Are you sure sender can send fast enough ?

Yes, as I can see drops (overrun UDP limit UdpRcvbufErrors). Switching
to pktgen and udp_sink to be sure.

> >
> > With 'top' I can see ksoftirq are still getting a higher %CPU time:
> >
> > PID %CPU TIME+ COMMAND
> > 3 36.5 2:28.98 ksoftirqd/0
> > 10724 9.6 0:01.05 netserver
> > 10722 9.3 0:01.05 netserver
> > 10723 9.3 0:01.05 netserver
> > 10725 9.3 0:01.05 netserver
>
> Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
> and 4 sockets using SO_REUSEPORT)
>
> 10755 root 20 0 34948 4 0 S 79.7 0.0 0:33.66 udprcv
> 3 root 20 0 0 0 0 R 19.9 0.0 0:25.49 ksoftirqd/0
>
> Pressing 'H' in top gives :
>
> 3 root 20 0 0 0 0 R 19.9 0.0 0:47.84 ksoftirqd/0
> 10756 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv
> 10757 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv
> 10758 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv
> 10759 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv

Yes, I'm seeing the same when unning 5 instances my own udp_sink[1]:
sudo taskset -c 0 ./udp_sink --port 10003 --recvmsg --reuse-port --count $((10**10))

PID S %CPU TIME+ COMMAND
3 R 21.6 2:21.33 ksoftirqd/0
3838 R 15.9 0:02.18 udp_sink
3856 R 15.6 0:02.16 udp_sink
3862 R 15.6 0:02.16 udp_sink
3844 R 15.3 0:02.15 udp_sink
3850 S 15.3 0:02.15 udp_sink

This is the expected result, that adding more userspace receivers
scales up. I needed 5 udp_sink's before I don't see any drops, either
this says the job performed by ksoftirqd is 5 times faster or the
collective queue size of the programs was fast enough to absorb the
scheduling jitter.

The result from this run were handling 1,517,248 pps, without any
drops, all processes pinned to the same CPU.

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 1517225 0.0
IpInDelivers 1517224 0.0
UdpInDatagrams 1517248 0.0
IpExtInOctets 69793408 0.0
IpExtInNoECTPkts 1517246 0.0

I'm acking this patch:

Acked-by: Jesper Dangaard Brouer <bro...@redhat.com>

>
> Patch was on top of commit 071e31e254e0e0c438eecba3dba1d6e2d0da36c2

Mine on top of commit 84fd1b191a9468

> >
> >
> > > Since the load runs in well identified threads context, an admin can
> > > more easily tune process scheduling parameters if needed.
> >
> > With this patch applied, I found that changing the UDP server process,
> > scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost
> > from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with
> > a single UDP stream, also tested with more)
> >
> > Command used:
> > sudo chrt --rr -p 20 $(pgrep netserver)
>
>
> Sure, this is what I mentioned in my changelog : Once we properly
> schedule and rely on ksoftirqd, tuning is available.
>
> >
> > The scheduling picture also change a lot:
> >
> > PID %CPU TIME+ COMMAND
> > 10783 24.3 0:21.53 netserver
> > 10784 24.3 0:21.53 netserver
> > 10785 24.3 0:21.52 netserver
> > 10786 24.3 0:21.50 netserver
> > 3 2.7 3:12.18 ksoftirqd/0
> >

[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

Eric Dumazet

unread,

Aug 31, 2016, 6:27:42 PM8/31/16

to Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 2016-08-31 at 23:51 +0200, Jesper Dangaard Brouer wrote:

>
> The result from this run were handling 1,517,248 pps, without any
> drops, all processes pinned to the same CPU.
>
> $ nstat > /dev/null && sleep 1 && nstat
> #kernel
> IpInReceives 1517225 0.0
> IpInDelivers 1517224 0.0
> UdpInDatagrams 1517248 0.0
> IpExtInOctets 69793408 0.0
> IpExtInNoECTPkts 1517246 0.0
>
> I'm acking this patch:
>
> Acked-by: Jesper Dangaard Brouer <bro...@redhat.com>
>

Thanks a lot for bringing back the issue to me again, and all your
tests !

Rick Jones

unread,

Aug 31, 2016, 6:47:57 PM8/31/16

to Eric Dumazet, Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

With regard to drops, are both of you sure you're using the same socket
buffer sizes?

In the meantime, is anything interesting happening with TCP_RR or
TCP_STREAM?

happy benchmarking,

rick jones

Eric Dumazet

unread,

Aug 31, 2016, 7:17:29 PM8/31/16

to Rick Jones, Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:
> With regard to drops, are both of you sure you're using the same socket
> buffer sizes?

Does it really matter ?

I used the standard /proc/sys/net/core/rmem_default, but under flood
receive queue is almost always full, even if you make it bigger.

By varying its size, you only make batches bigger and number of context
switches should be lower, if only two threads are competing for the cpu.

Exact 'optimal' size would depend on various factors, depending on
application and platform constraints.

>
> In the meantime, is anything interesting happening with TCP_RR or
> TCP_STREAM?

TCP_RR is driven by the network latency, we do not drop packets in the
socket itself.

TC_STREAM is normally paced by the ability of the receiver to send ACK
packets. TCP has this auto regulating mode, unless the sender violates
the RFC(s).

If your question is :

What happens if thousands of threads on the host want the cpu, and
ksoftirqd gets not enough cycles by virtue of being a normal thread ?

Then, you are back to typical provisioning problems, and normally people
play with priorities and containers/cgroups, and/or various techniques
like RPS/RFS

(You can change ksoftirqd priority if you like)

Rick Jones

unread,

Aug 31, 2016, 7:30:08 PM8/31/16

to Eric Dumazet, Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On 08/31/2016 04:11 PM, Eric Dumazet wrote:
> On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:
>> With regard to drops, are both of you sure you're using the same socket
>> buffer sizes?
>
> Does it really matter ?

At least at points in the past I have seen different drop counts at the
SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis
I was operating under at the time was that this dealt with those
situations where the netserver was held-off from running for "a little
while" from time to time. It didn't change things for a sustained
overload situation though.

>> In the meantime, is anything interesting happening with TCP_RR or
>> TCP_STREAM?
>
> TCP_RR is driven by the network latency, we do not drop packets in the
> socket itself.

I've been of the opinion it (single stream) is driven by path length.
Sometimes by NIC latency. But then I'm almost always measuring in the
LAN rather than across the WAN.

happy benchmarking,

rick

Jesper Dangaard Brouer

unread,

Sep 1, 2016, 6:39:14 AM9/1/16

to Rick Jones, bro...@redhat.com, Eric Dumazet, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 31 Aug 2016 16:29:56 -0700 Rick Jones <rick....@hpe.com> wrote:
> On 08/31/2016 04:11 PM, Eric Dumazet wrote:
> > On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:
> >> With regard to drops, are both of you sure you're using the same socket
> >> buffer sizes?
> >
> > Does it really matter ?
>
> At least at points in the past I have seen different drop counts at the
> SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis
> I was operating under at the time was that this dealt with those
> situations where the netserver was held-off from running for "a little
> while" from time to time. It didn't change things for a sustained
> overload situation though.

Yes, Rick, your hypothesis corresponds to my measurements. The
userspace program is held-off from running for "a little while" from
time to time. I've measured this with perf sched record/latency. It
is sort of a natural scheduler characteristic.
The userspace UDP socket program consume/need more cycles to perform
its jobs, than kernel softirqd. Thus the UDP-prog use up its sched
time-slice, and periodically ksoftirq get schedule multiple times,
because UDP-prog don't have any credits any-longer.

WARNING: Do not increase socket queue size to pamper over this issue,
it is the WRONG solution, it will give horrible latency issues.

With above warning, I can tell your, yes you are also right about
increasing the socket buffer size, can be used to mitigate/hide the
packet drops. You can even increase the socket size so much, that the
drop problem "goes-away". The queue simply need to be deep enough to
absorb the worst/maximum time UDP-prog was scheduled out. The hidden
effect to make this work (to not contradict queue theory) is that this
also slows-down/cost-more-cycles for ksoftirqd/NAPI as it cost more to
enqueue (instead of dropping packets on a full queue).

You can measure the sched "Maximum delay" using:
sudo perf sched record -C 0 sleep 10
sudo perf sched latency

On my setup I measured "Maximum delay" of approx 9 ms. Given I can
see an incoming packet rate of 2.4Mpps (880Kpps reach UDP-prog), and
knowing network stack use skb->truesize (approx 2048 bytes on this
driver), I can calculate that I need approx 45MBytes buffer
((2.4*10^6)*(9/1000)*2048 = 44.2Mb)

The PPS measurement comes from:

$ nstat > /dev/null && sleep 1 && nstat
#kernel

IpInReceives 2335926 0.0
IpInDelivers 2335925 0.0
UdpInDatagrams 880086 0.0
UdpInErrors 1455850 0.0
UdpRcvbufErrors 1455850 0.0
IpExtInOctets 107453056 0.0

Changing queue size to 50MBytes :
sysctl -w net/core/rmem_max=$((50*1024*1024)) ;\
sysctl -w net.core.rmem_default=$((50*1024*1024))

New result looks "nice", with no drops, and 1.42Mpps delivered to
UDP-prog, but in reality it is not nice for latency...

$ nstat > /dev/null && sleep 1 && nstat
#kernel

IpInReceives 1425013 0.0
IpInDelivers 1425017 0.0
UdpInDatagrams 1432139 0.0
IpExtInOctets 65539328 0.0
IpExtInNoECTPkts 1424771 0.0

Tracking of queue size, max, min and average::

while (true); do netstat -uan | grep '0.0.0.0:9'; sleep 0.3; done |
awk 'BEGIN {max=0;min=0xffffffff;sum=0;n=0} \
{if ($2 > max) max=$2;
if ($2 < min) min=$2;
n++; sum+=$2;
printf "%s Recv-Q: %d max: %d min: %d ave: %.3f\n",$1,$2,max,min,sum/n;}';
Result:
udp Recv-Q: 23624832 max: 47058176 min: 4352 ave: 25092687.698

I see max queue of 47MBytes, and worse an average standing queue of
25Mbytes, which is really bad for the latency seen by the
application. And having this much outstanding memory is also bad for
CPU cache size effects, and stressing the memory allocator.
I'm actually using this huge queue "misconfig" to stress the page
allocator and my page_pool implementation into worse case situations ;-)

Jesper Dangaard Brouer

unread,

Sep 1, 2016, 7:02:45 AM9/1/16

to Eric Dumazet, bro...@redhat.com, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

I need some help from scheduler people explaining this!

In above run of udp_sink (which had expected behavior), I ran udp_sink
in 5 different xterm/shells. Below, I'm running all 5 udp_sink
programs from the same bash shell (just backgrounding them).

PID S %CPU TIME+ COMMAND

3 R 50.0 29:02.23 ksoftirqd/0
10881 R 10.7 1:01.61 udp_sink
10837 R 10.0 1:05.20 udp_sink
10852 S 10.0 1:01.78 udp_sink
10862 R 10.0 1:05.19 udp_sink
10844 S 9.7 1:01.91 udp_sink

This is strange, why is ksoftirqd/0 getting 50% of the CPU time???

And I'm no-longer getting the full tput delivered into userspace (as I
did before with 5 receivers).

$ nstat > /dev/null && sleep 1 && nstat
#kernel

IpInReceives 1234368 0.0
IpInDelivers 1234368 0.0
UdpInDatagrams 1133971 0.0
UdpInErrors 80332 0.0
UdpRcvbufErrors 80332 0.0
IpExtInOctets 56792704 0.0
IpExtInNoECTPkts 1234624 0.0

Hannes Frederic Sowa

unread,

Sep 1, 2016, 7:12:03 AM9/1/16

to Jesper Dangaard Brouer, Eric Dumazet, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, linux-kernel, netdev, Jonathan Corbet

Could you enable schedstats (sysctl schedstats) and show
/proc/ksoftirq*/sched?

Thanks,
Hannes

Peter Zijlstra

unread,

Sep 1, 2016, 7:54:07 AM9/1/16

to Jesper Dangaard Brouer, Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote:
> PID S %CPU TIME+ COMMAND
> 3 R 50.0 29:02.23 ksoftirqd/0
> 10881 R 10.7 1:01.61 udp_sink
> 10837 R 10.0 1:05.20 udp_sink
> 10852 S 10.0 1:01.78 udp_sink
> 10862 R 10.0 1:05.19 udp_sink
> 10844 S 9.7 1:01.91 udp_sink
>
> This is strange, why is ksoftirqd/0 getting 50% of the CPU time???

Do you run your udp_sink thingy in a cpu-cgroup?

Hannes Frederic Sowa

unread,

Sep 1, 2016, 8:01:23 AM9/1/16

to Eric Dumazet, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Jesper Dangaard Brouer, linux-kernel, netdev, Jonathan Corbet

Acked-by: Hannes Frederic Sowa <han...@stressinduktion.org>

Thanks,
Hannes

Hannes Frederic Sowa

unread,

Sep 1, 2016, 8:06:03 AM9/1/16

to Eric Dumazet, Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, linux-kernel, netdev, Jonathan Corbet

On 31.08.2016 22:42, Eric Dumazet wrote:
> On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:
>
>> I can confirm the improvement of approx 900Kpps (no wonder people have
>> been complaining about DoS against UDP/DNS servers).
>>
>> BUT during my extensive testing, of this patch, I also think that we
>> have not gotten to the bottom of this. I was expecting to see a higher
>> (collective) PPS number as I add more UDP servers, but I don't.
>>
>> Running many UDP netperf's with command:
>> super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N
>
> Are you sure sender can send fast enough ?
>
>>
>> With 'top' I can see ksoftirq are still getting a higher %CPU time:
>>
>> PID %CPU TIME+ COMMAND
>> 3 36.5 2:28.98 ksoftirqd/0
>> 10724 9.6 0:01.05 netserver
>> 10722 9.3 0:01.05 netserver
>> 10723 9.3 0:01.05 netserver
>> 10725 9.3 0:01.05 netserver
>
> Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
> and 4 sockets using SO_REUSEPORT)

Would it make sense to include used socket backlog in udp socket lookup
compute_score calculation? Just want to throw out the idea, I actually
could imagine to also cause bad side effects.

Jesper Dangaard Brouer

unread,

Sep 1, 2016, 8:30:41 AM9/1/16

to Peter Zijlstra, Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet, bro...@redhat.com

That was also Paolo's feedback (IRC). I'm not aware of it, but it
might be some distribution (Fedora 22) default thing.

How do I verify/check if I have enabled a cpu-cgroup?

Jesper Dangaard Brouer

unread,

Sep 1, 2016, 8:39:13 AM9/1/16

to Peter Zijlstra, Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet, bro...@redhat.com

On Thu, 1 Sep 2016 14:29:25 +0200
Jesper Dangaard Brouer <bro...@redhat.com> wrote:

> On Thu, 1 Sep 2016 13:53:56 +0200
> Peter Zijlstra <pet...@infradead.org> wrote:
>
> > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote:
> > > PID S %CPU TIME+ COMMAND
> > > 3 R 50.0 29:02.23 ksoftirqd/0
> > > 10881 R 10.7 1:01.61 udp_sink
> > > 10837 R 10.0 1:05.20 udp_sink
> > > 10852 S 10.0 1:01.78 udp_sink
> > > 10862 R 10.0 1:05.19 udp_sink
> > > 10844 S 9.7 1:01.91 udp_sink
> > >
> > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time???
> >
> > Do you run your udp_sink thingy in a cpu-cgroup?
>
> That was also Paolo's feedback (IRC). I'm not aware of it, but it
> might be some distribution (Fedora 22) default thing.

Correction, on the server-under-test, I'm actually running RHEL7.2

> How do I verify/check if I have enabled a cpu-cgroup?

Hannes says I can look in "/proc/self/cgroup"

$ cat /proc/self/cgroup
7:net_cls:/
6:blkio:/
5:devices:/
4:perf_event:/
3:cpu,cpuacct:/
2:cpuset:/
1:name=systemd:/user.slice/user-1000.slice/session-c1.scope

And that "/" indicate I've not enabled cgroups, right?

Peter Zijlstra

unread,

Sep 1, 2016, 8:48:50 AM9/1/16

to Jesper Dangaard Brouer, Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

Mostly so. I think RHEL/Fedora has SCHED_AUTOGROUP enabled, and you can
find that through:

cat /proc/self/autogroup

And disable with the noautogroup boot param, or:

echo 0 > /proc/sys/kernel/sched_autogroup_enabled

although this latter will leave the current state intact while avoiding
creation of any further autogroups iirc.

Eric Dumazet

unread,

Sep 1, 2016, 8:51:29 AM9/1/16

to Hannes Frederic Sowa, Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, linux-kernel, netdev, Jonathan Corbet

On Thu, 2016-09-01 at 14:05 +0200, Hannes Frederic Sowa wrote:

> Would it make sense to include used socket backlog in udp socket lookup
> compute_score calculation? Just want to throw out the idea, I actually
> could imagine to also cause bad side effects.

Hopefully we can get rid of the backlog for UDP, by no longer having to
lock the socket in RX path, and perform memory charging in a better way.

The backlog for TCP is problematic for high speed flows, and for UDP it
is problematic in flood situations as a single recvmsg() might have to
process thousands of skbs before returning to user space.

What you suggest is going to be difficult :

1) Packets of a 5-tuple (eg QUIC flow) wont all land to the same silo,
and will cause reorders or application issues.

2) SO_ATTACH_REUSEPORT_CBPF wont have access to the socket(s) backlog to
perform the choice.

Thanks.

Eric Dumazet

unread,

Sep 1, 2016, 9:00:53 AM9/1/16

to Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, 2016-09-01 at 14:38 +0200, Jesper Dangaard Brouer wrote:

> Correction, on the server-under-test, I'm actually running RHEL7.2
>
>
> > How do I verify/check if I have enabled a cpu-cgroup?
>
> Hannes says I can look in "/proc/self/cgroup"
>
> $ cat /proc/self/cgroup
> 7:net_cls:/
> 6:blkio:/
> 5:devices:/
> 4:perf_event:/
> 3:cpu,cpuacct:/
> 2:cpuset:/
> 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
>
> And that "/" indicate I've not enabled cgroups, right?
>

In my experience, I found that times displayed by top are often off for
softirq processing.

Before applying my patch, top shows very small amount of cpu time for
udp_rcv and ksoftirqd/0 , while obviously cpu 0 is completely busy.

Make sure to try latest Linus tree, as I did yesterday, because
apparently things are better than a few weeks back.

BTW, even 'perf top' has sometimes problems showing me cycles spent in
softirq. I need to make sure the cpu processing NIC interrupts also
spend cycles in some user space program to get meaningful results.

Eric Dumazet

unread,

Sep 1, 2016, 9:06:14 AM9/1/16

to Jesper Dangaard Brouer, Rick Jones, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, 2016-09-01 at 12:38 +0200, Jesper Dangaard Brouer wrote:

> I see max queue of 47MBytes, and worse an average standing queue of
> 25Mbytes, which is really bad for the latency seen by the
> application. And having this much outstanding memory is also bad for
> CPU cache size effects, and stressing the memory allocator.
> I'm actually using this huge queue "misconfig" to stress the page
> allocator and my page_pool implementation into worse case situations ;-)
>

Since commit 95766fff6b9a78d11f ("[UDP]: Add memory accounting."),
it is dangerous to have a big SO_RCVBUF value, since it adds unexpected
recvmsg() latencies.

1) User thread locks the socket.
2) Gets one skb from receive queue
3) incoming flood of UDP packets are processed by softirq
4) Socket is found 'owned by the user'
5) packets are parked into the 'socket backlog' up to the SO_RCVBUF
limit
6) User thread release the socket.
7) It finds many skbs in the backlog and have to process them _all_ and
re-inject in socket receive queue.
8) return to user space.

Time spent in 7) can me in the order of millions of cpu cycles...

At least starting from 5413d1babe8f10d ("net: do not block BH while
processing socket backlog") we no longer block BH while doing 7) and we
have cond resched points.

Hannes Frederic Sowa

unread,

Sep 1, 2016, 9:08:16 AM9/1/16

to Eric Dumazet, Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, linux-kernel, netdev, Jonathan Corbet

I think that ksoftirqd time is actually accounted to system:

excerpt from irqtime_account_process_tick in kernel/sched/cputime.c

if (this_cpu_ksoftirqd() == p) {
/*
* ksoftirqd time do not get accounted in cpu_softirq_time.
* So, we have to handle it separately here.
* Also, p->stime needs to be updated for ksoftirqd.
*/
__account_system_time(p, cputime, scaled, CPUTIME_SOFTIRQ);
} else if (user_tick) {

Eric Dumazet

unread,

Sep 1, 2016, 9:25:46 AM9/1/16

to Hannes Frederic Sowa, Jesper Dangaard Brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni, linux-kernel, netdev, Jonathan Corbet

Tell me more about kernel/sched/cputime.c stability over recent linux
versions ;)

git log --oneline v4.2.. kernel/sched/cputime.c
03cbc732639ddcad15218c4b2046d255851ff1e3 sched/cputime: Resync steal time when guest & host lose sync
173be9a14f7b2e901cf77c18b1aafd4d672e9d9e sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression
26f2c75cd2cf10a6120ef02ca9a94db77cc9c8e0 sched/cputime: Fix omitted ticks passed in parameter
f9bcf1e0e0145323ba2cf72ecad5264ff3883eb1 sched/cputime: Fix steal time accounting
08fd8c17686c6b09fa410a26d516548dd80ff147 Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
553bf6bbfd8a540c70aee28eb50e24caff456a03 sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
0cfdf9a198b0d4f5ad6c87d894db7830b796b2cc sched/cputime: Clean up the old vtime gen irqtime accounting completely
b58c35840521bb02b150e1d0d34ca9197f8b7145 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
57430218317e5b280a80582a139b26029c25de6c sched/cputime: Count actually elapsed irq & softirq time
ecb23dc6f2eff0ce64dd60351a81f376f13b12cc xen: add steal_clock support on x86
807e5b80687c06715d62df51a5473b231e3e8b15 sched/cputime: Add steal time support to full dynticks CPU time accounting
f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d sched/cputime: Fix steal_account_process_tick() to always return jiffies
ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
c9bed1cf51011c815d88288b774865d013ca78a8 Merge tag 'for-linus-4.5-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
1fe7c4ef88bd32e039f5f4126537c3f20c340414 missing include asm/paravirt.h in cputime.c
b7ce2277f087fd052e7e1bbf432f7fecbee82bb6 sched/cputime: Convert vtime_seqlock to seqcount
e592539466380279a9e6e6fdfe4545aa54f22593 sched/cputime: Introduce vtime accounting check for readers
55dbdcfa05533f44c9416070b8a9f6432b22314a sched/cputime: Rename vtime_accounting_enabled() to vtime_accounting_cpu_enabled()
cab245d68c38afff1a4c4d018ab7e1d316982f5d sched/cputime: Correctly handle task guest time on housekeepers
7098c1eac75dc03fdbb7249171a6e68ce6044a5a sched/cputime: Clarify vtime symbols and document them
7877a0ba5ec63c7b0111b06c773f1696fa17b35a sched/cputime: Remove extra cost in task_cputime()
2541117b0cf79977fa11a0d6e17d61010677bd7b sched/cputime: Fix invalid gtime in proc
9eec50b8bbe1535c440a1ee88c1958f78fc55957 kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support
9d7fb04276481c59610983362d8e023d262b58ca sched/cputime: Guarantee stime + utime == rtime

Jesper Dangaard Brouer

unread,

Sep 1, 2016, 9:32:22 AM9/1/16

to Peter Zijlstra, Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet, bro...@redhat.com

On Thu, 1 Sep 2016 14:48:39 +0200

$ cat /proc/self/autogroup
/autogroup-88 nice 0

> And disable with the noautogroup boot param, or:
>
> echo 0 > /proc/sys/kernel/sched_autogroup_enabled

Looks like it is enabled on my system:

$ grep -H . /proc/sys/kernel/sched_autogroup_enabled
/proc/sys/kernel/sched_autogroup_enabled:1

> although this latter will leave the current state intact while avoiding
> creation of any further autogroups iirc.

$ sudo sh -c 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled'
$ grep -H . /proc/sys/kernel/sched_autogroup_enabled
/proc/sys/kernel/sched_autogroup_enabled:0

$ sudo systemctl restart sshd

Starting new SSH login:

$ cat /proc/self/autogroup
/autogroup-153 nice 0

Hmmm, still enabled...

$ sudo systemctl stop sshd
$ sudo systemctl start sshd
$ grep -H . /proc/sys/kernel/sched_autogroup_enabled
/proc/sys/kernel/sched_autogroup_enabled:0
$ cat /proc/self/autogroup
/autogroup-158 nice 0

Still... enabled!
Hmmm.. more idea how to disable this???

Peter Zijlstra

unread,

Sep 1, 2016, 11:28:23 AM9/1/16

to Jesper Dangaard Brouer, Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, Sep 01, 2016 at 03:30:42PM +0200, Jesper Dangaard Brouer wrote:
> Still... enabled!
> Hmmm.. more idea how to disable this???

I think you ought to be able to assign yourself to the root cgroup,
something like:

echo $$ > /cgroup/tasks

or wheverever the cpu-cgroup controller is mounted at.

But its been a fair while since I touched any of that, its not a CONFIG
I have enabled much.

David Miller

unread,

Sep 2, 2016, 2:39:21 AM9/2/16

to eric.d...@gmail.com, pet...@infradead.org, ri...@redhat.com, pab...@redhat.com, han...@redhat.com, jbr...@redhat.com, linux-...@vger.kernel.org, net...@vger.kernel.org, cor...@lwn.net

From: Eric Dumazet <eric.d...@gmail.com>
Date: Wed, 31 Aug 2016 10:42:29 -0700

I'm just kind of assuming this won't go through my tree, but I can take
it if that's what everyone agrees to.

Jesper Dangaard Brouer

unread,

Sep 2, 2016, 4:35:52 AM9/2/16

to Peter Zijlstra, Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet, bro...@redhat.com

I could not figure out how to disable autogroups, so I ended up
compiling the kernel without CONFIG_SCHED_AUTOGROUP.

PID PR S %CPU TIME+ COMMAND
3 20 R 20.7 0:53.05 ksoftirqd/0
9299 20 R 16.3 0:03.62 udp_sink
9296 20 S 16.0 0:03.59 udp_sink
9297 20 R 16.0 0:03.58 udp_sink
9298 20 R 16.0 0:03.57 udp_sink
9295 20 R 15.3 0:03.43 udp_sink

Top new shows the CPU distribution is more correct, thus we can
concluded the artifact I saw was indeed caused by autogroup.

I can also confirm that my netperf UDP_STREAM tests now work again,
but I need around 32 parallel netperf to counter the effectiveness of
the ksoftirqd process. While I only need 5 udp_sink programs.

Daniel Borkmann

unread,

Sep 23, 2016, 7:36:18 AM9/23/16

to David Miller, eric.d...@gmail.com, pet...@infradead.org, ri...@redhat.com, pab...@redhat.com, han...@redhat.com, jbr...@redhat.com, linux-...@vger.kernel.org, net...@vger.kernel.org, cor...@lwn.net

Was this actually picked up somewhere in the mean time?

Peter Zijlstra

unread,

Sep 23, 2016, 7:53:54 AM9/23/16

to Daniel Borkmann, David Miller, eric.d...@gmail.com, ri...@redhat.com, pab...@redhat.com, han...@redhat.com, jbr...@redhat.com, linux-...@vger.kernel.org, net...@vger.kernel.org, cor...@lwn.net, Ingo Molnar

I can queue it for tip. In fact, I've just done so to avoid loosing it.
If anybody else wants it holler.

Jesper Dangaard Brouer

unread,

Sep 23, 2016, 12:51:27 PM9/23/16

to Peter Zijlstra, Daniel Borkmann, David Miller, eric.d...@gmail.com, ri...@redhat.com, pab...@redhat.com, han...@redhat.com, linux-...@vger.kernel.org, net...@vger.kernel.org, cor...@lwn.net, Ingo Molnar

Good that you are picking this up! It is a very important fix, as least
for networking.

This is your git tree, right:
https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/

Doesn't look like you pushed it yet, or do I need to look at a specific
branch?

Peter Zijlstra

unread,

Sep 23, 2016, 5:16:30 PM9/23/16

to Jesper Dangaard Brouer, Daniel Borkmann, David Miller, eric.d...@gmail.com, ri...@redhat.com, pab...@redhat.com, han...@redhat.com, linux-...@vger.kernel.org, net...@vger.kernel.org, cor...@lwn.net, Ingo Molnar

On Fri, Sep 23, 2016 at 06:51:04PM +0200, Jesper Dangaard Brouer wrote:

> This is your git tree, right:
> https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/
>
> Doesn't look like you pushed it yet, or do I need to look at a specific
> branch?

I mainly work from a local quilt queue which I feed to mingo. I
occasionally push out to get build-bot coverage or have people look at
bits I poked together.

That said, I'll try and do a push later tonight.

Do note however, that git tree is a complete wipe and rebuild, don't
expect any kind of continuity from it.

tip-bot for Eric Dumazet

unread,

Sep 30, 2016, 7:56:06 AM9/30/16

to linux-ti...@vger.kernel.org, tg...@linutronix.de, linux-...@vger.kernel.org, han...@stressinduktion.org, edum...@google.com, h...@zytor.com, ri...@redhat.com, jbr...@redhat.com, han...@redhat.com, pet...@infradead.org, da...@davemloft.net, cor...@lwn.net, pab...@redhat.com, mi...@kernel.org, torv...@linux-foundation.org

Commit-ID: 4cd13c21b207e80ddb1144c576500098f2d5f882
Gitweb: http://git.kernel.org/tip/4cd13c21b207e80ddb1144c576500098f2d5f882
Author: Eric Dumazet <edum...@google.com>
AuthorDate: Wed, 31 Aug 2016 10:42:29 -0700
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 30 Sep 2016 10:43:36 +0200

softirq: Let ksoftirqd do its job

A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)

The problem seems to be that softirqs are very aggressive and are often
handled by the current process, even if we are under stress and that
ksoftirqd was scheduled, so that innocent threads would have more chance
to make progress.

This patch makes sure that if ksoftirq is running, we let it
perform the softirq work.

Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/

Tested:

- NIC receiving traffic handled by CPU 0
- UDP receiver running on CPU 0, using a single UDP socket.
- Incoming flood of UDP packets targeting the UDP socket.

Before the patch, the UDP receiver could almost never get CPU cycles and

could only receive ~2,000 packets per second.

After the patch, CPU cycles are split 50/50 between user application and

ksoftirqd/0, and we can effectively read ~900,000 packets per second,
a huge improvement in DOS situation. (Note that more packets are now

dropped by the NIC itself, since the BH handlers get less CPU cycles to

drain RX ring buffer)

Since the load runs in well identified threads context, an admin can
more easily tune process scheduling parameters if needed.

Reported-by: Paolo Abeni <pab...@redhat.com>
Reported-by: Hannes Frederic Sowa <han...@stressinduktion.org>
Signed-off-by: Eric Dumazet <edum...@google.com>

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: David Miller <da...@davemloft.net>
Cc: Hannes Frederic Sowa <han...@redhat.com>

Cc: Jesper Dangaard Brouer <jbr...@redhat.com>

Cc: Jonathan Corbet <cor...@lwn.net>
Cc: Linus Torvalds <torv...@linux-foundation.org>

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Rik van Riel <ri...@redhat.com>

Cc: Thomas Gleixner <tg...@linutronix.de>
Link: http://lkml.kernel.org/r/1472665349.14...@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
kernel/softirq.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b..8ed90e3 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -78,6 +78,17 @@ static void wakeup_softirqd(void)
}

/*
+ * If ksoftirqd is scheduled, we do not want to process pending softirqs
+ * right now. Let ksoftirqd handle this at its own rate, to get fairness.
+ */
+static bool ksoftirqd_running(void)
+{
+ struct task_struct *tsk = __this_cpu_read(ksoftirqd);
+
+ return tsk && (tsk->state == TASK_RUNNING);
+}
+
+/*
* preempt_count and SOFTIRQ_OFFSET usage:
* - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
* softirq processing.
@@ -313,7 +324,7 @@ asmlinkage __visible void do_softirq(void)

pending = local_softirq_pending();

- if (pending)
+ if (pending && !ksoftirqd_running())
do_softirq_own_stack();

local_irq_restore(flags);
@@ -340,6 +351,9 @@ void irq_enter(void)

static inline void invoke_softirq(void)
{
+ if (ksoftirqd_running())
+ return;
+
if (!force_irqthreads) {
#ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
/*