Ksoftirqd 0

29 views

Skip to first unread message

Pernille Pennebaker

unread,

Jul 24, 2024, 7:56:24 AM7/24/24

to zardjomaka

iperf is not part of the problem : just by having a LAN device downloading over the Internet (be it speedtest.net, fast.com, torrent or just fast http) is enough to make it struggle when new OpenWRT is placed between as router. Even with 980 Mbps bandwidth through it as router, the previous version showed no visible CPU usage under top (or just 1 or 2 tiny percent).

top1386616 111 KB
Bottleneck begins as soon as ksoftirqd/0 tops to 25% (I guess it means single threaded to 1 CPU over 4). This is all "sirq", I guess it means "soft IRQ" but guys like you will be much better than me in understanding what into OpenWRT could be using as much CPU time as "sirq"

ksoftirqd 0

Download 🆗 https://urllio.com/2zJTPj

well random observation... may or may not be related... around the time of dnsmasq-logspam release... my internet went down for half a day... noticed alot of netifdprocd page faults... could be related... ( probably a false positive tho' )

I'll may try to see if Raspberry Pi kernel (or even better, the kernel used on this OpenWRT version) and his related modules have the same issue when placed on Raspbian for example. If so the problem is on the Raspberry Pi kernel side. If not, it's likely to be somewhere into OpenWRT network stack settings - but it's reaching the limits of my knowledge... for now

I'll try to search a little bit more into Google to see what changed at rpi-5.6.y and rpi-5.7.y (and after) about network performance. For now I didn't anything interesting apart some guy who encountered the issue, didn't solved it but worked around just enough to get his 4 Gbps on compute module... so nothing interesting about how this issue really got repaired

I guess someone who is good at performance tuning and drivers would be able to locate and correct this performance drop, and/or to find out how this has been corrected on rpi-5.7.y compared to rpi-5.6.y, but it's unfortunately beyond my abilities...

In the following days, I may try to do some testing of the different kernels versions against the USB3 ax88179 NIC and try to post an issue on the Raspberry Pi kernel Github if I'm able to spot when the issue appeared with it

Edit :
I found some way to gather information about interrupts usage. The thing that uses as much interrupt on CPU0 when transferring over the USB3 NIC is BRCM STB PCIe MSI 524288 Edge xhci_hcd

I've got and RTL8153 and mine is working fine. Just 1% sirq at gigabit.
I can't make the test you made because my network is shared with others, but you made a really good job testing everything.
Seems like the problem is only with ax88179

But unless a change occurred into the ax88179 driver, it's strange that the same problem doesn't occur on both rtl8153 and ax88179 (I'm going to do the test soon, but EnfermeraSexy have you tested this on last snapshot? Because on previous snapshots, both ax88179 and rtl8153 were running perfectly fine* at 1 Gbps). As I wondered if this performance regression issue wasn't located into PCIe or USB bus handling of the rpi kernel.

If the regression is only about ax88179 driver (which would then probably not be Raspberry Pi related) then it would be an interesting information for later when I'll have some more time to dig into the issue (or if anyone does before I do).

By the way, even if buying new stuff is acceptable for testing, trashing and replacing good working hardware to solve failed software work/regressions should absolutely never be seen as the normal solution . Too often, if not almost always, this is what is advised. Even if of course for old/rare/unsupported/scary old hardware this is sometimes the only possible way.

Oh.. so I guess it depends on the manufacturer who placed the chip on the board (mine has "Edimax" logo on it)
So I hope the one I ordered with rtl8153 isn't as bad I'll soon get the answer!
EDIT : I had enough time to modify my order, and ordered the Tp-Link one (UE300) suggested by dlakelan, so that my rtl8153 device is more likely be fine too when it arrives.

For the USB NIC vs Integrated NIC they are located in 2 different bus (unless I'm wrong I believe on the PCIe, on RPi4, the only thing present is the USB3 controller). This is why I thought the initial issue (affecting both) had a common cause - but having integrated NIC issue solved alone on recent kernels, tends to prove I wasn't as right as I tought

After the weekend I'll do some tests with x86_64 and may bemainline kernels on RPi4, vs rpi-kernels, and then compare both drivers to spot differences. This will be one of my first time reading driver code, although I'm good at c/c++ and controller programming so I guess I may succeed to understand what the code does, and what changed

So the conclusion of this thread is probably about to be the following one : high CPU usage from integrated NIC have been solved in new kernels, anyway it isn't causing as much performance issue as the ax88179 regression, which seems to come from the mainline kernel (but only affecting arm64, or may be just bcm2711) - and isn't solved in recent kernel versions.

I may probably have tested few previous mainline kernels versions on the Pi 4, but I guess I won't be able to run the bcm2711 SoC with too old mainline kernels (and I'm a little bit too overwhelmed for now to test all of them )

By the way my TPlink UE300 / rtl8153 dongle arrived. But lucky as I am, he got squished under some truck(s) before being delivered here ^^ another is supposed to arrive next friday.
41603120
41603120

I was talking about IRQ9 because at 60k/s it's understandable that ACPI
takes some CPU trying to see if it's the intended receiver, but
ksoftirqd's behaviour is obviously not normal, besides the point that it
works perfectly por 10-20 minutes at least.I'll be out for a week at least and I doublt I will be able to answer
back, sorry for the inconvenience, I should've posted this after acoming
back. Thanks for the answer though ;)

> I'm running a home PC here, which is acting as a router/firewall too.
> When I get a specific network traffic type (detailed below) ksoftirqd/0
> starts using all CPU time. Network processes also use an extremely high
> amount of CPU time when this happens. I/O is extremely jerky and
> assuming control over the console (being in X) is next to impossible
> without resorting to SysRq (a million thanks to whomever invented it
> :)). Always reproducible after some minutes of this traffic.

I'm having a same problem, discovered today. I was moving some stuff from
windows-server (samba) to linux-server (tcp-nfs) via my workstation pc.
Data was tranferred something like 5-6MB/sec (no problems usually to this
point..) but after I start using ssh-connections and writing some emails
or chatting in IRCNet my computer will totally freeze after couple
minutes. I haven't been able to reproduce this with normal www-browsing or
ssh-connections but it's always reproducible when my eth0 is under heavy
load.

> Maybe some hardware bug that causes it to generate too many interrupts? It's
> strange though, both 8139 cards are identical and eth1 has much more traffic
> than eth2, and something like this has never happened. I use eth2 sparingly
> though, and I have never had it receive so many small packets.

> I just remembered something. In the kernel config, there's an option for
> the RTL-8139 driver to use polling I/O instead of memory mapped I/O -
> and it usually defaults to PIO. My kernel was compiled using the MMIO
> option. Check and see what you're using.I'm using MMIO-option, I was just thinking to go with the PIO and see what
will happen.I just found this thread (url below) and some others are having problems
also. -kernel/2004/Mar/2295.htmlAndrew:Did you find any solution to this one? (I guess not but could I be in help
someway to hunt this bug down?)

> Pasi Sjoholm :
> [...]
> > I haven't been able to reproduce this with normal www-browsing or
> > ssh-connections but it's always reproducible when my eth0 is under heavy
> > load.
> I guess it can be reproduced even if the binary (nvidia ?) module is never
> loaded after boot, right ?

Included in cc.I haven't changed anything else in my hardware but just upgraded my
10Mbps hub to 100Mbps switch. I guess 10Mbps wasn't generating too many
interrupts. =)Hope this helps..--
Pasi Sjholm

Hello!
This looks very much like the problem we see when doing route DoS testing
with Alexey. In summary: High softirq loads can totally kill userland. The reason is that
do_softirq() is run from many places hard interrupts, local_bh_enable etc
and bypasses the ksoftirqd protection. It just been discussed at OLS with
Andrea and Dipankar and others. Current RCU suffers from this problem as well. I've experimented some code to defer softirq's to ksoftirqd after a time as
well as deferring all softirq's to ksoftirqd. Andrea had some ideas as well
as Ingo. Cheers.
--ro

>>>> I haven't been able to reproduce this with normal www-browsing or
>>>> ssh-connections but it's always reproducible when my eth0 is under heavy
>>>> load.
>>> I guess it can be reproduced even if the binary (nvidia ?) module is never
>>> loaded after boot, right ?

> In summary: High softirq loads can totally kill userland. The reason is that
> do_softirq() is run from many places hard interrupts, local_bh_enable etc
> and bypasses the ksoftirqd protection. It just been discussed at OLS with
> Andrea and Dipankar and others. Current RCU suffers from this problem as well.

> > In summary: High softirq loads can totally kill userland. The reason is that
> > do_softirq() is run from many places hard interrupts, local_bh_enable etc
> > and bypasses the ksoftirqd protection. It just been discussed at OLS with
> > Andrea and Dipankar and others. Current RCU suffers from this problem as well.
>
> Ok, this explanation makes sense and my point of view I think this is
> quite critical problem if you can "crash" linux kernel just sending enough
> packets to network interface for an example.