Uplink is 6 gigabit Intel cards bonded together using 802.3ad algorithm
with xmit_hash_policy set to layer3+4. On the other side is Cisco 2960
switch. Machine is with two quad core Intel Xeons @2.33GHz.
Here goes a screen snapshot of "top" command. The described behavior
have nothing to do with 13% io-wait. It happens even if it is 0%
io-wait.
http://www.titov.net/misc/top-snap.png
kernel configuration:
http://www.titov.net/misc/config.gz
/proc/interrupts, lspci, dmesg (nothing intresting there), ifconfig,
uname -a:
http://www.titov.net/misc/misc.txt.gz
Is it a Linux bug or some hardware limitation?
Regards,
Anton Titov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
possibly some missing parameters when loading your e1000 drivers.
e1000 NICs support interrupt rate limitation, which proves very
efficient in cases such as yours. I'm used to limit them to about
5k ints/s. Do a "modinfo e1000" to get the parameter name, I don't
have it quite right in mind.
Also, I've CCed linux-net.
Regards,
Willy
# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
CPU7
0: 342 261 258 278 271 253 264
283 IO-APIC-edge timer
1: 0 0 1 0 1 0 0
0 IO-APIC-edge i8042
6: 0 1 0 1 0 0 1
0 IO-APIC-edge floppy
9: 0 0 0 0 0 0 0
0 IO-APIC-fasteoi acpi
12: 1 1 0 0 0 1 1
0 IO-APIC-edge i8042
17: 180 190 178 183 182 186 186
188 IO-APIC-fasteoi uhci_hcd:usb1, ehci_hcd:usb4
18: 843504 842514 843653 842033 842416 842742 841903
842960 IO-APIC-fasteoi 3w-9xxx, uhci_hcd:usb3
19: 0 0 0 0 0 0 0
0 IO-APIC-fasteoi uhci_hcd:usb2
498: 534642903 534635899 534726883 534732377 534701710 534708588 534730550
534742730 PCI-MSI-edge eth5
499: 531832274 531846609 531917849 531942676 531855140 531850692 531885565
531863468 PCI-MSI-edge eth4
500: 487251627 487279206 487248030 487220044 487239637 487231454 487281672
487227202 PCI-MSI-edge eth3
501: 486083953 486062203 486109925 486075793 486036977 486035152 486097551
486117164 PCI-MSI-edge eth2
502: 528889380 528863624 528760188 528798619 528891886 528890760 528807939
528822746 PCI-MSI-edge eth1
503: 529043135 529056706 528980250 528975209 529018995 529027386 528941583
528970472 PCI-MSI-edge eth0
NMI: 0 0 0 0 0 0 0
0 Non-maskable interrupts
LOC: 62893699 62809502 62744208 62746035 62708815 62709055 62739182
62620363 Local timer interrupts
RES: 15454866 15827970 16235695 15386970 15761053 16097167 16190851
16159843 Rescheduling interrupts
CAL: 85 98 85 84 98 93 94
91 function call interrupts
TLB: 3565361 3561798 3570271 3566272 3556996 3555866 3578257
3564557 TLB shootdowns
TRM: 0 0 0 0 0 0 0
0 Thermal event interrupts
THR: 0 0 0 0 0 0 0
0 Threshold APIC interrupts
SPU: 0 0 0 0 0 0 0
0 Spurious interrupts
Yikes! all wrong!
the network irq's are being ping-ponged around all the cores! bad!
1) turn the in-kernel IRQBALANCE option off !
2) use either the userspace `irqbalance` daemon or
3) set smp_affinity manually
Auke
> 2) use either the userspace `irqbalance` daemon or
> 3) set smp_affinity manually
I tried echoing 3 (assuming that CPU0 and CPU1 will share their cache,
as advised in other mails) into smp_affinity of all ethX interrupts and
no positive result was observed.
I will try disabling NAPI and limiting e1000 interrupts tomorrow.
But have you disabled irqbalance before doing this ? (you must reboot
and pass "noirqbalance" on the command line for this).
Also, if you are running on quad-core intel CPUs, I'm told that they're
simply two standard dual-core CPUs in the same case, so there is no
shared cache between any core. You should try to assign all irqs to
CPU0 for a test. It *must* make a difference, in either direction.
> I will try disabling NAPI and limiting e1000 interrupts tomorrow.
I found the parameter name I was speaking about : InterruptThrottleRate.
Beware it's an array with one entry per NIC, so you have to set as many
values as you have NICs. I have always observed huge performance boosts
when using the tunables the driver provides.
yes, I really don't see this is such an amazing discovery - the in-kernel
irqbalance code is totally wrong for network interrupts (and probably for most
interrupts).
on your system with 6 network interrupts it blows chunks and it's not NAPI that is
the issue - NAPI will work just fine on it's own. By disabling NAPI and reverting
to the in-driver irq moderation code you've effectively put the in-kernel
irqbalance code to the sideline and this is what makes it work again.
It's not the right solution.
We keep seing this exact issue pop up everywhere - especially with e1000(e)
datacenter users - this code _has_ to go or be fixed. Since there is a perfectly
viable solution, I strongly suggest disabling it.
This is not the first time I've sent this patch out in some form...
Auke
---
[X86] IRQBALANCE: Mark as BROKEN and disable by default
The IRQBALANCE option causes interrupts to bounce all around on SMP systems
quickly burying the CPU in migration cost and cache misses. Mainly affected are
network interrupts and this results in one CPU pegged in softirqd completely.
Disable this option and provide documentation to a better solution (userspace
irqbalance daemon does overall the best job to begin with and only manual setting
of smp_affinity will beat it).
Signed-off-by: Auke Kok <auke-ja...@intel.com>
---
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6c70fed..956aa22 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1026,13 +1026,17 @@ config EFI
platforms.
config IRQBALANCE
- def_bool y
+ def_bool n
prompt "Enable kernel irq balancing"
- depends on X86_32 && SMP && X86_IO_APIC
+ depends on X86_32 && SMP && X86_IO_APIC && BROKEN
help
The default yes will allow the kernel to do irq load balancing.
Saying no will keep the kernel from doing irq load balancing.
+ This option is known to cause performance issues on SMP
+ systems. The preferred method is to use the userspace
+ 'irqbalance' daemon instead. See http://irqbalance.org/.
+
config SECCOMP
def_bool y
prompt "Enable seccomp to safely compute untrusted bytecode"
> [X86] IRQBALANCE: Mark as BROKEN and disable by default
>
> The IRQBALANCE option causes interrupts to bounce all around on SMP systems
> quickly burying the CPU in migration cost and cache misses. Mainly affected
> are network interrupts and this results in one CPU pegged in softirqd
> completely.
If this is the problem, maybe it would help to only balance the IRQs each
e.g. ten seconds? Unfortunately I have no SMP system to try it out.
For example:
Router-KARAM ~ # cat /proc/interrupts
CPU0 CPU1
0: 87956938 1403052485 IO-APIC-edge timer
1: 0 2 IO-APIC-edge i8042
9: 0 0 IO-APIC-fasteoi acpi
19: 140 5714 IO-APIC-fasteoi ohci_hcd:usb1, ohci_hcd:usb2
24: 675673280 1186506694 IO-APIC-fasteoi eth2
26: 717865662 2201633562 IO-APIC-fasteoi eth0
27: 1869190 23075556 IO-APIC-fasteoi eth1
NMI: 0 0 Non-maskable interrupts
LOC: 1403052485 87956683 Local timer interrupts
RES: 75059 25408 Rescheduling interrupts
CAL: 99542 83 function call interrupts
TLB: 616 200 TLB shootdowns
TRM: 0 0 Thermal event interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
MIS: 0
sunfire-1 ~ # cat config|grep -i irq
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
# CONFIG_IRQBALANCE is not set
CONFIG_HT_IRQ=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_DEBUG_SHIRQ is not set
Is it harmful too?
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
------
Technical Manager
Virtual ISP S.A.L.
Lebanon
> [X86] IRQBALANCE: Mark as BROKEN and disable by default
>
> The IRQBALANCE option causes interrupts to bounce all around on SMP systems
> quickly burying the CPU in migration cost and cache misses. Mainly affected are
> network interrupts and this results in one CPU pegged in softirqd completely.
>
> Disable this option and provide documentation to a better solution (userspace
> irqbalance daemon does overall the best job to begin with and only manual setting
> of smp_affinity will beat it).
>
> Signed-off-by: Auke Kok <auke-ja...@intel.com>
>
> ---
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6c70fed..956aa22 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1026,13 +1026,17 @@ config EFI
> platforms.
>
> config IRQBALANCE
> - def_bool y
> + def_bool n
ACK.
> prompt "Enable kernel irq balancing"
> - depends on X86_32 && SMP && X86_IO_APIC
> + depends on X86_32 && SMP && X86_IO_APIC && BROKEN
This is wrong. irqbalance works, there's nothing wrong with it; but it
has nasty sideffects.
> help
> The default yes will allow the kernel to do irq load balancing.
> Saying no will keep the kernel from doing irq load balancing.
>
> + This option is known to cause performance issues on SMP
> + systems. The preferred method is to use the userspace
> + 'irqbalance' daemon instead. See http://irqbalance.org/.
> +
ACK.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> We keep seing this exact issue pop up everywhere - especially with
> e1000(e) datacenter users - this code _has_ to go or be fixed. Since
> there is a perfectly viable solution, I strongly suggest disabling it.
strongly agreed. Thanks Auke, applied.
Ingo
Be it kernel or user space, for consistent benchmark results it needs to
be able to be turned-off without turning the code. That leaves me in
agreement with Stephen that if it must exist, the user space one would
be preferable. It can be easily terminated with extreme prejudice.
rick jones
ok, I'm fine with taking that part out of the patch.
Ingo, want me to send an updated patch?
>
>> help
>> The default yes will allow the kernel to do irq load balancing.
>> Saying no will keep the kernel from doing irq load balancing.
>>
>> + This option is known to cause performance issues on SMP
>> + systems. The preferred method is to use the userspace
>> + 'irqbalance' daemon instead. See http://irqbalance.org/.
>> +
>
> ACK.
>
--
excellent, ignore my other reply to Pavel - I didn't see this reply yet :)
Thanks Ingo
Auke
> Ingo Molnar wrote:
>> * Kok, Auke <auke-ja...@intel.com> wrote:
>>
>>> We keep seing this exact issue pop up everywhere - especially with
>>> e1000(e) datacenter users - this code _has_ to go or be fixed. Since
>>> there is a perfectly viable solution, I strongly suggest disabling it.
>>
>> strongly agreed. Thanks Auke, applied.
>>
>> Ingo
>
>
> excellent, ignore my other reply to Pavel - I didn't see this reply yet :)
Shouldn't you just add it to the FeatureRemoval list too and remove it
then quickly? No need to keep disabled and known to be wrong code around.
-Andi
> > > [X86] IRQBALANCE: Mark as BROKEN and disable by default
> > >
> > > The IRQBALANCE option causes interrupts to bounce all around on SMP
> > > systems
> > > quickly burying the CPU in migration cost and cache misses. Mainly
> > > affected
> > > are network interrupts and this results in one CPU pegged in softirqd
> > > completely.
> >
> >
> > If this is the problem, maybe it would help to only balance the IRQs each
> > e.g. ten seconds? Unfortunately I have no SMP system to try it out.
>
> Be it kernel or user space, for consistent benchmark results it needs to be
> able to be turned-off without turning the code. That leaves me in agreement
> with Stephen that if it must exist, the user space one would be preferable.
> It can be easily terminated with extreme prejudice.
I agree that having a full-featured userspace balancer daemon with lots of
intelligence will be theoretically better, but if you can have a simple
daemon doing OK on many machines for less than the userspace daemon's
kernel stack, why not?
--
Funny quotes:
31. Why do "overlook" and "oversee" mean opposite things?
Perhaps my judgement is too colored by benchmark(et)ing, and desires to
have repeatable results on things like neperf, but I very much like to
know where my interrupts are going and don't like them moving around.
That is why I am not particularly fond of either flavor of irq balancing.
That being the case, whatever is out there aught to be able to be
disabled on a running system without having to roll bits or reboot.
rick jones
> > > Be it kernel or user space, for consistent benchmark results it needs to
> > > be
> > > able to be turned-off without turning the code. That leaves me in
> > > agreement
> > > with Stephen that if it must exist, the user space one would be
> > > preferable.
> > > It can be easily terminated with extreme prejudice.
> >
> >
> > I agree that having a full-featured userspace balancer daemon with lots of
> > intelligence will be theoretically better, but if you can have a simple
> > daemon doing OK on many machines for less than the userspace daemon's
> > kernel stack, why not?
>
> Perhaps my judgement is too colored by benchmark(et)ing, and desires to have
> repeatable results on things like neperf, but I very much like to know where
> my interrupts are going and don't like them moving around. That is why I am
> not particularly fond of either flavor of irq balancing.
>
> That being the case, whatever is out there aught to be able to be disabled on
> a running system without having to roll bits or reboot.
Adding a "module" parameter to disable it should be cheap, isn't it?
--
Top 100 things you don't want the sysadmin to say:
34. The network's down, but we're working on it. Come back after diner.
(Usually said at 2200 the night before thesis deadline... )
Except the irq balancing is system-wide. Adding per-device exemptions to an
obsolete feature seems like the wrong way to go.
-- Chris
Since you're changing the default setting, shouldn't the above be
changed to:
Saying yes will allow the kernel to do irq load balancing.
The default no will keep the kernel from doing irq load balancing.
> + This option is known to cause performance issues on SMP
> + systems. The preferred method is to use the userspace
> + 'irqbalance' daemon instead. See http://irqbalance.org/.
> +
> config SECCOMP
> def_bool y
> prompt "Enable seccomp to safely compute untrusted bytecode"
-Bill
No, not a per-device-exemption. My reasoning was: If the IRQ balancer
bounces the IRQ too often, doing it less often seems to be the correct
solution. One cache miss each ten seconds sounds like it should be OK.
As said before, I can't verify this theory.
this is exaclty what the userspace irqbalance does and it's even optimized to not
do those migrations once every 10 seconds if things look OK. from that
perspective, it's definately more mature and it's maintained as well.
Auke