Spreading NIC interrupts across multiple CPUs

Aaron Seelye

unread,

Mar 26, 2014, 2:40:01 PM3/26/14

to

I have a question regarding interrupt balancing for a NIC across CPUs.
I have a Dell R710 (dual quad core) with embedded broadcom 5709 that
seems to put everything on the CPU0. I even threw an Intel Pro/1000 PT
in the Dell, but this is showing the same problem.

For a test system, I have an HP DL360-G5 (also dual quad core) with
embedded broadcom 5708 that balances across all cores. I've also thrown
in an identical Intel NIC, and it seems to balance across the cores
properly. This leads me to believe that there's something wrong with my
BIOS setup, or there's something inherently wrong with the R710, though
I'm leading towards the former, as I'm seeing this on two R710s, and
doubt I'd hit a magic breakage across two chassis.

Also, this is with no massaging on my part, both running up to date
debian wheezy 7.4, with the Dell being installed originally with 7.1

My question is this, what option(s) could be present with the R710 bios
that would cause something like this to happen? If not the bios,
where/what else should I look at?

Thanks,

-Aaron

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: https://lists.debian.org/53331C52...@eltopia.com

Mr Queue

unread,

Mar 26, 2014, 3:10:02 PM3/26/14

to

On Wed, 26 Mar 2014 11:28:34 -0700
Aaron Seelye <aseely...@eltopia.com> wrote:

> My question is this, what option(s) could be present with the R710 bios
> that would cause something like this to happen? If not the bios,
> where/what else should I look at?

You don't have irqbalance running by chance do you? Because this sounds exactly what it's designed to do.

https://github.com/Irqbalance/irqbalance

"Irqbalance is a daemon to help balance the cpu load generated by interrupts
across all of a systems cpus. Irqbalance identifies the highest volume
interrupt sources, and isolates them to a single unique cpu, so that load is
spread as much as possible over an entire processor set, while minimizing cache
hit rates for irq handlers."

--
It is wrong always, everywhere and for everyone to believe anything upon
insufficient evidence.
- W. K. Clifford, British philosopher, circa 1876

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: https://lists.debian.org/20140326140...@mrqueue.com

Aaron Seelye

unread,

Mar 26, 2014, 5:10:02 PM3/26/14

to

I don't on either the Dell or HP. I tried it on the Dells, but it
didn't do anything on one, and just moved the interrupts from CPU0 to
CPU1 on the other.

On the HP that is balancing perfectly, I don't have the irqbalance
package installed, it just worked from the get-go.

-Aaron

On 3/26/2014 12:09 PM, Mr Queue wrote:
> On Wed, 26 Mar 2014 11:28:34 -0700
> Aaron Seelye <aseely...@eltopia.com> wrote:
>
>> My question is this, what option(s) could be present with the R710 bios
>> that would cause something like this to happen? If not the bios,
>> where/what else should I look at?
>
> You don't have irqbalance running by chance do you? Because this sounds exactly what it's designed to do.
>
> https://github.com/Irqbalance/irqbalance
>
> "Irqbalance is a daemon to help balance the cpu load generated by interrupts
> across all of a systems cpus. Irqbalance identifies the highest volume
> interrupt sources, and isolates them to a single unique cpu, so that load is
> spread as much as possible over an entire processor set, while minimizing cache
> hit rates for irq handlers."
>

--

To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: https://lists.debian.org/53334186...@eltopia.com

Stan Hoeppner

unread,

Mar 26, 2014, 5:50:02 PM3/26/14

to

On 3/26/2014 1:28 PM, Aaron Seelye wrote:
> I have a question regarding interrupt balancing for a NIC across CPUs. I
> have a Dell R710 (dual quad core) with embedded broadcom 5709 that seems
> to put everything on the CPU0. I even threw an Intel Pro/1000 PT in the
> Dell, but this is showing the same problem.
>
> For a test system, I have an HP DL360-G5 (also dual quad core) with
> embedded broadcom 5708 that balances across all cores. I've also thrown
> in an identical Intel NIC, and it seems to balance across the cores
> properly. This leads me to believe that there's something wrong with my
> BIOS setup, or there's something inherently wrong with the R710, though
> I'm leading towards the former, as I'm seeing this on two R710s, and
> doubt I'd hit a magic breakage across two chassis.
>
> Also, this is with no massaging on my part, both running up to date
> debian wheezy 7.4, with the Dell being installed originally with 7.1
>
> My question is this, what option(s) could be present with the R710 bios
> that would cause something like this to happen? If not the bios,
> where/what else should I look at?

Please read this for educational background, especially the Note at the
bottom of the page.

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html

Then ask an intelligent question about IRQ balancing and steering, WRT
the two specific and different hardware systems, and Debian kernel
versions, being used on each.

Cheers,

Stan

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: https://lists.debian.org/53334A40...@hardwarefreak.com

Aaron Seelye

unread,

Mar 26, 2014, 6:30:02 PM3/26/14

to

On 3/26/2014 2:44 PM, Stan Hoeppner wrote:
>
> Please read this for educational background, especially the Note at the
> bottom of the page.
>
> https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html
>
> Then ask an intelligent question about IRQ balancing and steering, WRT
> the two specific and different hardware systems, and Debian kernel
> versions, being used on each.

I'd seen other things similar to that, however, it doesn't seem to get
me any closer to the solution.

The output from one of the Dell (not balanced) systems:

root@conf-2:~# uname -a
Linux conf-2 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
root@conf-2:~# grep eth /proc/interrupts
79: 704642666 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 PCI-MSI-edge eth0
root@conf-2:~# cat /proc/irq/79/smp_affinity
0000ffff
root@conf-2:~# cat /proc/irq/79/smp_affinity_list
0-15

The output from the HP (balanced) system:

root@deb-test:~# grep eth /proc/interrupts
68: 4251 4190 4212 4264 4226 4257
4251 4214 PCI-MSI-edge eth0
root@deb-test:~# cat /proc/irq/68/smp_affinity
ff
root@deb-test:~# cat /proc/irq/68/smp_affinity_list
0-7

As you can see, both systems are running identical kernels, and both
have affinity set to spread across all CPUs. However, the Dell is using
CPU0 exclusively for the ethernet device interrupts, while the HP
spreads them pretty evenly.

Thanks,

-Aaron

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: https://lists.debian.org/5333534A...@eltopia.com

Stan Hoeppner

unread,

Mar 27, 2014, 1:40:07 AM3/27/14

to

On 3/26/2014 5:23 PM, Aaron Seelye wrote:
> On 3/26/2014 2:44 PM, Stan Hoeppner wrote:
>>
>> Please read this for educational background, especially the Note at the
>> bottom of the page.
>>
>> https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html
>>
>>
>> Then ask an intelligent question about IRQ balancing and steering, WRT
>> the two specific and different hardware systems, and Debian kernel
>> versions, being used on each.
>
> I'd seen other things similar to that, however, it doesn't seem to get
> me any closer to the solution.

Please post the full output of "cat /proc/interrupts" without line wrapping.

> The output from one of the Dell (not balanced) systems:
>
> root@conf-2:~# uname -a
> Linux conf-2 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
> root@conf-2:~# grep eth /proc/interrupts
> 79: 704642666 0 0 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 PCI-MSI-edge eth0
> root@conf-2:~# cat /proc/irq/79/smp_affinity
> 0000ffff
> root@conf-2:~# cat /proc/irq/79/smp_affinity_list
> 0-15

This is an 8 core machine with HT enabled, 16 logical CPUs, so right off
the bat it is dramatically different than the Compaq machine below as
far as the kernel is concerned and how scheduling is performed. The
current mask may or may not be correct for this configuration. I never
use HT and I can't find any docs about HT and /proc/irq/xx/smp_affinity.

If this is a production machine and you can't easily reboot it to
disable HT, first try a mask that includes only the physical CPUs and
not the logical:

~# echo ff > /proc/irq/79/smp_affinity

This should schedule IRQs only on the 1st logical processor (physical
CPU) of each core. If that doesn't do the trick reboot the box and
disable HT. If that doesn't do it I'll dig further into the scheduler
to figure out what's going on.

> The output from the HP (balanced) system:
>
> root@deb-test:~# grep eth /proc/interrupts
> 68: 4251 4190 4212 4264 4226 4257
> 4251 4214 PCI-MSI-edge eth0
> root@deb-test:~# cat /proc/irq/68/smp_affinity
> ff
> root@deb-test:~# cat /proc/irq/68/smp_affinity_list
> 0-7

This is an 8 core machine without HyperThreading. The mask is correct
for 8 physical CPUs. Oddly though, one box outputs the leading zeros of
the mask while the other does not. Or did you mung either output?

> As you can see, both systems are running identical kernels, and both
> have affinity set to spread across all CPUs.

The latter may not be a correct statement, as HT logical processors are
not CPUs. Also, the smp_affinity mask on the Dell implies 32
processors. Many, but not all, of the functional units are duplicated.
Just as you do not want to schedule two compute intensive tasks to both
logical processors on a core leaving the other cores idle, you also do
not want to assign assign any interrupts to the 2nd logical processor in
a given core. All this does is pile up context and state switches on
said core. The net effect is decreasing the overall work that can be
performed.

And to this point, it's not usually a good idea to spread interrupts
round robin from any device evenly across all cores in a system. This
is inefficient as each core must load the ISR for every interrupt. This
decreases the effectiveness of L1/L2 caches on all cores, causing
additional cache misses for other processes executing on those cores.
This is precisely why irqbalance was created.

> However, the Dell is using
> CPU0 exclusively for the ethernet device interrupts, while the HP
> spreads them pretty evenly.

This could be as simple at HT being enabled on the Dell. If not, the
contents of your /proc/interrupts files should help me narrow this down
for you.

For future reference, kernel scheduler problems such as this should be
posted on LKML, not a distro list, no matter which distro you use.
There are very few people on debian-user or any of the distro general
help lists with significant knowledge of the kernel, let alone the
scheduler. You typically get help with this kind of thing much faster,
and with more thorough knowledge transfer on LKML.

Cheers,

Stan

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: https://lists.debian.org/5333B78C...@hardwarefreak.com

Aaron Seelye

unread,

Mar 28, 2014, 1:40:03 PM3/28/14

to

On 3/26/2014 10:30 PM, Stan Hoeppner wrote:
> This is an 8 core machine with HT enabled, 16 logical CPUs, so right off
> the bat it is dramatically different than the Compaq machine below as
> far as the kernel is concerned and how scheduling is performed. The
> current mask may or may not be correct for this configuration. I never
> use HT and I can't find any docs about HT and /proc/irq/xx/smp_affinity.

Agreed on finding the docs, it was nigh impossible. I found a way to
offload the traffic for that server, made a few changes to the BIOS
(c-states, HT, etc), and booted it back up. Didn't seem to change much
on the spreading, but that's fine.

> And to this point, it's not usually a good idea to spread interrupts
> round robin from any device evenly across all cores in a system. This
> is inefficient as each core must load the ISR for every interrupt. This
> decreases the effectiveness of L1/L2 caches on all cores, causing
> additional cache misses for other processes executing on those cores.
> This is precisely why irqbalance was created.

A couple things on this, I did see what you're talking about WRT
spreading the interrupts about the processors. However, I did notice
one thing, irqbalance is set to specifically exempt ethernet/network
interfaces from its balancing. I'm not sure if it's to make sure what I
was seeing with the HP system doesn't inadvertently happen, or to make
sure the queues all stay on the same processor. This would lead me to
my next question, in the case of a NIC with multiple queues, should all
queues for a given interface be on a single CPU (actual cpu, not HT)?
(answered next paragraph)

>
>> However, the Dell is using
>> CPU0 exclusively for the ethernet device interrupts, while the HP
>> spreads them pretty evenly.
>
> This could be as simple at HT being enabled on the Dell. If not, the
> contents of your /proc/interrupts files should help me narrow this down
> for you.

Unfortunately it didn't change anything on the Dell, no idea why. Could
be as simple as the driver differences for the 5708 and 5709.

Looking at
https://we.riseup.net/riseup+tech/balancing-hardware-interrupts and more
specifically
http://www.alexonlinux.com/msi-x-the-right-way-to-spread-interrupt-load,
it looks like the queues enabled on the 5709 (which is on the Dell)
would enable me to manually balance the queues across multiple cores
without problems. I'd been under the impression that MSI-X was what was
to blame for the HP spreading things about, but I see that's not the case.

So far, under one day including a typical peak load, it looks like this
was rather successful, as I hit normal traffic patterns without dropping
any outbound packets.

>
> For future reference, kernel scheduler problems such as this should be
> posted on LKML, not a distro list, no matter which distro you use.
> There are very few people on debian-user or any of the distro general
> help lists with significant knowledge of the kernel, let alone the
> scheduler. You typically get help with this kind of thing much faster,
> and with more thorough knowledge transfer on LKML.

Will do. I'm sorry, but I thought this would have been a pretty
standard question for anyone operating in a production environment where
100k pps is typical (at least, that's what set it off for me). Either
way, I've definitely learned a lot more about this sort of thing and
have a solution that seems to be working well without any real hocus
pocus going on. Thank you for steering me in the right direction.

-Aaron

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Archive: https://lists.debian.org/5335B242...@eltopia.com