automatic interrupt affinity for MSI/MSI-X capable devices V2

Christoph Hellwig

unread,

Jun 14, 2016, 4:00:06 PM6/14/16

to

This series enhances the irq and PCI code to allow spreading around MSI and
MSI-X vectors so that they have per-cpu affinity if possible, or at least
per-node. For that it takes the algorithm from blk-mq, moves it to
a common place, and makes it available through a vastly simplified PCI
interrupt allocation API. It then switches blk-mq to be able to pick up
the queue mapping from the device if available, and demonstrates all this
using the NVMe driver.

There also is a git tree available at:

git://git.infradead.org/users/hch/block.git

Gitweb:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/msix-spreading.4

Changes since V1:
- irq core improvements to properly assign the affinity before
request_irq (tglx)
- better handling of the MSI vs MSI-X differences in the low level
MSI allocator (hch and tglx)
- various improvements to pci_alloc_irq_vectors (hch)
- remove blk-mq hardware queue reassigned on hotplug cpu events (hch)
- forward ported to Jens' current for-next tree (hch)

Christoph Hellwig

unread,

Jun 14, 2016, 4:10:07 PM6/14/16

to

Set the affinity_mask before allocating vectors.

Signed-off-by: Christoph Hellwig <h...@lst.de>
---
drivers/pci/msi.c | 26 ++++++++++++++++++++++++--
include/linux/pci.h | 1 +
2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a33adec..50d694c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -568,6 +568,7 @@ static struct msi_desc *msi_setup_entry(struct pci_dev *dev, int nvec)
entry->msi_attrib.multi_cap = (control & PCI_MSI_FLAGS_QMASK) >> 1;
entry->msi_attrib.multiple = ilog2(__roundup_pow_of_two(nvec));
entry->nvec_used = nvec;
+ entry->affinity = dev->irq_affinity;

if (control & PCI_MSI_FLAGS_64BIT)
entry->mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
@@ -679,10 +680,18 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
struct msix_entry *entries, int nvec)
{
+ const struct cpumask *mask = NULL;
struct msi_desc *entry;
- int i;
+ int cpu = -1, i;

for (i = 0; i < nvec; i++) {
+ if (dev->irq_affinity) {
+ cpu = cpumask_next(cpu, dev->irq_affinity);
+ if (cpu >= nr_cpu_ids)
+ cpu = cpumask_first(dev->irq_affinity);
+ mask = cpumask_of(cpu);
+ }
+
entry = alloc_msi_entry(&dev->dev);
if (!entry) {
if (!i)
@@ -699,6 +708,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
entry->msi_attrib.default_irq = dev->irq;
entry->mask_base = base;
entry->nvec_used = 1;
+ entry->affinity = mask;

list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
}
@@ -1176,12 +1186,20 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
{
unsigned int vecs, i;
u32 *irqs;
+ int ret;

max_vecs = min(max_vecs, pci_nr_irq_vectors(dev));

+ ret = irq_create_affinity_mask(&dev->irq_affinity, &max_vecs);
+ if (ret)
+ return ret;
+ if (max_vecs < min_vecs)
+ return -ENOSPC;
+
+ ret = -ENOMEM;
irqs = kcalloc(max_vecs, sizeof(u32), GFP_KERNEL);
if (!irqs)
- return -ENOMEM;
+ goto out_free_affinity;

if (!(flags & PCI_IRQ_NOMSIX)) {
vecs = pci_enable_msix_range_wrapper(dev, irqs, min_vecs,
@@ -1208,6 +1226,10 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
done:
dev->irqs = irqs;
return vecs;
+out_free_affinity:
+ kfree(dev->irq_affinity);
+ dev->irq_affinity = NULL;
+ return ret;
}
EXPORT_SYMBOL(pci_alloc_irq_vectors);

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 84a20fc..f474611 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -321,6 +321,7 @@ struct pci_dev {
*/
unsigned int irq;
unsigned int *irqs;
+ struct cpumask *irq_affinity;
struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */

bool match_driver; /* Skip attaching driver */
--
2.1.4

Christoph Hellwig

unread,

Jun 16, 2016, 11:30:04 AM6/16/16

to

On Thu, Jun 16, 2016 at 11:45:55AM +0200, Bart Van Assche wrote:
> Is my interpretation correct that for an adapter that supports two
> interrupts and on a system with eight CPU cores and no hyperthreading this
> patch series will assign interrupt vector 0 to CPU 0 and interrupt vector 1
> to CPU 1?

Yes - same as the existing blk-mq queue distribution.

> Are you aware that drivers like ib_srp assume that interrupts
> have been spread evenly, that means assigning vector 0 to CPU 0 and vector
> 1 to CPU 4?

which will make them run into a conflict with the current blk-mq
assignment. That's exactly the point why we're trying to move things
to a core location so that everyone can use the same mapping.

Alexander Gordeev

unread,

Jun 25, 2016, 4:30:05 PM6/25/16

to

dev->irq_affinity = irq_create_affinity_mask(&max_vecs); ?

> + if (ret)
> + return ret;
> + if (max_vecs < min_vecs)
> + return -ENOSPC;

irq_create_affinity_mask() should be called after MSI-X/MSI is enabled,
because we do not know number of vectors before the range functions
returned that number.

Since affinity masks is a function of number of vectors and CPU topology
the resulting masks might turn out suboptimal in general case (and
this code supposed to be general, right?).

I.e irq_create_affinity_mask() could decide "per-first-sibling" spreading
given number of available vectors, but only a subset of MSI vectors
were actually allocated. For that subset "per-core" affinity mask could
have been initialized, but we will still go with "per-first-sibling".

Alexander Gordeev

unread,

Jun 26, 2016, 3:40:06 PM6/26/16

to

On Tue, Jun 14, 2016 at 09:58:53PM +0200, Christoph Hellwig wrote:
> This series enhances the irq and PCI code to allow spreading around MSI and
> MSI-X vectors so that they have per-cpu affinity if possible, or at least
> per-node. For that it takes the algorithm from blk-mq, moves it to
> a common place, and makes it available through a vastly simplified PCI
> interrupt allocation API. It then switches blk-mq to be able to pick up
> the queue mapping from the device if available, and demonstrates all this
> using the NVMe driver.

Hi Christoph,

One general comment. As result of this series there will be
three locations to store/point to affinities: IRQ descriptor,
MSI descriptor and PCI device descriptor.

IRQ and MSI descriptors merely refer to duplicate masks while
the PCI device mask is the sum of all its MSI interrupts' masks.

Besides, MSI descriptors and PCI device affinity masks are only
used just once - at MSI initialization.

Overall, it looks like some cleanup is possible here.