[patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2

19 views
Skip to first unread message

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:30 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
This is the third part of [PCI]MSI refactoring which aims to provide the
ability of expanding MSI-X vectors after enabling MSI-X.

The first two parts of this work can be found here:

https://lore.kernel.org/r/202111262227...@linutronix.de
https://lore.kernel.org/r/202111262241...@linutronix.de

This third part has the following important changes:

1) Add locking to protect the MSI descriptor storage

Right now the MSI descriptor storage (linked list) is not protected
by anything under the assumption that the list is installed before
use and destroyed after use. As this is about to change there has to
be protection

2) A new set of iterators which allow filtering on the state of the
descriptors namely whether a descriptor is associated to a Linux
interrupt or not.

This cleans up a lot of use cases which have to do this filtering
manually.

3) A new set of MSI descriptor allocation functions which make the usage
sites simpler and confine the storage handling to the core code.

Trivial MSI descriptors (non PCI) are now allocated by the core code
automatically when the underlying irq domain requests that.

4) Rework of sysfs handling to prepare for dynamic extension of MSI-X

The current mechanism which creates the directory and the attributes
for all MSI descriptors in one go is obviously not suitable for
dynamic extension. The rework splits the directory creation out and
lets the MSI interrupt allocation create the per descriptor
attributes.

5) Conversion of the MSI descriptor storage to xarray

The linked list based storage is suboptimal even without dynamic
expansion as it requires full list walks to get to a specific
descriptor. With dynamic expansion this gets even more
convoluted. Xarray is way more suitable and simplifies the
final goal of dynamic expansion of the MSI-X space.

This third series is based on:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-2

and also available from git:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3

For the curious who can't wait for the next part to arrive the full series
is available via:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-4

Thanks,

tglx
---
.clang-format | 1
arch/powerpc/platforms/4xx/hsta_msi.c | 7
arch/powerpc/platforms/cell/axon_msi.c | 7
arch/powerpc/platforms/pasemi/msi.c | 9
arch/powerpc/sysdev/fsl_msi.c | 8
arch/powerpc/sysdev/mpic_u3msi.c | 9
arch/s390/pci/pci_irq.c | 6
arch/x86/pci/xen.c | 14
drivers/base/core.c | 3
drivers/base/platform-msi.c | 110 -----
drivers/bus/fsl-mc/fsl-mc-msi.c | 61 --
drivers/ntb/msi.c | 19
drivers/pci/controller/pci-hyperv.c | 15
drivers/pci/msi/irqdomain.c | 11
drivers/pci/msi/legacy.c | 20
drivers/pci/msi/msi.c | 255 +++++------
drivers/pci/xen-pcifront.c | 2
drivers/soc/ti/ti_sci_inta_msi.c | 77 +--
include/linux/device.h | 4
include/linux/msi.h | 135 +++++-
include/linux/soc/ti/ti_sci_inta_msi.h | 1
kernel/irq/msi.c | 719 ++++++++++++++++++++++-----------
22 files changed, 841 insertions(+), 652 deletions(-)


Thomas Gleixner

unread,
Nov 26, 2021, 8:22:31 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
It's only required when MSI is in use.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/base/core.c | 3 ---
include/linux/device.h | 4 ----
include/linux/msi.h | 4 +++-
kernel/irq/msi.c | 5 ++++-
4 files changed, 7 insertions(+), 9 deletions(-)

--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2874,9 +2874,6 @@ void device_initialize(struct device *de
INIT_LIST_HEAD(&dev->devres_head);
device_pm_init(dev);
set_dev_node(dev, NUMA_NO_NODE);
-#ifdef CONFIG_GENERIC_MSI_IRQ
- INIT_LIST_HEAD(&dev->msi_list);
-#endif
INIT_LIST_HEAD(&dev->links.consumers);
INIT_LIST_HEAD(&dev->links.suppliers);
INIT_LIST_HEAD(&dev->links.defer_sync);
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -422,7 +422,6 @@ struct dev_msi_info {
* @em_pd: device's energy model performance domain
* @pins: For device pin management.
* See Documentation/driver-api/pin-control.rst for details.
- * @msi_list: Hosts MSI descriptors
* @numa_node: NUMA node this device is close to.
* @dma_ops: DMA mapping operations for this device.
* @dma_mask: Dma mask (if dma'ble device).
@@ -518,9 +517,6 @@ struct device {
struct dev_pin_info *pins;
#endif
struct dev_msi_info msi;
-#ifdef CONFIG_GENERIC_MSI_IRQ
- struct list_head msi_list;
-#endif
#ifdef CONFIG_DMA_OPS
const struct dma_map_ops *dma_ops;
#endif
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -145,12 +145,14 @@ struct msi_desc {
* @properties: MSI properties which are interesting to drivers
* @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
+ * @list: List of MSI descriptors associated to the device
*/
struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
+ struct list_head list;
};

int msi_setup_device_data(struct device *dev);
@@ -187,7 +189,7 @@ static inline unsigned int msi_get_virq(

/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
-#define dev_to_msi_list(dev) (&(dev)->msi_list)
+#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
#define first_msi_entry(dev) \
list_first_entry(dev_to_msi_list((dev)), struct msi_desc, list)
#define for_each_msi_entry(desc, dev) \
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -87,7 +87,9 @@ EXPORT_SYMBOL_GPL(get_cached_msi_msg);

static void msi_device_data_release(struct device *dev, void *res)
{
- WARN_ON_ONCE(!list_empty(&dev->msi_list));
+ struct msi_device_data *md = res;
+
+ WARN_ON_ONCE(!list_empty(&md->list));
dev->msi.data = NULL;
}

@@ -113,6 +115,7 @@ int msi_setup_device_data(struct device
return -ENOMEM;

raw_spin_lock_init(&md->lock);
+ INIT_LIST_HEAD(&md->list);
dev->msi.data = md;
devres_add(dev, md);
return 0;

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:32 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
For upcoming runtime extensions of MSI-X interrupts it's required to
protect the MSI descriptor list. Add a mutex to struct msi_device_data and
provide lock/unlock functions.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 6 ++++++
kernel/irq/msi.c | 25 +++++++++++++++++++++++++
2 files changed, 31 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -3,6 +3,7 @@
#define LINUX_MSI_H

#include <linux/spinlock.h>
+#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/bits.h>
#include <asm/msi.h>
@@ -146,6 +147,7 @@ struct msi_desc {
* @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
+ * @mutex: Mutex protecting the MSI list
*/
struct msi_device_data {
raw_spinlock_t lock;
@@ -153,6 +155,7 @@ struct msi_device_data {
const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
struct list_head list;
+ struct mutex mutex;
};

int msi_setup_device_data(struct device *dev);
@@ -187,6 +190,9 @@ static inline unsigned int msi_get_virq(
return ret < 0 ? 0 : ret;
}

+void msi_lock_descs(struct device *dev);
+void msi_unlock_descs(struct device *dev);
+
/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -116,12 +116,37 @@ int msi_setup_device_data(struct device

raw_spin_lock_init(&md->lock);
INIT_LIST_HEAD(&md->list);
+ mutex_init(&md->mutex);
dev->msi.data = md;
devres_add(dev, md);
return 0;
}

/**
+ * msi_lock_descs - Lock the MSI descriptor storage of a device
+ * @dev: Device to operate on
+ */
+void msi_lock_descs(struct device *dev)
+{
+ if (WARN_ON_ONCE(!dev->msi.data))
+ return;
+ mutex_lock(&dev->msi.data->mutex);
+}
+EXPORT_SYMBOL_GPL(msi_lock_descs);
+
+/**
+ * msi_unlock_descs - Unlock the MSI descriptor storage of a device
+ * @dev: Device to operate on
+ */
+void msi_unlock_descs(struct device *dev)
+{
+ if (WARN_ON_ONCE(!dev->msi.data))
+ return;
+ mutex_unlock(&dev->msi.data->mutex);
+}
+EXPORT_SYMBOL_GPL(msi_unlock_descs);
+
+/**
* __msi_get_virq - Return Linux interrupt number of a MSI interrupt
* @dev: Device to operate on
* @index: MSI interrupt index to look for (0-based)

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:35 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Usage sites which do allocations of the MSI descriptors before invoking
msi_domain_alloc_irqs() require to lock the MSI decriptors accross the
operation.

Provide entry points which can be called with the MSI mutex held and lock
the mutex in the existing entry points.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 3 ++
kernel/irq/msi.c | 74 ++++++++++++++++++++++++++++++++++++++++------------
2 files changed, 61 insertions(+), 16 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -413,9 +413,12 @@ struct irq_domain *msi_create_irq_domain
struct irq_domain *parent);
int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
int nvec);
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+ int nvec);
int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
int nvec);
void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev);
void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -691,10 +691,8 @@ int __msi_domain_alloc_irqs(struct irq_d
virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
dev_to_node(dev), &arg, false,
desc->affinity);
- if (virq < 0) {
- ret = msi_handle_pci_fail(domain, desc, allocated);
- goto cleanup;
- }
+ if (virq < 0)
+ return msi_handle_pci_fail(domain, desc, allocated);

for (i = 0; i < desc->nvec_used; i++) {
irq_set_msi_desc_off(virq, i, desc);
@@ -728,7 +726,7 @@ int __msi_domain_alloc_irqs(struct irq_d
}
ret = irq_domain_activate_irq(irq_data, can_reserve);
if (ret)
- goto cleanup;
+ return ret;
}

skip_activate:
@@ -743,38 +741,63 @@ int __msi_domain_alloc_irqs(struct irq_d
}
}
return 0;
-
-cleanup:
- msi_domain_free_irqs(domain, dev);
- return ret;
}

/**
- * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
+ * msi_domain_alloc_irqs_descs_locked - Allocate interrupts from a MSI interrupt domain
* @domain: The domain to allocate from
* @dev: Pointer to device struct of the device for which the interrupts
* are allocated
* @nvec: The number of interrupts to allocate
*
+ * Must be invoked from within a msi_lock_descs() / msi_unlock_descs()
+ * pair. Use this for MSI irqdomains which implement their own vector
+ * allocation/free.
+ *
* Return: %0 on success or an error code.
*/
-int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
- int nvec)
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+ int nvec)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
int ret;

+ lockdep_assert_held(&dev->msi.data->mutex);
+
ret = ops->domain_alloc_irqs(domain, dev, nvec);
if (ret)
- return ret;
+ goto cleanup;

if (!(info->flags & MSI_FLAG_DEV_SYSFS))
return 0;

ret = msi_device_populate_sysfs(dev);
if (ret)
- msi_domain_free_irqs(domain, dev);
+ goto cleanup;
+ return 0;
+
+cleanup:
+ msi_domain_free_irqs_descs_locked(domain, dev);
+ return ret;
+}
+
+/**
+ * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
+ * @domain: The domain to allocate from
+ * @dev: Pointer to device struct of the device for which the interrupts
+ * are allocated
+ * @nvec: The number of interrupts to allocate
+ *
+ * Return: %0 on success or an error code.
+ */
+int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec)
+{
+ int ret;
+
+ msi_lock_descs(dev);
+ ret = msi_domain_alloc_irqs_descs_locked(domain, dev, nvec);
+ msi_unlock_descs(dev);
return ret;
}

@@ -804,22 +827,41 @@ void __msi_domain_free_irqs(struct irq_d
}

/**
- * msi_domain_free_irqs - Free interrupts from a MSI interrupt @domain associated to @dev
+ * msi_domain_free_irqs_descs_locked - Free interrupts from a MSI interrupt @domain associated to @dev
* @domain: The domain to managing the interrupts
* @dev: Pointer to device struct of the device for which the interrupts
* are free
+ *
+ * Must be invoked from within a msi_lock_descs() / msi_unlock_descs()
+ * pair. Use this for MSI irqdomains which implement their own vector
+ * allocation.
*/
-void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;

+ lockdep_assert_held(&dev->msi.data->mutex);
+
if (info->flags & MSI_FLAG_DEV_SYSFS)
msi_device_destroy_sysfs(dev);
ops->domain_free_irqs(domain, dev);
}

/**
+ * msi_domain_free_irqs - Free interrupts from a MSI interrupt @domain associated to @dev
+ * @domain: The domain to managing the interrupts
+ * @dev: Pointer to device struct of the device for which the interrupts
+ * are free
+ */
+void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+{
+ msi_lock_descs(dev);
+ msi_domain_free_irqs_descs_locked(domain, dev);
+ msi_unlock_descs(dev);
+}
+
+/**
* msi_get_domain_info - Get the MSI interrupt domain info for @domain
* @domain: The interrupt domain to retrieve data from
*

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:36 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
In preparation for dynamic handling of MSI-X interrupts provide a new set
of MSI descriptor accessor functions and iterators. They are benefitial per
se as they allow to cleanup quite some code in various MSI domain
implementations.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 58 ++++++++++++++++++++++++++++
kernel/irq/msi.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 165 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -140,6 +140,18 @@ struct msi_desc {
struct pci_msi_desc pci;
};

+/*
+ * Filter values for the MSI descriptor iterators and accessor functions.
+ */
+enum msi_desc_filter {
+ /* All descriptors */
+ MSI_DESC_ALL,
+ /* Descriptors which have no interrupt associated */
+ MSI_DESC_NOTASSOCIATED,
+ /* Descriptors which have an interrupt associated */
+ MSI_DESC_ASSOCIATED,
+};
+
/**
* msi_device_data - MSI per device data
* @lock: Spinlock to protect register access
@@ -148,6 +160,8 @@ struct msi_desc {
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
* @mutex: Mutex protecting the MSI list
+ * @__next: Cached pointer to the next entry for iterators
+ * @__filter: Cached descriptor filter
*/
struct msi_device_data {
raw_spinlock_t lock;
@@ -156,6 +170,8 @@ struct msi_device_data {
struct platform_msi_priv_data *platform_data;
struct list_head list;
struct mutex mutex;
+ struct msi_desc *__next;
+ enum msi_desc_filter __filter;
};

int msi_setup_device_data(struct device *dev);
@@ -193,6 +209,48 @@ static inline unsigned int msi_get_virq(
void msi_lock_descs(struct device *dev);
void msi_unlock_descs(struct device *dev);

+struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
+struct msi_desc *msi_next_desc(struct device *dev);
+
+/**
+ * msi_first_desc - Get the first MSI descriptor associated to the device
+ * @dev: Device to search
+ */
+static inline struct msi_desc *msi_first_desc(struct device *dev)
+{
+ return __msi_first_desc(dev, MSI_DESC_ALL, 0);
+}
+
+
+/**
+ * msi_for_each_desc_from - Iterate the MSI descriptors from a given index
+ *
+ * @desc: struct msi_desc pointer used as iterator
+ * @dev: struct device pointer - device to iterate
+ * @filter: Filter for descriptor selection
+ * @base_index: MSI index to iterate from
+ *
+ * Notes:
+ * - The loop must be protected with a msi_lock_descs()/msi_unlock_descs()
+ * pair.
+ * - It is safe to remove a retrieved MSI descriptor in the loop.
+ */
+#define msi_for_each_desc_from(desc, dev, filter, base_index) \
+ for ((desc) = __msi_first_desc((dev), (filter), (base_index)); (desc); \
+ (desc) = msi_next_desc((dev)))
+
+/**
+ * msi_for_each_desc - Iterate the MSI descriptors
+ *
+ * @desc: struct msi_desc pointer used as iterator
+ * @dev: struct device pointer - device to iterate
+ * @filter: Filter for descriptor selection
+ *
+ * See msi_for_each_desc_from()for further information.
+ */
+#define msi_for_each_desc(desc, dev, filter) \
+ msi_for_each_desc_from(desc, dev, filter, 0)
+
/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -142,10 +142,117 @@ void msi_unlock_descs(struct device *dev
{
if (WARN_ON_ONCE(!dev->msi.data))
return;
+ /* Clear the next pointer which was cached by the iterator */
+ dev->msi.data->__next = NULL;
mutex_unlock(&dev->msi.data->mutex);
}
EXPORT_SYMBOL_GPL(msi_unlock_descs);

+static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
+{
+ switch (filter) {
+ case MSI_DESC_ALL:
+ return true;
+ case MSI_DESC_NOTASSOCIATED:
+ return !desc->irq;
+ case MSI_DESC_ASSOCIATED:
+ return !!desc->irq;
+ }
+ WARN_ON_ONCE(1);
+ return false;
+}
+
+static struct msi_desc *msi_find_first_desc(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index)
+{
+ struct msi_desc *desc;
+
+ list_for_each_entry(desc, dev_to_msi_list(dev), list) {
+ if (desc->msi_index < base_index)
+ continue;
+ if (msi_desc_match(desc, filter))
+ return desc;
+ }
+ return NULL;
+}
+
+/**
+ * __msi_first_desc - Get the first MSI descriptor of a device
+ * @dev: Device to operate on
+ * @filter: Descriptor state filter
+ * @base_index: MSI index to start from for range based operations
+ *
+ * Must be called with the MSI descriptor mutex held, i.e. msi_lock_descs()
+ * must be invoked before the call.
+ *
+ * Return: Pointer to the first MSI descriptor matching the search
+ * criteria, NULL if none found.
+ */
+struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index)
+{
+ struct msi_desc *desc;
+
+ if (WARN_ON_ONCE(!dev->msi.data))
+ return NULL;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ /* Invalidate a previous invocation within the same lock section */
+ dev->msi.data->__next = NULL;
+
+ desc = msi_find_first_desc(dev, filter, base_index);
+ if (desc) {
+ dev->msi.data->__next = list_next_entry(desc, list);
+ dev->msi.data->__filter = filter;
+ }
+ return desc;
+}
+EXPORT_SYMBOL_GPL(__msi_first_desc);
+
+static struct msi_desc *__msi_next_desc(struct device *dev, enum msi_desc_filter filter,
+ struct msi_desc *from)
+{
+ struct msi_desc *desc = from;
+
+ list_for_each_entry_from(desc, dev_to_msi_list(dev), list) {
+ if (msi_desc_match(desc, filter))
+ return desc;
+ }
+ return NULL;
+}
+
+/**
+ * msi_next_desc - Get the next MSI descriptor of a device
+ * @dev: Device to operate on
+ *
+ * The first invocation of msi_next_desc() has to be preceeded by a
+ * successful incovation of __msi_first_desc(). Consecutive invocations are
+ * only valid if the previous one was successful. All these operations have
+ * to be done within the same MSI mutex held region.
+ *
+ * Return: Pointer to the next MSI descriptor matching the search
+ * criteria, NULL if none found.
+ */
+struct msi_desc *msi_next_desc(struct device *dev)
+{
+ struct msi_device_data *data = dev->msi.data;
+ struct msi_desc *desc;
+
+ if (WARN_ON_ONCE(!data))
+ return NULL;
+
+ lockdep_assert_held(&data->mutex);
+
+ if (!data->__next)
+ return NULL;
+
+ desc = __msi_next_desc(dev, data->__filter, data->__next);
+ dev->msi.data->__next = desc ? list_next_entry(desc, list) : NULL;
+ return desc;
+}
+EXPORT_SYMBOL_GPL(msi_next_desc);
+

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:38 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Provide msi_alloc_msi_desc() which takes a template MSI descriptor for
initializing a newly allocated descriptor. This allows to simplify various
usage sites of alloc_msi_entry() and moves the storage handling into the
core code.

For simple cases where only a linear vector space is required provide
msi_add_simple_msi_descs() which just allocates a linear range of MSI
descriptors and fills msi_desc::msi_index accordingly.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 2 +
kernel/irq/msi.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -302,6 +302,8 @@ static inline void pci_write_msi_msg(uns
}
#endif /* CONFIG_PCI_MSI */

+int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
+
struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
const struct irq_affinity_desc *affinity);
void free_msi_entry(struct msi_desc *entry);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -61,6 +61,65 @@ void free_msi_entry(struct msi_desc *ent
}

/**
+ * msi_add_msi_desc - Allocate and initialize a MSI descriptor
+ * @dev: Pointer to the device for which the descriptor is allocated
+ * @init_desc: Pointer to an MSI descriptor to initialize the new descriptor
+ *
+ * Return: 0 on success or an appropriate failure code.
+ */
+int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc)
+{
+ struct msi_desc *desc;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ desc = alloc_msi_entry(dev, init_desc->nvec_used, init_desc->affinity);
+ if (!desc)
+ return -ENOMEM;
+
+ /* Copy the MSI index and type specific data to the new descriptor. */
+ desc->msi_index = init_desc->msi_index;
+ desc->pci = init_desc->pci;
+
+ list_add_tail(&desc->list, &dev->msi.data->list);
+ return 0;
+}
+
+/**
+ * msi_add_simple_msi_descs - Allocate and initialize MSI descriptors
+ * @dev: Pointer to the device for which the descriptors are allocated
+ * @index: Index for the first MSI descriptor
+ * @ndesc: Number of descriptors to allocate
+ *
+ * Return: 0 on success or an appropriate failure code.
+ */
+static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
+{
+ struct msi_desc *desc, *tmp;
+ LIST_HEAD(list);
+ unsigned int i;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ for (i = 0; i < ndesc; i++) {
+ desc = alloc_msi_entry(dev, 1, NULL);
+ if (!desc)
+ goto fail;
+ desc->msi_index = index + i;
+ list_add_tail(&desc->list, &list);
+ }
+ list_splice_tail(&list, &dev->msi.data->list);
+ return 0;
+
+fail:
+ list_for_each_entry_safe(desc, tmp, &list, list) {
+ list_del(&desc->list);
+ free_msi_entry(desc);
+ }
+ return -ENOMEM;
+}
+
+/**
* msi_device_has_property - Check whether a device has a specific MSI property
* @dev: Pointer to the device which is queried
* @prop: Property to check for

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:39 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Provide domain info flags which tell the core to allocate simple
descriptors or to free descriptors when the interrupts are freed and
implement the required functionality.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 15 +++++++++++++++
kernel/irq/msi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 63 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -303,6 +303,17 @@ static inline void pci_write_msi_msg(uns
#endif /* CONFIG_PCI_MSI */

int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index, unsigned int ndesc);
+
+/**
+ * msi_free_msi_descs - Free MSI descriptors of a device
+ * @dev: Device to free the descriptors
+ */
+static inline void msi_free_msi_descs(struct device *dev)
+{
+ msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
+}

struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
const struct irq_affinity_desc *affinity);
@@ -463,6 +474,10 @@ enum {
MSI_FLAG_DEV_SYSFS = (1 << 7),
/* MSI-X entries must be contiguous */
MSI_FLAG_MSIX_CONTIGUOUS = (1 << 8),
+ /* Allocate simple MSI descriptors */
+ MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS = (1 << 9),
+ /* Free MSI descriptors */
+ MSI_FLAG_FREE_MSI_DESCS = (1 << 10),
};

int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -120,6 +120,32 @@ static int msi_add_simple_msi_descs(stru
}

/**
+ * msi_free_msi_descs_range - Free MSI descriptors of a device
+ * @dev: Device to free the descriptors
+ * @filter: Descriptor state filter
+ * @base_index: Index to start freeing from
+ * @ndesc: Number of descriptors to free
+ */
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index, unsigned int ndesc)
+{
+ struct msi_desc *desc;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ msi_for_each_desc(desc, dev, filter) {
+ /*
+ * Stupid for now to handle MSI device domain until the
+ * storage is switched over to an xarray.
+ */
+ if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
+ continue;
+ list_del(&desc->list);
+ free_msi_entry(desc);
+ }
+}
+
+/**
* msi_device_has_property - Check whether a device has a specific MSI property
* @dev: Pointer to the device which is queried
* @prop: Property to check for
@@ -905,6 +931,16 @@ int __msi_domain_alloc_irqs(struct irq_d
return 0;
}

+static int msi_domain_add_simple_msi_descs(struct msi_domain_info *info,
+ struct device *dev,
+ unsigned int num_descs)
+{
+ if (!(info->flags & MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS))
+ return 0;
+
+ return msi_add_simple_msi_descs(dev, 0, num_descs);
+}
+
/**
* msi_domain_alloc_irqs_descs_locked - Allocate interrupts from a MSI interrupt domain
* @domain: The domain to allocate from
@@ -927,6 +963,10 @@ int msi_domain_alloc_irqs_descs_locked(s

lockdep_assert_held(&dev->msi.data->mutex);

+ ret = msi_domain_add_simple_msi_descs(info, dev, nvec);
+ if (ret)
+ return ret;
+
ret = ops->domain_alloc_irqs(domain, dev, nvec);
if (ret)
goto cleanup;
@@ -988,6 +1028,13 @@ void __msi_domain_free_irqs(struct irq_d
}
}

+static void msi_domain_free_msi_descs(struct msi_domain_info *info,
+ struct device *dev)
+{
+ if (info->flags & MSI_FLAG_FREE_MSI_DESCS)
+ msi_free_msi_descs(dev);
+}
+
/**
* msi_domain_free_irqs_descs_locked - Free interrupts from a MSI interrupt @domain associated to @dev
* @domain: The domain to managing the interrupts
@@ -1008,6 +1055,7 @@ void msi_domain_free_irqs_descs_locked(s
if (info->flags & MSI_FLAG_DEV_SYSFS)
msi_device_destroy_sysfs(dev);
ops->domain_free_irqs(domain, dev);
+ msi_domain_free_msi_descs(info, dev);
}

/**

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:40 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 3 +++
kernel/irq/msi.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -156,6 +156,7 @@ enum msi_desc_filter {
* msi_device_data - MSI per device data
* @lock: Spinlock to protect register access
* @properties: MSI properties which are interesting to drivers
+ * @num_descs: The number of allocated MSI descriptors for the device
* @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
@@ -166,6 +167,7 @@ enum msi_desc_filter {
struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
+ unsigned int num_descs;
const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
struct list_head list;
@@ -208,6 +210,7 @@ static inline unsigned int msi_get_virq(

void msi_lock_descs(struct device *dev);
void msi_unlock_descs(struct device *dev);
+unsigned int msi_device_num_descs(struct device *dev);

struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
struct msi_desc *msi_next_desc(struct device *dev);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -82,6 +82,7 @@ int msi_add_msi_desc(struct device *dev,
desc->pci = init_desc->pci;

list_add_tail(&desc->list, &dev->msi.data->list);
+ dev->msi.data->num_descs++;
return 0;
}

@@ -109,6 +110,7 @@ int msi_add_simple_msi_descs(struct devi
list_add_tail(&desc->list, &list);
}
list_splice_tail(&list, &dev->msi.data->list);
+ dev->msi.data->num_descs += ndesc;
return 0;

fail:
@@ -142,6 +144,7 @@ void msi_free_msi_descs_range(struct dev
continue;
list_del(&desc->list);
free_msi_entry(desc);
+ dev->msi.data->num_descs--;
}
}

@@ -157,6 +160,21 @@ bool msi_device_has_property(struct devi
return !!(dev->msi.data->properties & prop);
}

+/**
+ * msi_device_num_descs - Query the number of allocated MSI descriptors of a device
+ * @dev: The device to read from
+ *
+ * Note: This is a lockless snapshot of msi_device_data::num_descs
+ *
+ * Returns the number of MSI descriptors which are allocated for @dev
+ */
+unsigned int msi_device_num_descs(struct device *dev)
+{
+ if (dev->msi.data)
+ return dev->msi.data->num_descs;
+ return 0;
+}
+
void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
*msg = entry->msg;

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:42 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
To prepare for dynamic extension of MSI-X vectors, protect the MSI
operations for MSI and MSI-X. This requires to move the invocation of
irq_create_affinity_masks() out of the descriptor lock section to avoid
reverse lock ordering vs. CPU hotplug lock as some callers of the PCI/MSI
allocation interfaces already hold it.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/irqdomain.c | 4 -
drivers/pci/msi/msi.c | 120 ++++++++++++++++++++++++++------------------
2 files changed, 73 insertions(+), 51 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -14,7 +14,7 @@ int pci_msi_setup_msi_irqs(struct pci_de

domain = dev_get_msi_domain(&dev->dev);
if (domain && irq_domain_is_hierarchy(domain))
- return msi_domain_alloc_irqs(domain, &dev->dev, nvec);
+ return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, nvec);

return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
}
@@ -25,7 +25,7 @@ void pci_msi_teardown_msi_irqs(struct pc

domain = dev_get_msi_domain(&dev->dev);
if (domain && irq_domain_is_hierarchy(domain))
- msi_domain_free_irqs(domain, &dev->dev);
+ msi_domain_free_irqs_descs_locked(domain, &dev->dev);
else
pci_msi_legacy_teardown_msi_irqs(dev);
}
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -322,11 +322,13 @@ static void __pci_restore_msix_state(str

write_msg = arch_restore_msi_irqs(dev);

+ msi_lock_descs(&dev->dev);
for_each_pci_msi_entry(entry, dev) {
if (write_msg)
__pci_write_msi_msg(entry, &entry->msg);
pci_msix_write_vector_ctrl(entry, entry->pci.msix_ctrl);
}
+ msi_unlock_descs(&dev->dev);

pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
}
@@ -339,19 +341,15 @@ void pci_restore_msi_state(struct pci_de
EXPORT_SYMBOL_GPL(pci_restore_msi_state);

static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
+msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity_desc *masks)
{
- struct irq_affinity_desc *masks = NULL;
struct msi_desc *entry;
u16 control;

- if (affd)
- masks = irq_create_affinity_masks(nvec, affd);
-
/* MSI Entry Initialization */
entry = alloc_msi_entry(&dev->dev, nvec, masks);
if (!entry)
- goto out;
+ return NULL;

pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
/* Lies, damned lies, and MSIs */
@@ -377,8 +375,7 @@ msi_setup_entry(struct pci_dev *dev, int
dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
if (entry->pci.msi_attrib.is_64)
dev->dev.msi.data->properties |= MSI_PROP_64BIT;
-out:
- kfree(masks);
+
return entry;
}

@@ -414,14 +411,21 @@ static int msi_verify_entries(struct pci
static int msi_capability_init(struct pci_dev *dev, int nvec,
struct irq_affinity *affd)
{
+ struct irq_affinity_desc *masks = NULL;
struct msi_desc *entry;
int ret;

pci_msi_set_enable(dev, 0); /* Disable MSI during set up */

- entry = msi_setup_entry(dev, nvec, affd);
- if (!entry)
- return -ENOMEM;
+ if (affd)
+ masks = irq_create_affinity_masks(nvec, affd);
+
+ msi_lock_descs(&dev->dev);
+ entry = msi_setup_entry(dev, nvec, masks);
+ if (!entry) {
+ ret = -ENOMEM;
+ goto unlock;
+ }

/* All MSIs are unmasked by default; mask them all */
pci_msi_mask(entry, msi_multi_mask(entry));
@@ -444,11 +448,14 @@ static int msi_capability_init(struct pc

pcibios_free_irq(dev);
dev->irq = entry->irq;
- return 0;
+ goto unlock;

err:
pci_msi_unmask(entry, msi_multi_mask(entry));
free_msi_irqs(dev);
+unlock:
+ msi_unlock_descs(&dev->dev);
+ kfree(masks);
return ret;
}

@@ -475,23 +482,18 @@ static void __iomem *msix_map_region(str

static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
struct msix_entry *entries, int nvec,
- struct irq_affinity *affd)
+ struct irq_affinity_desc *masks)
{
- struct irq_affinity_desc *curmsk, *masks = NULL;
+ int i, vec_count = pci_msix_vec_count(dev);
+ struct irq_affinity_desc *curmsk;
struct msi_desc *entry;
void __iomem *addr;
- int ret, i;
- int vec_count = pci_msix_vec_count(dev);
-
- if (affd)
- masks = irq_create_affinity_masks(nvec, affd);

for (i = 0, curmsk = masks; i < nvec; i++) {
entry = alloc_msi_entry(&dev->dev, 1, curmsk);
if (!entry) {
/* No enough memory. Don't try again */
- ret = -ENOMEM;
- goto out;
+ return -ENOMEM;
}

entry->pci.msi_attrib.is_msix = 1;
@@ -520,10 +522,7 @@ static int msix_setup_entries(struct pci
curmsk++;
}
dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
- ret = 0;
-out:
- kfree(masks);
- return ret;
+ return 0;
}

static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
@@ -550,6 +549,41 @@ static void msix_mask_all(void __iomem *
writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
}

+static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
+ struct msix_entry *entries, int nvec,
+ struct irq_affinity *affd)
+{
+ struct irq_affinity_desc *masks = NULL;
+ int ret;
+
+ if (affd)
+ masks = irq_create_affinity_masks(nvec, affd);
+
+ msi_lock_descs(&dev->dev);
+ ret = msix_setup_entries(dev, base, entries, nvec, masks);
+ if (ret)
+ goto out_free;
+
+ ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+ if (ret)
+ goto out_free;
+
+ /* Check if all MSI entries honor device restrictions */
+ ret = msi_verify_entries(dev);
+ if (ret)
+ goto out_free;
+
+ msix_update_entries(dev, entries);
+ goto out_unlock;
+
+out_free:
+ free_msi_irqs(dev);
+out_unlock:
+ msi_unlock_descs(&dev->dev);
+ kfree(masks);
+ return ret;
+}
+
/**
* msix_capability_init - configure device's MSI-X capability
* @dev: pointer to the pci_dev data structure of MSI-X device function
@@ -590,20 +624,9 @@ static int msix_capability_init(struct p
/* Ensure that all table entries are masked. */
msix_mask_all(base, tsize);

- ret = msix_setup_entries(dev, base, entries, nvec, affd);
+ ret = msix_setup_interrupts(dev, base, entries, nvec, affd);
if (ret)
- goto out_free;
-
- ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
- if (ret)
- goto out_free;
-
- /* Check if all MSI entries honor device restrictions */
- ret = msi_verify_entries(dev);
- if (ret)
- goto out_free;
-
- msix_update_entries(dev, entries);
+ goto out_disable;

/* Set MSI-X enabled bits and unmask the function */
pci_intx_for_msi(dev, 0);
@@ -613,12 +636,8 @@ static int msix_capability_init(struct p
pcibios_free_irq(dev);
return 0;

-out_free:
- free_msi_irqs(dev);
-
out_disable:
pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
-
return ret;
}

@@ -723,8 +742,10 @@ void pci_disable_msi(struct pci_dev *dev
if (!pci_msi_enable || !dev || !dev->msi_enabled)
return;

+ msi_lock_descs(&dev->dev);
pci_msi_shutdown(dev);
free_msi_irqs(dev);
+ msi_unlock_descs(&dev->dev);
}
EXPORT_SYMBOL(pci_disable_msi);

@@ -810,8 +831,10 @@ void pci_disable_msix(struct pci_dev *de
if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;

+ msi_lock_descs(&dev->dev);
pci_msix_shutdown(dev);
free_msi_irqs(dev);
+ msi_unlock_descs(&dev->dev);
}
EXPORT_SYMBOL(pci_disable_msix);

@@ -872,7 +895,6 @@ int pci_enable_msi(struct pci_dev *dev)

if (!rc)
rc = __pci_enable_msi_range(dev, 1, 1, NULL);
-
return rc < 0 ? rc : 0;
}
EXPORT_SYMBOL(pci_enable_msi);
@@ -959,11 +981,7 @@ int pci_alloc_irq_vectors_affinity(struc
struct irq_affinity *affd)
{
struct irq_affinity msi_default_affd = {0};
- int ret = msi_setup_device_data(&dev->dev);
- int nvecs = -ENOSPC;
-
- if (ret)
- return ret;
+ int ret, nvecs;

if (flags & PCI_IRQ_AFFINITY) {
if (!affd)
@@ -973,6 +991,10 @@ int pci_alloc_irq_vectors_affinity(struc
affd = NULL;
}

+ ret = msi_setup_device_data(&dev->dev);
+ if (ret)
+ return ret;
+
if (flags & PCI_IRQ_MSIX) {
nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
affd, flags);
@@ -1001,7 +1023,7 @@ int pci_alloc_irq_vectors_affinity(struc
}
}

- return nvecs;
+ return -ENOSPC;
}
EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);


Thomas Gleixner

unread,
Nov 26, 2021, 8:22:43 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Simplify the allocation of MSI descriptors by using msi_add_msi_desc()
which moves the storage handling to core code and prepares for dynamic
extension of the MSI-X vector space.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/msi.c | 121 ++++++++++++++++++++++++--------------------------
1 file changed, 59 insertions(+), 62 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -340,43 +340,49 @@ void pci_restore_msi_state(struct pci_de
}
EXPORT_SYMBOL_GPL(pci_restore_msi_state);

-static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity_desc *masks)
+static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
+ struct irq_affinity_desc *masks)
{
- struct msi_desc *entry;
+ struct msi_desc desc;
u16 control;
+ int ret;

/* MSI Entry Initialization */
- entry = alloc_msi_entry(&dev->dev, nvec, masks);
- if (!entry)
- return NULL;
+ memset(&desc, 0, sizeof(desc));

pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
/* Lies, damned lies, and MSIs */
if (dev->dev_flags & PCI_DEV_FLAGS_HAS_MSI_MASKING)
control |= PCI_MSI_FLAGS_MASKBIT;
+ /* Respect XEN's mask disabling */
+ if (pci_msi_ignore_mask)
+ control &= ~PCI_MSI_FLAGS_MASKBIT;

- entry->pci.msi_attrib.is_64 = !!(control & PCI_MSI_FLAGS_64BIT);
- entry->pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
- !!(control & PCI_MSI_FLAGS_MASKBIT);
- entry->pci.msi_attrib.default_irq = dev->irq;
- entry->pci.msi_attrib.multi_cap = (control & PCI_MSI_FLAGS_QMASK) >> 1;
- entry->pci.msi_attrib.multiple = ilog2(__roundup_pow_of_two(nvec));
+ desc.nvec_used = nvec;
+ desc.pci.msi_attrib.is_64 = !!(control & PCI_MSI_FLAGS_64BIT);
+ desc.pci.msi_attrib.can_mask = !!(control & PCI_MSI_FLAGS_MASKBIT);
+ desc.pci.msi_attrib.default_irq = dev->irq;
+ desc.pci.msi_attrib.multi_cap = (control & PCI_MSI_FLAGS_QMASK) >> 1;
+ desc.pci.msi_attrib.multiple = ilog2(__roundup_pow_of_two(nvec));
+ desc.affinity = masks;

if (control & PCI_MSI_FLAGS_64BIT)
- entry->pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
+ desc.pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
else
- entry->pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_32;
+ desc.pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_32;

/* Save the initial mask status */
- if (entry->pci.msi_attrib.can_mask)
- pci_read_config_dword(dev, entry->pci.mask_pos, &entry->pci.msi_mask);
+ if (desc.pci.msi_attrib.can_mask)
+ pci_read_config_dword(dev, desc.pci.mask_pos, &desc.pci.msi_mask);

- dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
- if (entry->pci.msi_attrib.is_64)
- dev->dev.msi.data->properties |= MSI_PROP_64BIT;
+ ret = msi_add_msi_desc(&dev->dev, &desc);
+ if (!ret) {
+ dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
+ if (desc.pci.msi_attrib.is_64)
+ dev->dev.msi.data->properties |= MSI_PROP_64BIT;
+ }

- return entry;
+ return ret;
}

static int msi_verify_entries(struct pci_dev *dev)
@@ -421,17 +427,14 @@ static int msi_capability_init(struct pc
masks = irq_create_affinity_masks(nvec, affd);

msi_lock_descs(&dev->dev);
- entry = msi_setup_entry(dev, nvec, masks);
- if (!entry) {
- ret = -ENOMEM;
+ ret = msi_setup_msi_desc(dev, nvec, masks);
+ if (ret)
goto unlock;
- }

/* All MSIs are unmasked by default; mask them all */
+ entry = first_pci_msi_entry(dev);
pci_msi_mask(entry, msi_multi_mask(entry));

- list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
-
/* Configure MSI capability structure */
ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
if (ret)
@@ -480,49 +483,41 @@ static void __iomem *msix_map_region(str
return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
}

-static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
- struct msix_entry *entries, int nvec,
- struct irq_affinity_desc *masks)
+static int msix_setup_msi_descs(struct pci_dev *dev, void __iomem *base,
+ struct msix_entry *entries, int nvec,
+ struct irq_affinity_desc *masks)
{
- int i, vec_count = pci_msix_vec_count(dev);
+ int ret, i, vec_count = pci_msix_vec_count(dev);
struct irq_affinity_desc *curmsk;
- struct msi_desc *entry;
+ struct msi_desc desc;
void __iomem *addr;

- for (i = 0, curmsk = masks; i < nvec; i++) {
- entry = alloc_msi_entry(&dev->dev, 1, curmsk);
- if (!entry) {
- /* No enough memory. Don't try again */
- return -ENOMEM;
- }
-
- entry->pci.msi_attrib.is_msix = 1;
- entry->pci.msi_attrib.is_64 = 1;
-
- if (entries)
- entry->msi_index = entries[i].entry;
- else
- entry->msi_index = i;
-
- entry->pci.msi_attrib.is_virtual = entry->msi_index >= vec_count;
-
- entry->pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
- !entry->pci.msi_attrib.is_virtual;
-
- entry->pci.msi_attrib.default_irq = dev->irq;
- entry->pci.mask_base = base;
+ memset(&desc, 0, sizeof(desc));

- if (entry->pci.msi_attrib.can_mask) {
- addr = pci_msix_desc_addr(entry);
- entry->pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
+ desc.nvec_used = 1;
+ desc.pci.msi_attrib.is_msix = 1;
+ desc.pci.msi_attrib.is_64 = 1;
+ desc.pci.msi_attrib.default_irq = dev->irq;
+ desc.pci.mask_base = base;
+
+ for (i = 0, curmsk = masks; i < nvec; i++, curmsk++) {
+ desc.msi_index = entries ? entries[i].entry : i;
+ desc.affinity = masks ? curmsk : NULL;
+ desc.pci.msi_attrib.is_virtual = desc.msi_index >= vec_count;
+ desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
+ !desc.pci.msi_attrib.is_virtual;
+
+ if (!desc.pci.msi_attrib.can_mask) {
+ addr = pci_msix_desc_addr(&desc);
+ desc.pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
}

- list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
- if (masks)
- curmsk++;
+ ret = msi_add_msi_desc(&dev->dev, &desc);
+ if (ret)
+ break;
}
- dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
- return 0;
+
+ return ret;
}

static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
@@ -560,10 +555,12 @@ static int msix_setup_interrupts(struct
masks = irq_create_affinity_masks(nvec, affd);

msi_lock_descs(&dev->dev);
- ret = msix_setup_entries(dev, base, entries, nvec, masks);
+ ret = msix_setup_msi_descs(dev, base, entries, nvec, masks);
if (ret)
goto out_free;

+ dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
+
ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
if (ret)
goto out_free;

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:45 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Set the domain info flag which tells the core code to free the MSI
descriptors from msi_domain_free_irqs() and add an explicit call to the
core function into the legacy code.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/irqdomain.c | 3 ++-
drivers/pci/msi/legacy.c | 1 +
drivers/pci/msi/msi.c | 14 --------------
3 files changed, 3 insertions(+), 15 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -171,7 +171,8 @@ struct irq_domain *pci_msi_create_irq_do
if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
pci_msi_domain_update_chip_ops(info);

- info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS;
+ info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS |
+ MSI_FLAG_FREE_MSI_DESCS;
if (IS_ENABLED(CONFIG_GENERIC_IRQ_RESERVATION_MODE))
info->flags |= MSI_FLAG_MUST_REACTIVATE;

--- a/drivers/pci/msi/legacy.c
+++ b/drivers/pci/msi/legacy.c
@@ -81,4 +81,5 @@ void pci_msi_legacy_teardown_msi_irqs(st
{
msi_device_destroy_sysfs(&dev->dev);
arch_teardown_msi_irqs(dev);
+ msi_free_msi_descs(&dev->dev);
}
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -224,22 +224,8 @@ EXPORT_SYMBOL_GPL(pci_write_msi_msg);

static void free_msi_irqs(struct pci_dev *dev)
{
- struct list_head *msi_list = dev_to_msi_list(&dev->dev);
- struct msi_desc *entry, *tmp;
- int i;
-
- for_each_pci_msi_entry(entry, dev)
- if (entry->irq)
- for (i = 0; i < entry->nvec_used; i++)
- BUG_ON(irq_has_action(entry->irq + i));
-
pci_msi_teardown_msi_irqs(dev);

- list_for_each_entry_safe(entry, tmp, msi_list, list) {
- list_del(&entry->list);
- free_msi_entry(entry);
- }
-
if (dev->msix_base) {
iounmap(dev->msix_base);
dev->msix_base = NULL;

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:46 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Use the new iterator functions which pave the way for dynamically extending
MSI-X vectors.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/irqdomain.c | 4 ++--
drivers/pci/msi/legacy.c | 19 ++++++++-----------
drivers/pci/msi/msi.c | 30 ++++++++++++++----------------
3 files changed, 24 insertions(+), 29 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -83,7 +83,7 @@ static int pci_msi_domain_check_cap(stru
struct msi_domain_info *info,
struct device *dev)
{
- struct msi_desc *desc = first_pci_msi_entry(to_pci_dev(dev));
+ struct msi_desc *desc = msi_first_desc(dev);

/* Special handling to support __pci_enable_msi_range() */
if (pci_msi_desc_is_multi_msi(desc) &&
@@ -98,7 +98,7 @@ static int pci_msi_domain_check_cap(stru
unsigned int idx = 0;

/* Check for gaps in the entry indices */
- for_each_msi_entry(desc, dev) {
+ msi_for_each_desc(desc, dev, MSI_DESC_ALL) {
if (desc->msi_index != idx++)
return -ENOTSUPP;
}
--- a/drivers/pci/msi/legacy.c
+++ b/drivers/pci/msi/legacy.c
@@ -29,7 +29,7 @@ int __weak arch_setup_msi_irqs(struct pc
if (type == PCI_CAP_ID_MSI && nvec > 1)
return 1;

- for_each_pci_msi_entry(desc, dev) {
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
ret = arch_setup_msi_irq(dev, desc);
if (ret)
return ret < 0 ? ret : -ENOSPC;
@@ -43,27 +43,24 @@ void __weak arch_teardown_msi_irqs(struc
struct msi_desc *desc;
int i;

- for_each_pci_msi_entry(desc, dev) {
- if (desc->irq) {
- for (i = 0; i < entry->nvec_used; i++)
- arch_teardown_msi_irq(desc->irq + i);
- }
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ASSOCIATED) {
+ for (i = 0; i < desc->nvec_used; i++)
+ arch_teardown_msi_irq(desc->irq + i);
}
}

static int pci_msi_setup_check_result(struct pci_dev *dev, int type, int ret)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;
int avail = 0;

if (type != PCI_CAP_ID_MSIX || ret >= 0)
return ret;

/* Scan the MSI descriptors for successfully allocated ones. */
- for_each_pci_msi_entry(entry, dev) {
- if (entry->irq != 0)
- avail++;
- }
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ASSOCIATED)
+ avail++;
+
return avail ? avail : ret;
}

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -299,7 +299,6 @@ static void __pci_restore_msix_state(str

if (!dev->msix_enabled)
return;
- BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));

/* route the table */
pci_intx_for_msi(dev, 0);
@@ -309,7 +308,7 @@ static void __pci_restore_msix_state(str
write_msg = arch_restore_msi_irqs(dev);

msi_lock_descs(&dev->dev);
- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
if (write_msg)
__pci_write_msi_msg(entry, &entry->msg);
pci_msix_write_vector_ctrl(entry, entry->pci.msix_ctrl);
@@ -378,14 +377,14 @@ static int msi_verify_entries(struct pci
if (!dev->no_64bit_msi)
return 0;

- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
if (entry->msg.address_hi) {
pci_err(dev, "arch assigned 64-bit MSI address %#x%08x but device only supports 32 bits\n",
entry->msg.address_hi, entry->msg.address_lo);
- return -EIO;
+ break;
}
}
- return 0;
+ return !entry ? 0 : -EIO;
}

/**
@@ -418,7 +417,7 @@ static int msi_capability_init(struct pc
goto unlock;

/* All MSIs are unmasked by default; mask them all */
- entry = first_pci_msi_entry(dev);
+ entry = msi_first_desc(&dev->dev);
pci_msi_mask(entry, msi_multi_mask(entry));

/* Configure MSI capability structure */
@@ -508,11 +507,11 @@ static int msix_setup_msi_descs(struct p

static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;

if (entries) {
- for_each_pci_msi_entry(entry, dev) {
- entries->vector = entry->irq;
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ALL) {
+ entries->vector = desc->irq;
entries++;
}
}
@@ -705,15 +704,14 @@ static void pci_msi_shutdown(struct pci_
if (!pci_msi_enable || !dev || !dev->msi_enabled)
return;

- BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
- desc = first_pci_msi_entry(dev);
-
pci_msi_set_enable(dev, 0);
pci_intx_for_msi(dev, 1);
dev->msi_enabled = 0;

/* Return the device with MSI unmasked as initial states */
- pci_msi_unmask(desc, msi_multi_mask(desc));
+ desc = msi_first_desc(&dev->dev);
+ if (!WARN_ON_ONCE(!desc))
+ pci_msi_unmask(desc, msi_multi_mask(desc));

/* Restore dev->irq to its default pin-assertion IRQ */
dev->irq = desc->pci.msi_attrib.default_irq;
@@ -789,7 +787,7 @@ static int __pci_enable_msix(struct pci_

static void pci_msix_shutdown(struct pci_dev *dev)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;

if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;
@@ -800,8 +798,8 @@ static void pci_msix_shutdown(struct pci
}

/* Return the device with MSI-X masked as initial states */
- for_each_pci_msi_entry(entry, dev)
- pci_msix_mask(entry);
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ALL)
+ pci_msix_mask(desc);

pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
pci_intx_for_msi(dev, 1);

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:48 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/x86/pci/xen.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -184,7 +184,7 @@ static int xen_setup_msi_irqs(struct pci
if (ret)
goto error;
i = 0;
- for_each_pci_msi_entry(msidesc, dev) {
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
irq = xen_bind_pirq_msi_to_irq(dev, msidesc, v[i],
(type == PCI_CAP_ID_MSI) ? nvec : 1,
(type == PCI_CAP_ID_MSIX) ?
@@ -235,7 +235,7 @@ static int xen_hvm_setup_msi_irqs(struct
if (type == PCI_CAP_ID_MSI && nvec > 1)
return 1;

- for_each_pci_msi_entry(msidesc, dev) {
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
pirq = xen_allocate_pirq_msi(dev, msidesc);
if (pirq < 0) {
irq = -ENODEV;
@@ -270,7 +270,7 @@ static int xen_initdom_setup_msi_irqs(st
int ret = 0;
struct msi_desc *msidesc;

- for_each_pci_msi_entry(msidesc, dev) {
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
struct physdev_map_pirq map_irq;
domid_t domid;

@@ -389,11 +389,9 @@ static void xen_teardown_msi_irqs(struct
struct msi_desc *msidesc;
int i;

- for_each_pci_msi_entry(msidesc, dev) {
- if (msidesc->irq) {
- for (i = 0; i < msidesc->nvec_used; i++)
- xen_destroy_irq(msidesc->irq + i);
- }
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_ASSOCIATED) {
+ for (i = 0; i < msidesc->nvec_used; i++)
+ xen_destroy_irq(msidesc->irq + i);
}
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:22:50 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/xen-pcifront.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -262,7 +262,7 @@ static int pci_frontend_enable_msix(stru
}

i = 0;
- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
op.msix_entries[i].entry = entry->msi_index;
/* Vector is useless at this point. */
op.msix_entries[i].vector = -1;

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:51 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
Cc: linux...@vger.kernel.org
Cc: Heiko Carstens <h...@linux.ibm.com>
Cc: Christian Borntraeger <bornt...@de.ibm.com>
---
arch/s390/pci/pci_irq.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

--- a/arch/s390/pci/pci_irq.c
+++ b/arch/s390/pci/pci_irq.c
@@ -303,7 +303,7 @@ int arch_setup_msi_irqs(struct pci_dev *

/* Request MSI interrupts */
hwirq = bit;
- for_each_pci_msi_entry(msi, pdev) {
+ msi_for_each_desc(msi, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
rc = -EIO;
if (hwirq - bit >= msi_vecs)
break;
@@ -362,9 +362,7 @@ void arch_teardown_msi_irqs(struct pci_d
return;

/* Release MSI interrupts */
- for_each_pci_msi_entry(msi, pdev) {
- if (!msi->irq)
- continue;
+ msi_for_each_desc(msi, &pdev->dev, MSI_DESC_ASSOCIATED) {
irq_set_msi_desc(msi->irq, NULL);
irq_free_desc(msi->irq);
msi->msg.address_lo = 0;

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:53 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/platforms/4xx/hsta_msi.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/powerpc/platforms/4xx/hsta_msi.c
+++ b/arch/powerpc/platforms/4xx/hsta_msi.c
@@ -47,7 +47,7 @@ static int hsta_setup_msi_irqs(struct pc
return -EINVAL;
}

- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
irq = msi_bitmap_alloc_hwirqs(&ppc4xx_hsta_msi.bmp, 1);
if (irq < 0) {
pr_debug("%s: Failed to allocate msi interrupt\n",
@@ -105,10 +105,7 @@ static void hsta_teardown_msi_irqs(struc
struct msi_desc *entry;
int irq;

- for_each_pci_msi_entry(entry, dev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ASSOCIATED) {
irq = hsta_find_hwirq_offset(entry->irq);

/* entry->irq should always be in irq_map */

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:54 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/platforms/cell/axon_msi.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/powerpc/platforms/cell/axon_msi.c
+++ b/arch/powerpc/platforms/cell/axon_msi.c
@@ -265,7 +265,7 @@ static int axon_msi_setup_msi_irqs(struc
if (rc)
return rc;

- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
virq = irq_create_direct_mapping(msic->irq_domain);
if (!virq) {
dev_warn(&dev->dev,
@@ -288,10 +288,7 @@ static void axon_msi_teardown_msi_irqs(s

dev_dbg(&dev->dev, "axon_msi: tearing down msi irqs\n");

- for_each_pci_msi_entry(entry, dev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ASSOCIATED) {
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
}

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:56 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/platforms/pasemi/msi.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

--- a/arch/powerpc/platforms/pasemi/msi.c
+++ b/arch/powerpc/platforms/pasemi/msi.c
@@ -62,17 +62,12 @@ static void pasemi_msi_teardown_msi_irqs

pr_debug("pasemi_msi_teardown_msi_irqs, pdev %p\n", pdev);

- for_each_pci_msi_entry(entry, pdev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
hwirq = virq_to_hw(entry->irq);
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
msi_bitmap_free_hwirqs(&msi_mpic->msi_bitmap, hwirq, ALLOC_CHUNK);
}
-
- return;
}

static int pasemi_msi_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
@@ -90,7 +85,7 @@ static int pasemi_msi_setup_msi_irqs(str
msg.address_hi = 0;
msg.address_lo = PASEMI_MSI_ADDR;

- for_each_pci_msi_entry(entry, pdev) {
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
/* Allocate 16 interrupts for now, since that's the grouping for
* affinity. This can be changed later if it turns out 32 is too
* few MSIs for someone, but restrictions will apply to how the

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:58 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/sysdev/fsl_msi.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

--- a/arch/powerpc/sysdev/fsl_msi.c
+++ b/arch/powerpc/sysdev/fsl_msi.c
@@ -125,17 +125,13 @@ static void fsl_teardown_msi_irqs(struct
struct fsl_msi *msi_data;
irq_hw_number_t hwirq;

- for_each_pci_msi_entry(entry, pdev) {
- if (!entry->irq)
- continue;
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
hwirq = virq_to_hw(entry->irq);
msi_data = irq_get_chip_data(entry->irq);
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
msi_bitmap_free_hwirqs(&msi_data->bitmap, hwirq, 1);
}
-
- return;
}

static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
@@ -215,7 +211,7 @@ static int fsl_setup_msi_irqs(struct pci
}
}

- for_each_pci_msi_entry(entry, pdev) {
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
/*
* Loop over all the MSI devices until we find one that has an
* available interrupt.

Thomas Gleixner

unread,
Nov 26, 2021, 8:22:59 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/sysdev/mpic_u3msi.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

--- a/arch/powerpc/sysdev/mpic_u3msi.c
+++ b/arch/powerpc/sysdev/mpic_u3msi.c
@@ -104,17 +104,12 @@ static void u3msi_teardown_msi_irqs(stru
struct msi_desc *entry;
irq_hw_number_t hwirq;

- for_each_pci_msi_entry(entry, pdev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
hwirq = virq_to_hw(entry->irq);
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
msi_bitmap_free_hwirqs(&msi_mpic->msi_bitmap, hwirq, 1);
}
-
- return;
}

static int u3msi_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
@@ -136,7 +131,7 @@ static int u3msi_setup_msi_irqs(struct p
return -ENXIO;
}

- for_each_pci_msi_entry(entry, pdev) {
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
hwirq = msi_bitmap_alloc_hwirqs(&msi_mpic->msi_bitmap, 1);
if (hwirq < 0) {
pr_debug("u3msi: failed allocating hwirq\n");

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:00 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering. Take
the descriptor lock around the iterators.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/controller/pci-hyperv.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3445,18 +3445,23 @@ static int hv_pci_suspend(struct hv_devi

static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
{
- struct msi_desc *entry;
struct irq_data *irq_data;
+ struct msi_desc *entry;
+ int ret = 0;

- for_each_pci_msi_entry(entry, pdev) {
+ msi_lock_descs(&pdev->dev);
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
irq_data = irq_get_irq_data(entry->irq);
- if (WARN_ON_ONCE(!irq_data))
- return -EINVAL;
+ if (WARN_ON_ONCE(!irq_data)) {
+ ret = -EINVAL;
+ break;
+ }

hv_compose_msi_msg(irq_data, &entry->msg);
}
+ msi_unlock_descs(&pdev->dev);

- return 0;
+ return ret;
}

/*

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:02 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger
Replace the about to vanish iterators, make use of the filtering and take
the descriptor lock around the iteration.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
Cc: Jon Mason <jdm...@kudzu.us>
Cc: Dave Jiang <dave....@intel.com>
Cc: Allen Hubbe <all...@gmail.com>
Cc: linu...@googlegroups.com
---
drivers/ntb/msi.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

--- a/drivers/ntb/msi.c
+++ b/drivers/ntb/msi.c
@@ -108,8 +108,10 @@ int ntb_msi_setup_mws(struct ntb_dev *nt
if (!ntb->msi)
return -EINVAL;

- desc = first_msi_entry(&ntb->pdev->dev);
+ msi_lock_descs(&ntb->pdev->dev);
+ desc = msi_first_desc(&ntb->pdev->dev);
addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+ msi_unlock_descs(&ntb->pdev->dev);

for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
@@ -281,13 +283,15 @@ int ntbm_msi_request_threaded_irq(struct
const char *name, void *dev_id,
struct ntb_msi_desc *msi_desc)
{
+ struct device *dev = &ntb->pdev->dev;
struct msi_desc *entry;
int ret;

if (!ntb->msi)
return -EINVAL;

- for_each_pci_msi_entry(entry, ntb->pdev) {
+ msi_lock_descs(dev);
+ msi_for_each_desc(entry, dev, MSI_DESC_ASSOCIATED) {
if (irq_has_action(entry->irq))
continue;

@@ -304,14 +308,17 @@ int ntbm_msi_request_threaded_irq(struct
ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
if (ret) {
devm_free_irq(&ntb->dev, entry->irq, dev_id);
- return ret;
+ goto unlock;
}

-
- return entry->irq;
+ ret = entry->irq;
+ goto unlock;
}
+ ret = -ENODEV;

- return -ENODEV;
+unlock:
+ msi_unlock_descs(dev);
+ return ret;
}
EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);


Thomas Gleixner

unread,
Nov 26, 2021, 8:23:03 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Protect the allocation properly and use the core allocation and free
mechanism.

No functional change intended.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/soc/ti/ti_sci_inta_msi.c | 71 +++++++++++++--------------------------
1 file changed, 25 insertions(+), 46 deletions(-)

--- a/drivers/soc/ti/ti_sci_inta_msi.c
+++ b/drivers/soc/ti/ti_sci_inta_msi.c
@@ -51,6 +51,7 @@ struct irq_domain *ti_sci_inta_msi_creat
struct irq_domain *domain;

ti_sci_inta_msi_update_chip_ops(info);
+ info->flags |= MSI_FLAG_FREE_MSI_DESCS;

domain = msi_create_irq_domain(fwnode, info, parent);
if (domain)
@@ -60,50 +61,31 @@ struct irq_domain *ti_sci_inta_msi_creat
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_create_irq_domain);

-static void ti_sci_inta_msi_free_descs(struct device *dev)
-{
- struct msi_desc *desc, *tmp;
-
- list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
- list_del(&desc->list);
- free_msi_entry(desc);
- }
-}
-
static int ti_sci_inta_msi_alloc_descs(struct device *dev,
struct ti_sci_resource *res)
{
- struct msi_desc *msi_desc;
+ struct msi_desc msi_desc;
int set, i, count = 0;

+ memset(&msi_desc, 0, sizeof(msi_desc));
+
for (set = 0; set < res->sets; set++) {
- for (i = 0; i < res->desc[set].num; i++) {
- msi_desc = alloc_msi_entry(dev, 1, NULL);
- if (!msi_desc) {
- ti_sci_inta_msi_free_descs(dev);
- return -ENOMEM;
- }
-
- msi_desc->msi_index = res->desc[set].start + i;
- INIT_LIST_HEAD(&msi_desc->list);
- list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
- count++;
+ for (i = 0; i < res->desc[set].num; i++, count++) {
+ msi_desc.msi_index = res->desc[set].start + i;
+ if (msi_add_msi_desc(dev, &msi_desc))
+ goto fail;
}
- for (i = 0; i < res->desc[set].num_sec; i++) {
- msi_desc = alloc_msi_entry(dev, 1, NULL);
- if (!msi_desc) {
- ti_sci_inta_msi_free_descs(dev);
- return -ENOMEM;
- }
-
- msi_desc->msi_index = res->desc[set].start_sec + i;
- INIT_LIST_HEAD(&msi_desc->list);
- list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
- count++;
+
+ for (i = 0; i < res->desc[set].num_sec; i++, count++) {
+ msi_desc.msi_index = res->desc[set].start_sec + i;
+ if (msi_add_msi_desc(dev, &msi_desc))
+ goto fail;
}
}
-
return count;
+fail:
+ msi_free_msi_descs(dev);
+ return -ENOMEM;
}

int ti_sci_inta_msi_domain_alloc_irqs(struct device *dev,
@@ -124,20 +106,18 @@ int ti_sci_inta_msi_domain_alloc_irqs(st
if (ret)
return ret;

+ msi_lock_descs(dev);
nvec = ti_sci_inta_msi_alloc_descs(dev, res);
- if (nvec <= 0)
- return nvec;
-
- ret = msi_domain_alloc_irqs(msi_domain, dev, nvec);
- if (ret) {
- dev_err(dev, "Failed to allocate IRQs %d\n", ret);
- goto cleanup;
+ if (nvec <= 0) {
+ ret = nvec;
+ goto unlock;
}

- return 0;
-
-cleanup:
- ti_sci_inta_msi_free_descs(&pdev->dev);
+ ret = msi_domain_alloc_irqs_descs_locked(msi_domain, dev, nvec);
+ if (ret)
+ dev_err(dev, "Failed to allocate IRQs %d\n", ret);
+unlock:
+ msi_unlock_descs(dev);
return ret;
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_alloc_irqs);
@@ -145,6 +125,5 @@ EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain
void ti_sci_inta_msi_domain_free_irqs(struct device *dev)
{
msi_domain_free_irqs(dev->msi.domain, dev);
- ti_sci_inta_msi_free_descs(dev);
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_free_irqs);

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:05 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The function has no users and is pointless now that the core frees the MSI
descriptors, which means potential users can just use msi_domain_free_irqs().

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/soc/ti/ti_sci_inta_msi.c | 6 ------
include/linux/soc/ti/ti_sci_inta_msi.h | 1 -
2 files changed, 7 deletions(-)

--- a/drivers/soc/ti/ti_sci_inta_msi.c
+++ b/drivers/soc/ti/ti_sci_inta_msi.c
@@ -121,9 +121,3 @@ int ti_sci_inta_msi_domain_alloc_irqs(st
return ret;
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_alloc_irqs);
-
-void ti_sci_inta_msi_domain_free_irqs(struct device *dev)
-{
- msi_domain_free_irqs(dev->msi.domain, dev);
-}
-EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_free_irqs);
--- a/include/linux/soc/ti/ti_sci_inta_msi.h
+++ b/include/linux/soc/ti/ti_sci_inta_msi.h
@@ -18,5 +18,4 @@ struct irq_domain
struct irq_domain *parent);
int ti_sci_inta_msi_domain_alloc_irqs(struct device *dev,
struct ti_sci_resource *res);
-void ti_sci_inta_msi_domain_free_irqs(struct device *dev);
#endif /* __INCLUDE_LINUX_IRQCHIP_TI_SCI_INTA_H */

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:07 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Let the MSI irq domain code handle descriptor allocation and free.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/bus/fsl-mc/fsl-mc-msi.c | 61 ++--------------------------------------
1 file changed, 4 insertions(+), 57 deletions(-)

--- a/drivers/bus/fsl-mc/fsl-mc-msi.c
+++ b/drivers/bus/fsl-mc/fsl-mc-msi.c
@@ -170,6 +170,7 @@ struct irq_domain *fsl_mc_msi_create_irq
fsl_mc_msi_update_dom_ops(info);
if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
fsl_mc_msi_update_chip_ops(info);
+ info->flags |= MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS | MSI_FLAG_FREE_MSI_DESCS;

domain = msi_create_irq_domain(fwnode, info, parent);
if (domain)
@@ -210,45 +211,7 @@ struct irq_domain *fsl_mc_find_msi_domai
return msi_domain;
}

-static void fsl_mc_msi_free_descs(struct device *dev)
-{
- struct msi_desc *desc, *tmp;
-
- list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
- list_del(&desc->list);
- free_msi_entry(desc);
- }
-}
-
-static int fsl_mc_msi_alloc_descs(struct device *dev, unsigned int irq_count)
-
-{
- unsigned int i;
- int error;
- struct msi_desc *msi_desc;
-
- for (i = 0; i < irq_count; i++) {
- msi_desc = alloc_msi_entry(dev, 1, NULL);
- if (!msi_desc) {
- dev_err(dev, "Failed to allocate msi entry\n");
- error = -ENOMEM;
- goto cleanup_msi_descs;
- }
-
- msi_desc->msi_index = i;
- INIT_LIST_HEAD(&msi_desc->list);
- list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
- }
-
- return 0;
-
-cleanup_msi_descs:
- fsl_mc_msi_free_descs(dev);
- return error;
-}
-
-int fsl_mc_msi_domain_alloc_irqs(struct device *dev,
- unsigned int irq_count)
+int fsl_mc_msi_domain_alloc_irqs(struct device *dev, unsigned int irq_count)
{
struct irq_domain *msi_domain;
int error;
@@ -261,28 +224,17 @@ int fsl_mc_msi_domain_alloc_irqs(struct
if (error)
return error;

- if (!list_empty(dev_to_msi_list(dev)))
+ if (msi_device_num_descs(dev))
return -EINVAL;

- error = fsl_mc_msi_alloc_descs(dev, irq_count);
- if (error < 0)
- return error;
-
/*
* NOTE: Calling this function will trigger the invocation of the
* its_fsl_mc_msi_prepare() callback
*/
error = msi_domain_alloc_irqs(msi_domain, dev, irq_count);

- if (error) {
+ if (error)
dev_err(dev, "Failed to allocate IRQs\n");
- goto cleanup_msi_descs;
- }
-
- return 0;
-
-cleanup_msi_descs:
- fsl_mc_msi_free_descs(dev);
return error;
}

@@ -295,9 +247,4 @@ void fsl_mc_msi_domain_free_irqs(struct
return;

msi_domain_free_irqs(msi_domain, dev);
-
- if (list_empty(dev_to_msi_list(dev)))
- return;
-
- fsl_mc_msi_free_descs(dev);
}

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:09 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Use the core functionality for platform MSI interrupt domains. The platform
device MSI interrupt domains will be converted in a later step.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/base/platform-msi.c | 112 ++++++++++++++++++--------------------------
1 file changed, 48 insertions(+), 64 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -107,57 +107,6 @@ static void platform_msi_update_chip_ops
info->flags &= ~MSI_FLAG_LEVEL_CAPABLE;
}

-static void platform_msi_free_descs(struct device *dev, int base, int nvec)
-{
- struct msi_desc *desc, *tmp;
-
- list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
- if (desc->msi_index >= base &&
- desc->msi_index < (base + nvec)) {
- list_del(&desc->list);
- free_msi_entry(desc);
- }
- }
-}
-
-static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
- int nvec)
-{
- struct msi_desc *desc;
- int i, base = 0;
-
- if (!list_empty(dev_to_msi_list(dev))) {
- desc = list_last_entry(dev_to_msi_list(dev),
- struct msi_desc, list);
- base = desc->msi_index + 1;
- }
-
- for (i = 0; i < nvec; i++) {
- desc = alloc_msi_entry(dev, 1, NULL);
- if (!desc)
- break;
-
- desc->msi_index = base + i;
- desc->irq = virq ? virq + i : 0;
-
- list_add_tail(&desc->list, dev_to_msi_list(dev));
- }
-
- if (i != nvec) {
- /* Clean up the mess */
- platform_msi_free_descs(dev, base, nvec);
-
- return -ENOMEM;
- }
-
- return 0;
-}
-
-static int platform_msi_alloc_descs(struct device *dev, int nvec)
-{
- return platform_msi_alloc_descs_with_irq(dev, 0, nvec);
-}
-
/**
* platform_msi_create_irq_domain - Create a platform MSI interrupt domain
* @fwnode: Optional fwnode of the interrupt controller
@@ -180,7 +129,8 @@ struct irq_domain *platform_msi_create_i
platform_msi_update_dom_ops(info);
if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
platform_msi_update_chip_ops(info);
- info->flags |= MSI_FLAG_DEV_SYSFS;
+ info->flags |= MSI_FLAG_DEV_SYSFS | MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS |
+ MSI_FLAG_FREE_MSI_DESCS;

domain = msi_create_irq_domain(fwnode, info, parent);
if (domain)
@@ -262,20 +212,10 @@ int platform_msi_domain_alloc_irqs(struc
if (err)
return err;

- err = platform_msi_alloc_descs(dev, nvec);
- if (err)
- goto out_free_priv_data;
-
err = msi_domain_alloc_irqs(dev->msi.domain, dev, nvec);
if (err)
- goto out_free_desc;
-
- return 0;
+ platform_msi_free_priv_data(dev);

-out_free_desc:
- platform_msi_free_descs(dev, 0, nvec);
-out_free_priv_data:
- platform_msi_free_priv_data(dev);
return err;
}
EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs);
@@ -287,7 +227,6 @@ EXPORT_SYMBOL_GPL(platform_msi_domain_al
void platform_msi_domain_free_irqs(struct device *dev)
{
msi_domain_free_irqs(dev->msi.domain, dev);
- platform_msi_free_descs(dev, 0, MAX_DEV_MSIS);
platform_msi_free_priv_data(dev);
}
EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs);
@@ -361,6 +300,51 @@ struct irq_domain *
return NULL;
}

+static void platform_msi_free_descs(struct device *dev, int base, int nvec)
+{
+ struct msi_desc *desc, *tmp;
+
+ list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
+ if (desc->msi_index >= base &&
+ desc->msi_index < (base + nvec)) {
+ list_del(&desc->list);
+ free_msi_entry(desc);
+ }
+ }
+}
+
+static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
+ int nvec)
+{
+ struct msi_desc *desc;
+ int i, base = 0;
+
+ if (!list_empty(dev_to_msi_list(dev))) {
+ desc = list_last_entry(dev_to_msi_list(dev),
+ struct msi_desc, list);
+ base = desc->msi_index + 1;
+ }
+
+ for (i = 0; i < nvec; i++) {
+ desc = alloc_msi_entry(dev, 1, NULL);
+ if (!desc)
+ break;
+
+ desc->msi_index = base + i;
+ desc->irq = virq + i;
+
+ list_add_tail(&desc->list, dev_to_msi_list(dev));
+ }
+
+ if (i != nvec) {
+ /* Clean up the mess */
+ platform_msi_free_descs(dev, base, nvec);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
/**
* platform_msi_device_domain_free - Free interrupts associated with a platform-msi
* device domain

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:10 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The allocation code is overly complex. It tries to have the MSI index space
packed, which is not working when an interrupt is freed. There is no
requirement for this. The only requirement is that the MSI index is unique.

Move the MSI descriptor allocation into msi_domain_populate_irqs() and use
the Linux interrupt number as MSI index which fulfils the unique
requirement.

This requires to lock the MSI descriptors which makes the lock order
reverse to the regular MSI alloc/free functions vs. the domain
mutex. Assign a seperate lockdep class for these MSI device domains.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/base/platform-msi.c | 88 +++++++++-----------------------------------
kernel/irq/msi.c | 46 +++++++++++------------
2 files changed, 40 insertions(+), 94 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -246,6 +246,8 @@ void *platform_msi_get_host_data(struct
return data->host_data;
}

+static struct lock_class_key platform_device_msi_lock_class;
+
/**
* __platform_msi_create_device_domain - Create a platform-msi device domain
*
@@ -278,6 +280,13 @@ struct irq_domain *
if (err)
return NULL;

+ /*
+ * Use a separate lock class for the MSI descriptor mutex on
+ * platform MSI device domains because the descriptor mutex nests
+ * into the domain mutex. See alloc/free below.
+ */
+ lockdep_set_class(&dev->msi.data->mutex, &platform_device_msi_lock_class);
+
data = dev->msi.data->platform_data;
data->host_data = host_data;
domain = irq_domain_create_hierarchy(dev->msi.domain, 0,
@@ -300,75 +309,23 @@ struct irq_domain *
return NULL;
}
- desc->irq = virq + i;
-
- list_add_tail(&desc->list, dev_to_msi_list(dev));
- }
-
- if (i != nvec) {
- /* Clean up the mess */
- platform_msi_free_descs(dev, base, nvec);
- return -ENOMEM;
- }
-
- return 0;
-}
-
/**
* platform_msi_device_domain_free - Free interrupts associated with a platform-msi
* device domain
*
* @domain: The platform-msi device domain
* @virq: The base irq from which to perform the free operation
- * @nvec: How many interrupts to free from @virq
+ * @nr_irqs: How many interrupts to free from @virq
*/
void platform_msi_device_domain_free(struct irq_domain *domain, unsigned int virq,
- unsigned int nvec)
+ unsigned int nr_irqs)
{
struct platform_msi_priv_data *data = domain->host_data;
- struct msi_desc *desc, *tmp;

- for_each_msi_entry_safe(desc, tmp, data->dev) {
- if (WARN_ON(!desc->irq || desc->nvec_used != 1))
- return;
- if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
- continue;
-
- irq_domain_free_irqs_common(domain, desc->irq, 1);
- list_del(&desc->list);
- free_msi_entry(desc);
- }
+ msi_lock_descs(data->dev);
+ irq_domain_free_irqs_common(domain, virq, nr_irqs);
+ msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, virq, nr_irqs);
+ msi_unlock_descs(data->dev);
}

/**
@@ -377,7 +334,7 @@ void platform_msi_device_domain_free(str
*
* @domain: The platform-msi device domain
* @virq: The base irq from which to perform the allocate operation
- * @nr_irqs: How many interrupts to free from @virq
+ * @nr_irqs: How many interrupts to allocate from @virq
*
* Return 0 on success, or an error code on failure. Must be called
* with irq_domain_mutex held (which can only be done as part of a
@@ -387,16 +344,7 @@ int platform_msi_device_domain_alloc(str
unsigned int nr_irqs)
{
struct platform_msi_priv_data *data = domain->host_data;
- int err;
-
- err = platform_msi_alloc_descs_with_irq(data->dev, virq, nr_irqs);
- if (err)
- return err;
-
- err = msi_domain_populate_irqs(domain->parent, data->dev,
- virq, nr_irqs, &data->arg);
- if (err)
- platform_msi_device_domain_free(domain, virq, nr_irqs);
+ struct device *dev = data->dev;

- return err;
+ return msi_domain_populate_irqs(domain->parent, dev, virq, nr_irqs, &data->arg);
}
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -775,43 +775,41 @@ int msi_domain_prepare_irqs(struct irq_d
}

int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
- int virq, int nvec, msi_alloc_info_t *arg)
+ int virq_base, int nvec, msi_alloc_info_t *arg)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
struct msi_desc *desc;
- int ret = 0;
+ int ret, virq;

- for_each_msi_entry(desc, dev) {
- /* Don't even try the multi-MSI brain damage. */
- if (WARN_ON(!desc->irq || desc->nvec_used != 1)) {
- ret = -EINVAL;
- break;
+ msi_lock_descs(dev);
+ for (virq = virq_base; virq < virq_base + nvec; virq++) {
+ desc = alloc_msi_entry(dev, 1, NULL);
+ if (!desc) {
+ ret = -ENOMEM;
+ goto fail;
}

- if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
- continue;
+ desc->msi_index = virq;
+ desc->irq = virq;
+ list_add_tail(&desc->list, &dev->msi.data->list);
+ dev->msi.data->num_descs++;

ops->set_desc(arg, desc);
- /* Assumes the domain mutex is held! */
- ret = irq_domain_alloc_irqs_hierarchy(domain, desc->irq, 1,
- arg);
+ ret = irq_domain_alloc_irqs_hierarchy(domain, virq, 1, arg);
if (ret)
- break;
+ goto fail;

- irq_set_msi_desc_off(desc->irq, 0, desc);
- }
-
- if (ret) {
- /* Mop up the damage */
- for_each_msi_entry(desc, dev) {
- if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
- continue;
-
- irq_domain_free_irqs_common(domain, desc->irq, 1);
- }
+ irq_set_msi_desc(virq, desc);
}
+ msi_unlock_descs(dev);
+ return 0;

+fail:
+ for (--virq; virq >= virq_base; virq--)
+ irq_domain_free_irqs_common(domain, virq, 1);
+ msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+ msi_unlock_descs(dev);
return ret;
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:23:11 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
There is no real reason to do several loops over the MSI descriptors
instead of just doing one loop. In case of an error everything is undone
anyway so it does not matter whether it's a partial or a full rollback.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
.clang-format | 1
include/linux/msi.h | 7 --
kernel/irq/msi.c | 129 +++++++++++++++++++++++++++-------------------------
3 files changed, 70 insertions(+), 67 deletions(-)

--- a/.clang-format
+++ b/.clang-format
@@ -216,7 +216,6 @@ ExperimentalAutoDetectBinPacking: false
- 'for_each_migratetype_order'
- 'for_each_msi_entry'
- 'for_each_msi_entry_safe'
- - 'for_each_msi_vector'
- 'for_each_net'
- 'for_each_net_continue_reverse'
- 'for_each_netdev'
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -263,12 +263,7 @@ static inline struct msi_desc *msi_first
list_for_each_entry((desc), dev_to_msi_list((dev)), list)
#define for_each_msi_entry_safe(desc, tmp, dev) \
list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
-#define for_each_msi_vector(desc, __irq, dev) \
- for_each_msi_entry((desc), (dev)) \
- if ((desc)->irq) \
- for (__irq = (desc)->irq; \
- __irq < ((desc)->irq + (desc)->nvec_used); \
- __irq++)
+
#ifdef CONFIG_IRQ_MSI_IOMMU
static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
{
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -873,23 +873,74 @@ static int msi_handle_pci_fail(struct ir
return allocated ? allocated : -ENOSPC;
}

+#define VIRQ_CAN_RESERVE 0x01
+#define VIRQ_ACTIVATE 0x02
+#define VIRQ_NOMASK_QUIRK 0x04
+
+static int msi_init_virq(struct irq_domain *domain, int virq, unsigned int vflags)
+{
+ struct irq_data *irqd = irq_domain_get_irq_data(domain, virq);
+ int ret;
+
+ if (!vflags & VIRQ_CAN_RESERVE) {
+ irqd_clr_can_reserve(irqd);
+ if (vflags & VIRQ_NOMASK_QUIRK)
+ irqd_set_msi_nomask_quirk(irqd);
+ }
+
+ if (!(vflags & VIRQ_ACTIVATE))
+ return 0;
+
+ ret = irq_domain_activate_irq(irqd, vflags & VIRQ_CAN_RESERVE);
+ if (ret)
+ return ret;
+ /*
+ * If the interrupt uses reservation mode, clear the activated bit
+ * so request_irq() will assign the final vector.
+ */
+ if (vflags & VIRQ_CAN_RESERVE)
+ irqd_clr_activated(irqd);
+ return 0;
+}
+
int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
int nvec)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
- struct irq_data *irq_data;
- struct msi_desc *desc;
msi_alloc_info_t arg = { };
+ unsigned int vflags = 0;
+ struct msi_desc *desc;
int allocated = 0;
int i, ret, virq;
- bool can_reserve;

ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
if (ret)
return ret;

- for_each_msi_entry(desc, dev) {
+ /*
+ * This flag is set by the PCI layer as we need to activate
+ * the MSI entries before the PCI layer enables MSI in the
+ * card. Otherwise the card latches a random msi message.
+ */
+ if (info->flags & MSI_FLAG_ACTIVATE_EARLY)
+ vflags |= VIRQ_ACTIVATE;
+
+ /*
+ * Interrupt can use a reserved vector and will not occupy
+ * a real device vector until the interrupt is requested.
+ */
+ if (msi_check_reservation_mode(domain, info, dev)) {
+ vflags |= VIRQ_CAN_RESERVE;
+ /*
+ * MSI affinity setting requires a special quirk (X86) when
+ * reservation mode is active.
+ */
+ if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK)
+ vflags |= VIRQ_NOMASK_QUIRK;
+ }
+
+ msi_for_each_desc(desc, dev, MSI_DESC_NOTASSOCIATED) {
ops->set_desc(&arg, desc);

virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -901,49 +952,12 @@ int __msi_domain_alloc_irqs(struct irq_d
for (i = 0; i < desc->nvec_used; i++) {
irq_set_msi_desc_off(virq, i, desc);
irq_debugfs_copy_devname(virq + i, dev);
+ ret = msi_init_virq(domain, virq + i, vflags);
+ if (ret)
+ return ret;
}
allocated++;
}
-
- can_reserve = msi_check_reservation_mode(domain, info, dev);
-
- /*
- * This flag is set by the PCI layer as we need to activate
- * the MSI entries before the PCI layer enables MSI in the
- * card. Otherwise the card latches a random msi message.
- */
- if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY))
- goto skip_activate;
-
- for_each_msi_vector(desc, i, dev) {
- if (desc->irq == i) {
- virq = desc->irq;
- dev_dbg(dev, "irq [%d-%d] for MSI\n",
- virq, virq + desc->nvec_used - 1);
- }
-
- irq_data = irq_domain_get_irq_data(domain, i);
- if (!can_reserve) {
- irqd_clr_can_reserve(irq_data);
- if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK)
- irqd_set_msi_nomask_quirk(irq_data);
- }
- ret = irq_domain_activate_irq(irq_data, can_reserve);
- if (ret)
- return ret;
- }
-
-skip_activate:
- /*
- * If these interrupts use reservation mode, clear the activated bit
- * so request_irq() will assign the final vector.
- */
- if (can_reserve) {
- for_each_msi_vector(desc, i, dev) {
- irq_data = irq_domain_get_irq_data(domain, i);
- irqd_clr_activated(irq_data);
- }
- }
return 0;
}

@@ -1021,26 +1035,21 @@ int msi_domain_alloc_irqs(struct irq_dom

void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
{
- struct irq_data *irq_data;
+ struct irq_data *irqd;
struct msi_desc *desc;
int i;

- for_each_msi_vector(desc, i, dev) {
- irq_data = irq_domain_get_irq_data(domain, i);
- if (irqd_is_activated(irq_data))
- irq_domain_deactivate_irq(irq_data);
- }
-
- for_each_msi_entry(desc, dev) {
- /*
- * We might have failed to allocate an MSI early
- * enough that there is no IRQ associated to this
- * entry. If that's the case, don't do anything.
- */
- if (desc->irq) {
- irq_domain_free_irqs(desc->irq, desc->nvec_used);
- desc->irq = 0;
+ /* Only handle MSI entries which have an interrupt associated */
+ msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+ /* Make sure all interrupts are deactivated */
+ for (i = 0; i < desc->nvec_used; i++) {
+ irqd = irq_domain_get_irq_data(domain, desc->irq + i);
+ if (irqd && irqd_is_activated(irqd))
+ irq_domain_deactivate_irq(irqd);
}
+
+ irq_domain_free_irqs(desc->irq, desc->nvec_used);
+ desc->irq = 0;
}
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:23:12 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Use the new iterator functions and add locking where required.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
kernel/irq/msi.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -354,6 +354,7 @@ struct msi_desc *msi_next_desc(struct de
int __msi_get_virq(struct device *dev, unsigned int index)
{
struct msi_desc *desc;
+ int ret = -ENOENT;
bool pcimsi;

if (!dev->msi.data)
@@ -361,11 +362,12 @@ int __msi_get_virq(struct device *dev, u

pcimsi = msi_device_has_property(dev, MSI_PROP_PCI_MSI);

- for_each_msi_entry(desc, dev) {
+ msi_lock_descs(dev);
+ msi_for_each_desc_from(desc, dev, MSI_DESC_ASSOCIATED, index) {
/* PCI-MSI has only one descriptor for multiple interrupts. */
if (pcimsi) {
- if (desc->irq && index < desc->nvec_used)
- return desc->irq + index;
+ if (index < desc->nvec_used)
+ ret = desc->irq + index;
break;
}

@@ -373,10 +375,13 @@ int __msi_get_virq(struct device *dev, u
* PCI-MSIX and platform MSI use a descriptor per
* interrupt.
*/
- if (desc->msi_index == index)
- return desc->irq;
+ if (desc->msi_index == index) {
+ ret = desc->irq;
+ break;
+ }
}
- return -ENOENT;
+ msi_unlock_descs(dev);
+ return ret;
}
EXPORT_SYMBOL_GPL(__msi_get_virq);

@@ -407,7 +412,7 @@ static const struct attribute_group **ms
int i;

/* Determine how many msi entries we have */
- for_each_msi_entry(entry, dev)
+ msi_for_each_desc(entry, dev, MSI_DESC_ALL)
num_msi += entry->nvec_used;
if (!num_msi)
return NULL;
@@ -417,7 +422,7 @@ static const struct attribute_group **ms
if (!msi_attrs)
return ERR_PTR(-ENOMEM);

- for_each_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, dev, MSI_DESC_ALL) {
for (i = 0; i < entry->nvec_used; i++) {
msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
if (!msi_dev_attr)
@@ -838,7 +843,7 @@ static bool msi_check_reservation_mode(s
* Checking the first MSI descriptor is sufficient. MSIX supports
* masking and MSI does so when the can_mask attribute is set.
*/
- desc = first_msi_entry(dev);
+ desc = msi_first_desc(dev);
return desc->pci.msi_attrib.is_msix || desc->pci.msi_attrib.can_mask;
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:23:14 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Get rid of the old iterators, alloc/free functions and adjust the core code
accordingly.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 15 ---------------
kernel/irq/msi.c | 31 +++++++++++++++----------------
2 files changed, 15 insertions(+), 31 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -254,15 +254,7 @@ static inline struct msi_desc *msi_first
#define msi_for_each_desc(desc, dev, filter) \
msi_for_each_desc_from(desc, dev, filter, 0)

-/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
-#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
-#define first_msi_entry(dev) \
- list_first_entry(dev_to_msi_list((dev)), struct msi_desc, list)
-#define for_each_msi_entry(desc, dev) \
- list_for_each_entry((desc), dev_to_msi_list((dev)), list)
-#define for_each_msi_entry_safe(desc, tmp, dev) \
- list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)

#ifdef CONFIG_IRQ_MSI_IOMMU
static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
@@ -288,10 +280,6 @@ static inline void msi_desc_set_iommu_co
#endif

#ifdef CONFIG_PCI_MSI
-#define first_pci_msi_entry(pdev) first_msi_entry(&(pdev)->dev)
-#define for_each_pci_msi_entry(desc, pdev) \
- for_each_msi_entry((desc), &(pdev)->dev)
-
struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc);
void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg);
#else /* CONFIG_PCI_MSI */
@@ -314,9 +302,6 @@ static inline void msi_free_msi_descs(st
msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
}

-struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
- const struct irq_affinity_desc *affinity);
-void free_msi_entry(struct msi_desc *entry);
void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg);

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -19,8 +19,10 @@

#include "internals.h"

+#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
+
/**
- * alloc_msi_entry - Allocate an initialized msi_desc
+ * msi_alloc_desc - Allocate an initialized msi_desc
* @dev: Pointer to the device for which this is allocated
* @nvec: The number of vectors used in this entry
* @affinity: Optional pointer to an affinity mask array size of @nvec
@@ -30,12 +32,11 @@
*
* Return: pointer to allocated &msi_desc on success or %NULL on failure
*/
-struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
- const struct irq_affinity_desc *affinity)
+static struct msi_desc *msi_alloc_desc(struct device *dev, int nvec,
+ const struct irq_affinity_desc *affinity)
{
- struct msi_desc *desc;
+ struct msi_desc *desc = kzalloc(sizeof(*desc), GFP_KERNEL);

- desc = kzalloc(sizeof(*desc), GFP_KERNEL);
if (!desc)
return NULL;

@@ -43,21 +44,19 @@ struct msi_desc *alloc_msi_entry(struct
desc->dev = dev;
desc->nvec_used = nvec;
if (affinity) {
- desc->affinity = kmemdup(affinity,
- nvec * sizeof(*desc->affinity), GFP_KERNEL);
+ desc->affinity = kmemdup(affinity, nvec * sizeof(*desc->affinity), GFP_KERNEL);
if (!desc->affinity) {
kfree(desc);
return NULL;
}
}
-
return desc;
}

-void free_msi_entry(struct msi_desc *entry)
+static void msi_free_desc(struct msi_desc *desc)
{
- kfree(entry->affinity);
- kfree(entry);
+ kfree(desc->affinity);
+ kfree(desc);
}

/**
@@ -73,7 +72,7 @@ int msi_add_msi_desc(struct device *dev,

lockdep_assert_held(&dev->msi.data->mutex);

- desc = alloc_msi_entry(dev, init_desc->nvec_used, init_desc->affinity);
+ desc = msi_alloc_desc(dev, init_desc->nvec_used, init_desc->affinity);
if (!desc)
return -ENOMEM;

@@ -103,7 +102,7 @@ int msi_add_simple_msi_descs(struct devi
lockdep_assert_held(&dev->msi.data->mutex);

for (i = 0; i < ndesc; i++) {
- desc = alloc_msi_entry(dev, 1, NULL);
+ desc = msi_alloc_desc(dev, 1, NULL);
if (!desc)
goto fail;
desc->msi_index = index + i;
@@ -116,7 +115,7 @@ int msi_add_simple_msi_descs(struct devi
fail:
list_for_each_entry_safe(desc, tmp, &list, list) {
list_del(&desc->list);
- free_msi_entry(desc);
+ msi_free_desc(desc);
}
return -ENOMEM;
}
@@ -143,7 +142,7 @@ void msi_free_msi_descs_range(struct dev
if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
continue;
list_del(&desc->list);
- free_msi_entry(desc);
+ msi_free_desc(desc);
dev->msi.data->num_descs--;
}
}
@@ -779,7 +778,7 @@ int msi_domain_populate_irqs(struct irq_

msi_lock_descs(dev);
for (virq = virq_base; virq < virq_base + nvec; virq++) {
- desc = alloc_msi_entry(dev, 1, NULL);
+ desc = msi_alloc_desc(dev, 1, NULL);
if (!desc) {
ret = -ENOMEM;
goto fail;

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:16 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -2,6 +2,20 @@
#ifndef LINUX_MSI_H
#define LINUX_MSI_H

+/*
+ * This header file contains MSI data structures and functions which are
+ * only relevant for:
+ * - Interrupt core code
+ * - PCI/MSI core code
+ * - MSI interrupt domain implementations
+ * - IOMMU, low level VFIO, NTB and other justified exceptions
+ * dealing with low level MSI details.
+ *
+ * Regular device drivers have no business with any of these functions and
+ * especially storing MSI descriptor pointers in random code is considered
+ * abuse. The only function which is relevant for drivers is msi_get_virq().
+ */
+
#include <linux/spinlock.h>
#include <linux/mutex.h>
#include <linux/list.h>

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:17 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The sysfs handling for MSI is a convoluted maze and it is in the way of
supporting dynamic expansion of the MSI-X vectors because it only supports
a one off bulk population/free of the sysfs entries.

Change it to do:

1) Creating an empty sysfs attribute group when msi_device_data is
allocated

2) Populate the entries when the MSI descriptor is initialized

3) Free the entries when a MSI descriptor is detached from a Linux
interrupt.

4) Provide functions for the legacy non-irqdomain fallback code to
do a bulk population/free. This code won't support dynamic
expansion.

This makes the code simpler and reduces the number of allocations as the
empty attribute group can be shared.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 7 +
kernel/irq/msi.c | 196 +++++++++++++++++++++++-----------------------------
2 files changed, 95 insertions(+), 108 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -72,6 +72,7 @@ struct irq_data;
struct msi_desc;
struct pci_dev;
struct platform_msi_priv_data;
+struct device_attribute;

void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
#ifdef CONFIG_GENERIC_MSI_IRQ
@@ -127,6 +128,7 @@ struct pci_msi_desc {
* @dev: Pointer to the device which uses this descriptor
* @msg: The last set MSI message cached for reuse
* @affinity: Optional pointer to a cpu affinity mask for this descriptor
+ * @sysfs_attr: Pointer to sysfs device attribute
*
* @write_msi_msg: Callback that may be called when the MSI message
* address or data changes
@@ -146,6 +148,9 @@ struct msi_desc {
#ifdef CONFIG_IRQ_MSI_IOMMU
const void *iommu_cookie;
#endif
+#ifdef CONFIG_SYSFS
+ struct device_attribute *sysfs_attrs;
+#endif

void (*write_msi_msg)(struct msi_desc *entry, void *data);
void *write_msi_msg_data;
@@ -171,7 +176,6 @@ enum msi_desc_filter {
* @lock: Spinlock to protect register access
* @properties: MSI properties which are interesting to drivers
* @num_descs: The number of allocated MSI descriptors for the device
- * @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
* @mutex: Mutex protecting the MSI list
@@ -182,7 +186,6 @@ struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
unsigned int num_descs;
- const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
struct list_head list;
struct mutex mutex;
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -19,6 +19,7 @@

#include "internals.h"

+static inline int msi_sysfs_create_group(struct device *dev);
#define dev_to_msi_list(dev) (&(dev)->msi.data->list)

/**
@@ -208,6 +209,7 @@ static void msi_device_data_release(stru
int msi_setup_device_data(struct device *dev)
{
struct msi_device_data *md;
+ int ret;

if (dev->msi.data)
return 0;
@@ -216,6 +218,12 @@ int msi_setup_device_data(struct device
if (!md)
return -ENOMEM;

+ ret = msi_sysfs_create_group(dev);
+ if (ret) {
+ devres_free(md);
+ return ret;
+ }
+
raw_spin_lock_init(&md->lock);
INIT_LIST_HEAD(&md->list);
mutex_init(&md->mutex);
@@ -395,6 +403,20 @@ int __msi_get_virq(struct device *dev, u
EXPORT_SYMBOL_GPL(__msi_get_virq);

#ifdef CONFIG_SYSFS
+static struct attribute *msi_dev_attrs[] = {
+ NULL
+};
+
+static const struct attribute_group msi_irqs_group = {
+ .name = "msi_irqs",
+ .attrs = msi_dev_attrs,
+};
+
+static inline int msi_sysfs_create_group(struct device *dev)
+{
+ return devm_device_add_group(dev, &msi_irqs_group);
+}
+
static ssize_t msi_mode_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -404,97 +426,74 @@ static ssize_t msi_mode_show(struct devi
return sysfs_emit(buf, "%s\n", is_msix ? "msix" : "msi");
}

-/**
- * msi_populate_sysfs - Populate msi_irqs sysfs entries for devices
- * @dev: The device(PCI, platform etc) who will get sysfs entries
- */
-static const struct attribute_group **msi_populate_sysfs(struct device *dev)
+static void msi_sysfs_remove_desc(struct device *dev, struct msi_desc *desc)
{
- const struct attribute_group **msi_irq_groups;
- struct attribute **msi_attrs, *msi_attr;
- struct device_attribute *msi_dev_attr;
- struct attribute_group *msi_irq_group;
- struct msi_desc *entry;
- int ret = -ENOMEM;
- int num_msi = 0;
- int count = 0;
+ struct device_attribute *attrs = desc->sysfs_attrs;
int i;

- /* Determine how many msi entries we have */
- msi_for_each_desc(entry, dev, MSI_DESC_ALL)
- num_msi += entry->nvec_used;
- if (!num_msi)
- return NULL;
+ if (!attrs)
+ return;

- /* Dynamically create the MSI attributes for the device */
- msi_attrs = kcalloc(num_msi + 1, sizeof(void *), GFP_KERNEL);
- if (!msi_attrs)
- return ERR_PTR(-ENOMEM);
-
- msi_for_each_desc(entry, dev, MSI_DESC_ALL) {
- for (i = 0; i < entry->nvec_used; i++) {
- msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
- if (!msi_dev_attr)
- goto error_attrs;
- msi_attrs[count] = &msi_dev_attr->attr;
-
- sysfs_attr_init(&msi_dev_attr->attr);
- msi_dev_attr->attr.name = kasprintf(GFP_KERNEL, "%d",
- entry->irq + i);
- if (!msi_dev_attr->attr.name)
- goto error_attrs;
- msi_dev_attr->attr.mode = 0444;
- msi_dev_attr->show = msi_mode_show;
- ++count;
- }
+ desc->sysfs_attrs = NULL;
+ for (i = 0; i < desc->nvec_used; i++) {
+ if (attrs[i].show)
+ sysfs_remove_file_from_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
+ kfree(attrs[i].attr.name);
}
+ kfree(attrs);
+}

- msi_irq_group = kzalloc(sizeof(*msi_irq_group), GFP_KERNEL);
- if (!msi_irq_group)
- goto error_attrs;
- msi_irq_group->name = "msi_irqs";
- msi_irq_group->attrs = msi_attrs;
-
- msi_irq_groups = kcalloc(2, sizeof(void *), GFP_KERNEL);
- if (!msi_irq_groups)
- goto error_irq_group;
- msi_irq_groups[0] = msi_irq_group;
+static int msi_sysfs_populate_desc(struct device *dev, struct msi_desc *desc)
+{
+ struct device_attribute *attrs;
+ int ret, i;

- ret = sysfs_create_groups(&dev->kobj, msi_irq_groups);
- if (ret)
- goto error_irq_groups;
+ attrs = kcalloc(desc->nvec_used, sizeof(*attrs), GFP_KERNEL);
+ if (!attrs)
+ return -ENOMEM;
+
+ desc->sysfs_attrs = attrs;
+ for (i = 0; i < desc->nvec_used; i++) {
+ sysfs_attr_init(&attrs[i].attr);
+ attrs[i].attr.name = kasprintf(GFP_KERNEL, "%d", desc->irq + i);
+ if (!attrs[i].attr.name) {
+ ret = -ENOMEM;
+ goto fail;
+ }

- return msi_irq_groups;
+ attrs[i].attr.mode = 0444;
+ attrs[i].show = msi_mode_show;

-error_irq_groups:
- kfree(msi_irq_groups);
-error_irq_group:
- kfree(msi_irq_group);
-error_attrs:
- count = 0;
- msi_attr = msi_attrs[count];
- while (msi_attr) {
- msi_dev_attr = container_of(msi_attr, struct device_attribute, attr);
- kfree(msi_attr->name);
- kfree(msi_dev_attr);
- ++count;
- msi_attr = msi_attrs[count];
+ ret = sysfs_add_file_to_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
+ if (ret) {
+ attrs[i].show = NULL;
+ goto fail;
+ }
}
- kfree(msi_attrs);
- return ERR_PTR(ret);
+ return 0;
+
+fail:
+ msi_sysfs_remove_desc(dev, desc);
+ return ret;
}

+#ifdef CONFIG_PCI_MSI_ARCH_FALLBACK
/**
* msi_device_populate_sysfs - Populate msi_irqs sysfs entries for a device
* @dev: The device(PCI, platform etc) which will get sysfs entries
*/
int msi_device_populate_sysfs(struct device *dev)
{
- const struct attribute_group **group = msi_populate_sysfs(dev);
+ struct msi_desc *desc;
+ int ret;

- if (IS_ERR(group))
- return PTR_ERR(group);
- dev->msi.data->attrs = group;
+ msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+ if (desc->sysfs_attrs)
+ continue;
+ ret = msi_sysfs_populate_desc(dev, desc);
+ if (ret)
+ return ret;
+ }
return 0;
}

@@ -505,28 +504,17 @@ int msi_device_populate_sysfs(struct dev
*/
void msi_device_destroy_sysfs(struct device *dev)
{
- const struct attribute_group **msi_irq_groups = dev->msi.data->attrs;
- struct device_attribute *dev_attr;
- struct attribute **msi_attrs;
- int count = 0;
-
- dev->msi.data->attrs = NULL;
- if (!msi_irq_groups)
- return;
+ struct msi_desc *desc;

- sysfs_remove_groups(&dev->kobj, msi_irq_groups);
- msi_attrs = msi_irq_groups[0]->attrs;
- while (msi_attrs[count]) {
- dev_attr = container_of(msi_attrs[count], struct device_attribute, attr);
- kfree(dev_attr->attr.name);
- kfree(dev_attr);
- ++count;
- }
- kfree(msi_attrs);
- kfree(msi_irq_groups[0]);
- kfree(msi_irq_groups);
+ msi_for_each_desc(desc, dev, MSI_DESC_ALL)
+ msi_sysfs_remove_desc(dev, desc);
}
-#endif
+#endif /* CONFIG_PCI_MSI_ARCH_FALLBACK */
+#else /* CONFIG_SYSFS */
+static inline int msi_sysfs_create_group(struct device *dev) { return 0; }
+static inline int msi_sysfs_populate_desc(struct device *dev, struct msi_desc *desc) { return 0; }
+static inline void msi_sysfs_remove_desc(struct device *dev, struct msi_desc *desc) { }
+#endif /* !CONFIG_SYSFS */

#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
static inline void irq_chip_write_msi_msg(struct irq_data *data,
@@ -959,6 +947,12 @@ int __msi_domain_alloc_irqs(struct irq_d
ret = msi_init_virq(domain, virq + i, vflags);
if (ret)
return ret;
+
+ if (info->flags & MSI_FLAG_DEV_SYSFS) {
+ ret = msi_sysfs_populate_desc(dev, desc);
+ if (ret)
+ return ret;
+ }
}
allocated++;
}
@@ -1003,18 +997,7 @@ int msi_domain_alloc_irqs_descs_locked(s

ret = ops->domain_alloc_irqs(domain, dev, nvec);
if (ret)
- goto cleanup;
-
- if (!(info->flags & MSI_FLAG_DEV_SYSFS))
- return 0;
-
- ret = msi_device_populate_sysfs(dev);
- if (ret)
- goto cleanup;
- return 0;
-
-cleanup:
- msi_domain_free_irqs_descs_locked(domain, dev);
+ msi_domain_free_irqs_descs_locked(domain, dev);
return ret;
}

@@ -1039,6 +1022,7 @@ int msi_domain_alloc_irqs(struct irq_dom

void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
{
+ struct msi_domain_info *info = domain->host_data;
struct irq_data *irqd;
struct msi_desc *desc;
int i;
@@ -1053,6 +1037,8 @@ void __msi_domain_free_irqs(struct irq_d
}

irq_domain_free_irqs(desc->irq, desc->nvec_used);
+ if (info->flags & MSI_FLAG_DEV_SYSFS)
+ msi_sysfs_remove_desc(dev, desc);
desc->irq = 0;
}
}
@@ -1081,8 +1067,6 @@ void msi_domain_free_irqs_descs_locked(s

lockdep_assert_held(&dev->msi.data->mutex);

- if (info->flags & MSI_FLAG_DEV_SYSFS)
- msi_device_destroy_sysfs(dev);
ops->domain_free_irqs(domain, dev);
msi_domain_free_msi_descs(info, dev);
}

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:19 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The current linked list storage for MSI descriptors is suboptimal in
several ways:

1) Looking up a MSI desciptor requires a O(n) list walk in the worst case

2) The upcoming support of runtime expansion of MSI-X vectors would need
to do a full list walk to figure out whether a particular index is
already associated.

3) Runtime expansion of sparse allocations is even more complex as the
current implementation assumes an ordered list (increasing MSI index).

Use an xarray which solves all of the above problems nicely.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 19 ++---
kernel/irq/msi.c | 188 ++++++++++++++++++++++------------------------------
2 files changed, 90 insertions(+), 117 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -17,6 +17,7 @@
*/

#include <linux/spinlock.h>
+#include <linux/xarray.h>
#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/bits.h>
@@ -122,7 +123,6 @@ struct pci_msi_desc {

/**
* struct msi_desc - Descriptor structure for MSI based interrupts
- * @list: List head for management
* @irq: The base interrupt number
* @nvec_used: The number of vectors used
* @dev: Pointer to the device which uses this descriptor
@@ -139,7 +139,6 @@ struct pci_msi_desc {
*/
struct msi_desc {
/* Shared device/bus type independent data */
- struct list_head list;
unsigned int irq;
unsigned int nvec_used;
struct device *dev;
@@ -177,20 +176,20 @@ enum msi_desc_filter {
* @properties: MSI properties which are interesting to drivers
* @num_descs: The number of allocated MSI descriptors for the device
* @platform_data: Platform-MSI specific data
- * @list: List of MSI descriptors associated to the device
- * @mutex: Mutex protecting the MSI list
- * @__next: Cached pointer to the next entry for iterators
- * @__filter: Cached descriptor filter
+ * @mutex: Mutex protecting the MSI descriptor store
+ * @store: Xarray for storing MSI descriptor pointers
+ * @__iter_idx: Index to search the next entry for iterators
+ * @__iter_filter: Cached descriptor filter
*/
struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
unsigned int num_descs;
struct platform_msi_priv_data *platform_data;
- struct list_head list;
struct mutex mutex;
- struct msi_desc *__next;
- enum msi_desc_filter __filter;
+ struct xarray store;
+ unsigned long __iter_idx;
+ enum msi_desc_filter __iter_filter;
};

int msi_setup_device_data(struct device *dev);
@@ -266,7 +265,7 @@ static inline struct msi_desc *msi_first
* @dev: struct device pointer - device to iterate
* @filter: Filter for descriptor selection
*
- * See msi_for_each_desc_from()for further information.
+ * See msi_for_each_desc_from() for further information.
*/
#define msi_for_each_desc(desc, dev, filter) \
msi_for_each_desc_from(desc, dev, filter, 0)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -20,7 +20,6 @@
#include "internals.h"

static inline int msi_sysfs_create_group(struct device *dev);
-#define dev_to_msi_list(dev) (&(dev)->msi.data->list)

/**
* msi_alloc_desc - Allocate an initialized msi_desc
@@ -41,7 +40,6 @@ static struct msi_desc *msi_alloc_desc(s
if (!desc)
return NULL;

- INIT_LIST_HEAD(&desc->list);
desc->dev = dev;
desc->nvec_used = nvec;
if (affinity) {
@@ -60,6 +58,19 @@ static void msi_free_desc(struct msi_des
kfree(desc);
}

+static int msi_insert_desc(struct msi_device_data *md, struct msi_desc *desc, unsigned int index)
+{
+ int ret;
+
+ desc->msi_index = index;
+ ret = xa_insert(&md->store, index, desc, GFP_KERNEL);
+ if (!ret)
+ md->num_descs++;
+ else
+ msi_free_desc(desc);
+ return ret;
+}
+
/**
* msi_add_msi_desc - Allocate and initialize a MSI descriptor
* @dev: Pointer to the device for which the descriptor is allocated
@@ -77,13 +88,9 @@ int msi_add_msi_desc(struct device *dev,
if (!desc)
return -ENOMEM;

- /* Copy the MSI index and type specific data to the new descriptor. */
- desc->msi_index = init_desc->msi_index;
+ /* Copy type specific data to the new descriptor. */
desc->pci = init_desc->pci;
-
- list_add_tail(&desc->list, &dev->msi.data->list);
- dev->msi.data->num_descs++;
- return 0;
+ return msi_insert_desc(dev->msi.data, desc, init_desc->msi_index);
}

/**
@@ -96,29 +103,41 @@ int msi_add_msi_desc(struct device *dev,
*/
static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
{
- struct msi_desc *desc, *tmp;
- LIST_HEAD(list);
- unsigned int i;
+ struct msi_desc *desc;
+ unsigned long i;
+ int ret;

lockdep_assert_held(&dev->msi.data->mutex);

for (i = 0; i < ndesc; i++) {
desc = msi_alloc_desc(dev, 1, NULL);
if (!desc)
+ goto fail_mem;
+ ret = msi_insert_desc(dev->msi.data, desc, index + i);
+ if (ret)
goto fail;
- desc->msi_index = index + i;
- list_add_tail(&desc->list, &list);
}
- list_splice_tail(&list, &dev->msi.data->list);
- dev->msi.data->num_descs += ndesc;
return 0;

+fail_mem:
+ ret = -ENOMEM;
fail:
- list_for_each_entry_safe(desc, tmp, &list, list) {
- list_del(&desc->list);
- msi_free_desc(desc);
+ msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, index, ndesc);
+ return ret;
+}
+
+static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
+{
+ switch (filter) {
+ case MSI_DESC_ALL:
+ return true;
+ case MSI_DESC_NOTASSOCIATED:
+ return !desc->irq;
+ case MSI_DESC_ASSOCIATED:
+ return !!desc->irq;
}
- return -ENOMEM;
+ WARN_ON_ONCE(1);
+ return false;
}

/**
@@ -132,19 +151,16 @@ void msi_free_msi_descs_range(struct dev
unsigned int base_index, unsigned int ndesc)
{
struct msi_desc *desc;
+ unsigned long idx;

lockdep_assert_held(&dev->msi.data->mutex);

- msi_for_each_desc(desc, dev, filter) {
- /*
- * Stupid for now to handle MSI device domain until the
- * storage is switched over to an xarray.
- */
- if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
- continue;
- list_del(&desc->list);
- msi_free_desc(desc);
- dev->msi.data->num_descs--;
+ xa_for_each_range(&dev->msi.data->store, idx, desc, base_index, base_index + ndesc - 1) {
+ if (msi_desc_match(desc, filter)) {
+ xa_erase(&dev->msi.data->store, idx);
+ msi_free_desc(desc);
+ dev->msi.data->num_descs--;
+ }
}
}

@@ -192,7 +208,8 @@ static void msi_device_data_release(stru
{
struct msi_device_data *md = res;

- WARN_ON_ONCE(!list_empty(&md->list));
+ WARN_ON_ONCE(!xa_empty(&md->store));
+ xa_destroy(&md->store);
dev->msi.data = NULL;
}

@@ -225,7 +242,7 @@ int msi_setup_device_data(struct device
}

raw_spin_lock_init(&md->lock);
- INIT_LIST_HEAD(&md->list);
+ xa_init(&md->store);
mutex_init(&md->mutex);
dev->msi.data = md;
devres_add(dev, md);
@@ -252,38 +269,21 @@ void msi_unlock_descs(struct device *dev
{
if (WARN_ON_ONCE(!dev->msi.data))
return;
- /* Clear the next pointer which was cached by the iterator */
- dev->msi.data->__next = NULL;
+ /* Invalidate the index wich was cached by the iterator */
+ dev->msi.data->__iter_idx = ULONG_MAX;
mutex_unlock(&dev->msi.data->mutex);
}
EXPORT_SYMBOL_GPL(msi_unlock_descs);

-static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
-{
- switch (filter) {
- case MSI_DESC_ALL:
- return true;
- case MSI_DESC_NOTASSOCIATED:
- return !desc->irq;
- case MSI_DESC_ASSOCIATED:
- return !!desc->irq;
- }
- WARN_ON_ONCE(1);
- return false;
-}
-
-static struct msi_desc *msi_find_first_desc(struct device *dev, enum msi_desc_filter filter,
- unsigned int base_index)
+static struct msi_desc *msi_find_desc(struct msi_device_data *md)
{
struct msi_desc *desc;

- list_for_each_entry(desc, dev_to_msi_list(dev), list) {
- if (desc->msi_index < base_index)
- continue;
- if (msi_desc_match(desc, filter))
- return desc;
+ xa_for_each_start(&md->store, md->__iter_idx, desc, md->__iter_idx) {
+ if (msi_desc_match(desc, md->__iter_filter))
+ break;
}
- return NULL;
+ return desc;
}

/**
@@ -301,43 +301,25 @@ static struct msi_desc *msi_find_first_d
struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter,
unsigned int base_index)
{
- struct msi_desc *desc;
+ struct msi_device_data *md = dev->msi.data;

- if (WARN_ON_ONCE(!dev->msi.data))
+ if (WARN_ON_ONCE(!md))
return NULL;

- lockdep_assert_held(&dev->msi.data->mutex);
+ lockdep_assert_held(&md->mutex);

- /* Invalidate a previous invocation within the same lock section */
- dev->msi.data->__next = NULL;
-
- desc = msi_find_first_desc(dev, filter, base_index);
- if (desc) {
- dev->msi.data->__next = list_next_entry(desc, list);
- dev->msi.data->__filter = filter;
- }
- return desc;
+ md->__iter_filter = filter;
+ md->__iter_idx = base_index;
+ return msi_find_desc(md);
}
EXPORT_SYMBOL_GPL(__msi_first_desc);

-static struct msi_desc *__msi_next_desc(struct device *dev, enum msi_desc_filter filter,
- struct msi_desc *from)
-{
- struct msi_desc *desc = from;
-
- list_for_each_entry_from(desc, dev_to_msi_list(dev), list) {
- if (msi_desc_match(desc, filter))
- return desc;
- }
- return NULL;
-}
-
/**
* msi_next_desc - Get the next MSI descriptor of a device
* @dev: Device to operate on
*
* The first invocation of msi_next_desc() has to be preceeded by a
- * successful incovation of __msi_first_desc(). Consecutive invocations are
+ * successful invocation of __msi_first_desc(). Consecutive invocations are
* only valid if the previous one was successful. All these operations have
* to be done within the same MSI mutex held region.
*
@@ -346,20 +328,18 @@ static struct msi_desc *__msi_next_desc(
*/
struct msi_desc *msi_next_desc(struct device *dev)
{
- struct msi_device_data *data = dev->msi.data;
- struct msi_desc *desc;
+ struct msi_device_data *md = dev->msi.data;

- if (WARN_ON_ONCE(!data))
+ if (WARN_ON_ONCE(!md))
return NULL;

- lockdep_assert_held(&data->mutex);
+ lockdep_assert_held(&md->mutex);

- if (!data->__next)
+ if (md->__iter_idx == ULONG_MAX)
return NULL;

- desc = __msi_next_desc(dev, data->__filter, data->__next);
- dev->msi.data->__next = desc ? list_next_entry(desc, list) : NULL;
- return desc;
+ md->__iter_idx++;
+ return msi_find_desc(md);
}
EXPORT_SYMBOL_GPL(msi_next_desc);

@@ -384,21 +364,18 @@ int __msi_get_virq(struct device *dev, u
pcimsi = msi_device_has_property(dev, MSI_PROP_PCI_MSI);

msi_lock_descs(dev);
- msi_for_each_desc_from(desc, dev, MSI_DESC_ASSOCIATED, index) {
- /* PCI-MSI has only one descriptor for multiple interrupts. */
- if (pcimsi) {
- if (index < desc->nvec_used)
- ret = desc->irq + index;
- break;
- }
-
+ desc = xa_load(&dev->msi.data->store, pcimsi ? 0 : index);
+ if (desc && desc->irq) {
/*
+ * PCI-MSI has only one descriptor for multiple interrupts.
* PCI-MSIX and platform MSI use a descriptor per
* interrupt.
*/
- if (desc->msi_index == index) {
+ if (pcimsi) {
+ if (index < desc->nvec_used)
+ ret = desc->irq + index;
+ } else {
ret = desc->irq;
- break;
}
}
msi_unlock_descs(dev);
@@ -779,17 +756,13 @@ int msi_domain_populate_irqs(struct irq_
int ret, virq;

msi_lock_descs(dev);
- for (virq = virq_base; virq < virq_base + nvec; virq++) {
- desc = msi_alloc_desc(dev, 1, NULL);
- if (!desc) {
- ret = -ENOMEM;
- goto fail;
- }
+ ret = msi_add_simple_msi_descs(dev, virq_base, nvec);
+ if (ret)
+ goto unlock;

- desc->msi_index = virq;
+ for (virq = virq_base; virq < virq_base + nvec; virq++) {
+ desc = xa_load(&dev->msi.data->store, virq);
desc->irq = virq;
- list_add_tail(&desc->list, &dev->msi.data->list);
- dev->msi.data->num_descs++;

ops->set_desc(arg, desc);
ret = irq_domain_alloc_irqs_hierarchy(domain, virq, 1, arg);
@@ -805,6 +778,7 @@ int msi_domain_populate_irqs(struct irq_
for (--virq; virq >= virq_base; virq--)
irq_domain_free_irqs_common(domain, virq, 1);
msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+unlock:
msi_unlock_descs(dev);
return ret;
}

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:31 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
This is the third part of [PCI]MSI refactoring which aims to provide the
ability of expanding MSI-X vectors after enabling MSI-X.

The first two parts of this work can be found here:

https://lore.kernel.org/r/202111262227...@linutronix.de
https://lore.kernel.org/r/202111262241...@linutronix.de

This third part has the following important changes:

1) Add locking to protect the MSI descriptor storage

Right now the MSI descriptor storage (linked list) is not protected
by anything under the assumption that the list is installed before
use and destroyed after use. As this is about to change there has to
be protection

2) A new set of iterators which allow filtering on the state of the
descriptors namely whether a descriptor is associated to a Linux
interrupt or not.

This cleans up a lot of use cases which have to do this filtering
manually.

3) A new set of MSI descriptor allocation functions which make the usage
sites simpler and confine the storage handling to the core code.

Trivial MSI descriptors (non PCI) are now allocated by the core code
automatically when the underlying irq domain requests that.

4) Rework of sysfs handling to prepare for dynamic extension of MSI-X

The current mechanism which creates the directory and the attributes
for all MSI descriptors in one go is obviously not suitable for
dynamic extension. The rework splits the directory creation out and
lets the MSI interrupt allocation create the per descriptor
attributes.

5) Conversion of the MSI descriptor storage to xarray

The linked list based storage is suboptimal even without dynamic
expansion as it requires full list walks to get to a specific
descriptor. With dynamic expansion this gets even more
convoluted. Xarray is way more suitable and simplifies the
final goal of dynamic expansion of the MSI-X space.

The series is based on:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-2

and also available from git:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3

For the curious who can't wait for the next part to arrive the full series
is available via:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-4

Thanks,

tglx
---
.clang-format | 1
arch/powerpc/platforms/4xx/hsta_msi.c | 7
arch/powerpc/platforms/cell/axon_msi.c | 7
arch/powerpc/platforms/pasemi/msi.c | 9
arch/powerpc/sysdev/fsl_msi.c | 8
arch/powerpc/sysdev/mpic_u3msi.c | 9
arch/s390/pci/pci_irq.c | 6
arch/x86/pci/xen.c | 14
drivers/base/core.c | 3
drivers/base/platform-msi.c | 110 -----
drivers/bus/fsl-mc/fsl-mc-msi.c | 61 --
drivers/ntb/msi.c | 19
drivers/pci/controller/pci-hyperv.c | 15
drivers/pci/msi/irqdomain.c | 11
drivers/pci/msi/legacy.c | 20
drivers/pci/msi/msi.c | 255 +++++------
drivers/pci/xen-pcifront.c | 2
drivers/soc/ti/ti_sci_inta_msi.c | 77 +--
include/linux/device.h | 4
include/linux/msi.h | 135 +++++-
include/linux/soc/ti/ti_sci_inta_msi.h | 1
kernel/irq/msi.c | 719 ++++++++++++++++++++++-----------
22 files changed, 841 insertions(+), 652 deletions(-)


Thomas Gleixner

unread,
Nov 26, 2021, 8:23:33 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
It's only required when MSI is in use.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/base/core.c | 3 ---
include/linux/device.h | 4 ----
include/linux/msi.h | 4 +++-
kernel/irq/msi.c | 5 ++++-
4 files changed, 7 insertions(+), 9 deletions(-)

--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2874,9 +2874,6 @@ void device_initialize(struct device *de
INIT_LIST_HEAD(&dev->devres_head);
device_pm_init(dev);
set_dev_node(dev, NUMA_NO_NODE);
-#ifdef CONFIG_GENERIC_MSI_IRQ
- INIT_LIST_HEAD(&dev->msi_list);
-#endif
INIT_LIST_HEAD(&dev->links.consumers);
INIT_LIST_HEAD(&dev->links.suppliers);
INIT_LIST_HEAD(&dev->links.defer_sync);
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -422,7 +422,6 @@ struct dev_msi_info {
* @em_pd: device's energy model performance domain
* @pins: For device pin management.
* See Documentation/driver-api/pin-control.rst for details.
- * @msi_list: Hosts MSI descriptors
* @numa_node: NUMA node this device is close to.
* @dma_ops: DMA mapping operations for this device.
* @dma_mask: Dma mask (if dma'ble device).
@@ -518,9 +517,6 @@ struct device {
struct dev_pin_info *pins;
#endif
struct dev_msi_info msi;
-#ifdef CONFIG_GENERIC_MSI_IRQ
- struct list_head msi_list;
-#endif
#ifdef CONFIG_DMA_OPS
const struct dma_map_ops *dma_ops;
#endif
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -145,12 +145,14 @@ struct msi_desc {
* @properties: MSI properties which are interesting to drivers
* @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
+ * @list: List of MSI descriptors associated to the device
*/
struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
+ struct list_head list;
};

int msi_setup_device_data(struct device *dev);
@@ -187,7 +189,7 @@ static inline unsigned int msi_get_virq(

/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
-#define dev_to_msi_list(dev) (&(dev)->msi_list)
+#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
#define first_msi_entry(dev) \
list_first_entry(dev_to_msi_list((dev)), struct msi_desc, list)
#define for_each_msi_entry(desc, dev) \
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -87,7 +87,9 @@ EXPORT_SYMBOL_GPL(get_cached_msi_msg);

static void msi_device_data_release(struct device *dev, void *res)
{
- WARN_ON_ONCE(!list_empty(&dev->msi_list));
+ struct msi_device_data *md = res;
+
+ WARN_ON_ONCE(!list_empty(&md->list));
dev->msi.data = NULL;
}

@@ -113,6 +115,7 @@ int msi_setup_device_data(struct device
return -ENOMEM;

raw_spin_lock_init(&md->lock);
+ INIT_LIST_HEAD(&md->list);
dev->msi.data = md;
devres_add(dev, md);
return 0;

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:33 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
For upcoming runtime extensions of MSI-X interrupts it's required to
protect the MSI descriptor list. Add a mutex to struct msi_device_data and
provide lock/unlock functions.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 6 ++++++
kernel/irq/msi.c | 25 +++++++++++++++++++++++++
2 files changed, 31 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -3,6 +3,7 @@
#define LINUX_MSI_H

#include <linux/spinlock.h>
+#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/bits.h>
#include <asm/msi.h>
@@ -146,6 +147,7 @@ struct msi_desc {
* @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
+ * @mutex: Mutex protecting the MSI list
*/
struct msi_device_data {
raw_spinlock_t lock;
@@ -153,6 +155,7 @@ struct msi_device_data {
const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
struct list_head list;
+ struct mutex mutex;
};

int msi_setup_device_data(struct device *dev);
@@ -187,6 +190,9 @@ static inline unsigned int msi_get_virq(
return ret < 0 ? 0 : ret;
}

+void msi_lock_descs(struct device *dev);
+void msi_unlock_descs(struct device *dev);
+
/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -116,12 +116,37 @@ int msi_setup_device_data(struct device

raw_spin_lock_init(&md->lock);
INIT_LIST_HEAD(&md->list);
+ mutex_init(&md->mutex);
dev->msi.data = md;
devres_add(dev, md);
return 0;
}

/**
+ * msi_lock_descs - Lock the MSI descriptor storage of a device
+ * @dev: Device to operate on
+ */
+void msi_lock_descs(struct device *dev)
+{
+ if (WARN_ON_ONCE(!dev->msi.data))
+ return;
+ mutex_lock(&dev->msi.data->mutex);
+}
+EXPORT_SYMBOL_GPL(msi_lock_descs);
+
+/**
+ * msi_unlock_descs - Unlock the MSI descriptor storage of a device
+ * @dev: Device to operate on
+ */
+void msi_unlock_descs(struct device *dev)
+{
+ if (WARN_ON_ONCE(!dev->msi.data))
+ return;
+ mutex_unlock(&dev->msi.data->mutex);
+}
+EXPORT_SYMBOL_GPL(msi_unlock_descs);
+
+/**
* __msi_get_virq - Return Linux interrupt number of a MSI interrupt
* @dev: Device to operate on
* @index: MSI interrupt index to look for (0-based)

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:36 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Usage sites which do allocations of the MSI descriptors before invoking
msi_domain_alloc_irqs() require to lock the MSI decriptors accross the
operation.

Provide entry points which can be called with the MSI mutex held and lock
the mutex in the existing entry points.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 3 ++
kernel/irq/msi.c | 74 ++++++++++++++++++++++++++++++++++++++++------------
2 files changed, 61 insertions(+), 16 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -413,9 +413,12 @@ struct irq_domain *msi_create_irq_domain
struct irq_domain *parent);
int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
int nvec);
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+ int nvec);
int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
int nvec);
void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev);
void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -691,10 +691,8 @@ int __msi_domain_alloc_irqs(struct irq_d
virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
dev_to_node(dev), &arg, false,
desc->affinity);
- if (virq < 0) {
- ret = msi_handle_pci_fail(domain, desc, allocated);
- goto cleanup;
- }
+ if (virq < 0)
+ return msi_handle_pci_fail(domain, desc, allocated);

for (i = 0; i < desc->nvec_used; i++) {
irq_set_msi_desc_off(virq, i, desc);
@@ -728,7 +726,7 @@ int __msi_domain_alloc_irqs(struct irq_d
}
ret = irq_domain_activate_irq(irq_data, can_reserve);
if (ret)
- goto cleanup;
+ return ret;
}

skip_activate:
@@ -743,38 +741,63 @@ int __msi_domain_alloc_irqs(struct irq_d
}
}
return 0;
-
-cleanup:
- msi_domain_free_irqs(domain, dev);
- return ret;
}

/**
- * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
+ * msi_domain_alloc_irqs_descs_locked - Allocate interrupts from a MSI interrupt domain
* @domain: The domain to allocate from
* @dev: Pointer to device struct of the device for which the interrupts
* are allocated
* @nvec: The number of interrupts to allocate
*
+ * Must be invoked from within a msi_lock_descs() / msi_unlock_descs()
+ * pair. Use this for MSI irqdomains which implement their own vector
+ * allocation/free.
+ *
* Return: %0 on success or an error code.
*/
-int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
- int nvec)
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+ int nvec)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
int ret;

+ lockdep_assert_held(&dev->msi.data->mutex);
+
ret = ops->domain_alloc_irqs(domain, dev, nvec);
if (ret)
- return ret;
+ goto cleanup;

if (!(info->flags & MSI_FLAG_DEV_SYSFS))
return 0;

ret = msi_device_populate_sysfs(dev);
if (ret)
- msi_domain_free_irqs(domain, dev);
+ goto cleanup;
+ return 0;
+
+cleanup:
+ msi_domain_free_irqs_descs_locked(domain, dev);
+ return ret;
+}
+
+/**
+ * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
+ * @domain: The domain to allocate from
+ * @dev: Pointer to device struct of the device for which the interrupts
+ * are allocated
+ * @nvec: The number of interrupts to allocate
+ *
+ * Return: %0 on success or an error code.
+ */
+int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec)
+{
+ int ret;
+
+ msi_lock_descs(dev);
+ ret = msi_domain_alloc_irqs_descs_locked(domain, dev, nvec);
+ msi_unlock_descs(dev);
return ret;
}

@@ -804,22 +827,41 @@ void __msi_domain_free_irqs(struct irq_d
}

/**
- * msi_domain_free_irqs - Free interrupts from a MSI interrupt @domain associated to @dev
+ * msi_domain_free_irqs_descs_locked - Free interrupts from a MSI interrupt @domain associated to @dev
* @domain: The domain to managing the interrupts
* @dev: Pointer to device struct of the device for which the interrupts
* are free
+ *
+ * Must be invoked from within a msi_lock_descs() / msi_unlock_descs()
+ * pair. Use this for MSI irqdomains which implement their own vector
+ * allocation.
*/
-void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;

+ lockdep_assert_held(&dev->msi.data->mutex);
+
if (info->flags & MSI_FLAG_DEV_SYSFS)
msi_device_destroy_sysfs(dev);
ops->domain_free_irqs(domain, dev);
}

/**
+ * msi_domain_free_irqs - Free interrupts from a MSI interrupt @domain associated to @dev
+ * @domain: The domain to managing the interrupts
+ * @dev: Pointer to device struct of the device for which the interrupts
+ * are free
+ */
+void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+{
+ msi_lock_descs(dev);
+ msi_domain_free_irqs_descs_locked(domain, dev);
+ msi_unlock_descs(dev);
+}
+
+/**
* msi_get_domain_info - Get the MSI interrupt domain info for @domain
* @domain: The interrupt domain to retrieve data from
*

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:37 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
In preparation for dynamic handling of MSI-X interrupts provide a new set
of MSI descriptor accessor functions and iterators. They are benefitial per
se as they allow to cleanup quite some code in various MSI domain
implementations.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 58 ++++++++++++++++++++++++++++
kernel/irq/msi.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 165 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -140,6 +140,18 @@ struct msi_desc {
struct pci_msi_desc pci;
};

+/*
+ * Filter values for the MSI descriptor iterators and accessor functions.
+ */
+enum msi_desc_filter {
+ /* All descriptors */
+ MSI_DESC_ALL,
+ /* Descriptors which have no interrupt associated */
+ MSI_DESC_NOTASSOCIATED,
+ /* Descriptors which have an interrupt associated */
+ MSI_DESC_ASSOCIATED,
+};
+
/**
* msi_device_data - MSI per device data
* @lock: Spinlock to protect register access
@@ -148,6 +160,8 @@ struct msi_desc {
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
* @mutex: Mutex protecting the MSI list
+ * @__next: Cached pointer to the next entry for iterators
+ * @__filter: Cached descriptor filter
*/
struct msi_device_data {
raw_spinlock_t lock;
@@ -156,6 +170,8 @@ struct msi_device_data {
struct platform_msi_priv_data *platform_data;
struct list_head list;
struct mutex mutex;
+ struct msi_desc *__next;
+ enum msi_desc_filter __filter;
};

int msi_setup_device_data(struct device *dev);
@@ -193,6 +209,48 @@ static inline unsigned int msi_get_virq(
void msi_lock_descs(struct device *dev);
void msi_unlock_descs(struct device *dev);

+struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
+struct msi_desc *msi_next_desc(struct device *dev);
+
+/**
+ * msi_first_desc - Get the first MSI descriptor associated to the device
+ * @dev: Device to search
+ */
+static inline struct msi_desc *msi_first_desc(struct device *dev)
+{
+ return __msi_first_desc(dev, MSI_DESC_ALL, 0);
+}
+
+
+/**
+ * msi_for_each_desc_from - Iterate the MSI descriptors from a given index
+ *
+ * @desc: struct msi_desc pointer used as iterator
+ * @dev: struct device pointer - device to iterate
+ * @filter: Filter for descriptor selection
+ * @base_index: MSI index to iterate from
+ *
+ * Notes:
+ * - The loop must be protected with a msi_lock_descs()/msi_unlock_descs()
+ * pair.
+ * - It is safe to remove a retrieved MSI descriptor in the loop.
+ */
+#define msi_for_each_desc_from(desc, dev, filter, base_index) \
+ for ((desc) = __msi_first_desc((dev), (filter), (base_index)); (desc); \
+ (desc) = msi_next_desc((dev)))
+
+/**
+ * msi_for_each_desc - Iterate the MSI descriptors
+ *
+ * @desc: struct msi_desc pointer used as iterator
+ * @dev: struct device pointer - device to iterate
+ * @filter: Filter for descriptor selection
+ *
+ * See msi_for_each_desc_from()for further information.
+ */
+#define msi_for_each_desc(desc, dev, filter) \
+ msi_for_each_desc_from(desc, dev, filter, 0)
+
/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -142,10 +142,117 @@ void msi_unlock_descs(struct device *dev
{
if (WARN_ON_ONCE(!dev->msi.data))
return;
+ /* Clear the next pointer which was cached by the iterator */
+ dev->msi.data->__next = NULL;
mutex_unlock(&dev->msi.data->mutex);
}
EXPORT_SYMBOL_GPL(msi_unlock_descs);

+static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
+{
+ switch (filter) {
+ case MSI_DESC_ALL:
+ return true;
+ case MSI_DESC_NOTASSOCIATED:
+ return !desc->irq;
+ case MSI_DESC_ASSOCIATED:
+ return !!desc->irq;
+ }
+ WARN_ON_ONCE(1);
+ return false;
+}
+
+static struct msi_desc *msi_find_first_desc(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index)
+{
+ struct msi_desc *desc;
+
+ list_for_each_entry(desc, dev_to_msi_list(dev), list) {
+ if (desc->msi_index < base_index)
+ continue;
+ if (msi_desc_match(desc, filter))
+ return desc;
+ }
+ return NULL;
+}
+
+/**
+ * __msi_first_desc - Get the first MSI descriptor of a device
+ * @dev: Device to operate on
+ * @filter: Descriptor state filter
+ * @base_index: MSI index to start from for range based operations
+ *
+ * Must be called with the MSI descriptor mutex held, i.e. msi_lock_descs()
+ * must be invoked before the call.
+ *
+ * Return: Pointer to the first MSI descriptor matching the search
+ * criteria, NULL if none found.
+ */
+struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index)
+{
+ struct msi_desc *desc;
+
+ if (WARN_ON_ONCE(!dev->msi.data))
+ return NULL;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ /* Invalidate a previous invocation within the same lock section */
+ dev->msi.data->__next = NULL;
+
+ desc = msi_find_first_desc(dev, filter, base_index);
+ if (desc) {
+ dev->msi.data->__next = list_next_entry(desc, list);
+ dev->msi.data->__filter = filter;
+ }
+ return desc;
+}
+EXPORT_SYMBOL_GPL(__msi_first_desc);
+
+static struct msi_desc *__msi_next_desc(struct device *dev, enum msi_desc_filter filter,
+ struct msi_desc *from)
+{
+ struct msi_desc *desc = from;
+
+ list_for_each_entry_from(desc, dev_to_msi_list(dev), list) {
+ if (msi_desc_match(desc, filter))
+ return desc;
+ }
+ return NULL;
+}
+
+/**
+ * msi_next_desc - Get the next MSI descriptor of a device
+ * @dev: Device to operate on
+ *
+ * The first invocation of msi_next_desc() has to be preceeded by a
+ * successful incovation of __msi_first_desc(). Consecutive invocations are
+ * only valid if the previous one was successful. All these operations have
+ * to be done within the same MSI mutex held region.
+ *
+ * Return: Pointer to the next MSI descriptor matching the search
+ * criteria, NULL if none found.
+ */
+struct msi_desc *msi_next_desc(struct device *dev)
+{
+ struct msi_device_data *data = dev->msi.data;
+ struct msi_desc *desc;
+
+ if (WARN_ON_ONCE(!data))
+ return NULL;
+
+ lockdep_assert_held(&data->mutex);
+
+ if (!data->__next)
+ return NULL;
+
+ desc = __msi_next_desc(dev, data->__filter, data->__next);
+ dev->msi.data->__next = desc ? list_next_entry(desc, list) : NULL;
+ return desc;
+}
+EXPORT_SYMBOL_GPL(msi_next_desc);
+

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:39 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Provide msi_alloc_msi_desc() which takes a template MSI descriptor for
initializing a newly allocated descriptor. This allows to simplify various
usage sites of alloc_msi_entry() and moves the storage handling into the
core code.

For simple cases where only a linear vector space is required provide
msi_add_simple_msi_descs() which just allocates a linear range of MSI
descriptors and fills msi_desc::msi_index accordingly.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 2 +
kernel/irq/msi.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -302,6 +302,8 @@ static inline void pci_write_msi_msg(uns
}
#endif /* CONFIG_PCI_MSI */

+int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
+
struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
const struct irq_affinity_desc *affinity);
void free_msi_entry(struct msi_desc *entry);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -61,6 +61,65 @@ void free_msi_entry(struct msi_desc *ent
}

/**
+ * msi_add_msi_desc - Allocate and initialize a MSI descriptor
+ * @dev: Pointer to the device for which the descriptor is allocated
+ * @init_desc: Pointer to an MSI descriptor to initialize the new descriptor
+ *
+ * Return: 0 on success or an appropriate failure code.
+ */
+int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc)
+{
+ struct msi_desc *desc;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ desc = alloc_msi_entry(dev, init_desc->nvec_used, init_desc->affinity);
+ if (!desc)
+ return -ENOMEM;
+
+ /* Copy the MSI index and type specific data to the new descriptor. */
+ desc->msi_index = init_desc->msi_index;
+ desc->pci = init_desc->pci;
+
+ list_add_tail(&desc->list, &dev->msi.data->list);
+ return 0;
+}
+
+/**
+ * msi_add_simple_msi_descs - Allocate and initialize MSI descriptors
+ * @dev: Pointer to the device for which the descriptors are allocated
+ * @index: Index for the first MSI descriptor
+ * @ndesc: Number of descriptors to allocate
+ *
+ * Return: 0 on success or an appropriate failure code.
+ */
+static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
+{
+ struct msi_desc *desc, *tmp;
+ LIST_HEAD(list);
+ unsigned int i;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ for (i = 0; i < ndesc; i++) {
+ desc = alloc_msi_entry(dev, 1, NULL);
+ if (!desc)
+ goto fail;
+ desc->msi_index = index + i;
+ list_add_tail(&desc->list, &list);
+ }
+ list_splice_tail(&list, &dev->msi.data->list);
+ return 0;
+
+fail:
+ list_for_each_entry_safe(desc, tmp, &list, list) {
+ list_del(&desc->list);
+ free_msi_entry(desc);
+ }
+ return -ENOMEM;
+}
+
+/**
* msi_device_has_property - Check whether a device has a specific MSI property
* @dev: Pointer to the device which is queried
* @prop: Property to check for

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:40 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Provide domain info flags which tell the core to allocate simple
descriptors or to free descriptors when the interrupts are freed and
implement the required functionality.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 15 +++++++++++++++
kernel/irq/msi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 63 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -303,6 +303,17 @@ static inline void pci_write_msi_msg(uns
#endif /* CONFIG_PCI_MSI */

int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index, unsigned int ndesc);
+
+/**
+ * msi_free_msi_descs - Free MSI descriptors of a device
+ * @dev: Device to free the descriptors
+ */
+static inline void msi_free_msi_descs(struct device *dev)
+{
+ msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
+}

struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
const struct irq_affinity_desc *affinity);
@@ -463,6 +474,10 @@ enum {
MSI_FLAG_DEV_SYSFS = (1 << 7),
/* MSI-X entries must be contiguous */
MSI_FLAG_MSIX_CONTIGUOUS = (1 << 8),
+ /* Allocate simple MSI descriptors */
+ MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS = (1 << 9),
+ /* Free MSI descriptors */
+ MSI_FLAG_FREE_MSI_DESCS = (1 << 10),
};

int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -120,6 +120,32 @@ static int msi_add_simple_msi_descs(stru
}

/**
+ * msi_free_msi_descs_range - Free MSI descriptors of a device
+ * @dev: Device to free the descriptors
+ * @filter: Descriptor state filter
+ * @base_index: Index to start freeing from
+ * @ndesc: Number of descriptors to free
+ */
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
+ unsigned int base_index, unsigned int ndesc)
+{
+ struct msi_desc *desc;
+
+ lockdep_assert_held(&dev->msi.data->mutex);
+
+ msi_for_each_desc(desc, dev, filter) {
+ /*
+ * Stupid for now to handle MSI device domain until the
+ * storage is switched over to an xarray.
+ */
+ if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
+ continue;
+ list_del(&desc->list);
+ free_msi_entry(desc);
+ }
+}
+
+/**
* msi_device_has_property - Check whether a device has a specific MSI property
* @dev: Pointer to the device which is queried
* @prop: Property to check for
@@ -905,6 +931,16 @@ int __msi_domain_alloc_irqs(struct irq_d
return 0;
}

+static int msi_domain_add_simple_msi_descs(struct msi_domain_info *info,
+ struct device *dev,
+ unsigned int num_descs)
+{
+ if (!(info->flags & MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS))
+ return 0;
+
+ return msi_add_simple_msi_descs(dev, 0, num_descs);
+}
+
/**
* msi_domain_alloc_irqs_descs_locked - Allocate interrupts from a MSI interrupt domain
* @domain: The domain to allocate from
@@ -927,6 +963,10 @@ int msi_domain_alloc_irqs_descs_locked(s

lockdep_assert_held(&dev->msi.data->mutex);

+ ret = msi_domain_add_simple_msi_descs(info, dev, nvec);
+ if (ret)
+ return ret;
+
ret = ops->domain_alloc_irqs(domain, dev, nvec);
if (ret)
goto cleanup;
@@ -988,6 +1028,13 @@ void __msi_domain_free_irqs(struct irq_d
}
}

+static void msi_domain_free_msi_descs(struct msi_domain_info *info,
+ struct device *dev)
+{
+ if (info->flags & MSI_FLAG_FREE_MSI_DESCS)
+ msi_free_msi_descs(dev);
+}
+
/**
* msi_domain_free_irqs_descs_locked - Free interrupts from a MSI interrupt @domain associated to @dev
* @domain: The domain to managing the interrupts
@@ -1008,6 +1055,7 @@ void msi_domain_free_irqs_descs_locked(s
if (info->flags & MSI_FLAG_DEV_SYSFS)
msi_device_destroy_sysfs(dev);
ops->domain_free_irqs(domain, dev);
+ msi_domain_free_msi_descs(info, dev);
}

/**

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:41 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 3 +++
kernel/irq/msi.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -156,6 +156,7 @@ enum msi_desc_filter {
* msi_device_data - MSI per device data
* @lock: Spinlock to protect register access
* @properties: MSI properties which are interesting to drivers
+ * @num_descs: The number of allocated MSI descriptors for the device
* @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
@@ -166,6 +167,7 @@ enum msi_desc_filter {
struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
+ unsigned int num_descs;
const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
struct list_head list;
@@ -208,6 +210,7 @@ static inline unsigned int msi_get_virq(

void msi_lock_descs(struct device *dev);
void msi_unlock_descs(struct device *dev);
+unsigned int msi_device_num_descs(struct device *dev);

struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
struct msi_desc *msi_next_desc(struct device *dev);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -82,6 +82,7 @@ int msi_add_msi_desc(struct device *dev,
desc->pci = init_desc->pci;

list_add_tail(&desc->list, &dev->msi.data->list);
+ dev->msi.data->num_descs++;
return 0;
}

@@ -109,6 +110,7 @@ int msi_add_simple_msi_descs(struct devi
list_add_tail(&desc->list, &list);
}
list_splice_tail(&list, &dev->msi.data->list);
+ dev->msi.data->num_descs += ndesc;
return 0;

fail:
@@ -142,6 +144,7 @@ void msi_free_msi_descs_range(struct dev
continue;
list_del(&desc->list);
free_msi_entry(desc);
+ dev->msi.data->num_descs--;
}
}

@@ -157,6 +160,21 @@ bool msi_device_has_property(struct devi
return !!(dev->msi.data->properties & prop);
}

+/**
+ * msi_device_num_descs - Query the number of allocated MSI descriptors of a device
+ * @dev: The device to read from
+ *
+ * Note: This is a lockless snapshot of msi_device_data::num_descs
+ *
+ * Returns the number of MSI descriptors which are allocated for @dev
+ */
+unsigned int msi_device_num_descs(struct device *dev)
+{
+ if (dev->msi.data)
+ return dev->msi.data->num_descs;
+ return 0;
+}
+
void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
*msg = entry->msg;

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:43 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
To prepare for dynamic extension of MSI-X vectors, protect the MSI
operations for MSI and MSI-X. This requires to move the invocation of
irq_create_affinity_masks() out of the descriptor lock section to avoid
reverse lock ordering vs. CPU hotplug lock as some callers of the PCI/MSI
allocation interfaces already hold it.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/irqdomain.c | 4 -
drivers/pci/msi/msi.c | 120 ++++++++++++++++++++++++++------------------
2 files changed, 73 insertions(+), 51 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -14,7 +14,7 @@ int pci_msi_setup_msi_irqs(struct pci_de

domain = dev_get_msi_domain(&dev->dev);
if (domain && irq_domain_is_hierarchy(domain))
- return msi_domain_alloc_irqs(domain, &dev->dev, nvec);
+ return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, nvec);

return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
}
@@ -25,7 +25,7 @@ void pci_msi_teardown_msi_irqs(struct pc

domain = dev_get_msi_domain(&dev->dev);
if (domain && irq_domain_is_hierarchy(domain))
- msi_domain_free_irqs(domain, &dev->dev);
+ msi_domain_free_irqs_descs_locked(domain, &dev->dev);
else
pci_msi_legacy_teardown_msi_irqs(dev);
}
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -322,11 +322,13 @@ static void __pci_restore_msix_state(str

write_msg = arch_restore_msi_irqs(dev);

+ msi_lock_descs(&dev->dev);
for_each_pci_msi_entry(entry, dev) {
if (write_msg)
__pci_write_msi_msg(entry, &entry->msg);
pci_msix_write_vector_ctrl(entry, entry->pci.msix_ctrl);
}
+ msi_unlock_descs(&dev->dev);

pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
}
@@ -339,19 +341,15 @@ void pci_restore_msi_state(struct pci_de
EXPORT_SYMBOL_GPL(pci_restore_msi_state);

static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
+msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity_desc *masks)
{
- struct irq_affinity_desc *masks = NULL;
struct msi_desc *entry;
u16 control;

- if (affd)
- masks = irq_create_affinity_masks(nvec, affd);
-
/* MSI Entry Initialization */
entry = alloc_msi_entry(&dev->dev, nvec, masks);
if (!entry)
- goto out;
+ return NULL;

pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
/* Lies, damned lies, and MSIs */
@@ -377,8 +375,7 @@ msi_setup_entry(struct pci_dev *dev, int
dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
if (entry->pci.msi_attrib.is_64)
dev->dev.msi.data->properties |= MSI_PROP_64BIT;
-out:
- kfree(masks);
+
return entry;
}

@@ -414,14 +411,21 @@ static int msi_verify_entries(struct pci
static int msi_capability_init(struct pci_dev *dev, int nvec,
struct irq_affinity *affd)
{
+ struct irq_affinity_desc *masks = NULL;
struct msi_desc *entry;
int ret;

pci_msi_set_enable(dev, 0); /* Disable MSI during set up */

- entry = msi_setup_entry(dev, nvec, affd);
- if (!entry)
- return -ENOMEM;
+ if (affd)
+ masks = irq_create_affinity_masks(nvec, affd);
+
+ msi_lock_descs(&dev->dev);
+ entry = msi_setup_entry(dev, nvec, masks);
+ if (!entry) {
+ ret = -ENOMEM;
+ goto unlock;
+ }

/* All MSIs are unmasked by default; mask them all */
pci_msi_mask(entry, msi_multi_mask(entry));
@@ -444,11 +448,14 @@ static int msi_capability_init(struct pc

pcibios_free_irq(dev);
dev->irq = entry->irq;
- return 0;
+ goto unlock;

err:
pci_msi_unmask(entry, msi_multi_mask(entry));
free_msi_irqs(dev);
+unlock:
+ msi_unlock_descs(&dev->dev);
+ kfree(masks);
return ret;
}

@@ -475,23 +482,18 @@ static void __iomem *msix_map_region(str

static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
struct msix_entry *entries, int nvec,
- struct irq_affinity *affd)
+ struct irq_affinity_desc *masks)
{
- struct irq_affinity_desc *curmsk, *masks = NULL;
+ int i, vec_count = pci_msix_vec_count(dev);
+ struct irq_affinity_desc *curmsk;
struct msi_desc *entry;
void __iomem *addr;
- int ret, i;
- int vec_count = pci_msix_vec_count(dev);
-
- if (affd)
- masks = irq_create_affinity_masks(nvec, affd);

for (i = 0, curmsk = masks; i < nvec; i++) {
entry = alloc_msi_entry(&dev->dev, 1, curmsk);
if (!entry) {
/* No enough memory. Don't try again */
- ret = -ENOMEM;
- goto out;
+ return -ENOMEM;
}

entry->pci.msi_attrib.is_msix = 1;
@@ -520,10 +522,7 @@ static int msix_setup_entries(struct pci
curmsk++;
}
dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
- ret = 0;
-out:
- kfree(masks);
- return ret;
+ return 0;
}

static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
@@ -550,6 +549,41 @@ static void msix_mask_all(void __iomem *
writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
}

+static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
+ struct msix_entry *entries, int nvec,
+ struct irq_affinity *affd)
+{
+ struct irq_affinity_desc *masks = NULL;
+ int ret;
+
+ if (affd)
+ masks = irq_create_affinity_masks(nvec, affd);
+
+ msi_lock_descs(&dev->dev);
+ ret = msix_setup_entries(dev, base, entries, nvec, masks);
+ if (ret)
+ goto out_free;
+
+ ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+ if (ret)
+ goto out_free;
+
+ /* Check if all MSI entries honor device restrictions */
+ ret = msi_verify_entries(dev);
+ if (ret)
+ goto out_free;
+
+ msix_update_entries(dev, entries);
+ goto out_unlock;
+
+out_free:
+ free_msi_irqs(dev);
+out_unlock:
+ msi_unlock_descs(&dev->dev);
+ kfree(masks);
+ return ret;
+}
+
/**
* msix_capability_init - configure device's MSI-X capability
* @dev: pointer to the pci_dev data structure of MSI-X device function
@@ -590,20 +624,9 @@ static int msix_capability_init(struct p
/* Ensure that all table entries are masked. */
msix_mask_all(base, tsize);

- ret = msix_setup_entries(dev, base, entries, nvec, affd);
+ ret = msix_setup_interrupts(dev, base, entries, nvec, affd);
if (ret)
- goto out_free;
-
- ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
- if (ret)
- goto out_free;
-
- /* Check if all MSI entries honor device restrictions */
- ret = msi_verify_entries(dev);
- if (ret)
- goto out_free;
-
- msix_update_entries(dev, entries);
+ goto out_disable;

/* Set MSI-X enabled bits and unmask the function */
pci_intx_for_msi(dev, 0);
@@ -613,12 +636,8 @@ static int msix_capability_init(struct p
pcibios_free_irq(dev);
return 0;

-out_free:
- free_msi_irqs(dev);
-
out_disable:
pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
-
return ret;
}

@@ -723,8 +742,10 @@ void pci_disable_msi(struct pci_dev *dev
if (!pci_msi_enable || !dev || !dev->msi_enabled)
return;

+ msi_lock_descs(&dev->dev);
pci_msi_shutdown(dev);
free_msi_irqs(dev);
+ msi_unlock_descs(&dev->dev);
}
EXPORT_SYMBOL(pci_disable_msi);

@@ -810,8 +831,10 @@ void pci_disable_msix(struct pci_dev *de
if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;

+ msi_lock_descs(&dev->dev);
pci_msix_shutdown(dev);
free_msi_irqs(dev);
+ msi_unlock_descs(&dev->dev);
}
EXPORT_SYMBOL(pci_disable_msix);

@@ -872,7 +895,6 @@ int pci_enable_msi(struct pci_dev *dev)

if (!rc)
rc = __pci_enable_msi_range(dev, 1, 1, NULL);
-
return rc < 0 ? rc : 0;
}
EXPORT_SYMBOL(pci_enable_msi);
@@ -959,11 +981,7 @@ int pci_alloc_irq_vectors_affinity(struc
struct irq_affinity *affd)
{
struct irq_affinity msi_default_affd = {0};
- int ret = msi_setup_device_data(&dev->dev);
- int nvecs = -ENOSPC;
-
- if (ret)
- return ret;
+ int ret, nvecs;

if (flags & PCI_IRQ_AFFINITY) {
if (!affd)
@@ -973,6 +991,10 @@ int pci_alloc_irq_vectors_affinity(struc
affd = NULL;
}

+ ret = msi_setup_device_data(&dev->dev);
+ if (ret)
+ return ret;
+
if (flags & PCI_IRQ_MSIX) {
nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
affd, flags);
@@ -1001,7 +1023,7 @@ int pci_alloc_irq_vectors_affinity(struc
}
}

- return nvecs;
+ return -ENOSPC;
}
EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);


Thomas Gleixner

unread,
Nov 26, 2021, 8:23:45 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Simplify the allocation of MSI descriptors by using msi_add_msi_desc()
which moves the storage handling to core code and prepares for dynamic
extension of the MSI-X vector space.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/msi.c | 121 ++++++++++++++++++++++++--------------------------
1 file changed, 59 insertions(+), 62 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -340,43 +340,49 @@ void pci_restore_msi_state(struct pci_de
}
EXPORT_SYMBOL_GPL(pci_restore_msi_state);

-static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity_desc *masks)
+static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
+ struct irq_affinity_desc *masks)
{
- struct msi_desc *entry;
+ struct msi_desc desc;
u16 control;
+ int ret;

/* MSI Entry Initialization */
- entry = alloc_msi_entry(&dev->dev, nvec, masks);
- if (!entry)
- return NULL;
+ memset(&desc, 0, sizeof(desc));

pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
/* Lies, damned lies, and MSIs */
if (dev->dev_flags & PCI_DEV_FLAGS_HAS_MSI_MASKING)
control |= PCI_MSI_FLAGS_MASKBIT;
+ /* Respect XEN's mask disabling */
+ if (pci_msi_ignore_mask)
+ control &= ~PCI_MSI_FLAGS_MASKBIT;

- entry->pci.msi_attrib.is_64 = !!(control & PCI_MSI_FLAGS_64BIT);
- entry->pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
- !!(control & PCI_MSI_FLAGS_MASKBIT);
- entry->pci.msi_attrib.default_irq = dev->irq;
- entry->pci.msi_attrib.multi_cap = (control & PCI_MSI_FLAGS_QMASK) >> 1;
- entry->pci.msi_attrib.multiple = ilog2(__roundup_pow_of_two(nvec));
+ desc.nvec_used = nvec;
+ desc.pci.msi_attrib.is_64 = !!(control & PCI_MSI_FLAGS_64BIT);
+ desc.pci.msi_attrib.can_mask = !!(control & PCI_MSI_FLAGS_MASKBIT);
+ desc.pci.msi_attrib.default_irq = dev->irq;
+ desc.pci.msi_attrib.multi_cap = (control & PCI_MSI_FLAGS_QMASK) >> 1;
+ desc.pci.msi_attrib.multiple = ilog2(__roundup_pow_of_two(nvec));
+ desc.affinity = masks;

if (control & PCI_MSI_FLAGS_64BIT)
- entry->pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
+ desc.pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
else
- entry->pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_32;
+ desc.pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_32;

/* Save the initial mask status */
- if (entry->pci.msi_attrib.can_mask)
- pci_read_config_dword(dev, entry->pci.mask_pos, &entry->pci.msi_mask);
+ if (desc.pci.msi_attrib.can_mask)
+ pci_read_config_dword(dev, desc.pci.mask_pos, &desc.pci.msi_mask);

- dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
- if (entry->pci.msi_attrib.is_64)
- dev->dev.msi.data->properties |= MSI_PROP_64BIT;
+ ret = msi_add_msi_desc(&dev->dev, &desc);
+ if (!ret) {
+ dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
+ if (desc.pci.msi_attrib.is_64)
+ dev->dev.msi.data->properties |= MSI_PROP_64BIT;
+ }

- return entry;
+ return ret;
}

static int msi_verify_entries(struct pci_dev *dev)
@@ -421,17 +427,14 @@ static int msi_capability_init(struct pc
masks = irq_create_affinity_masks(nvec, affd);

msi_lock_descs(&dev->dev);
- entry = msi_setup_entry(dev, nvec, masks);
- if (!entry) {
- ret = -ENOMEM;
+ ret = msi_setup_msi_desc(dev, nvec, masks);
+ if (ret)
goto unlock;
- }

/* All MSIs are unmasked by default; mask them all */
+ entry = first_pci_msi_entry(dev);
pci_msi_mask(entry, msi_multi_mask(entry));

- list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
-
/* Configure MSI capability structure */
ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
if (ret)
@@ -480,49 +483,41 @@ static void __iomem *msix_map_region(str
return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
}

-static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
- struct msix_entry *entries, int nvec,
- struct irq_affinity_desc *masks)
+static int msix_setup_msi_descs(struct pci_dev *dev, void __iomem *base,
+ struct msix_entry *entries, int nvec,
+ struct irq_affinity_desc *masks)
{
- int i, vec_count = pci_msix_vec_count(dev);
+ int ret, i, vec_count = pci_msix_vec_count(dev);
struct irq_affinity_desc *curmsk;
- struct msi_desc *entry;
+ struct msi_desc desc;
void __iomem *addr;

- for (i = 0, curmsk = masks; i < nvec; i++) {
- entry = alloc_msi_entry(&dev->dev, 1, curmsk);
- if (!entry) {
- /* No enough memory. Don't try again */
- return -ENOMEM;
- }
-
- entry->pci.msi_attrib.is_msix = 1;
- entry->pci.msi_attrib.is_64 = 1;
-
- if (entries)
- entry->msi_index = entries[i].entry;
- else
- entry->msi_index = i;
-
- entry->pci.msi_attrib.is_virtual = entry->msi_index >= vec_count;
-
- entry->pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
- !entry->pci.msi_attrib.is_virtual;
-
- entry->pci.msi_attrib.default_irq = dev->irq;
- entry->pci.mask_base = base;
+ memset(&desc, 0, sizeof(desc));

- if (entry->pci.msi_attrib.can_mask) {
- addr = pci_msix_desc_addr(entry);
- entry->pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
+ desc.nvec_used = 1;
+ desc.pci.msi_attrib.is_msix = 1;
+ desc.pci.msi_attrib.is_64 = 1;
+ desc.pci.msi_attrib.default_irq = dev->irq;
+ desc.pci.mask_base = base;
+
+ for (i = 0, curmsk = masks; i < nvec; i++, curmsk++) {
+ desc.msi_index = entries ? entries[i].entry : i;
+ desc.affinity = masks ? curmsk : NULL;
+ desc.pci.msi_attrib.is_virtual = desc.msi_index >= vec_count;
+ desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
+ !desc.pci.msi_attrib.is_virtual;
+
+ if (!desc.pci.msi_attrib.can_mask) {
+ addr = pci_msix_desc_addr(&desc);
+ desc.pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
}

- list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
- if (masks)
- curmsk++;
+ ret = msi_add_msi_desc(&dev->dev, &desc);
+ if (ret)
+ break;
}
- dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
- return 0;
+
+ return ret;
}

static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
@@ -560,10 +555,12 @@ static int msix_setup_interrupts(struct
masks = irq_create_affinity_masks(nvec, affd);

msi_lock_descs(&dev->dev);
- ret = msix_setup_entries(dev, base, entries, nvec, masks);
+ ret = msix_setup_msi_descs(dev, base, entries, nvec, masks);
if (ret)
goto out_free;

+ dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
+
ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
if (ret)
goto out_free;

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:46 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Set the domain info flag which tells the core code to free the MSI
descriptors from msi_domain_free_irqs() and add an explicit call to the
core function into the legacy code.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/irqdomain.c | 3 ++-
drivers/pci/msi/legacy.c | 1 +
drivers/pci/msi/msi.c | 14 --------------
3 files changed, 3 insertions(+), 15 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -171,7 +171,8 @@ struct irq_domain *pci_msi_create_irq_do
if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
pci_msi_domain_update_chip_ops(info);

- info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS;
+ info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS |
+ MSI_FLAG_FREE_MSI_DESCS;
if (IS_ENABLED(CONFIG_GENERIC_IRQ_RESERVATION_MODE))
info->flags |= MSI_FLAG_MUST_REACTIVATE;

--- a/drivers/pci/msi/legacy.c
+++ b/drivers/pci/msi/legacy.c
@@ -81,4 +81,5 @@ void pci_msi_legacy_teardown_msi_irqs(st
{
msi_device_destroy_sysfs(&dev->dev);
arch_teardown_msi_irqs(dev);
+ msi_free_msi_descs(&dev->dev);
}
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -224,22 +224,8 @@ EXPORT_SYMBOL_GPL(pci_write_msi_msg);

static void free_msi_irqs(struct pci_dev *dev)
{
- struct list_head *msi_list = dev_to_msi_list(&dev->dev);
- struct msi_desc *entry, *tmp;
- int i;
-
- for_each_pci_msi_entry(entry, dev)
- if (entry->irq)
- for (i = 0; i < entry->nvec_used; i++)
- BUG_ON(irq_has_action(entry->irq + i));
-
pci_msi_teardown_msi_irqs(dev);

- list_for_each_entry_safe(entry, tmp, msi_list, list) {
- list_del(&entry->list);
- free_msi_entry(entry);
- }
-
if (dev->msix_base) {
iounmap(dev->msix_base);
dev->msix_base = NULL;

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:48 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Use the new iterator functions which pave the way for dynamically extending
MSI-X vectors.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/msi/irqdomain.c | 4 ++--
drivers/pci/msi/legacy.c | 19 ++++++++-----------
drivers/pci/msi/msi.c | 30 ++++++++++++++----------------
3 files changed, 24 insertions(+), 29 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -83,7 +83,7 @@ static int pci_msi_domain_check_cap(stru
struct msi_domain_info *info,
struct device *dev)
{
- struct msi_desc *desc = first_pci_msi_entry(to_pci_dev(dev));
+ struct msi_desc *desc = msi_first_desc(dev);

/* Special handling to support __pci_enable_msi_range() */
if (pci_msi_desc_is_multi_msi(desc) &&
@@ -98,7 +98,7 @@ static int pci_msi_domain_check_cap(stru
unsigned int idx = 0;

/* Check for gaps in the entry indices */
- for_each_msi_entry(desc, dev) {
+ msi_for_each_desc(desc, dev, MSI_DESC_ALL) {
if (desc->msi_index != idx++)
return -ENOTSUPP;
}
--- a/drivers/pci/msi/legacy.c
+++ b/drivers/pci/msi/legacy.c
@@ -29,7 +29,7 @@ int __weak arch_setup_msi_irqs(struct pc
if (type == PCI_CAP_ID_MSI && nvec > 1)
return 1;

- for_each_pci_msi_entry(desc, dev) {
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
ret = arch_setup_msi_irq(dev, desc);
if (ret)
return ret < 0 ? ret : -ENOSPC;
@@ -43,27 +43,24 @@ void __weak arch_teardown_msi_irqs(struc
struct msi_desc *desc;
int i;

- for_each_pci_msi_entry(desc, dev) {
- if (desc->irq) {
- for (i = 0; i < entry->nvec_used; i++)
- arch_teardown_msi_irq(desc->irq + i);
- }
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ASSOCIATED) {
+ for (i = 0; i < desc->nvec_used; i++)
+ arch_teardown_msi_irq(desc->irq + i);
}
}

static int pci_msi_setup_check_result(struct pci_dev *dev, int type, int ret)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;
int avail = 0;

if (type != PCI_CAP_ID_MSIX || ret >= 0)
return ret;

/* Scan the MSI descriptors for successfully allocated ones. */
- for_each_pci_msi_entry(entry, dev) {
- if (entry->irq != 0)
- avail++;
- }
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ASSOCIATED)
+ avail++;
+
return avail ? avail : ret;
}

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -299,7 +299,6 @@ static void __pci_restore_msix_state(str

if (!dev->msix_enabled)
return;
- BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));

/* route the table */
pci_intx_for_msi(dev, 0);
@@ -309,7 +308,7 @@ static void __pci_restore_msix_state(str
write_msg = arch_restore_msi_irqs(dev);

msi_lock_descs(&dev->dev);
- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
if (write_msg)
__pci_write_msi_msg(entry, &entry->msg);
pci_msix_write_vector_ctrl(entry, entry->pci.msix_ctrl);
@@ -378,14 +377,14 @@ static int msi_verify_entries(struct pci
if (!dev->no_64bit_msi)
return 0;

- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
if (entry->msg.address_hi) {
pci_err(dev, "arch assigned 64-bit MSI address %#x%08x but device only supports 32 bits\n",
entry->msg.address_hi, entry->msg.address_lo);
- return -EIO;
+ break;
}
}
- return 0;
+ return !entry ? 0 : -EIO;
}

/**
@@ -418,7 +417,7 @@ static int msi_capability_init(struct pc
goto unlock;

/* All MSIs are unmasked by default; mask them all */
- entry = first_pci_msi_entry(dev);
+ entry = msi_first_desc(&dev->dev);
pci_msi_mask(entry, msi_multi_mask(entry));

/* Configure MSI capability structure */
@@ -508,11 +507,11 @@ static int msix_setup_msi_descs(struct p

static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;

if (entries) {
- for_each_pci_msi_entry(entry, dev) {
- entries->vector = entry->irq;
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ALL) {
+ entries->vector = desc->irq;
entries++;
}
}
@@ -705,15 +704,14 @@ static void pci_msi_shutdown(struct pci_
if (!pci_msi_enable || !dev || !dev->msi_enabled)
return;

- BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
- desc = first_pci_msi_entry(dev);
-
pci_msi_set_enable(dev, 0);
pci_intx_for_msi(dev, 1);
dev->msi_enabled = 0;

/* Return the device with MSI unmasked as initial states */
- pci_msi_unmask(desc, msi_multi_mask(desc));
+ desc = msi_first_desc(&dev->dev);
+ if (!WARN_ON_ONCE(!desc))
+ pci_msi_unmask(desc, msi_multi_mask(desc));

/* Restore dev->irq to its default pin-assertion IRQ */
dev->irq = desc->pci.msi_attrib.default_irq;
@@ -789,7 +787,7 @@ static int __pci_enable_msix(struct pci_

static void pci_msix_shutdown(struct pci_dev *dev)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;

if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;
@@ -800,8 +798,8 @@ static void pci_msix_shutdown(struct pci
}

/* Return the device with MSI-X masked as initial states */
- for_each_pci_msi_entry(entry, dev)
- pci_msix_mask(entry);
+ msi_for_each_desc(desc, &dev->dev, MSI_DESC_ALL)
+ pci_msix_mask(desc);

pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
pci_intx_for_msi(dev, 1);

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:50 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/x86/pci/xen.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -184,7 +184,7 @@ static int xen_setup_msi_irqs(struct pci
if (ret)
goto error;
i = 0;
- for_each_pci_msi_entry(msidesc, dev) {
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
irq = xen_bind_pirq_msi_to_irq(dev, msidesc, v[i],
(type == PCI_CAP_ID_MSI) ? nvec : 1,
(type == PCI_CAP_ID_MSIX) ?
@@ -235,7 +235,7 @@ static int xen_hvm_setup_msi_irqs(struct
if (type == PCI_CAP_ID_MSI && nvec > 1)
return 1;

- for_each_pci_msi_entry(msidesc, dev) {
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
pirq = xen_allocate_pirq_msi(dev, msidesc);
if (pirq < 0) {
irq = -ENODEV;
@@ -270,7 +270,7 @@ static int xen_initdom_setup_msi_irqs(st
int ret = 0;
struct msi_desc *msidesc;

- for_each_pci_msi_entry(msidesc, dev) {
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
struct physdev_map_pirq map_irq;
domid_t domid;

@@ -389,11 +389,9 @@ static void xen_teardown_msi_irqs(struct
struct msi_desc *msidesc;
int i;

- for_each_pci_msi_entry(msidesc, dev) {
- if (msidesc->irq) {
- for (i = 0; i < msidesc->nvec_used; i++)
- xen_destroy_irq(msidesc->irq + i);
- }
+ msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_ASSOCIATED) {
+ for (i = 0; i < msidesc->nvec_used; i++)
+ xen_destroy_irq(msidesc->irq + i);
}
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:23:51 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/xen-pcifront.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -262,7 +262,7 @@ static int pci_frontend_enable_msix(stru
}

i = 0;
- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
op.msix_entries[i].entry = entry->msi_index;
/* Vector is useless at this point. */
op.msix_entries[i].vector = -1;

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:52 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
Cc: linux...@vger.kernel.org
Cc: Heiko Carstens <h...@linux.ibm.com>
Cc: Christian Borntraeger <bornt...@de.ibm.com>
---
arch/s390/pci/pci_irq.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

--- a/arch/s390/pci/pci_irq.c
+++ b/arch/s390/pci/pci_irq.c
@@ -303,7 +303,7 @@ int arch_setup_msi_irqs(struct pci_dev *

/* Request MSI interrupts */
hwirq = bit;
- for_each_pci_msi_entry(msi, pdev) {
+ msi_for_each_desc(msi, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
rc = -EIO;
if (hwirq - bit >= msi_vecs)
break;
@@ -362,9 +362,7 @@ void arch_teardown_msi_irqs(struct pci_d
return;

/* Release MSI interrupts */
- for_each_pci_msi_entry(msi, pdev) {
- if (!msi->irq)
- continue;
+ msi_for_each_desc(msi, &pdev->dev, MSI_DESC_ASSOCIATED) {
irq_set_msi_desc(msi->irq, NULL);
irq_free_desc(msi->irq);
msi->msg.address_lo = 0;

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:54 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/platforms/4xx/hsta_msi.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/powerpc/platforms/4xx/hsta_msi.c
+++ b/arch/powerpc/platforms/4xx/hsta_msi.c
@@ -47,7 +47,7 @@ static int hsta_setup_msi_irqs(struct pc
return -EINVAL;
}

- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
irq = msi_bitmap_alloc_hwirqs(&ppc4xx_hsta_msi.bmp, 1);
if (irq < 0) {
pr_debug("%s: Failed to allocate msi interrupt\n",
@@ -105,10 +105,7 @@ static void hsta_teardown_msi_irqs(struc
struct msi_desc *entry;
int irq;

- for_each_pci_msi_entry(entry, dev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ASSOCIATED) {
irq = hsta_find_hwirq_offset(entry->irq);

/* entry->irq should always be in irq_map */

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:55 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/platforms/cell/axon_msi.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/powerpc/platforms/cell/axon_msi.c
+++ b/arch/powerpc/platforms/cell/axon_msi.c
@@ -265,7 +265,7 @@ static int axon_msi_setup_msi_irqs(struc
if (rc)
return rc;

- for_each_pci_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
virq = irq_create_direct_mapping(msic->irq_domain);
if (!virq) {
dev_warn(&dev->dev,
@@ -288,10 +288,7 @@ static void axon_msi_teardown_msi_irqs(s

dev_dbg(&dev->dev, "axon_msi: tearing down msi irqs\n");

- for_each_pci_msi_entry(entry, dev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &dev->dev, MSI_DESC_ASSOCIATED) {
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
}

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:57 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/platforms/pasemi/msi.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

--- a/arch/powerpc/platforms/pasemi/msi.c
+++ b/arch/powerpc/platforms/pasemi/msi.c
@@ -62,17 +62,12 @@ static void pasemi_msi_teardown_msi_irqs

pr_debug("pasemi_msi_teardown_msi_irqs, pdev %p\n", pdev);

- for_each_pci_msi_entry(entry, pdev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
hwirq = virq_to_hw(entry->irq);
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
msi_bitmap_free_hwirqs(&msi_mpic->msi_bitmap, hwirq, ALLOC_CHUNK);
}
-
- return;
}

static int pasemi_msi_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
@@ -90,7 +85,7 @@ static int pasemi_msi_setup_msi_irqs(str
msg.address_hi = 0;
msg.address_lo = PASEMI_MSI_ADDR;

- for_each_pci_msi_entry(entry, pdev) {
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
/* Allocate 16 interrupts for now, since that's the grouping for
* affinity. This can be changed later if it turns out 32 is too
* few MSIs for someone, but restrictions will apply to how the

Thomas Gleixner

unread,
Nov 26, 2021, 8:23:59 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/sysdev/fsl_msi.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

--- a/arch/powerpc/sysdev/fsl_msi.c
+++ b/arch/powerpc/sysdev/fsl_msi.c
@@ -125,17 +125,13 @@ static void fsl_teardown_msi_irqs(struct
struct fsl_msi *msi_data;
irq_hw_number_t hwirq;

- for_each_pci_msi_entry(entry, pdev) {
- if (!entry->irq)
- continue;
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
hwirq = virq_to_hw(entry->irq);
msi_data = irq_get_chip_data(entry->irq);
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
msi_bitmap_free_hwirqs(&msi_data->bitmap, hwirq, 1);
}
-
- return;
}

static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
@@ -215,7 +211,7 @@ static int fsl_setup_msi_irqs(struct pci
}
}

- for_each_pci_msi_entry(entry, pdev) {
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
/*
* Loop over all the MSI devices until we find one that has an
* available interrupt.

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:00 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/powerpc/sysdev/mpic_u3msi.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

--- a/arch/powerpc/sysdev/mpic_u3msi.c
+++ b/arch/powerpc/sysdev/mpic_u3msi.c
@@ -104,17 +104,12 @@ static void u3msi_teardown_msi_irqs(stru
struct msi_desc *entry;
irq_hw_number_t hwirq;

- for_each_pci_msi_entry(entry, pdev) {
- if (!entry->irq)
- continue;
-
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
hwirq = virq_to_hw(entry->irq);
irq_set_msi_desc(entry->irq, NULL);
irq_dispose_mapping(entry->irq);
msi_bitmap_free_hwirqs(&msi_mpic->msi_bitmap, hwirq, 1);
}
-
- return;
}

static int u3msi_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
@@ -136,7 +131,7 @@ static int u3msi_setup_msi_irqs(struct p
return -ENXIO;
}

- for_each_pci_msi_entry(entry, pdev) {
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
hwirq = msi_bitmap_alloc_hwirqs(&msi_mpic->msi_bitmap, 1);
if (hwirq < 0) {
pr_debug("u3msi: failed allocating hwirq\n");

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:03 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger
Replace the about to vanish iterators, make use of the filtering and take
the descriptor lock around the iteration.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
Cc: Jon Mason <jdm...@kudzu.us>
Cc: Dave Jiang <dave....@intel.com>
Cc: Allen Hubbe <all...@gmail.com>
Cc: linu...@googlegroups.com
---
drivers/ntb/msi.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

--- a/drivers/ntb/msi.c
+++ b/drivers/ntb/msi.c
@@ -108,8 +108,10 @@ int ntb_msi_setup_mws(struct ntb_dev *nt
if (!ntb->msi)
return -EINVAL;

- desc = first_msi_entry(&ntb->pdev->dev);
+ msi_lock_descs(&ntb->pdev->dev);
+ desc = msi_first_desc(&ntb->pdev->dev);
addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+ msi_unlock_descs(&ntb->pdev->dev);

for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
@@ -281,13 +283,15 @@ int ntbm_msi_request_threaded_irq(struct
const char *name, void *dev_id,
struct ntb_msi_desc *msi_desc)
{
+ struct device *dev = &ntb->pdev->dev;
struct msi_desc *entry;
int ret;

if (!ntb->msi)
return -EINVAL;

- for_each_pci_msi_entry(entry, ntb->pdev) {
+ msi_lock_descs(dev);
+ msi_for_each_desc(entry, dev, MSI_DESC_ASSOCIATED) {
if (irq_has_action(entry->irq))
continue;

@@ -304,14 +308,17 @@ int ntbm_msi_request_threaded_irq(struct
ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
if (ret) {
devm_free_irq(&ntb->dev, entry->irq, dev_id);
- return ret;
+ goto unlock;
}

-
- return entry->irq;
+ ret = entry->irq;
+ goto unlock;
}
+ ret = -ENODEV;

- return -ENODEV;
+unlock:
+ msi_unlock_descs(dev);
+ return ret;
}
EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);


Thomas Gleixner

unread,
Nov 26, 2021, 8:24:03 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Replace the about to vanish iterators and make use of the filtering. Take
the descriptor lock around the iterators.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/pci/controller/pci-hyperv.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3445,18 +3445,23 @@ static int hv_pci_suspend(struct hv_devi

static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
{
- struct msi_desc *entry;
struct irq_data *irq_data;
+ struct msi_desc *entry;
+ int ret = 0;

- for_each_pci_msi_entry(entry, pdev) {
+ msi_lock_descs(&pdev->dev);
+ msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
irq_data = irq_get_irq_data(entry->irq);
- if (WARN_ON_ONCE(!irq_data))
- return -EINVAL;
+ if (WARN_ON_ONCE(!irq_data)) {
+ ret = -EINVAL;
+ break;
+ }

hv_compose_msi_msg(irq_data, &entry->msg);
}
+ msi_unlock_descs(&pdev->dev);

- return 0;
+ return ret;
}

/*

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:05 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Protect the allocation properly and use the core allocation and free
mechanism.

No functional change intended.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/soc/ti/ti_sci_inta_msi.c | 71 +++++++++++++--------------------------
1 file changed, 25 insertions(+), 46 deletions(-)

--- a/drivers/soc/ti/ti_sci_inta_msi.c
+++ b/drivers/soc/ti/ti_sci_inta_msi.c
@@ -51,6 +51,7 @@ struct irq_domain *ti_sci_inta_msi_creat
struct irq_domain *domain;

ti_sci_inta_msi_update_chip_ops(info);
+ info->flags |= MSI_FLAG_FREE_MSI_DESCS;

domain = msi_create_irq_domain(fwnode, info, parent);
if (domain)
@@ -60,50 +61,31 @@ struct irq_domain *ti_sci_inta_msi_creat
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_create_irq_domain);

-static void ti_sci_inta_msi_free_descs(struct device *dev)
-{
- struct msi_desc *desc, *tmp;
-
- list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
- list_del(&desc->list);
- free_msi_entry(desc);
- }
-}
-
static int ti_sci_inta_msi_alloc_descs(struct device *dev,
struct ti_sci_resource *res)
{
- struct msi_desc *msi_desc;
+ struct msi_desc msi_desc;
int set, i, count = 0;

+ memset(&msi_desc, 0, sizeof(msi_desc));
+
for (set = 0; set < res->sets; set++) {
- for (i = 0; i < res->desc[set].num; i++) {
- msi_desc = alloc_msi_entry(dev, 1, NULL);
- if (!msi_desc) {
- ti_sci_inta_msi_free_descs(dev);
- return -ENOMEM;
- }
-
- msi_desc->msi_index = res->desc[set].start + i;
- INIT_LIST_HEAD(&msi_desc->list);
- list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
- count++;
+ for (i = 0; i < res->desc[set].num; i++, count++) {
+ msi_desc.msi_index = res->desc[set].start + i;
+ if (msi_add_msi_desc(dev, &msi_desc))
+ goto fail;
}
- for (i = 0; i < res->desc[set].num_sec; i++) {
- msi_desc = alloc_msi_entry(dev, 1, NULL);
- if (!msi_desc) {
- ti_sci_inta_msi_free_descs(dev);
- return -ENOMEM;
- }
-
- msi_desc->msi_index = res->desc[set].start_sec + i;
- INIT_LIST_HEAD(&msi_desc->list);
- list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
- count++;
+
+ for (i = 0; i < res->desc[set].num_sec; i++, count++) {
+ msi_desc.msi_index = res->desc[set].start_sec + i;
+ if (msi_add_msi_desc(dev, &msi_desc))
+ goto fail;
}
}
-
return count;
+fail:
+ msi_free_msi_descs(dev);
+ return -ENOMEM;
}

int ti_sci_inta_msi_domain_alloc_irqs(struct device *dev,
@@ -124,20 +106,18 @@ int ti_sci_inta_msi_domain_alloc_irqs(st
if (ret)
return ret;

+ msi_lock_descs(dev);
nvec = ti_sci_inta_msi_alloc_descs(dev, res);
- if (nvec <= 0)
- return nvec;
-
- ret = msi_domain_alloc_irqs(msi_domain, dev, nvec);
- if (ret) {
- dev_err(dev, "Failed to allocate IRQs %d\n", ret);
- goto cleanup;
+ if (nvec <= 0) {
+ ret = nvec;
+ goto unlock;
}

- return 0;
-
-cleanup:
- ti_sci_inta_msi_free_descs(&pdev->dev);
+ ret = msi_domain_alloc_irqs_descs_locked(msi_domain, dev, nvec);
+ if (ret)
+ dev_err(dev, "Failed to allocate IRQs %d\n", ret);
+unlock:
+ msi_unlock_descs(dev);
return ret;
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_alloc_irqs);
@@ -145,6 +125,5 @@ EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain
void ti_sci_inta_msi_domain_free_irqs(struct device *dev)
{
msi_domain_free_irqs(dev->msi.domain, dev);
- ti_sci_inta_msi_free_descs(dev);
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_free_irqs);

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:07 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The function has no users and is pointless now that the core frees the MSI
descriptors, which means potential users can just use msi_domain_free_irqs().

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/soc/ti/ti_sci_inta_msi.c | 6 ------
include/linux/soc/ti/ti_sci_inta_msi.h | 1 -
2 files changed, 7 deletions(-)

--- a/drivers/soc/ti/ti_sci_inta_msi.c
+++ b/drivers/soc/ti/ti_sci_inta_msi.c
@@ -121,9 +121,3 @@ int ti_sci_inta_msi_domain_alloc_irqs(st
return ret;
}
EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_alloc_irqs);
-
-void ti_sci_inta_msi_domain_free_irqs(struct device *dev)
-{
- msi_domain_free_irqs(dev->msi.domain, dev);
-}
-EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_free_irqs);
--- a/include/linux/soc/ti/ti_sci_inta_msi.h
+++ b/include/linux/soc/ti/ti_sci_inta_msi.h
@@ -18,5 +18,4 @@ struct irq_domain
struct irq_domain *parent);
int ti_sci_inta_msi_domain_alloc_irqs(struct device *dev,
struct ti_sci_resource *res);
-void ti_sci_inta_msi_domain_free_irqs(struct device *dev);
#endif /* __INCLUDE_LINUX_IRQCHIP_TI_SCI_INTA_H */

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:08 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Let the MSI irq domain code handle descriptor allocation and free.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/bus/fsl-mc/fsl-mc-msi.c | 61 ++--------------------------------------
1 file changed, 4 insertions(+), 57 deletions(-)

--- a/drivers/bus/fsl-mc/fsl-mc-msi.c
+++ b/drivers/bus/fsl-mc/fsl-mc-msi.c
@@ -170,6 +170,7 @@ struct irq_domain *fsl_mc_msi_create_irq
fsl_mc_msi_update_dom_ops(info);
if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
fsl_mc_msi_update_chip_ops(info);
+ info->flags |= MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS | MSI_FLAG_FREE_MSI_DESCS;

domain = msi_create_irq_domain(fwnode, info, parent);
if (domain)
@@ -210,45 +211,7 @@ struct irq_domain *fsl_mc_find_msi_domai
return msi_domain;
}

-static void fsl_mc_msi_free_descs(struct device *dev)
-{
- struct msi_desc *desc, *tmp;
-
- list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
- list_del(&desc->list);
- free_msi_entry(desc);
- }
-}
-
-static int fsl_mc_msi_alloc_descs(struct device *dev, unsigned int irq_count)
-
-{
- unsigned int i;
- int error;
- struct msi_desc *msi_desc;
-
- for (i = 0; i < irq_count; i++) {
- msi_desc = alloc_msi_entry(dev, 1, NULL);
- if (!msi_desc) {
- dev_err(dev, "Failed to allocate msi entry\n");
- error = -ENOMEM;
- goto cleanup_msi_descs;
- }
-
- msi_desc->msi_index = i;
- INIT_LIST_HEAD(&msi_desc->list);
- list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
- }
-
- return 0;
-
-cleanup_msi_descs:
- fsl_mc_msi_free_descs(dev);
- return error;
-}
-
-int fsl_mc_msi_domain_alloc_irqs(struct device *dev,
- unsigned int irq_count)
+int fsl_mc_msi_domain_alloc_irqs(struct device *dev, unsigned int irq_count)
{
struct irq_domain *msi_domain;
int error;
@@ -261,28 +224,17 @@ int fsl_mc_msi_domain_alloc_irqs(struct
if (error)
return error;

- if (!list_empty(dev_to_msi_list(dev)))
+ if (msi_device_num_descs(dev))
return -EINVAL;

- error = fsl_mc_msi_alloc_descs(dev, irq_count);
- if (error < 0)
- return error;
-
/*
* NOTE: Calling this function will trigger the invocation of the
* its_fsl_mc_msi_prepare() callback
*/
error = msi_domain_alloc_irqs(msi_domain, dev, irq_count);

- if (error) {
+ if (error)
dev_err(dev, "Failed to allocate IRQs\n");
- goto cleanup_msi_descs;
- }
-
- return 0;
-
-cleanup_msi_descs:
- fsl_mc_msi_free_descs(dev);
return error;
}

@@ -295,9 +247,4 @@ void fsl_mc_msi_domain_free_irqs(struct
return;

msi_domain_free_irqs(msi_domain, dev);
-
- if (list_empty(dev_to_msi_list(dev)))
- return;
-
- fsl_mc_msi_free_descs(dev);
}

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:10 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Use the core functionality for platform MSI interrupt domains. The platform
device MSI interrupt domains will be converted in a later step.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/base/platform-msi.c | 112 ++++++++++++++++++--------------------------
1 file changed, 48 insertions(+), 64 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -107,57 +107,6 @@ static void platform_msi_update_chip_ops
info->flags &= ~MSI_FLAG_LEVEL_CAPABLE;
}

-static void platform_msi_free_descs(struct device *dev, int base, int nvec)
-{
- struct msi_desc *desc, *tmp;
-
- list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
- if (desc->msi_index >= base &&
- desc->msi_index < (base + nvec)) {
- list_del(&desc->list);
- free_msi_entry(desc);
- }
- }
-}
-
-static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
- int nvec)
-{
- struct msi_desc *desc;
- int i, base = 0;
-
- if (!list_empty(dev_to_msi_list(dev))) {
- desc = list_last_entry(dev_to_msi_list(dev),
- struct msi_desc, list);
- base = desc->msi_index + 1;
- }
-
- for (i = 0; i < nvec; i++) {
- desc = alloc_msi_entry(dev, 1, NULL);
- if (!desc)
- break;
-
- desc->msi_index = base + i;
- desc->irq = virq ? virq + i : 0;
-
- list_add_tail(&desc->list, dev_to_msi_list(dev));
- }
-
- if (i != nvec) {
- /* Clean up the mess */
- platform_msi_free_descs(dev, base, nvec);
-
- return -ENOMEM;
- }
-
- return 0;
-}
-
-static int platform_msi_alloc_descs(struct device *dev, int nvec)
-{
- return platform_msi_alloc_descs_with_irq(dev, 0, nvec);
-}
-
/**
* platform_msi_create_irq_domain - Create a platform MSI interrupt domain
* @fwnode: Optional fwnode of the interrupt controller
@@ -180,7 +129,8 @@ struct irq_domain *platform_msi_create_i
platform_msi_update_dom_ops(info);
if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
platform_msi_update_chip_ops(info);
- info->flags |= MSI_FLAG_DEV_SYSFS;
+ info->flags |= MSI_FLAG_DEV_SYSFS | MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS |
+ MSI_FLAG_FREE_MSI_DESCS;

domain = msi_create_irq_domain(fwnode, info, parent);
if (domain)
@@ -262,20 +212,10 @@ int platform_msi_domain_alloc_irqs(struc
if (err)
return err;

- err = platform_msi_alloc_descs(dev, nvec);
- if (err)
- goto out_free_priv_data;
-
err = msi_domain_alloc_irqs(dev->msi.domain, dev, nvec);
if (err)
- goto out_free_desc;
-
- return 0;
+ platform_msi_free_priv_data(dev);

-out_free_desc:
- platform_msi_free_descs(dev, 0, nvec);
-out_free_priv_data:
- platform_msi_free_priv_data(dev);
return err;
}
EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs);
@@ -287,7 +227,6 @@ EXPORT_SYMBOL_GPL(platform_msi_domain_al
void platform_msi_domain_free_irqs(struct device *dev)
{
msi_domain_free_irqs(dev->msi.domain, dev);
- platform_msi_free_descs(dev, 0, MAX_DEV_MSIS);
platform_msi_free_priv_data(dev);
}
EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs);
@@ -361,6 +300,51 @@ struct irq_domain *
return NULL;
}

+static void platform_msi_free_descs(struct device *dev, int base, int nvec)
+{
+ struct msi_desc *desc, *tmp;
+
+ list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
+ if (desc->msi_index >= base &&
+ desc->msi_index < (base + nvec)) {
+ list_del(&desc->list);
+ free_msi_entry(desc);
+ }
+ }
+}
+
+static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
+ int nvec)
+{
+ struct msi_desc *desc;
+ int i, base = 0;
+
+ if (!list_empty(dev_to_msi_list(dev))) {
+ desc = list_last_entry(dev_to_msi_list(dev),
+ struct msi_desc, list);
+ base = desc->msi_index + 1;
+ }
+
+ for (i = 0; i < nvec; i++) {
+ desc = alloc_msi_entry(dev, 1, NULL);
+ if (!desc)
+ break;
+
+ desc->msi_index = base + i;
+ desc->irq = virq + i;
+
+ list_add_tail(&desc->list, dev_to_msi_list(dev));
+ }
+
+ if (i != nvec) {
+ /* Clean up the mess */
+ platform_msi_free_descs(dev, base, nvec);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
/**
* platform_msi_device_domain_free - Free interrupts associated with a platform-msi
* device domain

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:12 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The allocation code is overly complex. It tries to have the MSI index space
packed, which is not working when an interrupt is freed. There is no
requirement for this. The only requirement is that the MSI index is unique.

Move the MSI descriptor allocation into msi_domain_populate_irqs() and use
the Linux interrupt number as MSI index which fulfils the unique
requirement.

This requires to lock the MSI descriptors which makes the lock order
reverse to the regular MSI alloc/free functions vs. the domain
mutex. Assign a seperate lockdep class for these MSI device domains.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
drivers/base/platform-msi.c | 88 +++++++++-----------------------------------
kernel/irq/msi.c | 46 +++++++++++------------
2 files changed, 40 insertions(+), 94 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -246,6 +246,8 @@ void *platform_msi_get_host_data(struct
return data->host_data;
}

+static struct lock_class_key platform_device_msi_lock_class;
+
/**
* __platform_msi_create_device_domain - Create a platform-msi device domain
*
@@ -278,6 +280,13 @@ struct irq_domain *
if (err)
return NULL;

+ /*
+ * Use a separate lock class for the MSI descriptor mutex on
+ * platform MSI device domains because the descriptor mutex nests
+ * into the domain mutex. See alloc/free below.
+ */
+ lockdep_set_class(&dev->msi.data->mutex, &platform_device_msi_lock_class);
+
data = dev->msi.data->platform_data;
data->host_data = host_data;
domain = irq_domain_create_hierarchy(dev->msi.domain, 0,
@@ -300,75 +309,23 @@ struct irq_domain *
return NULL;
}
- desc->irq = virq + i;
-
- list_add_tail(&desc->list, dev_to_msi_list(dev));
- }
-
- if (i != nvec) {
- /* Clean up the mess */
- platform_msi_free_descs(dev, base, nvec);
- return -ENOMEM;
- }
-
- return 0;
-}
-
/**
* platform_msi_device_domain_free - Free interrupts associated with a platform-msi
* device domain
*
* @domain: The platform-msi device domain
* @virq: The base irq from which to perform the free operation
- * @nvec: How many interrupts to free from @virq
+ * @nr_irqs: How many interrupts to free from @virq
*/
void platform_msi_device_domain_free(struct irq_domain *domain, unsigned int virq,
- unsigned int nvec)
+ unsigned int nr_irqs)
{
struct platform_msi_priv_data *data = domain->host_data;
- struct msi_desc *desc, *tmp;

- for_each_msi_entry_safe(desc, tmp, data->dev) {
- if (WARN_ON(!desc->irq || desc->nvec_used != 1))
- return;
- if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
- continue;
-
- irq_domain_free_irqs_common(domain, desc->irq, 1);
- list_del(&desc->list);
- free_msi_entry(desc);
- }
+ msi_lock_descs(data->dev);
+ irq_domain_free_irqs_common(domain, virq, nr_irqs);
+ msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, virq, nr_irqs);
+ msi_unlock_descs(data->dev);
}

/**
@@ -377,7 +334,7 @@ void platform_msi_device_domain_free(str
*
* @domain: The platform-msi device domain
* @virq: The base irq from which to perform the allocate operation
- * @nr_irqs: How many interrupts to free from @virq
+ * @nr_irqs: How many interrupts to allocate from @virq
*
* Return 0 on success, or an error code on failure. Must be called
* with irq_domain_mutex held (which can only be done as part of a
@@ -387,16 +344,7 @@ int platform_msi_device_domain_alloc(str
unsigned int nr_irqs)
{
struct platform_msi_priv_data *data = domain->host_data;
- int err;
-
- err = platform_msi_alloc_descs_with_irq(data->dev, virq, nr_irqs);
- if (err)
- return err;
-
- err = msi_domain_populate_irqs(domain->parent, data->dev,
- virq, nr_irqs, &data->arg);
- if (err)
- platform_msi_device_domain_free(domain, virq, nr_irqs);
+ struct device *dev = data->dev;

- return err;
+ return msi_domain_populate_irqs(domain->parent, dev, virq, nr_irqs, &data->arg);
}
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -775,43 +775,41 @@ int msi_domain_prepare_irqs(struct irq_d
}

int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
- int virq, int nvec, msi_alloc_info_t *arg)
+ int virq_base, int nvec, msi_alloc_info_t *arg)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
struct msi_desc *desc;
- int ret = 0;
+ int ret, virq;

- for_each_msi_entry(desc, dev) {
- /* Don't even try the multi-MSI brain damage. */
- if (WARN_ON(!desc->irq || desc->nvec_used != 1)) {
- ret = -EINVAL;
- break;
+ msi_lock_descs(dev);
+ for (virq = virq_base; virq < virq_base + nvec; virq++) {
+ desc = alloc_msi_entry(dev, 1, NULL);
+ if (!desc) {
+ ret = -ENOMEM;
+ goto fail;
}

- if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
- continue;
+ desc->msi_index = virq;
+ desc->irq = virq;
+ list_add_tail(&desc->list, &dev->msi.data->list);
+ dev->msi.data->num_descs++;

ops->set_desc(arg, desc);
- /* Assumes the domain mutex is held! */
- ret = irq_domain_alloc_irqs_hierarchy(domain, desc->irq, 1,
- arg);
+ ret = irq_domain_alloc_irqs_hierarchy(domain, virq, 1, arg);
if (ret)
- break;
+ goto fail;

- irq_set_msi_desc_off(desc->irq, 0, desc);
- }
-
- if (ret) {
- /* Mop up the damage */
- for_each_msi_entry(desc, dev) {
- if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
- continue;
-
- irq_domain_free_irqs_common(domain, desc->irq, 1);
- }
+ irq_set_msi_desc(virq, desc);
}
+ msi_unlock_descs(dev);
+ return 0;

+fail:
+ for (--virq; virq >= virq_base; virq--)
+ irq_domain_free_irqs_common(domain, virq, 1);
+ msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+ msi_unlock_descs(dev);
return ret;
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:24:13 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
There is no real reason to do several loops over the MSI descriptors
instead of just doing one loop. In case of an error everything is undone
anyway so it does not matter whether it's a partial or a full rollback.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
.clang-format | 1
include/linux/msi.h | 7 --
kernel/irq/msi.c | 129 +++++++++++++++++++++++++++-------------------------
3 files changed, 70 insertions(+), 67 deletions(-)

--- a/.clang-format
+++ b/.clang-format
@@ -216,7 +216,6 @@ ExperimentalAutoDetectBinPacking: false
- 'for_each_migratetype_order'
- 'for_each_msi_entry'
- 'for_each_msi_entry_safe'
- - 'for_each_msi_vector'
- 'for_each_net'
- 'for_each_net_continue_reverse'
- 'for_each_netdev'
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -263,12 +263,7 @@ static inline struct msi_desc *msi_first
list_for_each_entry((desc), dev_to_msi_list((dev)), list)
#define for_each_msi_entry_safe(desc, tmp, dev) \
list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
-#define for_each_msi_vector(desc, __irq, dev) \
- for_each_msi_entry((desc), (dev)) \
- if ((desc)->irq) \
- for (__irq = (desc)->irq; \
- __irq < ((desc)->irq + (desc)->nvec_used); \
- __irq++)
+
#ifdef CONFIG_IRQ_MSI_IOMMU
static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
{
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -873,23 +873,74 @@ static int msi_handle_pci_fail(struct ir
return allocated ? allocated : -ENOSPC;
}

+#define VIRQ_CAN_RESERVE 0x01
+#define VIRQ_ACTIVATE 0x02
+#define VIRQ_NOMASK_QUIRK 0x04
+
+static int msi_init_virq(struct irq_domain *domain, int virq, unsigned int vflags)
+{
+ struct irq_data *irqd = irq_domain_get_irq_data(domain, virq);
+ int ret;
+
+ if (!vflags & VIRQ_CAN_RESERVE) {
+ irqd_clr_can_reserve(irqd);
+ if (vflags & VIRQ_NOMASK_QUIRK)
+ irqd_set_msi_nomask_quirk(irqd);
+ }
+
+ if (!(vflags & VIRQ_ACTIVATE))
+ return 0;
+
+ ret = irq_domain_activate_irq(irqd, vflags & VIRQ_CAN_RESERVE);
+ if (ret)
+ return ret;
+ /*
+ * If the interrupt uses reservation mode, clear the activated bit
+ * so request_irq() will assign the final vector.
+ */
+ if (vflags & VIRQ_CAN_RESERVE)
+ irqd_clr_activated(irqd);
+ return 0;
+}
+
int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
int nvec)
{
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
- struct irq_data *irq_data;
- struct msi_desc *desc;
msi_alloc_info_t arg = { };
+ unsigned int vflags = 0;
+ struct msi_desc *desc;
int allocated = 0;
int i, ret, virq;
- bool can_reserve;

ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
if (ret)
return ret;

- for_each_msi_entry(desc, dev) {
+ /*
+ * This flag is set by the PCI layer as we need to activate
+ * the MSI entries before the PCI layer enables MSI in the
+ * card. Otherwise the card latches a random msi message.
+ */
+ if (info->flags & MSI_FLAG_ACTIVATE_EARLY)
+ vflags |= VIRQ_ACTIVATE;
+
+ /*
+ * Interrupt can use a reserved vector and will not occupy
+ * a real device vector until the interrupt is requested.
+ */
+ if (msi_check_reservation_mode(domain, info, dev)) {
+ vflags |= VIRQ_CAN_RESERVE;
+ /*
+ * MSI affinity setting requires a special quirk (X86) when
+ * reservation mode is active.
+ */
+ if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK)
+ vflags |= VIRQ_NOMASK_QUIRK;
+ }
+
+ msi_for_each_desc(desc, dev, MSI_DESC_NOTASSOCIATED) {
ops->set_desc(&arg, desc);

virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -901,49 +952,12 @@ int __msi_domain_alloc_irqs(struct irq_d
for (i = 0; i < desc->nvec_used; i++) {
irq_set_msi_desc_off(virq, i, desc);
irq_debugfs_copy_devname(virq + i, dev);
+ ret = msi_init_virq(domain, virq + i, vflags);
+ if (ret)
+ return ret;
}
allocated++;
}
-
- can_reserve = msi_check_reservation_mode(domain, info, dev);
-
- /*
- * This flag is set by the PCI layer as we need to activate
- * the MSI entries before the PCI layer enables MSI in the
- * card. Otherwise the card latches a random msi message.
- */
- if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY))
- goto skip_activate;
-
- for_each_msi_vector(desc, i, dev) {
- if (desc->irq == i) {
- virq = desc->irq;
- dev_dbg(dev, "irq [%d-%d] for MSI\n",
- virq, virq + desc->nvec_used - 1);
- }
-
- irq_data = irq_domain_get_irq_data(domain, i);
- if (!can_reserve) {
- irqd_clr_can_reserve(irq_data);
- if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK)
- irqd_set_msi_nomask_quirk(irq_data);
- }
- ret = irq_domain_activate_irq(irq_data, can_reserve);
- if (ret)
- return ret;
- }
-
-skip_activate:
- /*
- * If these interrupts use reservation mode, clear the activated bit
- * so request_irq() will assign the final vector.
- */
- if (can_reserve) {
- for_each_msi_vector(desc, i, dev) {
- irq_data = irq_domain_get_irq_data(domain, i);
- irqd_clr_activated(irq_data);
- }
- }
return 0;
}

@@ -1021,26 +1035,21 @@ int msi_domain_alloc_irqs(struct irq_dom

void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
{
- struct irq_data *irq_data;
+ struct irq_data *irqd;
struct msi_desc *desc;
int i;

- for_each_msi_vector(desc, i, dev) {
- irq_data = irq_domain_get_irq_data(domain, i);
- if (irqd_is_activated(irq_data))
- irq_domain_deactivate_irq(irq_data);
- }
-
- for_each_msi_entry(desc, dev) {
- /*
- * We might have failed to allocate an MSI early
- * enough that there is no IRQ associated to this
- * entry. If that's the case, don't do anything.
- */
- if (desc->irq) {
- irq_domain_free_irqs(desc->irq, desc->nvec_used);
- desc->irq = 0;
+ /* Only handle MSI entries which have an interrupt associated */
+ msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+ /* Make sure all interrupts are deactivated */
+ for (i = 0; i < desc->nvec_used; i++) {
+ irqd = irq_domain_get_irq_data(domain, desc->irq + i);
+ if (irqd && irqd_is_activated(irqd))
+ irq_domain_deactivate_irq(irqd);
}
+
+ irq_domain_free_irqs(desc->irq, desc->nvec_used);
+ desc->irq = 0;
}
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:24:14 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Use the new iterator functions and add locking where required.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
kernel/irq/msi.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -354,6 +354,7 @@ struct msi_desc *msi_next_desc(struct de
int __msi_get_virq(struct device *dev, unsigned int index)
{
struct msi_desc *desc;
+ int ret = -ENOENT;
bool pcimsi;

if (!dev->msi.data)
@@ -361,11 +362,12 @@ int __msi_get_virq(struct device *dev, u

pcimsi = msi_device_has_property(dev, MSI_PROP_PCI_MSI);

- for_each_msi_entry(desc, dev) {
+ msi_lock_descs(dev);
+ msi_for_each_desc_from(desc, dev, MSI_DESC_ASSOCIATED, index) {
/* PCI-MSI has only one descriptor for multiple interrupts. */
if (pcimsi) {
- if (desc->irq && index < desc->nvec_used)
- return desc->irq + index;
+ if (index < desc->nvec_used)
+ ret = desc->irq + index;
break;
}

@@ -373,10 +375,13 @@ int __msi_get_virq(struct device *dev, u
* PCI-MSIX and platform MSI use a descriptor per
* interrupt.
*/
- if (desc->msi_index == index)
- return desc->irq;
+ if (desc->msi_index == index) {
+ ret = desc->irq;
+ break;
+ }
}
- return -ENOENT;
+ msi_unlock_descs(dev);
+ return ret;
}
EXPORT_SYMBOL_GPL(__msi_get_virq);

@@ -407,7 +412,7 @@ static const struct attribute_group **ms
int i;

/* Determine how many msi entries we have */
- for_each_msi_entry(entry, dev)
+ msi_for_each_desc(entry, dev, MSI_DESC_ALL)
num_msi += entry->nvec_used;
if (!num_msi)
return NULL;
@@ -417,7 +422,7 @@ static const struct attribute_group **ms
if (!msi_attrs)
return ERR_PTR(-ENOMEM);

- for_each_msi_entry(entry, dev) {
+ msi_for_each_desc(entry, dev, MSI_DESC_ALL) {
for (i = 0; i < entry->nvec_used; i++) {
msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
if (!msi_dev_attr)
@@ -838,7 +843,7 @@ static bool msi_check_reservation_mode(s
* Checking the first MSI descriptor is sufficient. MSIX supports
* masking and MSI does so when the can_mask attribute is set.
*/
- desc = first_msi_entry(dev);
+ desc = msi_first_desc(dev);
return desc->pci.msi_attrib.is_msix || desc->pci.msi_attrib.can_mask;
}


Thomas Gleixner

unread,
Nov 26, 2021, 8:24:16 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Get rid of the old iterators, alloc/free functions and adjust the core code
accordingly.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 15 ---------------
kernel/irq/msi.c | 31 +++++++++++++++----------------
2 files changed, 15 insertions(+), 31 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -254,15 +254,7 @@ static inline struct msi_desc *msi_first
#define msi_for_each_desc(desc, dev, filter) \
msi_for_each_desc_from(desc, dev, filter, 0)

-/* Helpers to hide struct msi_desc implementation details */
#define msi_desc_to_dev(desc) ((desc)->dev)
-#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
-#define first_msi_entry(dev) \
- list_first_entry(dev_to_msi_list((dev)), struct msi_desc, list)
-#define for_each_msi_entry(desc, dev) \
- list_for_each_entry((desc), dev_to_msi_list((dev)), list)
-#define for_each_msi_entry_safe(desc, tmp, dev) \
- list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)

#ifdef CONFIG_IRQ_MSI_IOMMU
static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
@@ -288,10 +280,6 @@ static inline void msi_desc_set_iommu_co
#endif

#ifdef CONFIG_PCI_MSI
-#define first_pci_msi_entry(pdev) first_msi_entry(&(pdev)->dev)
-#define for_each_pci_msi_entry(desc, pdev) \
- for_each_msi_entry((desc), &(pdev)->dev)
-
struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc);
void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg);
#else /* CONFIG_PCI_MSI */
@@ -314,9 +302,6 @@ static inline void msi_free_msi_descs(st
msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
}

-struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
- const struct irq_affinity_desc *affinity);
-void free_msi_entry(struct msi_desc *entry);
void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg);

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -19,8 +19,10 @@

#include "internals.h"

+#define dev_to_msi_list(dev) (&(dev)->msi.data->list)
+
/**
- * alloc_msi_entry - Allocate an initialized msi_desc
+ * msi_alloc_desc - Allocate an initialized msi_desc
* @dev: Pointer to the device for which this is allocated
* @nvec: The number of vectors used in this entry
* @affinity: Optional pointer to an affinity mask array size of @nvec
@@ -30,12 +32,11 @@
*
* Return: pointer to allocated &msi_desc on success or %NULL on failure
*/
-struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
- const struct irq_affinity_desc *affinity)
+static struct msi_desc *msi_alloc_desc(struct device *dev, int nvec,
+ const struct irq_affinity_desc *affinity)
{
- struct msi_desc *desc;
+ struct msi_desc *desc = kzalloc(sizeof(*desc), GFP_KERNEL);

- desc = kzalloc(sizeof(*desc), GFP_KERNEL);
if (!desc)
return NULL;

@@ -43,21 +44,19 @@ struct msi_desc *alloc_msi_entry(struct
desc->dev = dev;
desc->nvec_used = nvec;
if (affinity) {
- desc->affinity = kmemdup(affinity,
- nvec * sizeof(*desc->affinity), GFP_KERNEL);
+ desc->affinity = kmemdup(affinity, nvec * sizeof(*desc->affinity), GFP_KERNEL);
if (!desc->affinity) {
kfree(desc);
return NULL;
}
}
-
return desc;
}

-void free_msi_entry(struct msi_desc *entry)
+static void msi_free_desc(struct msi_desc *desc)
{
- kfree(entry->affinity);
- kfree(entry);
+ kfree(desc->affinity);
+ kfree(desc);
}

/**
@@ -73,7 +72,7 @@ int msi_add_msi_desc(struct device *dev,

lockdep_assert_held(&dev->msi.data->mutex);

- desc = alloc_msi_entry(dev, init_desc->nvec_used, init_desc->affinity);
+ desc = msi_alloc_desc(dev, init_desc->nvec_used, init_desc->affinity);
if (!desc)
return -ENOMEM;

@@ -103,7 +102,7 @@ int msi_add_simple_msi_descs(struct devi
lockdep_assert_held(&dev->msi.data->mutex);

for (i = 0; i < ndesc; i++) {
- desc = alloc_msi_entry(dev, 1, NULL);
+ desc = msi_alloc_desc(dev, 1, NULL);
if (!desc)
goto fail;
desc->msi_index = index + i;
@@ -116,7 +115,7 @@ int msi_add_simple_msi_descs(struct devi
fail:
list_for_each_entry_safe(desc, tmp, &list, list) {
list_del(&desc->list);
- free_msi_entry(desc);
+ msi_free_desc(desc);
}
return -ENOMEM;
}
@@ -143,7 +142,7 @@ void msi_free_msi_descs_range(struct dev
if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
continue;
list_del(&desc->list);
- free_msi_entry(desc);
+ msi_free_desc(desc);
dev->msi.data->num_descs--;
}
}
@@ -779,7 +778,7 @@ int msi_domain_populate_irqs(struct irq_

msi_lock_descs(dev);
for (virq = virq_base; virq < virq_base + nvec; virq++) {
- desc = alloc_msi_entry(dev, 1, NULL);
+ desc = msi_alloc_desc(dev, 1, NULL);
if (!desc) {
ret = -ENOMEM;
goto fail;

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:17 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -2,6 +2,20 @@
#ifndef LINUX_MSI_H
#define LINUX_MSI_H

+/*
+ * This header file contains MSI data structures and functions which are
+ * only relevant for:
+ * - Interrupt core code
+ * - PCI/MSI core code
+ * - MSI interrupt domain implementations
+ * - IOMMU, low level VFIO, NTB and other justified exceptions
+ * dealing with low level MSI details.
+ *
+ * Regular device drivers have no business with any of these functions and
+ * especially storing MSI descriptor pointers in random code is considered
+ * abuse. The only function which is relevant for drivers is msi_get_virq().
+ */
+
#include <linux/spinlock.h>
#include <linux/mutex.h>
#include <linux/list.h>

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:19 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The sysfs handling for MSI is a convoluted maze and it is in the way of
supporting dynamic expansion of the MSI-X vectors because it only supports
a one off bulk population/free of the sysfs entries.

Change it to do:

1) Creating an empty sysfs attribute group when msi_device_data is
allocated

2) Populate the entries when the MSI descriptor is initialized

3) Free the entries when a MSI descriptor is detached from a Linux
interrupt.

4) Provide functions for the legacy non-irqdomain fallback code to
do a bulk population/free. This code won't support dynamic
expansion.

This makes the code simpler and reduces the number of allocations as the
empty attribute group can be shared.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 7 +
kernel/irq/msi.c | 196 +++++++++++++++++++++++-----------------------------
2 files changed, 95 insertions(+), 108 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -72,6 +72,7 @@ struct irq_data;
struct msi_desc;
struct pci_dev;
struct platform_msi_priv_data;
+struct device_attribute;

void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
#ifdef CONFIG_GENERIC_MSI_IRQ
@@ -127,6 +128,7 @@ struct pci_msi_desc {
* @dev: Pointer to the device which uses this descriptor
* @msg: The last set MSI message cached for reuse
* @affinity: Optional pointer to a cpu affinity mask for this descriptor
+ * @sysfs_attr: Pointer to sysfs device attribute
*
* @write_msi_msg: Callback that may be called when the MSI message
* address or data changes
@@ -146,6 +148,9 @@ struct msi_desc {
#ifdef CONFIG_IRQ_MSI_IOMMU
const void *iommu_cookie;
#endif
+#ifdef CONFIG_SYSFS
+ struct device_attribute *sysfs_attrs;
+#endif

void (*write_msi_msg)(struct msi_desc *entry, void *data);
void *write_msi_msg_data;
@@ -171,7 +176,6 @@ enum msi_desc_filter {
* @lock: Spinlock to protect register access
* @properties: MSI properties which are interesting to drivers
* @num_descs: The number of allocated MSI descriptors for the device
- * @attrs: Pointer to the sysfs attribute group
* @platform_data: Platform-MSI specific data
* @list: List of MSI descriptors associated to the device
* @mutex: Mutex protecting the MSI list
@@ -182,7 +186,6 @@ struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
unsigned int num_descs;
- const struct attribute_group **attrs;
struct platform_msi_priv_data *platform_data;
struct list_head list;
struct mutex mutex;
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -19,6 +19,7 @@

#include "internals.h"

+static inline int msi_sysfs_create_group(struct device *dev);
#define dev_to_msi_list(dev) (&(dev)->msi.data->list)

/**
@@ -208,6 +209,7 @@ static void msi_device_data_release(stru
int msi_setup_device_data(struct device *dev)
{
struct msi_device_data *md;
+ int ret;

if (dev->msi.data)
return 0;
@@ -216,6 +218,12 @@ int msi_setup_device_data(struct device
if (!md)
return -ENOMEM;

+ ret = msi_sysfs_create_group(dev);
+ if (ret) {
+ devres_free(md);
+ return ret;
+ }
+
raw_spin_lock_init(&md->lock);
INIT_LIST_HEAD(&md->list);
mutex_init(&md->mutex);
@@ -395,6 +403,20 @@ int __msi_get_virq(struct device *dev, u
EXPORT_SYMBOL_GPL(__msi_get_virq);

#ifdef CONFIG_SYSFS
+static struct attribute *msi_dev_attrs[] = {
+ NULL
+};
+
+static const struct attribute_group msi_irqs_group = {
+ .name = "msi_irqs",
+ .attrs = msi_dev_attrs,
+};
+
+static inline int msi_sysfs_create_group(struct device *dev)
+{
+ return devm_device_add_group(dev, &msi_irqs_group);
+}
+
static ssize_t msi_mode_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -404,97 +426,74 @@ static ssize_t msi_mode_show(struct devi
return sysfs_emit(buf, "%s\n", is_msix ? "msix" : "msi");
}

-/**
- * msi_populate_sysfs - Populate msi_irqs sysfs entries for devices
- * @dev: The device(PCI, platform etc) who will get sysfs entries
- */
-static const struct attribute_group **msi_populate_sysfs(struct device *dev)
+static void msi_sysfs_remove_desc(struct device *dev, struct msi_desc *desc)
{
- const struct attribute_group **msi_irq_groups;
- struct attribute **msi_attrs, *msi_attr;
- struct device_attribute *msi_dev_attr;
- struct attribute_group *msi_irq_group;
- struct msi_desc *entry;
- int ret = -ENOMEM;
- int num_msi = 0;
- int count = 0;
+ struct device_attribute *attrs = desc->sysfs_attrs;
int i;

- /* Determine how many msi entries we have */
- msi_for_each_desc(entry, dev, MSI_DESC_ALL)
- num_msi += entry->nvec_used;
- if (!num_msi)
- return NULL;
+ if (!attrs)
+ return;

- /* Dynamically create the MSI attributes for the device */
- msi_attrs = kcalloc(num_msi + 1, sizeof(void *), GFP_KERNEL);
- if (!msi_attrs)
- return ERR_PTR(-ENOMEM);
-
- msi_for_each_desc(entry, dev, MSI_DESC_ALL) {
- for (i = 0; i < entry->nvec_used; i++) {
- msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
- if (!msi_dev_attr)
- goto error_attrs;
- msi_attrs[count] = &msi_dev_attr->attr;
-
- sysfs_attr_init(&msi_dev_attr->attr);
- msi_dev_attr->attr.name = kasprintf(GFP_KERNEL, "%d",
- entry->irq + i);
- if (!msi_dev_attr->attr.name)
- goto error_attrs;
- msi_dev_attr->attr.mode = 0444;
- msi_dev_attr->show = msi_mode_show;
- ++count;
- }
+ desc->sysfs_attrs = NULL;
+ for (i = 0; i < desc->nvec_used; i++) {
+ if (attrs[i].show)
+ sysfs_remove_file_from_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
+ kfree(attrs[i].attr.name);
}
+ kfree(attrs);
+}

- msi_irq_group = kzalloc(sizeof(*msi_irq_group), GFP_KERNEL);
- if (!msi_irq_group)
- goto error_attrs;
- msi_irq_group->name = "msi_irqs";
- msi_irq_group->attrs = msi_attrs;
-
- msi_irq_groups = kcalloc(2, sizeof(void *), GFP_KERNEL);
- if (!msi_irq_groups)
- goto error_irq_group;
- msi_irq_groups[0] = msi_irq_group;
+static int msi_sysfs_populate_desc(struct device *dev, struct msi_desc *desc)
+{
+ struct device_attribute *attrs;
+ int ret, i;

- ret = sysfs_create_groups(&dev->kobj, msi_irq_groups);
- if (ret)
- goto error_irq_groups;
+ attrs = kcalloc(desc->nvec_used, sizeof(*attrs), GFP_KERNEL);
+ if (!attrs)
+ return -ENOMEM;
+
+ desc->sysfs_attrs = attrs;
+ for (i = 0; i < desc->nvec_used; i++) {
+ sysfs_attr_init(&attrs[i].attr);
+ attrs[i].attr.name = kasprintf(GFP_KERNEL, "%d", desc->irq + i);
+ if (!attrs[i].attr.name) {
+ ret = -ENOMEM;
+ goto fail;
+ }

- return msi_irq_groups;
+ attrs[i].attr.mode = 0444;
+ attrs[i].show = msi_mode_show;

-error_irq_groups:
- kfree(msi_irq_groups);
-error_irq_group:
- kfree(msi_irq_group);
-error_attrs:
- count = 0;
- msi_attr = msi_attrs[count];
- while (msi_attr) {
- msi_dev_attr = container_of(msi_attr, struct device_attribute, attr);
- kfree(msi_attr->name);
- kfree(msi_dev_attr);
- ++count;
- msi_attr = msi_attrs[count];
+ ret = sysfs_add_file_to_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
+ if (ret) {
+ attrs[i].show = NULL;
+ goto fail;
+ }
}
- kfree(msi_attrs);
- return ERR_PTR(ret);
+ return 0;
+
+fail:
+ msi_sysfs_remove_desc(dev, desc);
+ return ret;
}

+#ifdef CONFIG_PCI_MSI_ARCH_FALLBACK
/**
* msi_device_populate_sysfs - Populate msi_irqs sysfs entries for a device
* @dev: The device(PCI, platform etc) which will get sysfs entries
*/
int msi_device_populate_sysfs(struct device *dev)
{
- const struct attribute_group **group = msi_populate_sysfs(dev);
+ struct msi_desc *desc;
+ int ret;

- if (IS_ERR(group))
- return PTR_ERR(group);
- dev->msi.data->attrs = group;
+ msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+ if (desc->sysfs_attrs)
+ continue;
+ ret = msi_sysfs_populate_desc(dev, desc);
+ if (ret)
+ return ret;
+ }
return 0;
}

@@ -505,28 +504,17 @@ int msi_device_populate_sysfs(struct dev
*/
void msi_device_destroy_sysfs(struct device *dev)
{
- const struct attribute_group **msi_irq_groups = dev->msi.data->attrs;
- struct device_attribute *dev_attr;
- struct attribute **msi_attrs;
- int count = 0;
-
- dev->msi.data->attrs = NULL;
- if (!msi_irq_groups)
- return;
+ struct msi_desc *desc;

- sysfs_remove_groups(&dev->kobj, msi_irq_groups);
- msi_attrs = msi_irq_groups[0]->attrs;
- while (msi_attrs[count]) {
- dev_attr = container_of(msi_attrs[count], struct device_attribute, attr);
- kfree(dev_attr->attr.name);
- kfree(dev_attr);
- ++count;
- }
- kfree(msi_attrs);
- kfree(msi_irq_groups[0]);
- kfree(msi_irq_groups);
+ msi_for_each_desc(desc, dev, MSI_DESC_ALL)
+ msi_sysfs_remove_desc(dev, desc);
}
-#endif
+#endif /* CONFIG_PCI_MSI_ARCH_FALLBACK */
+#else /* CONFIG_SYSFS */
+static inline int msi_sysfs_create_group(struct device *dev) { return 0; }
+static inline int msi_sysfs_populate_desc(struct device *dev, struct msi_desc *desc) { return 0; }
+static inline void msi_sysfs_remove_desc(struct device *dev, struct msi_desc *desc) { }
+#endif /* !CONFIG_SYSFS */

#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
static inline void irq_chip_write_msi_msg(struct irq_data *data,
@@ -959,6 +947,12 @@ int __msi_domain_alloc_irqs(struct irq_d
ret = msi_init_virq(domain, virq + i, vflags);
if (ret)
return ret;
+
+ if (info->flags & MSI_FLAG_DEV_SYSFS) {
+ ret = msi_sysfs_populate_desc(dev, desc);
+ if (ret)
+ return ret;
+ }
}
allocated++;
}
@@ -1003,18 +997,7 @@ int msi_domain_alloc_irqs_descs_locked(s

ret = ops->domain_alloc_irqs(domain, dev, nvec);
if (ret)
- goto cleanup;
-
- if (!(info->flags & MSI_FLAG_DEV_SYSFS))
- return 0;
-
- ret = msi_device_populate_sysfs(dev);
- if (ret)
- goto cleanup;
- return 0;
-
-cleanup:
- msi_domain_free_irqs_descs_locked(domain, dev);
+ msi_domain_free_irqs_descs_locked(domain, dev);
return ret;
}

@@ -1039,6 +1022,7 @@ int msi_domain_alloc_irqs(struct irq_dom

void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
{
+ struct msi_domain_info *info = domain->host_data;
struct irq_data *irqd;
struct msi_desc *desc;
int i;
@@ -1053,6 +1037,8 @@ void __msi_domain_free_irqs(struct irq_d
}

irq_domain_free_irqs(desc->irq, desc->nvec_used);
+ if (info->flags & MSI_FLAG_DEV_SYSFS)
+ msi_sysfs_remove_desc(dev, desc);
desc->irq = 0;
}
}
@@ -1081,8 +1067,6 @@ void msi_domain_free_irqs_descs_locked(s

lockdep_assert_held(&dev->msi.data->mutex);

- if (info->flags & MSI_FLAG_DEV_SYSFS)
- msi_device_destroy_sysfs(dev);
ops->domain_free_irqs(domain, dev);
msi_domain_free_msi_descs(info, dev);
}

Thomas Gleixner

unread,
Nov 26, 2021, 8:24:21 PM11/26/21
to LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
The current linked list storage for MSI descriptors is suboptimal in
several ways:

1) Looking up a MSI desciptor requires a O(n) list walk in the worst case

2) The upcoming support of runtime expansion of MSI-X vectors would need
to do a full list walk to figure out whether a particular index is
already associated.

3) Runtime expansion of sparse allocations is even more complex as the
current implementation assumes an ordered list (increasing MSI index).

Use an xarray which solves all of the above problems nicely.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/msi.h | 19 ++---
kernel/irq/msi.c | 188 ++++++++++++++++++++++------------------------------
2 files changed, 90 insertions(+), 117 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -17,6 +17,7 @@
*/

#include <linux/spinlock.h>
+#include <linux/xarray.h>
#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/bits.h>
@@ -122,7 +123,6 @@ struct pci_msi_desc {

/**
* struct msi_desc - Descriptor structure for MSI based interrupts
- * @list: List head for management
* @irq: The base interrupt number
* @nvec_used: The number of vectors used
* @dev: Pointer to the device which uses this descriptor
@@ -139,7 +139,6 @@ struct pci_msi_desc {
*/
struct msi_desc {
/* Shared device/bus type independent data */
- struct list_head list;
unsigned int irq;
unsigned int nvec_used;
struct device *dev;
@@ -177,20 +176,20 @@ enum msi_desc_filter {
* @properties: MSI properties which are interesting to drivers
* @num_descs: The number of allocated MSI descriptors for the device
* @platform_data: Platform-MSI specific data
- * @list: List of MSI descriptors associated to the device
- * @mutex: Mutex protecting the MSI list
- * @__next: Cached pointer to the next entry for iterators
- * @__filter: Cached descriptor filter
+ * @mutex: Mutex protecting the MSI descriptor store
+ * @store: Xarray for storing MSI descriptor pointers
+ * @__iter_idx: Index to search the next entry for iterators
+ * @__iter_filter: Cached descriptor filter
*/
struct msi_device_data {
raw_spinlock_t lock;
unsigned long properties;
unsigned int num_descs;
struct platform_msi_priv_data *platform_data;
- struct list_head list;
struct mutex mutex;
- struct msi_desc *__next;
- enum msi_desc_filter __filter;
+ struct xarray store;
+ unsigned long __iter_idx;
+ enum msi_desc_filter __iter_filter;
};

int msi_setup_device_data(struct device *dev);
@@ -266,7 +265,7 @@ static inline struct msi_desc *msi_first
* @dev: struct device pointer - device to iterate
* @filter: Filter for descriptor selection
*
- * See msi_for_each_desc_from()for further information.
+ * See msi_for_each_desc_from() for further information.
*/
#define msi_for_each_desc(desc, dev, filter) \
msi_for_each_desc_from(desc, dev, filter, 0)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -20,7 +20,6 @@
#include "internals.h"

static inline int msi_sysfs_create_group(struct device *dev);
-#define dev_to_msi_list(dev) (&(dev)->msi.data->list)

/**
* msi_alloc_desc - Allocate an initialized msi_desc
@@ -41,7 +40,6 @@ static struct msi_desc *msi_alloc_desc(s
if (!desc)
return NULL;

- INIT_LIST_HEAD(&desc->list);
desc->dev = dev;
desc->nvec_used = nvec;
if (affinity) {
@@ -60,6 +58,19 @@ static void msi_free_desc(struct msi_des
kfree(desc);
}

+static int msi_insert_desc(struct msi_device_data *md, struct msi_desc *desc, unsigned int index)
+{
+ int ret;
+
+ desc->msi_index = index;
+ ret = xa_insert(&md->store, index, desc, GFP_KERNEL);
+ if (!ret)
+ md->num_descs++;
+ else
+ msi_free_desc(desc);
+ return ret;
+}
+
/**
* msi_add_msi_desc - Allocate and initialize a MSI descriptor
* @dev: Pointer to the device for which the descriptor is allocated
@@ -77,13 +88,9 @@ int msi_add_msi_desc(struct device *dev,
if (!desc)
return -ENOMEM;

- /* Copy the MSI index and type specific data to the new descriptor. */
- desc->msi_index = init_desc->msi_index;
+ /* Copy type specific data to the new descriptor. */
desc->pci = init_desc->pci;
-
- list_add_tail(&desc->list, &dev->msi.data->list);
- dev->msi.data->num_descs++;
- return 0;
+ return msi_insert_desc(dev->msi.data, desc, init_desc->msi_index);
}

/**
@@ -96,29 +103,41 @@ int msi_add_msi_desc(struct device *dev,
*/
static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
{
- struct msi_desc *desc, *tmp;
- LIST_HEAD(list);
- unsigned int i;
+ struct msi_desc *desc;
+ unsigned long i;
+ int ret;

lockdep_assert_held(&dev->msi.data->mutex);

for (i = 0; i < ndesc; i++) {
desc = msi_alloc_desc(dev, 1, NULL);
if (!desc)
+ goto fail_mem;
+ ret = msi_insert_desc(dev->msi.data, desc, index + i);
+ if (ret)
goto fail;
- desc->msi_index = index + i;
- list_add_tail(&desc->list, &list);
}
- list_splice_tail(&list, &dev->msi.data->list);
- dev->msi.data->num_descs += ndesc;
return 0;

+fail_mem:
+ ret = -ENOMEM;
fail:
- list_for_each_entry_safe(desc, tmp, &list, list) {
- list_del(&desc->list);
- msi_free_desc(desc);
+ msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, index, ndesc);
+ return ret;
+}
+
+static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
+{
+ switch (filter) {
+ case MSI_DESC_ALL:
+ return true;
+ case MSI_DESC_NOTASSOCIATED:
+ return !desc->irq;
+ case MSI_DESC_ASSOCIATED:
+ return !!desc->irq;
}
- return -ENOMEM;
+ WARN_ON_ONCE(1);
+ return false;
}

/**
@@ -132,19 +151,16 @@ void msi_free_msi_descs_range(struct dev
unsigned int base_index, unsigned int ndesc)
{
struct msi_desc *desc;
+ unsigned long idx;

lockdep_assert_held(&dev->msi.data->mutex);

- msi_for_each_desc(desc, dev, filter) {
- /*
- * Stupid for now to handle MSI device domain until the
- * storage is switched over to an xarray.
- */
- if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
- continue;
- list_del(&desc->list);
- msi_free_desc(desc);
- dev->msi.data->num_descs--;
+ xa_for_each_range(&dev->msi.data->store, idx, desc, base_index, base_index + ndesc - 1) {
+ if (msi_desc_match(desc, filter)) {
+ xa_erase(&dev->msi.data->store, idx);
+ msi_free_desc(desc);
+ dev->msi.data->num_descs--;
+ }
}
}

@@ -192,7 +208,8 @@ static void msi_device_data_release(stru
{
struct msi_device_data *md = res;

- WARN_ON_ONCE(!list_empty(&md->list));
+ WARN_ON_ONCE(!xa_empty(&md->store));
+ xa_destroy(&md->store);
dev->msi.data = NULL;
}

@@ -225,7 +242,7 @@ int msi_setup_device_data(struct device
}

raw_spin_lock_init(&md->lock);
- INIT_LIST_HEAD(&md->list);
+ xa_init(&md->store);
mutex_init(&md->mutex);
dev->msi.data = md;
devres_add(dev, md);
@@ -252,38 +269,21 @@ void msi_unlock_descs(struct device *dev
{
if (WARN_ON_ONCE(!dev->msi.data))
return;
- /* Clear the next pointer which was cached by the iterator */
- dev->msi.data->__next = NULL;
+ /* Invalidate the index wich was cached by the iterator */
+ dev->msi.data->__iter_idx = ULONG_MAX;
mutex_unlock(&dev->msi.data->mutex);
}
EXPORT_SYMBOL_GPL(msi_unlock_descs);

-static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
-{
- switch (filter) {
- case MSI_DESC_ALL:
- return true;
- case MSI_DESC_NOTASSOCIATED:
- return !desc->irq;
- case MSI_DESC_ASSOCIATED:
- return !!desc->irq;
- }
- WARN_ON_ONCE(1);
- return false;
-}
-
-static struct msi_desc *msi_find_first_desc(struct device *dev, enum msi_desc_filter filter,
- unsigned int base_index)
+static struct msi_desc *msi_find_desc(struct msi_device_data *md)
{
struct msi_desc *desc;

- list_for_each_entry(desc, dev_to_msi_list(dev), list) {
- if (desc->msi_index < base_index)
- continue;
- if (msi_desc_match(desc, filter))
- return desc;
+ xa_for_each_start(&md->store, md->__iter_idx, desc, md->__iter_idx) {
+ if (msi_desc_match(desc, md->__iter_filter))
+ break;
}
- return NULL;
+ return desc;
}

/**
@@ -301,43 +301,25 @@ static struct msi_desc *msi_find_first_d
struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter,
unsigned int base_index)
{
- struct msi_desc *desc;
+ struct msi_device_data *md = dev->msi.data;

- if (WARN_ON_ONCE(!dev->msi.data))
+ if (WARN_ON_ONCE(!md))
return NULL;

- lockdep_assert_held(&dev->msi.data->mutex);
+ lockdep_assert_held(&md->mutex);

- /* Invalidate a previous invocation within the same lock section */
- dev->msi.data->__next = NULL;
-
- desc = msi_find_first_desc(dev, filter, base_index);
- if (desc) {
- dev->msi.data->__next = list_next_entry(desc, list);
- dev->msi.data->__filter = filter;
- }
- return desc;
+ md->__iter_filter = filter;
+ md->__iter_idx = base_index;
+ return msi_find_desc(md);
}
EXPORT_SYMBOL_GPL(__msi_first_desc);

-static struct msi_desc *__msi_next_desc(struct device *dev, enum msi_desc_filter filter,
- struct msi_desc *from)
-{
- struct msi_desc *desc = from;
-
- list_for_each_entry_from(desc, dev_to_msi_list(dev), list) {
- if (msi_desc_match(desc, filter))
- return desc;
- }
- return NULL;
-}
-
/**
* msi_next_desc - Get the next MSI descriptor of a device
* @dev: Device to operate on
*
* The first invocation of msi_next_desc() has to be preceeded by a
- * successful incovation of __msi_first_desc(). Consecutive invocations are
+ * successful invocation of __msi_first_desc(). Consecutive invocations are
* only valid if the previous one was successful. All these operations have
* to be done within the same MSI mutex held region.
*
@@ -346,20 +328,18 @@ static struct msi_desc *__msi_next_desc(
*/
struct msi_desc *msi_next_desc(struct device *dev)
{
- struct msi_device_data *data = dev->msi.data;
- struct msi_desc *desc;
+ struct msi_device_data *md = dev->msi.data;

- if (WARN_ON_ONCE(!data))
+ if (WARN_ON_ONCE(!md))
return NULL;

- lockdep_assert_held(&data->mutex);
+ lockdep_assert_held(&md->mutex);

- if (!data->__next)
+ if (md->__iter_idx == ULONG_MAX)
return NULL;

- desc = __msi_next_desc(dev, data->__filter, data->__next);
- dev->msi.data->__next = desc ? list_next_entry(desc, list) : NULL;
- return desc;
+ md->__iter_idx++;
+ return msi_find_desc(md);
}
EXPORT_SYMBOL_GPL(msi_next_desc);

@@ -384,21 +364,18 @@ int __msi_get_virq(struct device *dev, u
pcimsi = msi_device_has_property(dev, MSI_PROP_PCI_MSI);

msi_lock_descs(dev);
- msi_for_each_desc_from(desc, dev, MSI_DESC_ASSOCIATED, index) {
- /* PCI-MSI has only one descriptor for multiple interrupts. */
- if (pcimsi) {
- if (index < desc->nvec_used)
- ret = desc->irq + index;
- break;
- }
-
+ desc = xa_load(&dev->msi.data->store, pcimsi ? 0 : index);
+ if (desc && desc->irq) {
/*
+ * PCI-MSI has only one descriptor for multiple interrupts.
* PCI-MSIX and platform MSI use a descriptor per
* interrupt.
*/
- if (desc->msi_index == index) {
+ if (pcimsi) {
+ if (index < desc->nvec_used)
+ ret = desc->irq + index;
+ } else {
ret = desc->irq;
- break;
}
}
msi_unlock_descs(dev);
@@ -779,17 +756,13 @@ int msi_domain_populate_irqs(struct irq_
int ret, virq;

msi_lock_descs(dev);
- for (virq = virq_base; virq < virq_base + nvec; virq++) {
- desc = msi_alloc_desc(dev, 1, NULL);
- if (!desc) {
- ret = -ENOMEM;
- goto fail;
- }
+ ret = msi_add_simple_msi_descs(dev, virq_base, nvec);
+ if (ret)
+ goto unlock;

- desc->msi_index = virq;
+ for (virq = virq_base; virq < virq_base + nvec; virq++) {
+ desc = xa_load(&dev->msi.data->store, virq);
desc->irq = virq;
- list_add_tail(&desc->list, &dev->msi.data->list);
- dev->msi.data->num_descs++;

ops->set_desc(arg, desc);
ret = irq_domain_alloc_irqs_hierarchy(domain, virq, 1, arg);
@@ -805,6 +778,7 @@ int msi_domain_populate_irqs(struct irq_
for (--virq; virq >= virq_base; virq--)
irq_domain_free_irqs_common(domain, virq, 1);
msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+unlock:
msi_unlock_descs(dev);
return ret;
}

Greg Kroah-Hartman

unread,
Nov 27, 2021, 7:19:13 AM11/27/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>

No changelog?

Anyway, why do we care about the number of decriptors?


> ---
> include/linux/msi.h | 3 +++
> kernel/irq/msi.c | 18 ++++++++++++++++++
> 2 files changed, 21 insertions(+)
>
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -156,6 +156,7 @@ enum msi_desc_filter {
> * msi_device_data - MSI per device data
> * @lock: Spinlock to protect register access
> * @properties: MSI properties which are interesting to drivers
> + * @num_descs: The number of allocated MSI descriptors for the device
> * @attrs: Pointer to the sysfs attribute group
> * @platform_data: Platform-MSI specific data
> * @list: List of MSI descriptors associated to the device
> @@ -166,6 +167,7 @@ enum msi_desc_filter {
> struct msi_device_data {
> raw_spinlock_t lock;
> unsigned long properties;
> + unsigned int num_descs;
> const struct attribute_group **attrs;
> struct platform_msi_priv_data *platform_data;
> struct list_head list;
> @@ -208,6 +210,7 @@ static inline unsigned int msi_get_virq(
>
> void msi_lock_descs(struct device *dev);
> void msi_unlock_descs(struct device *dev);
> +unsigned int msi_device_num_descs(struct device *dev);
>
> struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
> struct msi_desc *msi_next_desc(struct device *dev);
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -82,6 +82,7 @@ int msi_add_msi_desc(struct device *dev,
> desc->pci = init_desc->pci;
>
> list_add_tail(&desc->list, &dev->msi.data->list);
> + dev->msi.data->num_descs++;
> return 0;
> }
>
> @@ -109,6 +110,7 @@ int msi_add_simple_msi_descs(struct devi
> list_add_tail(&desc->list, &list);
> }
> list_splice_tail(&list, &dev->msi.data->list);
> + dev->msi.data->num_descs += ndesc;
> return 0;
>
> fail:
> @@ -142,6 +144,7 @@ void msi_free_msi_descs_range(struct dev
> continue;
> list_del(&desc->list);
> free_msi_entry(desc);
> + dev->msi.data->num_descs--;
> }
> }
>
> @@ -157,6 +160,21 @@ bool msi_device_has_property(struct devi
> return !!(dev->msi.data->properties & prop);
> }
>
> +/**
> + * msi_device_num_descs - Query the number of allocated MSI descriptors of a device
> + * @dev: The device to read from
> + *
> + * Note: This is a lockless snapshot of msi_device_data::num_descs
> + *
> + * Returns the number of MSI descriptors which are allocated for @dev
> + */
> +unsigned int msi_device_num_descs(struct device *dev)
> +{
> + if (dev->msi.data)
> + return dev->msi.data->num_descs;

As this number can change after it is read, what will callers do with
it?

thanks,

greg k-h

Greg Kroah-Hartman

unread,
Nov 27, 2021, 7:19:46 AM11/27/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27, 2021 at 02:22:29AM +0100, Thomas Gleixner wrote:
> It's only required when MSI is in use.
>
> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
> ---
> drivers/base/core.c | 3 ---
> include/linux/device.h | 4 ----
> include/linux/msi.h | 4 +++-
> kernel/irq/msi.c | 5 ++++-
> 4 files changed, 7 insertions(+), 9 deletions(-)

Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>

Greg Kroah-Hartman

unread,
Nov 27, 2021, 7:32:38 AM11/27/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
> The sysfs handling for MSI is a convoluted maze and it is in the way of
> supporting dynamic expansion of the MSI-X vectors because it only supports
> a one off bulk population/free of the sysfs entries.
>
> Change it to do:
>
> 1) Creating an empty sysfs attribute group when msi_device_data is
> allocated
>
> 2) Populate the entries when the MSI descriptor is initialized

How much later does this happen? Can it happen while the device has a
driver bound to it?
Much nicer, but you changed the lifetime rules of when these attributes
will be removed, is that ok?

I still worry that these attributes show up "after" the device is
registered with the driver core, but hey, it's no worse than it
currently is, so that's not caused by this patch series...
That's a cute hack, but should be documented somewhere in the code (that
if there is no show function, that means no attribute was registered
here).

If you add a comment for this (either here or when you register the
attribute), feel free to add:

Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>

Greg Kroah-Hartman

unread,
Nov 27, 2021, 7:33:29 AM11/27/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27, 2021 at 02:23:17AM +0100, Thomas Gleixner wrote:
> The current linked list storage for MSI descriptors is suboptimal in
> several ways:
>
> 1) Looking up a MSI desciptor requires a O(n) list walk in the worst case
>
> 2) The upcoming support of runtime expansion of MSI-X vectors would need
> to do a full list walk to figure out whether a particular index is
> already associated.
>
> 3) Runtime expansion of sparse allocations is even more complex as the
> current implementation assumes an ordered list (increasing MSI index).
>
> Use an xarray which solves all of the above problems nicely.
>
> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
> ---
> include/linux/msi.h | 19 ++---
> kernel/irq/msi.c | 188 ++++++++++++++++++++++------------------------------
> 2 files changed, 90 insertions(+), 117 deletions(-)

Much simpler code too, nice!

Reviewed-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>

Thomas Gleixner

unread,
Nov 27, 2021, 2:22:28 PM11/27/21
to Greg Kroah-Hartman, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27 2021 at 13:19, Greg Kroah-Hartman wrote:
> On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
>> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
>
> No changelog?

Bah. This one should not be there at all.

> Anyway, why do we care about the number of decriptors?
>> +/**
>> + * msi_device_num_descs - Query the number of allocated MSI descriptors of a device
>> + * @dev: The device to read from
>> + *
>> + * Note: This is a lockless snapshot of msi_device_data::num_descs
>> + *
>> + * Returns the number of MSI descriptors which are allocated for @dev
>> + */
>> +unsigned int msi_device_num_descs(struct device *dev)
>> +{
>> + if (dev->msi.data)
>> + return dev->msi.data->num_descs;
>
> As this number can change after it is read, what will callers do with
> it?

I wanted to get rid of this, but then forgot. Getting old.

Thanks,

tglx

Thomas Gleixner

unread,
Nov 27, 2021, 2:31:39 PM11/27/21
to Greg Kroah-Hartman, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27 2021 at 13:32, Greg Kroah-Hartman wrote:
> On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
>> The sysfs handling for MSI is a convoluted maze and it is in the way of
>> supporting dynamic expansion of the MSI-X vectors because it only supports
>> a one off bulk population/free of the sysfs entries.
>>
>> Change it to do:
>>
>> 1) Creating an empty sysfs attribute group when msi_device_data is
>> allocated
>>
>> 2) Populate the entries when the MSI descriptor is initialized
>
> How much later does this happen? Can it happen while the device has a
> driver bound to it?

That's not later than before. It's when the driver initializes the
MSI[X] interrupts, which usually happens in the probe() function.

The difference is that the group, (i.e.) directory is created slightly
earlier.

>> +
>> +static inline int msi_sysfs_create_group(struct device *dev)
>> +{
>> + return devm_device_add_group(dev, &msi_irqs_group);
>
> Much nicer, but you changed the lifetime rules of when these attributes
> will be removed, is that ok?

The msi entries are removed at the same place as they are removed in the
current mainline code, i.e. when the device driver shuts the device
down and disables MSI[X], which happens usually during remove()

What's different now is that the empty group stays around a bit
longer. I don't see how that matters.

> I still worry that these attributes show up "after" the device is
> registered with the driver core, but hey, it's no worse than it
> currently is, so that's not caused by this patch series...

Happens that register before or after driver->probe()?

>> - }
>> + desc->sysfs_attrs = NULL;
>> + for (i = 0; i < desc->nvec_used; i++) {
>> + if (attrs[i].show)
>> + sysfs_remove_file_from_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
>> + kfree(attrs[i].attr.name);
>
> That's a cute hack, but should be documented somewhere in the code (that
> if there is no show function, that means no attribute was registered
> here).
>
> If you add a comment for this (either here or when you register the
> attribute), feel free to add:

Will do.

Thanks,

tglx

Thomas Gleixner

unread,
Nov 27, 2021, 2:45:22 PM11/27/21
to Greg Kroah-Hartman, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27 2021 at 20:22, Thomas Gleixner wrote:

> On Sat, Nov 27 2021 at 13:19, Greg Kroah-Hartman wrote:
>> On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
>>> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
>>
>> No changelog?
>
> Bah. This one should not be there at all.
>
>> Anyway, why do we care about the number of decriptors?

The last part of this really cares about it for the dynamic extension
part, but that's core code which looks at the counter under the lock.

Thanks,

tglx

Jason Gunthorpe

unread,
Nov 27, 2021, 8:00:41 PM11/27/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27, 2021 at 02:22:33AM +0100, Thomas Gleixner wrote:
> In preparation for dynamic handling of MSI-X interrupts provide a new set
> of MSI descriptor accessor functions and iterators. They are benefitial per
> se as they allow to cleanup quite some code in various MSI domain
> implementations.
>
> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
> include/linux/msi.h | 58 ++++++++++++++++++++++++++++
> kernel/irq/msi.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 165 insertions(+)
>
> +++ b/include/linux/msi.h
> @@ -140,6 +140,18 @@ struct msi_desc {
> struct pci_msi_desc pci;
> };
>
> +/*
> + * Filter values for the MSI descriptor iterators and accessor functions.
> + */
> +enum msi_desc_filter {
> + /* All descriptors */
> + MSI_DESC_ALL,
> + /* Descriptors which have no interrupt associated */
> + MSI_DESC_NOTASSOCIATED,
> + /* Descriptors which have an interrupt associated */
> + MSI_DESC_ASSOCIATED,
> +};
> +
> /**
> * msi_device_data - MSI per device data
> * @lock: Spinlock to protect register access
> @@ -148,6 +160,8 @@ struct msi_desc {
> * @platform_data: Platform-MSI specific data
> * @list: List of MSI descriptors associated to the device
> * @mutex: Mutex protecting the MSI list
> + * @__next: Cached pointer to the next entry for iterators
> + * @__filter: Cached descriptor filter
> */
> struct msi_device_data {
> raw_spinlock_t lock;
> @@ -156,6 +170,8 @@ struct msi_device_data {
> struct platform_msi_priv_data *platform_data;
> struct list_head list;
> struct mutex mutex;
> + struct msi_desc *__next;
> + enum msi_desc_filter __filter;
> };
>
> int msi_setup_device_data(struct device *dev);
> @@ -193,6 +209,48 @@ static inline unsigned int msi_get_virq(
> void msi_lock_descs(struct device *dev);
> void msi_unlock_descs(struct device *dev);
>
> +struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
> +struct msi_desc *msi_next_desc(struct device *dev);
> +
> +/**
> + * msi_first_desc - Get the first MSI descriptor associated to the device
> + * @dev: Device to search
> + */
> +static inline struct msi_desc *msi_first_desc(struct device *dev)
> +{
> + return __msi_first_desc(dev, MSI_DESC_ALL, 0);
> +}
> +
> +
> +/**
> + * msi_for_each_desc_from - Iterate the MSI descriptors from a given index
> + *
> + * @desc: struct msi_desc pointer used as iterator
> + * @dev: struct device pointer - device to iterate
> + * @filter: Filter for descriptor selection
> + * @base_index: MSI index to iterate from
> + *
> + * Notes:
> + * - The loop must be protected with a msi_lock_descs()/msi_unlock_descs()
> + * pair.
> + * - It is safe to remove a retrieved MSI descriptor in the loop.
> + */
> +#define msi_for_each_desc_from(desc, dev, filter, base_index) \
> + for ((desc) = __msi_first_desc((dev), (filter), (base_index)); (desc); \
> + (desc) = msi_next_desc((dev)))

Given this ends up as an xarray it feels really weird that there is a
hidden shared __next/__iter_idx instead of having the caller provide
the index storage as is normal for xa operations.

I understand why that isn't desirable at this patch where the storage
would have to be a list_head pointer, but still, seems like an odd
place to end up at the end of the series.

eg add index here unused and then the last patch uses it instead of
__iter_idx.

Also, I don't understand why filter was stored in the dev and not
passed into msi_next_desc() in the macro here?

Jason

Greg Kroah-Hartman

unread,
Nov 28, 2021, 6:07:12 AM11/28/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, Nov 27, 2021 at 08:31:37PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 27 2021 at 13:32, Greg Kroah-Hartman wrote:
> > On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
> >> The sysfs handling for MSI is a convoluted maze and it is in the way of
> >> supporting dynamic expansion of the MSI-X vectors because it only supports
> >> a one off bulk population/free of the sysfs entries.
> >>
> >> Change it to do:
> >>
> >> 1) Creating an empty sysfs attribute group when msi_device_data is
> >> allocated
> >>
> >> 2) Populate the entries when the MSI descriptor is initialized
> >
> > How much later does this happen? Can it happen while the device has a
> > driver bound to it?
>
> That's not later than before. It's when the driver initializes the
> MSI[X] interrupts, which usually happens in the probe() function.
>
> The difference is that the group, (i.e.) directory is created slightly
> earlier.

Ok, but that still happens when probe() is called for the driver, right?

> >> +
> >> +static inline int msi_sysfs_create_group(struct device *dev)
> >> +{
> >> + return devm_device_add_group(dev, &msi_irqs_group);
> >
> > Much nicer, but you changed the lifetime rules of when these attributes
> > will be removed, is that ok?
>
> The msi entries are removed at the same place as they are removed in the
> current mainline code, i.e. when the device driver shuts the device
> down and disables MSI[X], which happens usually during remove()
>
> What's different now is that the empty group stays around a bit
> longer. I don't see how that matters.

How much longer does it stick around?

What happens if this sequence happens:
- probe()
- disconnect()
- probe()
with the same device (i.e. the device is not removed from the system)?

Which can happen as userspace can trigger disconnect() or even worse, if
the driver is unloaded and then loaded again? Will the second call to
create this directory fail as it is not cleaned up yet?

I can never remember if devm_*() stuff sticks around for the device
lifecycle, or for the driver/device lifecycle, which is one big reason
why I don't like that api...

> > I still worry that these attributes show up "after" the device is
> > registered with the driver core, but hey, it's no worse than it
> > currently is, so that's not caused by this patch series...
>
> Happens that register before or after driver->probe()?

During probe is a bit too late, but we can handle that as we are used to
it. If it happens after probe() succeeds, based on something else being
asked for in the driver (like the device being opened), then userspace
has no chance of ever noticing these attributes being added.

But again, this isn't new to your code series, so I wouldn't worry about
it. Obviously userspace tools do not care or really notice these
attributes at all otherwise the authors of them would have complained
a long time ago :)

So again, no real objection from me here, just meta-comments, except for
the above thing with the devm_* call to ensure that the
probe/disconnect/probe sequence will still work just as well as it does
today. Should be easy enough to test out by just unloading a module and
then loading it again with this patch series applied.

thanks,

greg k-h

Greg Kroah-Hartman

unread,
Nov 28, 2021, 6:07:46 AM11/28/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Ah, that should be documented well as right now you are saying "this is
done lockless" in the comment :)

thanks,

greg k-h

Thomas Gleixner

unread,
Nov 28, 2021, 2:22:32 PM11/28/21
to Jason Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
TBH, didn't think about doing just that. OTH, given the experience of
looking at the creative mess people create, this was probably also a
vain attempt to make it harder in the future.

> Also, I don't understand why filter was stored in the dev and not
> passed into msi_next_desc() in the macro here?

No real reason. I probably just stored it along with the rest. Lemme try
that index approach.

Thanks,

tglx

Thomas Gleixner

unread,
Nov 28, 2021, 2:23:17 PM11/28/21
to Greg Kroah-Hartman, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
I already zapped the whole patch as the function is not required for the
core code.

Thanks,

tglx

Thomas Gleixner

unread,
Nov 28, 2021, 2:33:15 PM11/28/21
to Greg Kroah-Hartman, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, Neil Horman
On Sun, Nov 28 2021 at 12:07, Greg Kroah-Hartman wrote:
> On Sat, Nov 27, 2021 at 08:31:37PM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 27 2021 at 13:32, Greg Kroah-Hartman wrote:
>> > On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
>> >> The sysfs handling for MSI is a convoluted maze and it is in the way of
>> >> supporting dynamic expansion of the MSI-X vectors because it only supports
>> >> a one off bulk population/free of the sysfs entries.
>> >>
>> >> Change it to do:
>> >>
>> >> 1) Creating an empty sysfs attribute group when msi_device_data is
>> >> allocated
>> >>
>> >> 2) Populate the entries when the MSI descriptor is initialized
>> >
>> > How much later does this happen? Can it happen while the device has a
>> > driver bound to it?
>>
>> That's not later than before. It's when the driver initializes the
>> MSI[X] interrupts, which usually happens in the probe() function.
>>
>> The difference is that the group, (i.e.) directory is created slightly
>> earlier.
>
> Ok, but that still happens when probe() is called for the driver,
> right?

Yes.

>> >> +static inline int msi_sysfs_create_group(struct device *dev)
>> >> +{
>> >> + return devm_device_add_group(dev, &msi_irqs_group);
>> >
>> > Much nicer, but you changed the lifetime rules of when these attributes
>> > will be removed, is that ok?
>>
>> The msi entries are removed at the same place as they are removed in the
>> current mainline code, i.e. when the device driver shuts the device
>> down and disables MSI[X], which happens usually during remove()
>>
>> What's different now is that the empty group stays around a bit
>> longer. I don't see how that matters.
>
> How much longer does it stick around?
>
> What happens if this sequence happens:
> - probe()
> - disconnect()
> - probe()
> with the same device (i.e. the device is not removed from the system)?
>
> Which can happen as userspace can trigger disconnect() or even worse, if
> the driver is unloaded and then loaded again? Will the second call to
> create this directory fail as it is not cleaned up yet?
>
> I can never remember if devm_*() stuff sticks around for the device
> lifecycle, or for the driver/device lifecycle, which is one big reason
> why I don't like that api...

Driver lifecycle AFAICT.

>> > I still worry that these attributes show up "after" the device is
>> > registered with the driver core, but hey, it's no worse than it
>> > currently is, so that's not caused by this patch series...
>>
>> Happens that register before or after driver->probe()?
>
> During probe is a bit too late, but we can handle that as we are used to
> it. If it happens after probe() succeeds, based on something else being
> asked for in the driver (like the device being opened), then userspace
> has no chance of ever noticing these attributes being added.
>
> But again, this isn't new to your code series, so I wouldn't worry about
> it. Obviously userspace tools do not care or really notice these
> attributes at all otherwise the authors of them would have complained
> a long time ago :)

I have no idea how these attributes are used at all. Neil should knows
as he added it in the first place.

> So again, no real objection from me here, just meta-comments, except for
> the above thing with the devm_* call to ensure that the
> probe/disconnect/probe sequence will still work just as well as it does
> today. Should be easy enough to test out by just unloading a module and
> then loading it again with this patch series applied.

That works just fine. Tested that already before posting. After module
removal the directory is gone.

Thanks,

tglx

Thomas Gleixner

unread,
Nov 29, 2021, 4:26:15 AM11/29/21
to Jason Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Jason,
After looking at all the call sites again, there is no real usage for
this local index variable.

If anything needs the index of a descriptor then it's available in the
descriptor itself. That won't change because the low level message write
code needs the index too and the only accessible storage there is
msi_desc.

So the "gain" would be to have a pointless 'unsigned long index;'
variable at all usage sites.

What for? The usage sites should not have to care about the storage
details of a facility they are using.

So it might look odd in the first place, but at the end it's conveniant
and does not put any restrictions on changing the underlying mechanics.

Thanks,

tglx

Niklas Schnelle

unread,
Nov 29, 2021, 5:31:18 AM11/29/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Sat, 2021-11-27 at 02:23 +0100, Thomas Gleixner wrote:
> Replace the about to vanish iterators and make use of the filtering.
>
> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
> Cc: linux...@vger.kernel.org
> Cc: Heiko Carstens <h...@linux.ibm.com>
> Cc: Christian Borntraeger <bornt...@de.ibm.com>
> ---
> arch/s390/pci/pci_irq.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> --- a/arch/s390/pci/pci_irq.c
> +++ b/arch/s390/pci/pci_irq.c
> @@ -303,7 +303,7 @@ int arch_setup_msi_irqs(struct pci_dev *
>
> /* Request MSI interrupts */
> hwirq = bit;
> - for_each_pci_msi_entry(msi, pdev) {
> + msi_for_each_desc(msi, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
> rc = -EIO;
> if (hwirq - bit >= msi_vecs)
> break;
> @@ -362,9 +362,7 @@ void arch_teardown_msi_irqs(struct pci_d
> return;
>
> /* Release MSI interrupts */
> - for_each_pci_msi_entry(msi, pdev) {
> - if (!msi->irq)
> - continue;
> + msi_for_each_desc(msi, &pdev->dev, MSI_DESC_ASSOCIATED) {
> irq_set_msi_desc(msi->irq, NULL);
> irq_free_desc(msi->irq);
> msi->msg.address_lo = 0;
>

Hi Thomas,

while the change looks good to me I ran into some trouble trying to
test it. I tried with the git repository you linked in the cover
letter:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3

But with that I get the following linker error on s390:

s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_setup_msi_irqs':
/home/nschnelle/mainline/drivers/pci/msi/legacy.c:72: undefined reference to `msi_device_populate_sysfs'
s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_teardown_msi_irqs':
/home/nschnelle/mainline/drivers/pci/msi/legacy.c:78: undefined reference to `msi_device_destroy_sysfs'
make: *** [Makefile:1161: vmlinux] Error 1

This is caused by a misspelling of CONFIG_PCI_MSI_ARCH_FALLBACKS
(missing the final S) in kernel/irq/msi.c. With that fixed everything
builds and MSI IRQs work fine. So with that fixed you have my

Acked-by: Niklas Schnelle <schn...@linux.ibm.com>
Tested-by: Niklas Schnelle <schn...@linux.ibm.com>

Best regards,
Niklas

Thomas Gleixner

unread,
Nov 29, 2021, 8:04:46 AM11/29/21
to Niklas Schnelle, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Niklas,

On Mon, Nov 29 2021 at 11:31, Niklas Schnelle wrote:
> On Sat, 2021-11-27 at 02:23 +0100, Thomas Gleixner wrote:
>
> while the change looks good to me I ran into some trouble trying to
> test it. I tried with the git repository you linked in the cover
> letter:
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3
>
> But with that I get the following linker error on s390:
>
> s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_setup_msi_irqs':
> /home/nschnelle/mainline/drivers/pci/msi/legacy.c:72: undefined reference to `msi_device_populate_sysfs'
> s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_teardown_msi_irqs':
> /home/nschnelle/mainline/drivers/pci/msi/legacy.c:78: undefined reference to `msi_device_destroy_sysfs'
> make: *** [Makefile:1161: vmlinux] Error 1

Yes, that got reported before and I fixed it locally already.

> This is caused by a misspelling of CONFIG_PCI_MSI_ARCH_FALLBACKS
> (missing the final S) in kernel/irq/msi.c. With that fixed everything
> builds and MSI IRQs work fine. So with that fixed you have my
>
> Acked-by: Niklas Schnelle <schn...@linux.ibm.com>
> Tested-by: Niklas Schnelle <schn...@linux.ibm.com>

Thanks for testing and dealing with my ineptness.

tglx

Jason Gunthorpe

unread,
Nov 29, 2021, 9:01:15 AM11/29/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
On Mon, Nov 29, 2021 at 10:26:11AM +0100, Thomas Gleixner wrote:
> Jason,
>
> On Sun, Nov 28 2021 at 20:22, Thomas Gleixner wrote:
> > On Sat, Nov 27 2021 at 21:00, Jason Gunthorpe wrote:
> >> On Sat, Nov 27, 2021 at 02:22:33AM +0100, Thomas Gleixner wrote:
> >> I understand why that isn't desirable at this patch where the storage
> >> would have to be a list_head pointer, but still, seems like an odd
> >> place to end up at the end of the series.
> >>
> >> eg add index here unused and then the last patch uses it instead of
> >> __iter_idx.
> >
> > TBH, didn't think about doing just that. OTH, given the experience of
> > looking at the creative mess people create, this was probably also a
> > vain attempt to make it harder in the future.
> >
> >> Also, I don't understand why filter was stored in the dev and not
> >> passed into msi_next_desc() in the macro here?
> >
> > No real reason. I probably just stored it along with the rest. Lemme try
> > that index approach.
>
> After looking at all the call sites again, there is no real usage for
> this local index variable.
>
> If anything needs the index of a descriptor then it's available in the
> descriptor itself. That won't change because the low level message write
> code needs the index too and the only accessible storage there is
> msi_desc.

Oh, that makes it simpler, just use the current desc->index as the
input to the xa_for_each_start() and then there should be no need of
hidden state?

> What for? The usage sites should not have to care about the storage
> details of a facility they are using.

Generally for_each things shouldn't have hidden state that prevents
them from being nested. It is just an unexpected design pattern..

Jason

Thomas Gleixner

unread,
Nov 29, 2021, 9:46:02 AM11/29/21
to Jason Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com
Jason,

On Mon, Nov 29 2021 at 10:01, Jason Gunthorpe wrote:
> On Mon, Nov 29, 2021 at 10:26:11AM +0100, Thomas Gleixner wrote:
>> After looking at all the call sites again, there is no real usage for
>> this local index variable.
>>
>> If anything needs the index of a descriptor then it's available in the
>> descriptor itself. That won't change because the low level message write
>> code needs the index too and the only accessible storage there is
>> msi_desc.
>
> Oh, that makes it simpler, just use the current desc->index as the
> input to the xa_for_each_start() and then there should be no need of
> hidden state?

That works for alloc, but on free that's going to end up badly.

>> What for? The usage sites should not have to care about the storage
>> details of a facility they are using.
>
> Generally for_each things shouldn't have hidden state that prevents
> them from being nested. It is just an unexpected design pattern..

I'm not seeing any sensible use case for:

msi_for_each_desc(dev)
msi_for_each_desc(dev)

If that ever comes forth, I'm happy to debate this further :)

Thanks,

tglx

Logan Gunthorpe

unread,
Nov 29, 2021, 1:21:40 PM11/29/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger


On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
> Replace the about to vanish iterators, make use of the filtering and take
> the descriptor lock around the iteration.
>
> Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
> Cc: Jon Mason <jdm...@kudzu.us>
> Cc: Dave Jiang <dave....@intel.com>
> Cc: Allen Hubbe <all...@gmail.com>
> Cc: linu...@googlegroups.com

This patch looks good to me:

Reviewed-by: Logan Gunthorpe <log...@deltatee.com>

Thanks,

Logan

Thomas Gleixner

unread,
Nov 29, 2021, 3:51:13 PM11/29/21
to Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger
Logan,

On Mon, Nov 29 2021 at 11:21, Logan Gunthorpe wrote:
> On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
>> Replace the about to vanish iterators, make use of the filtering and take
>> the descriptor lock around the iteration.
>
> This patch looks good to me:
>
> Reviewed-by: Logan Gunthorpe <log...@deltatee.com>

thanks for having a look at this. While I have your attention, I have a
question related to NTB.

The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
to allocate non-hardware backed MSI-X descriptors.

AFAIU these descriptors are not MSI-X descriptors in the regular sense
of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
usage is somewhere in NTB which has nothing to do with the way how the
real MSI-X interrupts of a device work which explains why we have those
pci.msi_attrib.is_virtual checks all over the place.

I assume that there are other variants feeding into NTB which can handle
that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
in that code.

Could you please shed some light on the larger picture of this?

Thanks,

tglx

Logan Gunthorpe

unread,
Nov 29, 2021, 5:27:29 PM11/29/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger
Yes, of course. I'll try to explain:

The NTB code here is trying to create an MSI interrupt that is not
triggered by the PCI device itself but from a peer behind the
Non-Transparent Bridge (or, more carefully: from the CPU's perspective
the interrupt will come from the PCI device, but nothing in the PCI
device's firmware or hardware will have triggered the interrupt).

In most cases, the NTB code needs more interrupts than the hardware
actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
for: it allows the driver to request more interrupts than the hardware
advertises (ie. pci_msix_vec_count()). These extra interrupts are
created, but get flagged with msi_attrib.is_virtual which ensures
functions that program the MSI-X table don't try to write past the end
of the hardware's table.

The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
(Or rather it can use any interrupt: it doesn't care whether its virtual
or not, it would be fine if it is just a spare interrupt in hardware,
but in practice, it will usually be a virtual one). The code uses these
interrupts by setting up a memory window in the bridge to cover the MMIO
addresses of MSI-X interrupts. It communicates the offsets of the
interrupts (and the MSI message data) to the peer so that the peer can
trigger the interrupt simply by writing the message data to its side of
the memory window. (In the code: ntbm_msi_request_threaded_irq() is
called to request and interrupt which fills in the ntb_msi_desc with the
offset and data, which is transferred to the peer which would then use
ntb_msi_peer_trigger() to trigger the interrupt.)

Existing NTB hardware does already have what's called a doorbell which
provides the same functionally as the above technique. However, existing
hardware implementations of doorbells have significant latency and thus
slow down performance substantially. Implementing the MSI interrupts as
described above increased the performance of ntb_transport by more than
three times[1].

There aren't really other "variants". In theory, IDT hardware would also
require the same quirk but the drivers in the tree aren't quite up to
snuff and don't even support ntb_transport (so nobody has added
support). Intel and AMD drivers could probably do this as well (provided
they have extra memory windows) but I don't know that anyone has tried.

Let me know if anything is still unclear or you have further questions.
You can also read the last posting of the patch series[2] if you'd like
more information.

Logan

[1] 2b0569b3b7e6 ("NTB: Add MSI interrupt support to ntb_transport")
[2]
https://lore.kernel.org/all/20190523223100...@deltatee.com/T/#u




Dave Jiang

unread,
Nov 29, 2021, 5:50:28 PM11/29/21
to Logan Gunthorpe, Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger
The Intel driver used to do something similar to bypass the doorbell
hardware errata for pre Skylake Xeon hardware that wasn't upstream. I'd
like to get this working for the performance reasons you mentioned. I
just really need to find some time to test this with the second memory
window Intel NTB has.

Jason Gunthorpe

unread,
Nov 29, 2021, 6:31:37 PM11/29/21
to Logan Gunthorpe, Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger
On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:

> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.

AFAICT what you've described is what Intel is calling IMS in other
contexts.

IMS is fundamentally a way to control MSI interrupt descriptors that
are not accessed through PCI SIG compliant means. In this case the NTB
driver has to do its magic to relay the addr/data pairs to the real
MSI storage in the hidden devices.

PCI_IRQ_VIRTUAL should probably be fully replaced by the new dynamic
APIs in the fullness of time..

> Existing NTB hardware does already have what's called a doorbell which
> provides the same functionally as the above technique. However, existing
> hardware implementations of doorbells have significant latency and thus
> slow down performance substantially. Implementing the MSI interrupts as
> described above increased the performance of ntb_transport by more than
> three times[1].

Does the doorbell scheme allow as many interrupts?

Jason

Logan Gunthorpe

unread,
Nov 29, 2021, 6:52:42 PM11/29/21
to Jason Gunthorpe, Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger


On 2021-11-29 4:31 p.m., Jason Gunthorpe wrote:
> On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:
>
>> In most cases, the NTB code needs more interrupts than the hardware
>> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
>> for: it allows the driver to request more interrupts than the hardware
>> advertises (ie. pci_msix_vec_count()). These extra interrupts are
>> created, but get flagged with msi_attrib.is_virtual which ensures
>> functions that program the MSI-X table don't try to write past the end
>> of the hardware's table.
>
> AFAICT what you've described is what Intel is calling IMS in other
> contexts.
>
> IMS is fundamentally a way to control MSI interrupt descriptors that
> are not accessed through PCI SIG compliant means. In this case the NTB
> driver has to do its magic to relay the addr/data pairs to the real
> MSI storage in the hidden devices.

With current applications, it isn't that there is real "MSI storage"
anywhere; the device on the other side of the bridge is always another
Linux host which holds the address (or rather mw offset) and data in
memory to use when it needs to trigger the interrupt of the other
machine. There are many prototypes and proprietary messes that try to
have other PCI devices (ie NVMe, etc) behind the non-transparent bridge;
but the Linux subsystem has no support for this.

> PCI_IRQ_VIRTUAL should probably be fully replaced by the new dynamic
> APIs in the fullness of time..

Perhaps, I don't really know much about IMS or how close a match it is.

>> Existing NTB hardware does already have what's called a doorbell which
>> provides the same functionally as the above technique. However, existing
>> hardware implementations of doorbells have significant latency and thus
>> slow down performance substantially. Implementing the MSI interrupts as
>> described above increased the performance of ntb_transport by more than
>> three times[1].
>
> Does the doorbell scheme allow as many interrupts?

No, but for current applications there are plenty of doorbells.

Switchtec hardware (and I think other hardware) typically have 64
doorbells for the entire network (they must be split among the number of
hosts in the network; a two host system could have 32 per host). The NTB
subsystem in Linux only currently supports 2 hosts, but switchtec
hardware supports up to 48 hosts, in which case you might only have 1
doorbell per host and that might be limiting depending on the application.

Logan

Jason Gunthorpe

unread,
Nov 29, 2021, 7:01:54 PM11/29/21
to Logan Gunthorpe, Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger
On Mon, Nov 29, 2021 at 04:52:35PM -0700, Logan Gunthorpe wrote:
>
>
> On 2021-11-29 4:31 p.m., Jason Gunthorpe wrote:
> > On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:
> >
> >> In most cases, the NTB code needs more interrupts than the hardware
> >> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> >> for: it allows the driver to request more interrupts than the hardware
> >> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> >> created, but get flagged with msi_attrib.is_virtual which ensures
> >> functions that program the MSI-X table don't try to write past the end
> >> of the hardware's table.
> >
> > AFAICT what you've described is what Intel is calling IMS in other
> > contexts.
> >
> > IMS is fundamentally a way to control MSI interrupt descriptors that
> > are not accessed through PCI SIG compliant means. In this case the NTB
> > driver has to do its magic to relay the addr/data pairs to the real
> > MSI storage in the hidden devices.
>
> With current applications, it isn't that there is real "MSI storage"
> anywhere; the device on the other side of the bridge is always another
> Linux host which holds the address (or rather mw offset) and data in
> memory to use when it needs to trigger the interrupt of the other
> machine.

Sure, that is fine "MSI Storage". The triggering device only needs to
store the addr/data pair someplace to be "MSI Storage".

Jason

Thomas Gleixner

unread,
Nov 29, 2021, 7:29:43 PM11/29/21
to Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org
Logan,

On Mon, Nov 29 2021 at 15:27, Logan Gunthorpe wrote:
> On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
>> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
>> to allocate non-hardware backed MSI-X descriptors.
>>
>> AFAIU these descriptors are not MSI-X descriptors in the regular sense
>> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
>> usage is somewhere in NTB which has nothing to do with the way how the
>> real MSI-X interrupts of a device work which explains why we have those
>> pci.msi_attrib.is_virtual checks all over the place.
>>
>> I assume that there are other variants feeding into NTB which can handle
>> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
>> in that code.
>>
>> Could you please shed some light on the larger picture of this?
>
> Yes, of course. I'll try to explain:
>
> The NTB code here is trying to create an MSI interrupt that is not
> triggered by the PCI device itself but from a peer behind the
> Non-Transparent Bridge (or, more carefully: from the CPU's perspective
> the interrupt will come from the PCI device, but nothing in the PCI
> device's firmware or hardware will have triggered the interrupt).

That far I got.

> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.
>
> The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
> (Or rather it can use any interrupt: it doesn't care whether its virtual
> or not, it would be fine if it is just a spare interrupt in hardware,
> but in practice, it will usually be a virtual one). The code uses these
> interrupts by setting up a memory window in the bridge to cover the MMIO
> addresses of MSI-X interrupts. It communicates the offsets of the
> interrupts (and the MSI message data) to the peer so that the peer can
> trigger the interrupt simply by writing the message data to its side of
> the memory window. (In the code: ntbm_msi_request_threaded_irq() is
> called to request and interrupt which fills in the ntb_msi_desc with the
> offset and data, which is transferred to the peer which would then use
> ntb_msi_peer_trigger() to trigger the interrupt.)

So the whole thing looks like this:

PCIe
|
| ________________________
| | |
| | NTB |
| | |
| | PCI config space |
| | MSI-X space |
| |_______________________|
|---| |
| Memory window A |
| Memory window .. |
| Memory window N |
|_______________________|

The peers can inject an interrupt through the associated memory window
like the NTB device itself does via the real MSI-X interrupts by writing
the associated MSI message data to the associated address (either
directly to the targeted APIC or to the IOMMU table).

As this message goes through the NTB device it's tagged as originating
from the NTB PCIe device as seen by the IOMMU/Interrupt remapping unit.

So far so good.

I completely understand why you went down the road to add this
PCI_IRQ_VIRTUAL "feature", but TBH this should have never happened.

Why?

These interrupts have absolutely nothing to do with PCI/MSI-X as defined
by the spec and as handled by the PCI/MSI core.

The fact that the MSI message is transported by PCIe does not change
that at all, neither does it matter that from an IOMMU/Interrupt
remapping POV these messages are tagged to come from that particular
PCIe device.

At the conceptual level these interrupts are in separate irq domains:

| _______________________
| | |
| | NTB |
| | |
| | PCI config space |
| | MSI-X space | <- #1 Global or per IOMMU zone PCI/MSI domain
| |_____________________ |
|---| |
| Memory window A |
| Memory window .. | <- #2 Per device NTB domain
| Memory window N |
|______________________|

You surely can see the disctinction between #1 and #2, right?

And because they are in different domains, they simply cannot share the
interrupt chip implementation taking care of the particular oddities of
the "hardware". I know, you made it 'shared' by adding these
'is_virtual' conditionals all over the place, but that pretty much
defeats the purpose of having separate domains.

This is also reflected in drivers/ntb/msi.c::ntbm_msi_request_threaded_irq():

for_each_pci_msi_entry(entry, ntb->pdev) {
if (irq_has_action(entry->irq))
continue;

What on earth guarantees that this check has any relevance? Just because
an entry does not have an interrupt requested on it does not tell
anything.

That might be an actual entry which belongs to the physical PCI NTB
device MSI-X space which is not requested yet. Just because that
swichtec device does not fall into that category does not make it any
more correct.

Seriously, infrastructure code which relies on undocumented assumptions
based on a particular hardware device is broken by definition.

I'm way too tired to come up with a proper solution for that, but that
PCI_IRQ_VIRTUAL has to die ASAP.

Thanks,

tglx

Logan Gunthorpe

unread,
Nov 30, 2021, 2:21:32 PM11/30/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org


On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
> At the conceptual level these interrupts are in separate irq domains:
>
> | _______________________
> | | |
> | | NTB |
> | | |
> | | PCI config space |
> | | MSI-X space | <- #1 Global or per IOMMU zone PCI/MSI domain
> | |_____________________ |
> |---| |
> | Memory window A |
> | Memory window .. | <- #2 Per device NTB domain
> | Memory window N |
> |______________________|
>
> You surely can see the disctinction between #1 and #2, right?

I wouldn't say that's entirely obvious or even irrefutable. However, I'm
certainly open to this approach if it improves the code.

> I'm way too tired to come up with a proper solution for that, but that
> PCI_IRQ_VIRTUAL has to die ASAP.

I'm willing to volunteer a bit of my time to clean this up, but I'd need
a bit more direction on what a proper solution would look like. The MSI
domain code is far from well documented nor is it easy to understand.

Logan

Thomas Gleixner

unread,
Nov 30, 2021, 2:48:06 PM11/30/21
to Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org
Logan,
Fair enough. I'm struggling with finding time to document that properly.

I've not yet made my mind up what the best way forward for this is, but
I have a few ideas which I want to explore deeper.

But the most important question is right now on which architectures
these switchtec virtual interrupt things are used.

If it's used on any architecture which does not use hierarchical
irqdomains for MSI (x86, arm, arm64, power64), then using irqdomains is
obviously a non-starter unless falling back to a single interrupt would
not be considered a regression :)

Thanks,

tglx

Logan Gunthorpe

unread,
Nov 30, 2021, 3:14:48 PM11/30/21
to Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org, Doug Meyer
Well that's a hard question to answer. The switchtec hardware is a very
generic PCI switch that can technically be used on any architecture that
supports PCI. However, NTB is a very specialized use case and only a
handful of companies have attempted to use it for anything, as is. I
can't say I know for sure, but my gut says the vast majority (if not
all) are using x86. Maybe some are trying with arm64 or power64, but the
only architecture not in that list that I'd conceivably think someone
might care about down the road might be riscv.

Most other NTB hardware (specifically AMD and Intel) are built into x86
CPUs so should be fine with this restriction.

I personally expect it would be fine to add a dependency on
CONFIG_IRQ_DOMAIN_HIERARCHY to CONFIG_NTB_MSI. However, I've copied Doug
from GigaIO who's the only user I know that might have a better informed
opinion on this.

Logan

Jason Gunthorpe

unread,
Nov 30, 2021, 3:28:06 PM11/30/21
to Thomas Gleixner, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org
On Tue, Nov 30, 2021 at 08:48:03PM +0100, Thomas Gleixner wrote:
> Logan,
>
> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
> > On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
> >> I'm way too tired to come up with a proper solution for that, but that
> >> PCI_IRQ_VIRTUAL has to die ASAP.
> >
> > I'm willing to volunteer a bit of my time to clean this up, but I'd need
> > a bit more direction on what a proper solution would look like. The MSI
> > domain code is far from well documented nor is it easy to understand.
>
> Fair enough. I'm struggling with finding time to document that properly.
>
> I've not yet made my mind up what the best way forward for this is, but
> I have a few ideas which I want to explore deeper.

I may have lost the plot in all of these patches, but I thought the
direction was moving toward the msi_domain_alloc_irqs() approach IDXD
demo'd here:

https://lore.kernel.org/kvm/162164243591.261970.34...@djiang5-desk3.ch.intel.com/

I'd expect all the descriptor handling code in drivers/ntb/msi.c to
get wrapped in an irq_chip instead of inserting a single-use callback
to the pci core code's implementation:

void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
if (entry->write_msi_msg)
entry->write_msi_msg(entry, entry->write_msi_msg_data);

If this doesn't become an irq_chip what other way is there to properly
program the addr/data pair as drivers/ntb/msi.c is doing?

Jason

Thomas Gleixner

unread,
Nov 30, 2021, 4:23:19 PM11/30/21
to Jason Gunthorpe, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org
On Tue, Nov 30 2021 at 16:28, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 08:48:03PM +0100, Thomas Gleixner wrote:
>> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
>> > On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
>> >> I'm way too tired to come up with a proper solution for that, but that
>> >> PCI_IRQ_VIRTUAL has to die ASAP.
>> >
>> > I'm willing to volunteer a bit of my time to clean this up, but I'd need
>> > a bit more direction on what a proper solution would look like. The MSI
>> > domain code is far from well documented nor is it easy to understand.
>>
>> Fair enough. I'm struggling with finding time to document that properly.
>>
>> I've not yet made my mind up what the best way forward for this is, but
>> I have a few ideas which I want to explore deeper.
>
> I may have lost the plot in all of these patches, but I thought the
> direction was moving toward the msi_domain_alloc_irqs() approach IDXD
> demo'd here:
>
> https://lore.kernel.org/kvm/162164243591.261970.34...@djiang5-desk3.ch.intel.com/

Yes, that's something I have in mind. Though this patch series would not
be really required to support IDXD, it's making stuff simpler.

The main point of this is to cure the VFIO issue of tearing down MSI-X
of passed through devices in order to expand the MSI-X vector space on
the host.

> I'd expect all the descriptor handling code in drivers/ntb/msi.c to
> get wrapped in an irq_chip instead of inserting a single-use callback
> to the pci core code's implementation:
>
> void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
> {
> if (entry->write_msi_msg)
> entry->write_msi_msg(entry, entry->write_msi_msg_data);
>
> If this doesn't become an irq_chip what other way is there to properly
> program the addr/data pair as drivers/ntb/msi.c is doing?

That's not the question. This surely will be a separate irq chip and a
separate irqdomain.

The real problem is where to store the MSI descriptors because the PCI
device has its own real PCI/MSI-X interrupts which means it still shares
the storage space.

IDXD is different in that regard because IDXD creates subdevices which
have their own struct device and they just store the MSI descriptors in
the msi data of that device.

I'm currently tending to partition the index space in the xarray:

0x00000000 - 0x0000ffff PCI/MSI-X
0x00010000 - 0x0001ffff NTB

which is feasible now with the range modifications and way simpler to do
with xarray than with the linked list.

Thanks,

tglx


Jason Gunthorpe

unread,
Nov 30, 2021, 7:17:51 PM11/30/21
to Thomas Gleixner, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org
On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

> > If this doesn't become an irq_chip what other way is there to properly
> > program the addr/data pair as drivers/ntb/msi.c is doing?
>
> That's not the question. This surely will be a separate irq chip and a
> separate irqdomain.

OK

> The real problem is where to store the MSI descriptors because the PCI
> device has its own real PCI/MSI-X interrupts which means it still shares
> the storage space.

Er.. I never realized that just looking at the patches :|

That is relevant to all real "IMS" users. IDXD escaped this because
it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
a messy artifact of VFIO, it should not be required to make the IRQ
layers work.

I don't think it makes sense that the msi_desc would point to a mdev,
the iommu layer consumes the msi_desc_to_dev(), it really should point
to the physical device that originates the message with a proper
iommu ops/data/etc.

> I'm currently tending to partition the index space in the xarray:
>
> 0x00000000 - 0x0000ffff PCI/MSI-X
> 0x00010000 - 0x0001ffff NTB

It is OK, with some xarray work it can be range allocating & reserving
so that the msi_domain_alloc_irqs() flows can carve out chunks of the
number space..

Another view is the msi_domain_alloc_irqs() flows should have their
own xarrays..

> which is feasible now with the range modifications and way simpler to do
> with xarray than with the linked list.

Indeed!

Regards,
Jason

Thomas Gleixner

unread,
Dec 1, 2021, 5:16:51 AM12/1/21
to Jason Gunthorpe, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org, Joerg Roedel, io...@lists.linux-foundation.org
Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>> The real problem is where to store the MSI descriptors because the PCI
>> device has its own real PCI/MSI-X interrupts which means it still shares
>> the storage space.
>
> Er.. I never realized that just looking at the patches :|
>
> That is relevant to all real "IMS" users. IDXD escaped this because
> it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
> a messy artifact of VFIO, it should not be required to make the IRQ
> layers work.

> I don't think it makes sense that the msi_desc would point to a mdev,
> the iommu layer consumes the msi_desc_to_dev(), it really should point
> to the physical device that originates the message with a proper
> iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

>> I'm currently tending to partition the index space in the xarray:
>>
>> 0x00000000 - 0x0000ffff PCI/MSI-X
>> 0x00010000 - 0x0001ffff NTB
>
> It is OK, with some xarray work it can be range allocating & reserving
> so that the msi_domain_alloc_irqs() flows can carve out chunks of the
> number space..
>
> Another view is the msi_domain_alloc_irqs() flows should have their
> own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

struct xarray store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.

>> which is feasible now with the range modifications and way simpler to do
>> with xarray than with the linked list.
>
> Indeed!

I'm glad you like the approach.

Thanks,

tglx


Jason Gunthorpe

unread,
Dec 1, 2021, 8:00:27 AM12/1/21
to Thomas Gleixner, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linu...@vger.kernel.org, Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe, linu...@googlegroups.com, linux...@vger.kernel.org, Heiko Carstens, Christian Borntraeger, x...@kernel.org, Joerg Roedel, io...@lists.linux-foundation.org
On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:

> Looking at the device slices as subdevices with their own struct device
> makes a lot of sense from the conceptual level.

Except IMS is not just for subdevices, it should be usable for any
driver in any case as a general interrupt mechiansm, as you alluded to
below about ethernet queues. ntb seems to be the current example of
this need..

If it works properly on the physical PCI dev there is no reason to try
to also make it work on the mdev and add complixity in iommu land.

> Though I fear there is also a use case for MSI-X and IMS tied to the
> same device. That network card you are talking about might end up using
> MSI-X for a control block and then IMS for the actual network queues
> when it is used as physical function device as a whole, but that's
> conceptually a different case.

Is it? Why?

IDXD is not so much making device "slices", but virtualizing and
sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
described and the VFIO driver is simply allocating queues from a PCI
device for specific usages and assigning them interrupts.

There is already a char dev interface that equally allocates queues
from the same IDXD device, why shouldn't it be able to access IMS
interrupt pools too?

IMHO a well designed IDXD driver should put all the PCI MMIO, queue
mangement, interrupts, etc in the PCI driver layer, and the VFIO
driver layer should only deal with the MMIO trapping and VFIO
interfacing.

From this view it is conceptually wrong to have the VFIO driver
directly talking to MMIO registers in the PCI device or owning the
irq_chip. It would be very odd for the PCI driver to allocate
interrupts from some random external struct device when it is creating
queues on the PCI device.

> > Another view is the msi_domain_alloc_irqs() flows should have their
> > own xarrays..
>
> Yes, I was thinking about that as well. The trivial way would be:
>
> struct xarray store[MSI_MAX_STORES];
>
> and then have a store index for each allocation domain. With the
> proposed encapsulation of the xarray handling that's definitely
> feasible. Whether that buys much is a different question. Let me think
> about it some more.

Any possibility that the 'IMS' xarray could be outside the struct
device?

Thanks,
Jason
It is loading more messages.
0 new messages