[PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

Eric Auger

unread,

Jan 26, 2016, 8:13:31 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
It pursues the efforts done on [1], [2], [3]. It also aims at covering the
same need on some PowerPC platforms.

On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
as interrupt messages: accesses to this special PA window directly target the
APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.

This is not the case on above mentionned platforms where MSI messages emitted
by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
must exist for the MSI to reach the MSI controller. Normal way to create
IOVA bindings consists in using VFIO DMA MAP API. However in this case
the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
controller frame).

Following first comments, the spirit of [2] is kept: the guest registers
an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
its MSI vectors, it overwrites the MSI controller physical address with an IOVA,
allocated within the window provided by the userspace. This IOVA is mapped
onto the MSI controller frame physical page.

The series does not address yet the problematic of telling the userspace how
much IOVA he should provision.

Best Regards

Eric

Testing:
This is currently tested on ARM64 AMD Overdrive HW (single GICv2m frame)
with an e1000e PCIe card. This is not tested on PPC.

References:
[1] [RFC 0/2] VFIO: Add virtual MSI doorbell support
(https://lkml.org/lkml/2015/7/24/135)
[2] [RFC PATCH 0/6] vfio: Add interface to map MSI pages
(https://lists.cs.columbia.edu/pipermail/kvmarm/2015-September/016607.html)
[3] [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO
(http://permalink.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858)

Git:
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc1-pcie-passthrough-v1

History:

RFC v1 [2] -> PATCH v1:
- use the existing dma map/unmap ioctl interface with a flag to register a
reserved IOVA range. Use the legacy Rb to store this special vfio_dma.
- a single reserved IOVA contiguous region now is allowed
- use of an RB tree indexed by PA to store allocated reserved slots
- use of a vfio_domain iova_domain to manage iova allocation within the
window provided by the userspace
- vfio alloc_map/unmap_free take a vfio_group handle
- vfio_group handle is cached in vfio_pci_device
- add ref counting to bindings
- user modality enabled at the end of the series

Eric Auger (10):
iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute
vfio: expose MSI mapping requirement through VFIO_IOMMU_GET_INFO
vfio_iommu_type1: add reserved binding RB tree management
vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
vfio/type1: attach a reserved iova domain to vfio_domain
vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova
vfio: pci: cache the vfio_group in vfio_pci_device
vfio: introduce vfio_group_require_msi_mapping
vfio-pci: create an iommu mapping for msi address
vfio: allow the user to register reserved iova range for MSI mapping

drivers/iommu/arm-smmu.c | 2 +
drivers/iommu/fsl_pamu_domain.c | 3 +
drivers/vfio/pci/vfio_pci.c | 8 +
drivers/vfio/pci/vfio_pci_intrs.c | 73 ++++++-
drivers/vfio/pci/vfio_pci_private.h | 1 +
drivers/vfio/vfio.c | 64 ++++++
drivers/vfio/vfio_iommu_type1.c | 412 +++++++++++++++++++++++++++++++++++-
include/linux/iommu.h | 1 +
include/linux/vfio.h | 39 +++-
include/uapi/linux/vfio.h | 10 +
10 files changed, 598 insertions(+), 15 deletions(-)

--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:13:41 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Legacy dma_list is just used to insert the reserved iova region and
check mapping of reserved iova happens in this window. As opposed to
other vfio_dma slots, the reserved one is not necessarily mapped.

We will need to track which host physical addresses are mapped to
reserved IOVA. In that prospect we introduce a new RB tree indexed
by physical address. This reverse RB tree only is used for reserved
IOVA bindings. It belongs to a given iommu domain.

It is expected this RB tree will contain very few bindings. Those
generally correspond to single page mapping one MSI frame (at least
for ARM, containing the GICv2m frame or ITS GITS_TRANSLATER frame.

Signed-off-by: Eric Auger <eric....@linaro.org>
---
drivers/vfio/vfio_iommu_type1.c | 63 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c5b57e1..32438d9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -56,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
struct vfio_iommu {
struct list_head domain_list;
struct mutex lock;
+ /* rb tree indexed by IOVA */
struct rb_root dma_list;
bool v2;
bool nesting;
@@ -65,6 +66,8 @@ struct vfio_domain {
struct iommu_domain *domain;
struct list_head next;
struct list_head group_list;
+ /* rb tree indexed by PA, for reserved bindings only */
+ struct rb_root reserved_binding_list;
int prot; /* IOMMU_CACHE */
bool fgsp; /* Fine-grained super pages */
};
@@ -77,11 +80,70 @@ struct vfio_dma {
int prot; /* IOMMU_READ/WRITE */
};

+struct vfio_reserved_binding {
+ struct kref kref;
+ struct rb_node node;
+ struct vfio_domain *domain;
+ phys_addr_t addr;
+ dma_addr_t iova;
+ size_t size;
+};
+
struct vfio_group {
struct iommu_group *iommu_group;
struct list_head next;
};

+/* Reserved binding RB-tree manipulation */
+
+static struct vfio_reserved_binding *vfio_find_reserved_binding(
+ struct vfio_domain *d,
+ phys_addr_t start, size_t size)
+{
+ struct rb_node *node = d->reserved_binding_list.rb_node;
+
+ while (node) {
+ struct vfio_reserved_binding *binding =
+ rb_entry(node, struct vfio_reserved_binding, node);
+
+ if (start + size <= binding->addr)
+ node = node->rb_left;
+ else if (start >= binding->addr + binding->size)
+ node = node->rb_right;
+ else
+ return binding;
+ }
+
+ return NULL;
+}
+
+static void vfio_link_reserved_binding(struct vfio_domain *d,
+ struct vfio_reserved_binding *new)
+{
+ struct rb_node **link = &d->reserved_binding_list.rb_node;
+ struct rb_node *parent = NULL;
+ struct vfio_reserved_binding *binding;
+
+ while (*link) {
+ parent = *link;
+ binding = rb_entry(parent, struct vfio_reserved_binding, node);
+
+ if (new->addr + new->size <= binding->addr)
+ link = &(*link)->rb_left;
+ else
+ link = &(*link)->rb_right;
+ }
+
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, &d->reserved_binding_list);
+}
+
+static void vfio_unlink_reserved_binding(struct vfio_domain *d,
+ struct vfio_reserved_binding *old)
+{
+ rb_erase(&old->node, &d->reserved_binding_list);
+}
+
/*
* This code handles mapping and unmapping of user data buffers
* into DMA'ble space using the IOMMU
@@ -784,6 +846,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
ret = -ENOMEM;
goto out_free;
}
+ domain->reserved_binding_list = RB_ROOT;

group->iommu_group = iommu_group;

--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:13:51 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

This patch adds a reserved iova_domain to the vfio_domain struct.
This iova domain will enable to allocate iova within the reserved
iova region.

alloc_reserved_iova_domain makes possible to allocate and initialize
this iova domain. The user will be introduced in subsequent patches.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/vfio/vfio_iommu_type1.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 33a9ce4..33304c0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/workqueue.h>
+#include <linux/iova.h>

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <alex.wi...@redhat.com>"
@@ -77,6 +78,7 @@ struct vfio_domain {
struct list_head group_list;

/* rb tree indexed by PA, for reserved bindings only */

struct rb_root reserved_binding_list;
+ struct iova_domain *reserved_iova_domain;

int prot; /* IOMMU_CACHE */
bool fgsp; /* Fine-grained super pages */
};

@@ -178,6 +180,41 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
return NULL;
}

+/* alloc_reserved_iova_domain: allocate an iova domain used to manage
+ * reserved iova
+ * @iova: base iova of the domain
+ * @size: size of the domain
+ * @order: iommu page size order
+ */
+static int alloc_reserved_iova_domain(struct vfio_domain *domain,
+ dma_addr_t iova, size_t size,
+ unsigned long order)
+{
+ unsigned long granule, mask;
+ int ret = 0;
+
+ granule = 1UL << order;
+ mask = granule - 1;
+ if (iova & mask)
+ return -EINVAL;
+ if ((!size) || (size & mask))
+ return -EINVAL;
+
+ domain->reserved_iova_domain =
+ kzalloc(sizeof(struct iova_domain), GFP_KERNEL);
+ if (!domain->reserved_iova_domain) {
+ ret = -ENOMEM;
+ goto out_free;
+ }
+ init_iova_domain(domain->reserved_iova_domain,
+ granule, iova >> order, (iova + size - 1) >> order);
+ return ret;
+
+out_free:
+ kfree(domain->reserved_iova_domain);
+ return ret;
+}
+
static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
{
struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:14:04 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

The user is allowed to register a reserved IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
It provides the base address and the size. This region is stored in the
vfio_dma rb tree. At that point the iova range is not mapped to any target
address yet. The host kernel will use those iova when needed, typically
when the VFIO-PCI device allocates its MSI's.

This patch also handles the destruction of the reserved binding RB-tree and
domain's iova_domains.

Signed-off-by: Eric Auger <eric....@linaro.org>
Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

---

- Currently the user is not yet informed about the number of pages to
provide

RFC v1 -> RFC v2:
- takes into account Alex comments, based on
[RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:

- use the existing dma map/unmap ioctl interface with a flag to register

a reserved IOVA range. A single reserved iova region is allowed.
---
drivers/vfio/vfio_iommu_type1.c | 98 ++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 9 ++++
2 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2f085d3..37c7d78 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -538,10 +538,40 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
vfio_lock_acct(-unlocked);
}

+/* vfio_unmap_reserved: unmap and free all reserved binding nodes
+ * for all domains and destroy their iova_domain
+ *
+ * @iommu: iommu handle
+ */
+static void vfio_unmap_reserved(struct vfio_iommu *iommu)
+{
+ struct vfio_domain *d;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ struct rb_node *node;
+
+ while ((node = rb_first(&d->reserved_binding_list))) {
+ struct vfio_reserved_binding *b =
+ rb_entry(node,
+ struct vfio_reserved_binding, node);
+
+ while (!kref_put(&b->kref,
+ vfio_reserved_binding_release)) {
+ }
+ }
+ d->reserved_binding_list = RB_ROOT;
+
+ put_iova_domain(d->reserved_iova_domain);
+ kfree(d->reserved_iova_domain);
+ }
+}
+
static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
{
if (likely(dma->type != VFIO_IOVA_RESERVED))
vfio_unmap_unpin(iommu, dma);
+ else
+ vfio_unmap_reserved(iommu);
vfio_unlink_dma(iommu, dma);
kfree(dma);
}
@@ -785,6 +815,68 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
return ret;
}

+static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_dma_map *map)
+{
+ dma_addr_t iova = map->iova;
+ size_t size = map->size;
+ uint64_t mask;
+ struct vfio_dma *dma;

+ int ret = 0;

+ struct vfio_domain *d;
+ unsigned long order;
+
+ /* Verify that none of our __u64 fields overflow */
+ if (map->size != size || map->iova != iova)
+ return -EINVAL;
+
+ order = __ffs(vfio_pgsize_bitmap(iommu));
+ mask = ((uint64_t)1 << order) - 1;
+
+ WARN_ON(mask & PAGE_MASK);
+
+ /* we currently only support MSI_RESERVED_IOVA */
+ if (!(map->flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA))
+ return -EINVAL;
+
+ if (!size || (size | iova) & mask)
+ return -EINVAL;
+
+ /* Don't allow IOVA address wrap */
+ if (iova + size - 1 < iova)
+ return -EINVAL;
+
+ mutex_lock(&iommu->lock);
+
+ /* check if the iova domain has not been instantiated already*/
+ d = list_first_entry(&iommu->domain_list,
+ struct vfio_domain, next);
+
+ if (d->reserved_iova_domain || vfio_find_dma(iommu, iova, size)) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+ if (!dma) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ dma->iova = iova;
+ dma->size = size;
+ dma->type = VFIO_IOVA_RESERVED;
+
+ vfio_link_dma(iommu, dma);
+
+ list_for_each_entry(d, &iommu->domain_list, next)
+ alloc_reserved_iova_domain(d, iova, size, order);
+
+out:
+ mutex_unlock(&iommu->lock);

+ return ret;
+}
+

static int vfio_bus_type(struct device *dev, void *data)
{
struct bus_type **bus = data;
@@ -1297,7 +1389,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
} else if (cmd == VFIO_IOMMU_MAP_DMA) {
struct vfio_iommu_type1_dma_map map;
uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
- VFIO_DMA_MAP_FLAG_WRITE;
+ VFIO_DMA_MAP_FLAG_WRITE |
+ VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;

minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);

@@ -1307,6 +1400,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
if (map.argsz < minsz || map.flags & ~mask)
return -EINVAL;

+ if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
+ return vfio_register_reserved_iova_range(iommu, &map);
+
return vfio_dma_do_map(iommu, &map);

} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 43e183b..982e326 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -411,12 +411,21 @@ struct vfio_iommu_type1_info {
*
* Map process virtual addresses to IO virtual addresses using the
* provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case MSI_RESERVED_IOVA is set, the API only aims at registering an IOVA
+ * region which will be used on some platforms to map the host MSI frame.
+ * in that specific case, vaddr and prot are ignored. The requirement for
+ * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
+ * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
+ * MSI_RESERVED_IOVA region can be registered
*/
struct vfio_iommu_type1_dma_map {
__u32 argsz;
__u32 flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */
+/* reserved iova for MSI vectors*/
+#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
__u64 vaddr; /* Process virtual address */
__u64 iova; /* IO virtual address */
__u64 size; /* Size of mapping (bytes) */
--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:14:15 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Some platforms require the MSI address programmed in the PCI
device to be an IOVA and not a host physical address.
This is typically the case for ARM and PowerPC, as opposed
to x86. This patch allocates an IOVA page and maps it onto
the physical page which contains the target MSI write transaction
address. In case an IOMMU binding already exists between those 2,
simply reprogram the device.

The binding is destroyed by the VFIO IOMMU backend.

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

Signed-off-by: Eric Auger <eric....@linaro.org>

---

---
drivers/vfio/pci/vfio_pci_intrs.c | 73 ++++++++++++++++++++++++++++++++++++---
1 file changed, 68 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3b3ba15..bac24c9 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -305,6 +305,57 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
return 0;
}

+/**
+ * vfio_set_mapped_msi_addr: overwrites the msi physical address with an iova
+ *
+ * @pdev: vfio pci device handle
+ * @irq: irq linux number
+ * returns 0 upon success, < 0 on failure
+ */
+static int vfio_set_mapped_msi_addr(struct vfio_pci_device *vdev, int irq)
+{
+ phys_addr_t msi_addr;
+ dma_addr_t msi_iova;
+ struct vfio_group *group = vdev->vfio_group;
+ struct msi_msg msg;
+ int ret;
+
+ get_cached_msi_msg(irq, &msg);
+ msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
+ (phys_addr_t)(msg.address_lo);
+
+ ret = vfio_group_alloc_map_reserved_iova(group, msi_addr,
+ IOMMU_WRITE, &msi_iova);
+ if (ret)
+ goto out;
+
+ /* Re-program the msi-address with the iova */
+ msg.address_hi = (u32)(msi_iova >> 32);
+ msg.address_lo = (u32)(msi_iova & 0xffffffff);
+ pci_write_msi_msg(irq, &msg);
+
+out:

+ return ret;
+}
+

+/**
+ * vfio_unset_mapped_msi_addr: decrement the ref counter of the msi iova page
+ * associated to the linux irq (in case it is null unmaps and frees resources)
+ *
+ * @pdev: vfio pci device handle
+ * @irq: irq linux number
+ */
+static void vfio_unset_mapped_msi_addr(struct vfio_pci_device *vdev, int irq)
+{
+ dma_addr_t msi_iova;
+ struct vfio_group *group = vdev->vfio_group;
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ msi_iova = (u64)(msg.address_hi) << 32 | (u64)(msg.address_lo);
+ vfio_group_unmap_free_reserved_iova(group, msi_iova);
+}
+
static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
int vector, int fd, bool msix)
{
@@ -318,6 +369,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return -EINVAL;

if (vdev->ctx[vector].trigger) {
+ vfio_unset_mapped_msi_addr(vdev, irq);
free_irq(irq, vdev->ctx[vector].trigger);
irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
kfree(vdev->ctx[vector].name);
@@ -355,11 +407,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,

ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
- if (ret) {
- kfree(vdev->ctx[vector].name);
- eventfd_ctx_put(trigger);
- return ret;
- }
+ if (ret)
+ goto error_free;

vdev->ctx[vector].producer.token = trigger;
vdev->ctx[vector].producer.irq = irq;
@@ -369,9 +418,23 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
"irq bypass producer (token %p) registration fails: %d\n",
vdev->ctx[vector].producer.token, ret);

+ if (vfio_group_require_msi_mapping(vdev->vfio_group)) {
+ ret = vfio_set_mapped_msi_addr(vdev, irq);
+ if (ret)
+ goto error_free_irq;
+ }
+
vdev->ctx[vector].trigger = trigger;

return 0;
+
+error_free_irq:
+ free_irq(irq, vdev->ctx[vector].trigger);
+error_free:
+ kfree(vdev->ctx[vector].name);
+ eventfd_ctx_put(trigger);
+ return ret;
+
}

static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned start,
--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:14:41 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

This new function enables to know whether msi write transaction
addresses must be mapped.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/vfio/vfio.c | 22 ++++++++++++++++++++++
drivers/vfio/vfio_iommu_type1.c | 8 ++++++++
include/linux/vfio.h | 2 ++
3 files changed, 32 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 2760d4c..f3df5a10 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -793,6 +793,28 @@ static int vfio_iommu_group_notifier(struct notifier_block *nb,
return NOTIFY_OK;
}

+bool vfio_group_require_msi_mapping(struct vfio_group *group)
+{
+ struct vfio_container *container = group->container;
+ struct vfio_iommu_driver *driver;
+ bool ret;
+
+ down_read(&container->group_lock);
+
+ driver = container->iommu_driver;
+ if (!driver || !driver->ops || !driver->ops->require_msi_mapping) {
+ ret = -EINVAL;
+ goto up;
+ }
+
+ ret = driver->ops->require_msi_mapping(container->iommu_data);
+
+up:
+ up_read(&container->group_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_group_require_msi_mapping);
+
/**
* VFIO driver API
*/
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a79e2a8..2f085d3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1179,6 +1179,13 @@ unlock:
return ret;
}

+static bool vfio_iommu_type1_require_msi_mapping(void *iommu_data)
+{
+ struct vfio_iommu *iommu = iommu_data;
+
+ return vfio_domains_require_msi_mapping(iommu);
+}
+
static void *vfio_iommu_type1_open(unsigned long arg)
{
struct vfio_iommu *iommu;
@@ -1336,6 +1343,7 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
vfio_iommu_type1_alloc_map_reserved_iova,
.unmap_free_reserved_iova =
vfio_iommu_type1_unmap_free_reserved_iova,
+ .require_msi_mapping = vfio_iommu_type1_require_msi_mapping,
};

static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index f7eaf30..3e6cbeb 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -58,6 +58,7 @@ extern void *vfio_del_group_dev(struct device *dev);
extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
extern void vfio_device_put(struct vfio_device *device);
extern void *vfio_device_data(struct vfio_device *device);
+extern bool vfio_group_require_msi_mapping(struct vfio_group *group);

/**
* struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
@@ -85,6 +86,7 @@ struct vfio_iommu_driver_ops {
int (*unmap_free_reserved_iova)(void *iommu_data,
struct iommu_group *group,
dma_addr_t iova);
+ bool (*require_msi_mapping)(void *iommu_data);
};

extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:15:00 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

We introduce a vfio_dma type since we will need to discriminate
legacy vfio_dma's from new reserved ones. Since those latter are
not mapped at registration, some handlings must be unplugged:
removal, replay. They will be enhanced later on.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/vfio/vfio_iommu_type1.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 32438d9..33a9ce4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -53,6 +53,15 @@ module_param_named(disable_hugepages,
MODULE_PARM_DESC(disable_hugepages,
"Disable VFIO IOMMU support for IOMMU hugepages.");

+enum vfio_iova_type {
+ VFIO_IOVA_USER = 0, /* standard IOVA used to map user vaddr */
+ /*
+ * IOVA reserved to map special host physical addresses,
+ * MSI frames for instance
+ */
+ VFIO_IOVA_RESERVED,
+};
+

struct vfio_iommu {
struct list_head domain_list;
struct mutex lock;

@@ -78,6 +87,7 @@ struct vfio_dma {
unsigned long vaddr; /* Process virtual addr */
size_t size; /* Map size (bytes) */

int prot; /* IOMMU_READ/WRITE */

+ enum vfio_iova_type type; /* type of IOVA */
};

struct vfio_reserved_binding {
@@ -480,7 +490,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)

static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
{

- vfio_unmap_unpin(iommu, dma);
+ if (likely(dma->type != VFIO_IOVA_RESERVED))
+ vfio_unmap_unpin(iommu, dma);
vfio_unlink_dma(iommu, dma);
kfree(dma);
}
@@ -756,6 +767,10 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
dma_addr_t iova;

dma = rb_entry(n, struct vfio_dma, node);
+
+ if (unlikely(dma->type == VFIO_IOVA_RESERVED))
+ continue;
+
iova = dma->iova;

while (iova < dma->iova + dma->size) {
--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:15:04 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Add a new set_group operation which allows to cache the vfio_group
handle within the vfio_pci_device struct. This is useful to do
iommu operations from the vfio_pci device.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/vfio/pci/vfio_pci.c | 8 ++++++++
drivers/vfio/pci/vfio_pci_private.h | 1 +
drivers/vfio/vfio.c | 3 +++
include/linux/vfio.h | 3 +++
4 files changed, 15 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 2760a7b..2a32856 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -920,6 +920,13 @@ static void vfio_pci_request(void *device_data, unsigned int count)
mutex_unlock(&vdev->igate);
}

+static void vfio_pci_set_group(void *device_data, struct vfio_group *group)
+{
+ struct vfio_pci_device *vdev = device_data;
+
+ vdev->vfio_group = group;
+}
+
static const struct vfio_device_ops vfio_pci_ops = {
.name = "vfio-pci",
.open = vfio_pci_open,
@@ -929,6 +936,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
.write = vfio_pci_write,
.mmap = vfio_pci_mmap,
.request = vfio_pci_request,
+ .set_group = vfio_pci_set_group,
};

static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 0e7394f..2893b10 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -35,6 +35,7 @@ struct vfio_pci_irq_ctx {

struct vfio_pci_device {
struct pci_dev *pdev;
+ struct vfio_group *vfio_group;
void __iomem *barmap[PCI_STD_RESOURCE_END + 1];
u8 *pci_config_map;
u8 *vconfig;
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 3d9de00..2760d4c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -548,6 +548,9 @@ struct vfio_device *vfio_group_create_device(struct vfio_group *group,
/* No need to get group_lock, caller has group reference */
vfio_group_get(group);

+ if (device->ops->set_group)
+ device->ops->set_group(device_data, group);
+
mutex_lock(&group->device_lock);
list_add(&device->group_next, &group->device_list);
mutex_unlock(&group->device_lock);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0020f81..f7eaf30 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,8 @@
#include <linux/poll.h>
#include <uapi/linux/vfio.h>

+struct vfio_group;
+
/**
* struct vfio_device_ops - VFIO bus driver device callbacks
*
@@ -42,6 +44,7 @@ struct vfio_device_ops {
unsigned long arg);
int (*mmap)(void *device_data, struct vm_area_struct *vma);
void (*request)(void *device_data, unsigned int count);
+ void (*set_group)(void *device_data, struct vfio_group *group);
};

extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:15:35 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

This patch introduces vfio_group_alloc_map_/unmap_free_reserved_iova
and implements corresponding vfio_iommu_type1 operations.

alloc_map allows to allocate a new reserved iova page and map it
onto the physical page that contains a given PA. It returns the iova
that is mapped onto the provided PA. In case a mapping already exist
between both pages, the IOVA corresponding to the PA is directly returned.

Signed-off-by: Eric Auger <eric....@linaro.org>
Signed-off-by: Ankit Jindal <aji...@apm.com>
Signed-off-by: Pranavkumar Sawargaonkar <prana...@linaro.org>

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>
---

drivers/vfio/vfio.c | 39 ++++++++++
drivers/vfio/vfio_iommu_type1.c | 163 ++++++++++++++++++++++++++++++++++++++--
include/linux/vfio.h | 34 ++++++++-
3 files changed, 228 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 82f25cc..3d9de00 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -268,6 +268,45 @@ void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
}
EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);

+int vfio_group_alloc_map_reserved_iova(struct vfio_group *group,
+ phys_addr_t addr, int prot,
+ dma_addr_t *iova)

+{
+ struct vfio_container *container = group->container;

+ const struct vfio_iommu_driver_ops *ops = container->iommu_driver->ops;
+ int ret;
+
+ if (!ops->alloc_map_reserved_iova)
+ return -EINVAL;
+
+ down_read(&container->group_lock);
+ ret = ops->alloc_map_reserved_iova(container->iommu_data,
+ group->iommu_group,
+ addr, prot, iova);

+ up_read(&container->group_lock);
+ return ret;
+
+}

+EXPORT_SYMBOL_GPL(vfio_group_alloc_map_reserved_iova);
+
+int vfio_group_unmap_free_reserved_iova(struct vfio_group *group,
+ dma_addr_t iova)

+{
+ struct vfio_container *container = group->container;

+ const struct vfio_iommu_driver_ops *ops = container->iommu_driver->ops;
+ int ret;
+
+ if (!ops->unmap_free_reserved_iova)
+ return -EINVAL;
+
+ down_read(&container->group_lock);
+ ret = ops->unmap_free_reserved_iova(container->iommu_data,
+ group->iommu_group, iova);

+ up_read(&container->group_lock);
+ return ret;
+}

+EXPORT_SYMBOL_GPL(vfio_group_unmap_free_reserved_iova);
+
/**
* Group minor allocation/free - both called with vfio.group_lock held
*/
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 33304c0..a79e2a8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -156,6 +156,19 @@ static void vfio_unlink_reserved_binding(struct vfio_domain *d,
rb_erase(&old->node, &d->reserved_binding_list);
}

+static void vfio_reserved_binding_release(struct kref *kref)
+{

+ struct vfio_reserved_binding *b =

+ container_of(kref, struct vfio_reserved_binding, kref);
+ struct vfio_domain *d = b->domain;
+ unsigned long order = __ffs(b->size);
+
+ iommu_unmap(d->domain, b->iova, b->size);
+ free_iova(d->reserved_iova_domain, b->iova >> order);
+ vfio_unlink_reserved_binding(d, b);
+ kfree(b);

+}
+
/*
* This code handles mapping and unmapping of user data buffers
* into DMA'ble space using the IOMMU

@@ -1034,6 +1047,138 @@ done:
mutex_unlock(&iommu->lock);
}

+static struct vfio_domain *vfio_find_iommu_domain(void *iommu_data,
+ struct iommu_group *group)

+{
+ struct vfio_iommu *iommu = iommu_data;

+ struct vfio_group *g;

+ struct vfio_domain *d;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {

+ list_for_each_entry(g, &d->group_list, next) {
+ if (g->iommu_group == group)
+ return d;

+ }
+ }
+ return NULL;
+}
+

+static int vfio_iommu_type1_alloc_map_reserved_iova(void *iommu_data,
+ struct iommu_group *group,
+ phys_addr_t addr, int prot,
+ dma_addr_t *iova)

+{
+ struct vfio_iommu *iommu = iommu_data;

+ struct vfio_domain *d;
+ uint64_t mask, iommu_page_size;
+ struct vfio_reserved_binding *b;
+ unsigned long order;
+ struct iova *p_iova;
+ phys_addr_t aligned_addr, offset;

+ int ret = 0;
+

+ order = __ffs(vfio_pgsize_bitmap(iommu));
+ iommu_page_size = (uint64_t)1 << order;
+ mask = iommu_page_size - 1;
+ aligned_addr = addr & ~mask;
+ offset = addr - aligned_addr;

+
+ mutex_lock(&iommu->lock);
+

+ d = vfio_find_iommu_domain(iommu_data, group);
+ if (!d) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ b = vfio_find_reserved_binding(d, aligned_addr, iommu_page_size);
+ if (b) {
+ ret = 0;
+ *iova = b->iova + offset;
+ kref_get(&b->kref);
+ goto unlock;
+ }
+
+ /* allocate a new reserved IOVA page and a new binding node */
+ p_iova = alloc_iova(d->reserved_iova_domain, 1,
+ d->reserved_iova_domain->dma_32bit_pfn, true);
+ if (!p_iova) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+ *iova = p_iova->pfn_lo << order;
+
+ b = kzalloc(sizeof(*b), GFP_KERNEL);
+ if (!b) {
+ ret = -ENOMEM;
+ goto free_iova_unlock;
+ }
+
+ ret = iommu_map(d->domain, *iova, aligned_addr, iommu_page_size, prot);
+ if (ret)
+ goto free_binding_iova_unlock;
+
+ kref_init(&b->kref);
+ kref_get(&b->kref);
+ b->domain = d;
+ b->addr = aligned_addr;
+ b->iova = *iova;
+ b->size = iommu_page_size;
+ vfio_link_reserved_binding(d, b);
+ *iova += offset;
+
+ goto unlock;
+
+free_binding_iova_unlock:
+ kfree(b);
+free_iova_unlock:
+ free_iova(d->reserved_iova_domain, *iova >> order);
+unlock:
+ mutex_unlock(&iommu->lock);

+ return ret;
+}
+

+static int vfio_iommu_type1_unmap_free_reserved_iova(void *iommu_data,
+ struct iommu_group *group,
+ dma_addr_t iova)

+{
+ struct vfio_iommu *iommu = iommu_data;

+ struct vfio_reserved_binding *b;
+ struct vfio_domain *d;
+ phys_addr_t aligned_addr;
+ dma_addr_t aligned_iova, iommu_page_size, mask, offset;
+ unsigned long order;

+ int ret = 0;
+

+ order = __ffs(vfio_pgsize_bitmap(iommu));
+ iommu_page_size = (uint64_t)1 << order;
+ mask = iommu_page_size - 1;
+ aligned_iova = iova & ~mask;
+ offset = iova - aligned_iova;

+
+ mutex_lock(&iommu->lock);
+

+ d = vfio_find_iommu_domain(iommu_data, group);
+ if (!d) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ aligned_addr = iommu_iova_to_phys(d->domain, aligned_iova);
+
+ b = vfio_find_reserved_binding(d, aligned_addr, iommu_page_size);
+ if (!b) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ kref_put(&b->kref, vfio_reserved_binding_release);
+
+unlock:

+ mutex_unlock(&iommu->lock);
+ return ret;

+}
+
static void *vfio_iommu_type1_open(unsigned long arg)
{
struct vfio_iommu *iommu;

@@ -1180,13 +1325,17 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

}

static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {

- .name = "vfio-iommu-type1",
- .owner = THIS_MODULE,
- .open = vfio_iommu_type1_open,
- .release = vfio_iommu_type1_release,
- .ioctl = vfio_iommu_type1_ioctl,
- .attach_group = vfio_iommu_type1_attach_group,
- .detach_group = vfio_iommu_type1_detach_group,
+ .name = "vfio-iommu-type1",
+ .owner = THIS_MODULE,
+ .open = vfio_iommu_type1_open,
+ .release = vfio_iommu_type1_release,
+ .ioctl = vfio_iommu_type1_ioctl,
+ .attach_group = vfio_iommu_type1_attach_group,
+ .detach_group = vfio_iommu_type1_detach_group,
+ .alloc_map_reserved_iova =
+ vfio_iommu_type1_alloc_map_reserved_iova,
+ .unmap_free_reserved_iova =
+ vfio_iommu_type1_unmap_free_reserved_iova,
};

static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 610a86a..0020f81 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -75,7 +75,13 @@ struct vfio_iommu_driver_ops {
struct iommu_group *group);
void (*detach_group)(void *iommu_data,
struct iommu_group *group);
-
+ int (*alloc_map_reserved_iova)(void *iommu_data,
+ struct iommu_group *group,
+ phys_addr_t addr, int prot,
+ dma_addr_t *iova);
+ int (*unmap_free_reserved_iova)(void *iommu_data,
+ struct iommu_group *group,
+ dma_addr_t iova);

};

extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);

@@ -138,4 +144,30 @@ extern int vfio_virqfd_enable(void *opaque,
void *data, struct virqfd **pvirqfd, int fd);
extern void vfio_virqfd_disable(struct virqfd **pvirqfd);

+/**
+ * vfio_group_alloc_map_reserved_iova: allocates a new iova page and map
+ * it onto the aligned physical page that contains a given physical addr.
+ * page size is the domain iommu page size.
+ *
+ * @group: vfio group handle
+ * @addr: physical address to map
+ * @prot: protection attribute
+ * @iova: returned iova that is mapped onto addr
+ *
+ * returns 0 on success, < 0 on failure
+ */
+extern int vfio_group_alloc_map_reserved_iova(struct vfio_group *group,
+ phys_addr_t addr, int prot,
+ dma_addr_t *iova);
+/**
+ * vfio_group_unmap_free_reserved_iova: unmap and free the reserved iova page
+ *
+ * @group: vfio group handle
+ * @iova: base iova, must be aligned on the IOMMU page size
+ *
+ * returns 0 on success, < 0 on failure
+ */
+extern int vfio_group_unmap_free_reserved_iova(struct vfio_group *group,
+ dma_addr_t iova);
+
#endif /* VFIO_H */
--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:16:28 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

This patch allows the user-space to retrieve whether msi write
transaction addresses must be mapped. This is returned through the
VFIO_IOMMU_GET_INFO API using a new flag: VFIO_IOMMU_INFO_REQUIRE_MSI_MAP.

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

Signed-off-by: Eric Auger <eric....@linaro.org>

---

RFC v1 -> RFC v2:

- derived from
[RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
drivers/vfio/vfio_iommu_type1.c | 26 ++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 1 +
2 files changed, 27 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 6f1ea3d..c5b57e1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -255,6 +255,29 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
}

/*
+ * vfio_domains_require_msi_mapping: indicates whether MSI write transaction
+ * addresses must be mapped
+ *
+ * returns true if it does
+ */
+static bool vfio_domains_require_msi_mapping(struct vfio_iommu *iommu)
+{
+ struct vfio_domain *d;
+ bool ret;
+
+ mutex_lock(&iommu->lock);
+ /* All domains have same require_msi_map property, pick first */
+ d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+ if (iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_MAPPING, NULL) < 0)
+ ret = false;
+ else
+ ret = true;
+ mutex_unlock(&iommu->lock);
+

+ return ret;
+}
+

+/*
* Attempt to pin pages. We really don't want to track all the pfns and
* the iommu can only map chunks of consecutive pfns anyway, so get the
* first page and all consecutive pages with the same locking.
@@ -997,6 +1020,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

info.flags = VFIO_IOMMU_INFO_PGSIZES;

+ if (vfio_domains_require_msi_mapping(iommu))
+ info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
+
info.iova_pgsizes = vfio_pgsize_bitmap(iommu);

return copy_to_user((void __user *)arg, &info, minsz);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 7d7a4c6..43e183b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -400,6 +400,7 @@ struct vfio_iommu_type1_info {
__u32 argsz;
__u32 flags;
#define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
__u64 iova_pgsizes; /* Bitmap of supported page sizes */
};

--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 8:17:20 AM1/26/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Introduce new DOMAIN_ATTR_MSI_MAPPING domain attribute. If supported,
this means the MSI addresses need to be mapped in the IOMMU. ARM SMMUS
and FSL PAMU, at least expose this attribute.

x86 IOMMUs typically don't expose the attribute since on x86 MSI write
transaction addresses always are within the 1MB PA region [FEE0_0000h -
FEF0_000h] window which directly targets the APIC configuration space and
hence bypass the sMMU.

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>
Signed-off-by: Eric Auger <eric....@linaro.org>

---

to be honest I wonder whether this is property of the sMMU? Isn't it
a platform property?

RFC v1 -> RFC v2:

- the data field is not used
- for this attribute domain_get_attr simply returns 0 if the MSI_MAPPING
capability if needed or <0 if not.
- removed struct iommu_domain_msi_maps
---
drivers/iommu/arm-smmu.c | 2 ++
drivers/iommu/fsl_pamu_domain.c | 3 +++
include/linux/iommu.h | 1 +
3 files changed, 6 insertions(+)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 59ee4b8..c8b7e71 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1409,6 +1409,8 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
case DOMAIN_ATTR_NESTING:
*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
return 0;
+ case DOMAIN_ATTR_MSI_MAPPING:
+ return 0;
default:
return -ENODEV;
}
diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index da0e1e3..dd2e0d6 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -856,6 +856,9 @@ static int fsl_pamu_get_domain_attr(struct iommu_domain *domain,
case DOMAIN_ATTR_FSL_PAMUV1:
*(int *)data = DOMAIN_ATTR_FSL_PAMUV1;
break;
+ case DOMAIN_ATTR_MSI_MAPPING:
+ ret = 0;
+ break;
default:
pr_debug("Unsupported attribute type\n");
ret = -EINVAL;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f28dff3..3ae2fb6 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -112,6 +112,7 @@ enum iommu_attr {
DOMAIN_ATTR_FSL_PAMU_ENABLE,
DOMAIN_ATTR_FSL_PAMUV1,
DOMAIN_ATTR_NESTING, /* two stages of translation */
+ DOMAIN_ATTR_MSI_MAPPING, /* Require MSIs mapping in iommu */
DOMAIN_ATTR_MAX,
};

--
1.9.1

Eric Auger

unread,

Jan 26, 2016, 10:14:52 AM1/26/16

to kbuild test robot, kbuil...@01.org, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi,
On 01/26/2016 03:43 PM, kbuild test robot wrote:
> Hi Eric,
>
> [auto build test WARNING on v4.5-rc1]
> [also build test WARNING on next-20160125]
> [cannot apply to iommu/next]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
>
> url: https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921
> config: i386-allmodconfig (attached as .config)
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All warnings (new ones prefixed by >>):
>
> drivers/vfio/pci/vfio_pci_intrs.c: In function 'vfio_set_mapped_msi_addr':
>>> drivers/vfio/pci/vfio_pci_intrs.c:324:43: warning: left shift count >= width of type [-Wshift-count-overflow]
> msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
> ^
>>> drivers/vfio/pci/vfio_pci_intrs.c:333:34: warning: right shift count >= width of type [-Wshift-count-overflow]
> msg.address_hi = (u32)(msi_iova >> 32);
I definitively need to revisit that code. I Better understand Alex'
comment now :-(

Thanks

Eric
> ^
>
> vim +324 drivers/vfio/pci/vfio_pci_intrs.c
>
> 318 dma_addr_t msi_iova;
> 319 struct vfio_group *group = vdev->vfio_group;
> 320 struct msi_msg msg;
> 321 int ret;
> 322
> 323 get_cached_msi_msg(irq, &msg);
> > 324 msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
> 325 (phys_addr_t)(msg.address_lo);
> 326
> 327 ret = vfio_group_alloc_map_reserved_iova(group, msi_addr,
> 328 IOMMU_WRITE, &msi_iova);
> 329 if (ret)
> 330 goto out;
> 331
> 332 /* Re-program the msi-address with the iova */
> > 333 msg.address_hi = (u32)(msi_iova >> 32);
> 334 msg.address_lo = (u32)(msi_iova & 0xffffffff);
> 335 pci_write_msi_msg(irq, &msg);
> 336
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
>

Eric Auger

unread,

Jan 26, 2016, 11:38:18 AM1/26/16

to kbuild test robot, kbuil...@01.org, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi,

On 01/26/2016 05:17 PM, kbuild test robot wrote:
> Hi Eric,
>

> [auto build test ERROR on v4.5-rc1]
> [also build test ERROR on next-20160125]

> [cannot apply to iommu/next]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
>
> url: https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921

> config: x86_64-randconfig-s3-01262306 (attached as .config)

> reproduce:
> # save the attached .config to linux build tree

> make ARCH=x86_64
>
> All errors (new ones prefixed by >>):
>
>>> ERROR: "alloc_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!
>>> ERROR: "free_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!

I will protect the code with CONFIG_IOMMU_IOVA if the usage of those
functions is acknowledged.

Thanks

Eric

Pavel Fedin

unread,

Jan 26, 2016, 12:25:21 PM1/26/16

to Eric Auger, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hello!
I'd like just to clarify some things for myself and better wrap my head around it...

> On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.

So, this is effectively the same as always having hardwired 1:1 mappings on all IOMMUs, isn't it ?
If so, then we can't we just do the same, just by forcing similar 1:1 mapping? This is what i tried to do in my patchset. All of
you are talking about a situation which arises when we are emulating different machine with different physical addresses layout. And
e. g. if our host has MSI at 0xABADCAFE, our target could have valid RAM at the same location, and we need to handle it somehow,
therefore we have to move our MSI window out of target's RAM. But how does this work on a PC then? What if our host is PC, and we
want to emulate some ARM board, which has RAM at FE00 0000 ? Or does it mean that PC architecture is flawed and can reliably handle
PCI passthrough only for itself ?

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia

Eric Auger

unread,

Jan 27, 2016, 3:53:00 AM1/27/16

to Pavel Fedin, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi Pavel,

Alex answered to this I think:
"
x86 isn't problem-free in this space. An x86 VM is going to know that
the 0xfee00000 address range is special, it won't be backed by RAM and
won't be a DMA target, thus we'll never attempt to map it for an iova
address. However, if we run a non-x86 VM or a userspace driver, it
doesn't necessarily know that there's anything special about that range
of iovas. I intend to resolve this with an extension to the iommu info
ioctl that describes the available iova space for the iommu. The
interrupt region would simply be excluded.
"

I am not sure I've addressed this requirement yet but it seems more
future proof to have an IOMMU mapping for those addresses.

For the ARM use case I think Marc gave guidance:
"
We want userspace to be in control of the memory map, and it
is the kernel's job to tell us whether or not this matches the HW
capabilities or not. A fixed mapping may completely clash with the
memory map I want (think emulating HW x on platform y), and there is no
reason why we should have the restrictions x86 has.
"

That's the rationale behind respining that way.

Waiting for other comments & discussions, I am going to address the iova
and dma_addr_t kbuilt reported compilation issues. Please apologize for
those.

Best Regards

Eric

Pavel Fedin

unread,

Jan 28, 2016, 2:13:34 AM1/28/16

to Eric Auger, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hello!

> x86 isn't problem-free in this space. An x86 VM is going to know that
> the 0xfee00000 address range is special, it won't be backed by RAM and
> won't be a DMA target, thus we'll never attempt to map it for an iova
> address. However, if we run a non-x86 VM or a userspace driver, it
> doesn't necessarily know that there's anything special about that range
> of iovas. I intend to resolve this with an extension to the iommu info
> ioctl that describes the available iova space for the iommu. The
> interrupt region would simply be excluded.

I see now, but i still don't understand how it would work. How can we tell the guest OS that we cannot do DMA to this particular
area? Just exclude it from RAM at all? But this means we would have to modify machine's model...
I know that this is a bit different story from what we are implementing now. Just curious.

Eric Auger

unread,

Jan 28, 2016, 4:50:32 AM1/28/16

to Pavel Fedin, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi Pavel,

Well in QEMU mach-virt we have a static guest PA memory map. Maybe in
some other virt machines this is different and it is possible to take
into account the fact an IOVA range cannot be used?

Regards

Eric

Alex Williamson

unread,

Jan 28, 2016, 4:52:14 PM1/28/16

to Eric Auger, eric....@st.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

On Tue, 2016-01-26 at 13:12 +0000, Eric Auger wrote:
> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
> same need on some PowerPC platforms.
>
> On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
>
> This is not the case on above mentionned platforms where MSI messages emitted
> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
> must exist for the MSI to reach the MSI controller. Normal way to create
> IOVA bindings consists in using VFIO DMA MAP API. However in this case
> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
> controller frame).
>
> Following first comments, the spirit of [2] is kept: the guest registers
> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
> its MSI vectors, it overwrites the MSI controller physical address with an IOVA,
> allocated within the window provided by the userspace. This IOVA is mapped
> onto the MSI controller frame physical page.
>
> The series does not address yet the problematic of telling the userspace how
> much IOVA he should provision.

I'm sort of on a think-different approach today, so bear with me; how is
it that x86 can make interrupt remapping so transparent to drivers like
vfio-pci while for ARM and ppc we seem to be stuck with doing these
fixups of the physical vector ourselves, implying ugly (no offense)
paths bouncing through vfio to connect the driver and iommu backends?

We know that x86 handles MSI vectors specially, so there is some
hardware that helps the situation.  It's not just that x86 has a fixed
range for MSI, it's how it manages that range when interrupt remapping
hardware is enabled.  A device table indexed by source-ID references a
per device table indexed by data from the MSI write itself.  So we get
much, much finer granularity, but there's still effectively an interrupt
domain per device that's being transparently managed under the covers
whenever we request an MSI vector for a device.

So why can't we do something more like that here?  There's no predefined
MSI vector range, so defining an interface for the user to specify that
is unavoidable.  But why shouldn't everything else be transparent?  We
could add an interface to the IOMMU API that allows us to register that
reserved range for the IOMMU domain.  IOMMU-core (or maybe interrupt
remapping) code might allocate an IOVA domain for this just as you've
done in the type1 code here.  But rather than having any interaction
with vfio-pci, why not do this at lower levels such that the platform
interrupt vector allocation code automatically uses one of those IOVA
ranges and returns the IOVA rather than the physical address for the PCI
code to program into the device?  I think we know what needs to be done,
but we're taking the approach of managing the space ourselves and doing
a fixup of the device after the core code has done its job when we
really ought to be letting the core code manage a space that we define
and programming the device so that it doesn't need a fixup in the
vfio-pci code.  Wouldn't it be nicer if pci_enable_msix_range() returned
with the device properly programmed or generate an error if there's not
enough reserved mapping space in IOMMU domain? Can it be done?  Thanks,

Alex

Eric Auger

unread,

Jan 29, 2016, 9:35:58 AM1/29/16

to Alex Williamson, eric....@st.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi Alex,

About the granularity, I think ARM GICv3 now provides a similar
capability with GICv3 ITS (interrupt translation service). Along with
the MSI MSG write transaction, the device outputs a DeviceID conveyed on
the bus. This DeviceID (~ your source-ID) enables to index a device
table. The entry in the device table points to a DeviceId interrupt
translation table indexed by the EventID found in the msi msg. So the
entry in the interrupt translation table eventually gives you the
eventual interrupt ID targeted by the MSI MSG.
This translation capability if not available in GICv2M though, ie. the
one I am currently using.

Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

but there's still effectively an interrupt
> domain per device that's being transparently managed under the covers
> whenever we request an MSI vector for a device.
>
> So why can't we do something more like that here? There's no predefined
> MSI vector range, so defining an interface for the user to specify that
> is unavoidable.

Do you confirm that VFIO user API still still is the good choice to
provide that IOVA range?

But why shouldn't everything else be transparent? We
> could add an interface to the IOMMU API that allows us to register that
> reserved range for the IOMMU domain. IOMMU-core (or maybe interrupt
> remapping) code might allocate an IOVA domain for this just as you've
> done in the type1 code here.

I have no objection to move that iova allocation scheme somewhere else.
I just need to figure out how to deal with the fact iova.c is not
compiled everywhere as I noticed too late ;-)

But rather than having any interaction
> with vfio-pci, why not do this at lower levels such that the platform
> interrupt vector allocation code automatically uses one of those IOVA
> ranges and returns the IOVA rather than the physical address for the PCI
> code to program into the device? I think we know what needs to be done,
> but we're taking the approach of managing the space ourselves and doing
> a fixup of the device after the core code has done its job when we
> really ought to be letting the core code manage a space that we define
> and programming the device so that it doesn't need a fixup in the
> vfio-pci code. Wouldn't it be nicer if pci_enable_msix_range() returned
> with the device properly programmed or generate an error if there's not
> enough reserved mapping space in IOMMU domain? Can it be done?

I agree with you on the fact it would be cleaner to manage that natively
at MSI controller level instead of patching the address value in
vfio_pci_intrs.c. I will investigate in that direction but I need some
more time to understand the links between the MSI controller, the PCI
device and the IOMMU.

Best Regards

Eric

Thanks,
>
> Alex
>

Alex Williamson

unread,

Jan 29, 2016, 2:33:36 PM1/29/16

to Eric Auger, eric....@st.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

So it sounds like the interrupt remapping plumbing needs to be
implemented for those chips.  How does ITS identify an MSI versus any
other DMA write?  Does it need to be within a preconfigured address
space like on x86 or does it know this implicitly by the transaction
(which doesn't seem possible on PCIe)?

Along with this discussion, we should probably be revisiting whether
existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
capability.  This capability is meant to indicate interrupt isolation,
but if an entire page of IOVA space is mapped through the IOMMU to a
range of interrupts and some of those interrupts are shared with host
devices or other VMs, then we really don't have that isolation and the
system is susceptible to one VM interfering with another or with the
host.  If that's the case, the SMMU should not be claiming
IOMMU_CAP_INTR_REMAP.

> but there's still effectively an interrupt
> > domain per device that's being transparently managed under the covers
> > whenever we request an MSI vector for a device.
> >
> > So why can't we do something more like that here? There's no predefined
> > MSI vector range, so defining an interface for the user to specify that
> > is unavoidable.
> Do you confirm that VFIO user API still still is the good choice to
> provide that IOVA range?

I don't see that we have an option there unless ARM wants to
retroactively reserve a range of IOVA space in the spec, which is
certainly not going to happen. The only other thing that comes to mind
would be if there was an existing address space which could never be
backed by RAM or other DMA capable targets. But that seems far fetched
as well.

Since the current interrupt remapping schemes seem to operate in a
different address space, I expect there will be work to do to fit the
interrupt remapping within a provided address space, but it seems like a
very reasonable constraint to add. Thanks,

Alex

Eric Auger

unread,

Jan 29, 2016, 4:26:24 PM1/29/16

to Alex Williamson, eric....@st.com, will....@arm.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi Alex,

It seems there is a kind of misunderstanding here. Assuming a "simple"
system with a single ITS, all devices likely to produce MSI must write
those messages in a single register, located in the ITS MSI 64kB frame
(this register is called GITS_TRANSLATER). Then the ITS discriminates
between senders using the DeviceID conveyed out-of-band on the bus (or
by other implementation defined means). For those DeviceId, a deviceId
interrupt translation table is supposed to exist, else it is going to
fault. If any "undeclared" device is writing into that register, its
deviceid will be unknown. It looks like on Intel the interrupt remapping
HW rather is abstracted on the IOMMU side; I did not take time yet to
carefully read the VT-d spec but maybe the Intel interrupt remapping HW
rather acts as an IOMMU that takes an input MSI address within the
famous window and apply a translation scheme based on the MSI address &
data? On ARM the input MSI address always is the GITS_TRANSLATER and
then the translation scheme is based on out-of-band info (deviceid) +
data content(eventid). I Hope this clarifies.

>
> Along with this discussion, we should probably be revisiting whether
> existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
> capability.

so according to the above explanation not sure it is relevant. Will/Marc
might correct me if I told some wrong things.

This capability is meant to indicate interrupt isolation,
> but if an entire page of IOVA space is mapped through the IOMMU to a
> range of interrupts and some of those interrupts are shared with host
> devices or other VMs, then we really don't have that isolation and the
> system is susceptible to one VM interfering with another or with the
> host. If that's the case, the SMMU should not be claiming
> IOMMU_CAP_INTR_REMAP.

My understanding is a PCI device working for the host must have its own
deviceid translation table while another one assigned to a guest needs
to have another one. Each of those will then trigger different final
interrupt IDs in separate domains.

To be honest for the time being I was not addressing the ITS case but
just the simpler GICv2m case where we do not have interrupt translation.
In GICv2m with a single 4kB MSI frame you still have a single register
written by devices. The msg data content then induces a given interrupt ID.

My kernel series "just" aimed at allowing the device to reach the
physical address of the GICv2m MSI frame through the IOMMU.

But you're right here I think I should have a larger vision of what is
targeted with ITS. In GICv2m with a single MSI frame the discrimination
only works on the msi data (there is no deviceid). However it is also
possible to have several GICv2M MSI 4kB frames and in that case you can
give 1 MSI 4kB frame per VM but it is yet another use case. My AMD
system currently exposes a single MSI frame - in which case we have poor
isolation as you say -.

>
>> but there's still effectively an interrupt
>>> domain per device that's being transparently managed under the covers
>>> whenever we request an MSI vector for a device.
>>>
>>> So why can't we do something more like that here? There's no predefined
>>> MSI vector range, so defining an interface for the user to specify that
>>> is unavoidable.
>> Do you confirm that VFIO user API still still is the good choice to
>> provide that IOVA range?
>
> I don't see that we have an option there unless ARM wants to
> retroactively reserve a range of IOVA space in the spec, which is
> certainly not going to happen. The only other thing that comes to mind
> would be if there was an existing address space which could never be
> backed by RAM or other DMA capable targets. But that seems far fetched
> as well.

I don't think there is a plan for such change and I am afraid we need to
integrate above configurations (GICv2M with a single frame, GICv2M with
several frames, ITS and there may be others not covered here that I am
not aware of).

I hope this discussion will help. Please ARM guys, correct me if there
are some unclarities or wrong statements.

Thank you for reading up to here and have a nice WE!

Best Regards

Eric
>
> Alex
>

Will Deacon

unread,

Feb 1, 2016, 9:04:01 AM2/1/16

to Eric Auger, Alex Williamson, eric....@st.com, christof...@linaro.org, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> On 01/29/2016 08:33 PM, Alex Williamson wrote:
> >>> We know that x86 handles MSI vectors specially, so there is some
> >>> hardware that helps the situation. It's not just that x86 has a fixed
> >>> range for MSI, it's how it manages that range when interrupt remapping
> >>> hardware is enabled. A device table indexed by source-ID references a
> >>> per device table indexed by data from the MSI write itself. So we get
> >>> much, much finer granularity,
> >> About the granularity, I think ARM GICv3 now provides a similar
> >> capability with GICv3 ITS (interrupt translation service). Along with
> >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> >> the bus. This DeviceID (~ your source-ID) enables to index a device
> >> table. The entry in the device table points to a DeviceId interrupt
> >> translation table indexed by the EventID found in the msi msg. So the
> >> entry in the interrupt translation table eventually gives you the
> >> eventual interrupt ID targeted by the MSI MSG.
> >> This translation capability if not available in GICv2M though, ie. the
> >> one I am currently using.
> >>
> >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

That's right. GICv3/ITS disambiguates the interrupt source using the
DeviceID, which for PCI is derived from the Requester ID of the endpoint.
GICv2m is less flexible and requires a separate physical frame per guest
to achieve isolation.

Will

Christoffer Dall

unread,

Feb 3, 2016, 7:50:33 AM2/3/16

to Will Deacon, Eric Auger, Alex Williamson, eric....@st.com, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

We should still support MSI passthrough with a single MSI frame host
system though, right?

(Users should just be aware that guests are not fully protected against
misbehaving hardware in that case).

-Christoffer

Will Deacon

unread,

Feb 3, 2016, 8:11:17 AM2/3/16

to Christoffer Dall, Eric Auger, Alex Williamson, eric....@st.com, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

I think we should treat the frame as an exclusive resource and assign it
to a single VM.

> (Users should just be aware that guests are not fully protected against
> misbehaving hardware in that case).

Is it confined to misbehaving hardware? What if a malicious/buggy guest
configures its device to DMA all over the doorbell?

Will

Christoffer Dall

unread,

Feb 3, 2016, 10:35:46 AM2/3/16

to Will Deacon, Eric Auger, Alex Williamson, eric....@st.com, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

so on a single frame GICv2m system, either your host or a single VM gets
to do MSIs...

>
> > (Users should just be aware that guests are not fully protected against
> > misbehaving hardware in that case).
>
> Is it confined to misbehaving hardware? What if a malicious/buggy guest
> configures its device to DMA all over the doorbell?
>

I guess not, I suppose we can't trap any configuration access and
mediate that for any device. Bummer.

-Christoffer

Eric Auger

unread,

Feb 5, 2016, 12:33:10 PM2/5/16

to Christoffer Dall, Will Deacon, Alex Williamson, eric....@st.com, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi Alex,

I tried to sketch a proposal for guaranteeing the IRQ integrity when
doing ARM PCI/MSI passthrough with ARM GICv2M msi-controller. This is
based on extended VFIO group viability control, as detailed below.

As opposed to ARM GICv3 ITS, this MSI controller does *not* support IRQ
remapping. It can expose 1 or more 4kB MSI frame. Each frame contains a
single register where the msi data is written.

I would be grateful to you if you could tell me whether it makes any sense.

Thanks in advance

Best Regards

Eric

1) GICv2m with a single 4kB single frame
all devices having this msi-controller as msi-parent share this
single MSI frame. Those devices can work on behalf of the host
or work on behalf of 1 or more guests (KVM assigned devices). We
must make sure either the host only or 1 single VM can access to the
single frame to guarantee interrupt integrity: a device assigned
to 1 VM should not be able to trigger MSI targeted to the host
or another VM.

I would propose to extend the VFIO notion of group viability.
Currently a VFIO group is viable if:
all devices belonging to the same group are bound to a VFIO driver
or unbound.

Let's imagine we extend the viability check as follows:

0) keep the current viable check: all the devices belonging to
the group must be vfio bound or unbound.
1) retrieve the MSI parent of the device and list all the
other devices using that MSI controller as MSI-parent (does not
look straightforward):
2) they must be VFIO driver bound or unbound as well (meaning
they are not used by the host). If not, reject device attachment
- in case they are VFIO bound (a VFIO group is set):
x if all VFIO containers are the same as the one of the device's
we try to attach, that's OK. This means the other devices
use different IOMMU mappings, eventually will target the
MSI frame but they all work for the same user space client/VM.
x 1 or more devices has a different container than the device
under attachment:
It works on behalf of a different user space client/VM,
we can't attach the new device. I think there is a case however
where severals containers can be opened by a single QEMU.

Of course the dynamic aspects, ie a new device showing up or an unbind
event bring significant complexity.

2) GICv2M with multiple 4kB frames
Each msi-frame is enumerated as msi-controller. The device tree
statically defines which device is attached to each msi frame.
In case devices are assigned we cannot change this attachment
anyway since there might be physical contraints behind.
So devices likely to be assigned to guests should be linked to a
different MSI frame than devices that are not.

I think extended viability concept can be used as well.

This model still is not ideal: in case we have a SR-IOV device
plugged onto an host bridge attached to a single MSI parent you won't
be able anyway to have 1 Virtual Function working for host and 1 VF
working for a guest. Only Interrupt translation (ITS) will bring that
feature.

3) GICv3 ITS
This one supports interrupt translation service ~ Intel
IRQ remapping.
This means a single frame can be used by all devices. A deviceID is
used exclusively by the host or a guest. I assume the ITS driver
allocates/populates deviceid interrupt translation table featuring
separate LPI spaces ie by construction different ITT cannot feature
same LPIs. So no need to do the extended viability test.

The MSI controller should have a property telling whether
it supports interrupt translation. This kind of property currently
exists on IOMMU side for INTEL remapping.

Alex Williamson

unread,

Feb 5, 2016, 1:17:13 PM2/5/16

to Eric Auger, Christoffer Dall, Will Deacon, eric....@st.com, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi Eric,

Would anyone be terribly upset if we simply assume the worst case
scenario on GICv2m/M, have the IOMMU not claim IOMMU_CAP_INTR_REMAP, and
require the user to opt-in via the allow_unsafe_interrupts on the
vfio_iommu_type1 module? That would make it very compatible with what
we already do on x86, where it really is all or nothing. My assumption
is that GICv2 would be phased out in favor of GICv3, so there's always
a hardware upgrade path to having more complete isolation, but the
return on investment for figuring out whether a given device really has
this sort of isolation seems pretty low. Often users already have some
degree of trust in the VMs they use for device assignment anyway. An
especially prudent user can still look at the hardware specs for their
specific system to understand whether any devices are fully isolated
and only make use of those for device assignment. Does that seem like
a reasonable alternative? Thanks,

Alex

Christoffer Dall

unread,

Feb 8, 2016, 4:48:02 AM2/8/16

to Alex Williamson, Eric Auger, Will Deacon, eric....@st.com, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

meaning either you allow unsafe multiplexing with passthrough in every
flavor (unsafely) or you don't allow it at all?

I didn't know such on option existed, but it seems to me that this fits
the bill exactly.

> My assumption
> is that GICv2 would be phased out in favor of GICv3, so there's always
> a hardware upgrade path to having more complete isolation, but the
> return on investment for figuring out whether a given device really has
> this sort of isolation seems pretty low. Often users already have some
> degree of trust in the VMs they use for device assignment anyway. An
> especially prudent user can still look at the hardware specs for their
> specific system to understand whether any devices are fully isolated
> and only make use of those for device assignment. Does that seem like
> a reasonable alternative?
>

It sounds good to me, that would allow us to release a GICv2m-based
solution for MSI passthrough on currently available hardware like the
Seattle.

Thanks,
-Christoffer

Eric Auger

unread,

Feb 8, 2016, 8:28:17 AM2/8/16

to Christoffer Dall, Alex Williamson, Will Deacon, eric....@st.com, marc.z...@arm.com, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, suravee.su...@amd.com, linux-...@vger.kernel.org, pat...@linaro.org, io...@lists.linux-foundation.org

Hi Alex, Christoffer,

that's my understanding. if the iommu does not expose
IOMMU_CAP_INTR_REMAP, the end-user must explicitly turn
allow_unsafe_interrupts on. On ARM we will have the handle the fact the
interrupt translation is handled on interrupt controller side and not on
iommu side though;

>
> I didn't know such on option existed, but it seems to me that this fits
> the bill exactly.

well I think the support of multiple GICv2m MSI frames was devised to
allow safe interrupts but extending the VFIO viability notion as
described above effectively seems a huge work with small benefits since
we don't have much HW featuring multiple frames I am afraid. So I think
it is a good compromise to have a minimal integration with GICv2m and
full feature with best fitted HW, ie. GICv3 ITS.

>
>
>> My assumption
>> is that GICv2 would be phased out in favor of GICv3, so there's always
>> a hardware upgrade path to having more complete isolation, but the
>> return on investment for figuring out whether a given device really has
>> this sort of isolation seems pretty low. Often users already have some
>> degree of trust in the VMs they use for device assignment anyway. An
>> especially prudent user can still look at the hardware specs for their
>> specific system to understand whether any devices are fully isolated
>> and only make use of those for device assignment. Does that seem like
>> a reasonable alternative?
>>
>
> It sounds good to me, that would allow us to release a GICv2m-based
> solution for MSI passthrough on currently available hardware like the
> Seattle.

Sounds good to me too. I am going to respin the kernel series according
to this discussion and previous comments.

Thanks for your comments!

Best Regards

Eric
>
> Thanks,
> -Christoffer
>

Eric Auger

unread,

Feb 11, 2016, 9:35:04 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
It pursues the efforts done on [1], [2], [3]. It also aims at covering the

same need on PowerPC platforms although the same kind of integration
should be carried out.

On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
as interrupt messages: accesses to this special PA window directly target the
APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.

This is not the case on above mentionned platforms where MSI messages emitted
by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
must exist for the MSI to reach the MSI controller. Normal way to create
IOVA bindings consists in using VFIO DMA MAP API. However in this case
the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
controller frame).

In a nutshell, this series does:
- introduce an IOMMU API to register a IOVA window usable for reserved mapping
- reuse VFIO DMA MAP ioctl with a new flag to plug onto that new API
- check if the device MSI-parent controllers allow IRQ remapping
(allow unsafe interrupt modality) for a given group
- introduce a new IOMMU API to allocate reserved IOVAs and bind them onto
a physical address
- allow the GICv2M and GICv3-ITS PCI irqchip to map/unmap the MSI frame
on irq_write_msi_msg

Best Regards

Eric

Testing:
- functional on ARM64 AMD Overdrive HW (single GICv2m frame) with an e1000e
PCIe card.
- tested there is no regresion on
x non assigned PCIe driver
x platform device passthrough
- Not tested: ARM with SR-IOV, ARM GICv3 ITS

References:
[1] [RFC 0/2] VFIO: Add virtual MSI doorbell support
(https://lkml.org/lkml/2015/7/24/135)
[2] [RFC PATCH 0/6] vfio: Add interface to map MSI pages
(https://lists.cs.columbia.edu/pipermail/kvmarm/2015-September/016607.html)
[3] [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO
(http://permalink.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858)

Git:
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc3-pcie-passthrough-rfcv2

previous version at
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc1-pcie-passthrough-v1

QEMU Integration:
[RFC v2 0/8] KVM PCI/MSI passthrough with mach-virt
(http://lists.gnu.org/archive/html/qemu-arm/2016-01/msg00444.html)
https://git.linaro.org/people/eric.auger/qemu.git/shortlog/refs/heads/v2.5.0-pci-passthrough-rfc-v2

User Hints:
To allow PCI/MSI passthrough with GICv2M, compile VFIO as a module and
load the vfio_iommu_type1 module with allow_unsafe_interrupts param:
sudo modprobe -v vfio_iommu_type1 allow_unsafe_interrupts=1
sudo modprobe -v vfio-pci

History:

PATCH v1 -> RFC v2:
- reverted to RFC since it looks more reasonable ;-) the code is split
between VFIO, IOMMU, MSI controller and I am not sure I did the right
choices. Also API need to be further discussed.
- iova API usage in arm-smmu.c.
- MSI controller natively programs the MSI addr with either the PA or IOVA.
This is not done anymore in vfio-pci driver as suggested by Alex.
- check irq remapping capability of the group

RFC v1 [2] -> PATCH v1:
- use the existing dma map/unmap ioctl interface with a flag to register a
reserved IOVA range. Use the legacy Rb to store this special vfio_dma.
- a single reserved IOVA contiguous region now is allowed
- use of an RB tree indexed by PA to store allocated reserved slots
- use of a vfio_domain iova_domain to manage iova allocation within the

window provided by the userspace

- vfio alloc_map/unmap_free take a vfio_group handle
- vfio_group handle is cached in vfio_pci_device
- add ref counting to bindings
- user modality enabled at the end of the series

Eric Auger (15):
iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute
vfio: expose MSI mapping requirement through VFIO_IOMMU_GET_INFO
vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
iommu: add alloc/free_reserved_iova_domain
iommu/arm-smmu: implement alloc/free_reserved_iova_domain
iommu/arm-smmu: add a reserved binding RB tree
iommu: iommu_get/put_single_reserved
iommu/arm-smmu: implement iommu_get/put_single_reserved
iommu/arm-smmu: relinquish reserved resources on domain deletion
vfio: allow the user to register reserved iova range for MSI mapping
msi: Add a new MSI_FLAG_IRQ_REMAPPING flag
msi: export msi_get_domain_info
vfio/type1: also check IRQ remapping capability at msi domain
iommu/arm-smmu: do not advertise IOMMU_CAP_INTR_REMAP
irqchip/gicv2m/v3-its-pci-msi: IOMMU map the MSI frame when needed

drivers/iommu/arm-smmu.c | 292 +++++++++++++++++++++++++++++--
drivers/iommu/fsl_pamu_domain.c | 2 +
drivers/iommu/iommu.c | 43 +++++
drivers/irqchip/irq-gic-common.c | 60 +++++++
drivers/irqchip/irq-gic-common.h | 5 +
drivers/irqchip/irq-gic-v2m.c | 3 +-
drivers/irqchip/irq-gic-v3-its-pci-msi.c | 3 +-
drivers/irqchip/irq-gic-v3-its.c | 1 +
drivers/vfio/vfio_iommu_type1.c | 156 ++++++++++++++++-
include/linux/iommu.h | 53 ++++++
include/linux/msi.h | 6 +
include/uapi/linux/vfio.h | 10 ++
kernel/irq/msi.c | 1 +
13 files changed, 613 insertions(+), 22 deletions(-)

--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:35:18 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Introduce alloc/free_reserved_iova_domain in the IOMMU API.
alloc_reserved_iova_domain initializes an iova domain at a given
iova base address and with a given size. This iova domain will
be used to allocate iova within that window. Those IOVAs will be reserved
for special purpose, typically MSI frame binding. Allocation function
within the reserved iova domain will be introduced in subsequent patches.

This is the responsability of the API user to make sure any IOVA
belonging to that domain are allocated with those allocation functions.

Signed-off-by: Eric Auger <eric....@linaro.org>

---

v1 -> v2:
- moved from vfio API to IOMMU API
---
drivers/iommu/iommu.c | 22 ++++++++++++++++++++++
include/linux/iommu.h | 21 +++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0e3b009..a994f34 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1557,6 +1557,28 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
}
EXPORT_SYMBOL_GPL(iommu_domain_set_attr);

+int iommu_alloc_reserved_iova_domain(struct iommu_domain *domain,
+ dma_addr_t iova, size_t size,
+ unsigned long order)
+{
+ int ret;
+
+ if (!domain->ops->alloc_reserved_iova_domain)
+ return -EINVAL;
+ ret = domain->ops->alloc_reserved_iova_domain(domain, iova,
+ size, order);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_alloc_reserved_iova_domain);
+
+void iommu_free_reserved_iova_domain(struct iommu_domain *domain)
+{
+ if (!domain->ops->free_reserved_iova_domain)
+ return;
+ domain->ops->free_reserved_iova_domain(domain);
+}
+EXPORT_SYMBOL_GPL(iommu_free_reserved_iova_domain);
+
void iommu_get_dm_regions(struct device *dev, struct list_head *list)
{
const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a4fe04a..32c1a4e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -195,6 +195,12 @@ struct iommu_ops {
int (*domain_set_windows)(struct iommu_domain *domain, u32 w_count);
/* Get the number of windows per domain */
u32 (*domain_get_windows)(struct iommu_domain *domain);
+ /* allocates the reserved iova domain */
+ int (*alloc_reserved_iova_domain)(struct iommu_domain *domain,
+ dma_addr_t iova, size_t size,
+ unsigned long order);
+ /* frees the reserved iova domain */
+ void (*free_reserved_iova_domain)(struct iommu_domain *domain);

#ifdef CONFIG_OF_IOMMU
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
@@ -266,6 +272,10 @@ extern int iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr,
void *data);
extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
void *data);
+extern int iommu_alloc_reserved_iova_domain(struct iommu_domain *domain,
+ dma_addr_t iova, size_t size,
+ unsigned long order);
+extern void iommu_free_reserved_iova_domain(struct iommu_domain *domain);
struct device *iommu_device_create(struct device *parent, void *drvdata,
const struct attribute_group **groups,
const char *fmt, ...) __printf(4, 5);
@@ -541,6 +551,17 @@ static inline void iommu_device_unlink(struct device *dev, struct device *link)
{
}

+static int iommu_alloc_reserved_iova_domain(struct iommu_domain *domain,
+ dma_addr_t iova, size_t size,
+ unsigned long order)
+{

+ return -EINVAL;
+}
+

+static void iommu_free_reserved_iova_domain(struct iommu_domain *domain)
+{
+}
+
#endif /* CONFIG_IOMMU_API */

#endif /* __LINUX_IOMMU_H */
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:35:29 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Implement alloc/free_reserved_iova_domain for arm-smmu. we use
the iova allocator (iova.c). The iova_domain is attached to the
arm_smmu_domain struct. A mutex is introduced to protect it.

Signed-off-by: Eric Auger <eric....@linaro.org>

---

v1 -> v2:

- formerly implemented in vfio_iommu_type1
---
drivers/iommu/arm-smmu.c | 87 +++++++++++++++++++++++++++++++++++++++---------
1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index c8b7e71..f42341d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -42,6 +42,7 @@
#include <linux/platform_device.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/iova.h>

#include <linux/amba/bus.h>

@@ -347,6 +348,9 @@ struct arm_smmu_domain {
enum arm_smmu_domain_stage stage;
struct mutex init_mutex; /* Protects smmu pointer */
struct iommu_domain domain;
+ struct iova_domain *reserved_iova_domain;
+ /* protects reserved domain manipulation */
+ struct mutex reserved_mutex;
};

static struct iommu_ops arm_smmu_ops;
@@ -975,6 +979,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
return NULL;

mutex_init(&smmu_domain->init_mutex);
+ mutex_init(&smmu_domain->reserved_mutex);
spin_lock_init(&smmu_domain->pgtbl_lock);

return &smmu_domain->domain;
@@ -1446,22 +1451,74 @@ out_unlock:
return ret;
}

+static int arm_smmu_alloc_reserved_iova_domain(struct iommu_domain *domain,

+ dma_addr_t iova, size_t size,
+ unsigned long order)
+{

+ unsigned long granule, mask;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+ int ret = 0;
+

+ granule = 1UL << order;
+ mask = granule - 1;
+ if (iova & mask || (!size) || (size & mask))
+ return -EINVAL;
+
+ if (smmu_domain->reserved_iova_domain)
+ return -EEXIST;
+
+ mutex_lock(&smmu_domain->reserved_mutex);
+
+ smmu_domain->reserved_iova_domain =
+ kzalloc(sizeof(struct iova_domain), GFP_KERNEL);
+ if (!smmu_domain->reserved_iova_domain) {

+ ret = -ENOMEM;
+ goto unlock;
+ }
+

+ init_iova_domain(smmu_domain->reserved_iova_domain,
+ granule, iova >> order, (iova + size - 1) >> order);
+
+unlock:
+ mutex_unlock(&smmu_domain->reserved_mutex);

+ return ret;
+}
+

+static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct iova_domain *iovad = smmu_domain->reserved_iova_domain;
+
+ if (!iovad)
+ return;
+
+ mutex_lock(&smmu_domain->reserved_mutex);
+
+ put_iova_domain(iovad);
+ kfree(iovad);
+
+ mutex_unlock(&smmu_domain->reserved_mutex);
+}
+
static struct iommu_ops arm_smmu_ops = {
- .capable = arm_smmu_capable,
- .domain_alloc = arm_smmu_domain_alloc,
- .domain_free = arm_smmu_domain_free,
- .attach_dev = arm_smmu_attach_dev,
- .detach_dev = arm_smmu_detach_dev,
- .map = arm_smmu_map,
- .unmap = arm_smmu_unmap,
- .map_sg = default_iommu_map_sg,
- .iova_to_phys = arm_smmu_iova_to_phys,
- .add_device = arm_smmu_add_device,
- .remove_device = arm_smmu_remove_device,
- .device_group = arm_smmu_device_group,
- .domain_get_attr = arm_smmu_domain_get_attr,
- .domain_set_attr = arm_smmu_domain_set_attr,
- .pgsize_bitmap = -1UL, /* Restricted during device attach */
+ .capable = arm_smmu_capable,
+ .domain_alloc = arm_smmu_domain_alloc,
+ .domain_free = arm_smmu_domain_free,
+ .attach_dev = arm_smmu_attach_dev,
+ .detach_dev = arm_smmu_detach_dev,
+ .map = arm_smmu_map,
+ .unmap = arm_smmu_unmap,
+ .map_sg = default_iommu_map_sg,
+ .iova_to_phys = arm_smmu_iova_to_phys,
+ .add_device = arm_smmu_add_device,
+ .remove_device = arm_smmu_remove_device,
+ .device_group = arm_smmu_device_group,
+ .domain_get_attr = arm_smmu_domain_get_attr,
+ .domain_set_attr = arm_smmu_domain_set_attr,
+ .alloc_reserved_iova_domain = arm_smmu_alloc_reserved_iova_domain,
+ .free_reserved_iova_domain = arm_smmu_free_reserved_iova_domain,
+ /* Page size bitmap, restricted during device attach */
+ .pgsize_bitmap = -1UL,
};

static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:35:40 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Implement the iommu_get/put_single_reserved API in arm-smmu.

In order to track which physical address is already mapped we
use the RB tree indexed by PA.

Signed-off-by: Eric Auger <eric....@linaro.org>

---

v1 -> v2:

- previously in vfio_iommu_type1.c
---
drivers/iommu/arm-smmu.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 114 insertions(+)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 729a4c6..9961bfd 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1563,6 +1563,118 @@ static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
mutex_unlock(&smmu_domain->reserved_mutex);
}

+static int arm_smmu_get_single_reserved(struct iommu_domain *domain,

+ phys_addr_t addr, int prot,
+ dma_addr_t *iova)

+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+ unsigned long order = __ffs(domain->ops->pgsize_bitmap);
+ size_t page_size = 1 << order;
+ phys_addr_t mask = page_size - 1;
+ phys_addr_t aligned_addr = addr & ~mask;
+ phys_addr_t offset = addr - aligned_addr;
+ struct arm_smmu_reserved_binding *b;
+ struct iova *p_iova;

+ struct iova_domain *iovad = smmu_domain->reserved_iova_domain;

+ int ret;
+
+ if (!iovad)
+ return -EINVAL;

+
+ mutex_lock(&smmu_domain->reserved_mutex);
+

+ b = find_reserved_binding(smmu_domain, aligned_addr, page_size);
+ if (b) {

+ *iova = b->iova + offset;
+ kref_get(&b->kref);

+ ret = 0;

+ goto unlock;
+ }
+

+ /* there is no existing reserved iova for this pa */
+ p_iova = alloc_iova(iovad, 1, iovad->dma_32bit_pfn, true);
+ if (!p_iova) {

+ ret = -ENOMEM;
+ goto unlock;
+ }

+ *iova = p_iova->pfn_lo << order;
+
+ b = kzalloc(sizeof(*b), GFP_KERNEL);
+ if (!b) {

+ ret = -ENOMEM;

+ goto free_iova_unlock;
+ }
+

+ ret = arm_smmu_map(domain, *iova, aligned_addr, page_size, prot);

+ if (ret)
+ goto free_binding_iova_unlock;
+
+ kref_init(&b->kref);
+ kref_get(&b->kref);

+ b->domain = smmu_domain;

+ b->addr = aligned_addr;
+ b->iova = *iova;

+ b->size = page_size;
+
+ link_reserved_binding(smmu_domain, b);
+

+ *iova += offset;
+ goto unlock;
+
+free_binding_iova_unlock:
+ kfree(b);
+free_iova_unlock:

+ free_iova(smmu_domain->reserved_iova_domain, *iova >> order);

+unlock:
+ mutex_unlock(&smmu_domain->reserved_mutex);
+ return ret;
+}
+

+/* called with reserved_mutex locked */
+static void reserved_binding_release(struct kref *kref)
+{
+ struct arm_smmu_reserved_binding *b =
+ container_of(kref, struct arm_smmu_reserved_binding, kref);
+ struct arm_smmu_domain *smmu_domain = b->domain;
+ struct iommu_domain *d = &smmu_domain->domain;

+ unsigned long order = __ffs(b->size);
+
+

+ arm_smmu_unmap(d, b->iova, b->size);
+ free_iova(smmu_domain->reserved_iova_domain, b->iova >> order);
+ unlink_reserved_binding(smmu_domain, b);
+ kfree(b);
+}
+
+static void arm_smmu_put_single_reserved(struct iommu_domain *domain,
+ dma_addr_t iova)
+{
+ unsigned long order;
+ phys_addr_t aligned_addr;
+ dma_addr_t aligned_iova, page_size, mask, offset;
+ struct arm_smmu_reserved_binding *b;

+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+

+ order = __ffs(domain->ops->pgsize_bitmap);
+ page_size = (uint64_t)1 << order;
+ mask = page_size - 1;
+

+ aligned_iova = iova & ~mask;
+ offset = iova - aligned_iova;
+

+ aligned_addr = iommu_iova_to_phys(domain, aligned_iova);

+
+ mutex_lock(&smmu_domain->reserved_mutex);
+

+ b = find_reserved_binding(smmu_domain, aligned_addr, page_size);
+ if (!b)
+ goto unlock;
+ kref_put(&b->kref, reserved_binding_release);

+
+unlock:
+ mutex_unlock(&smmu_domain->reserved_mutex);
+}

+
static struct iommu_ops arm_smmu_ops = {

.capable = arm_smmu_capable,
.domain_alloc = arm_smmu_domain_alloc,
@@ -1580,6 +1692,8 @@ static struct iommu_ops arm_smmu_ops = {
.domain_set_attr = arm_smmu_domain_set_attr,
.alloc_reserved_iova_domain = arm_smmu_alloc_reserved_iova_domain,
.free_reserved_iova_domain = arm_smmu_free_reserved_iova_domain,
+ .get_single_reserved = arm_smmu_get_single_reserved,
+ .put_single_reserved = arm_smmu_put_single_reserved,

/* Page size bitmap, restricted during device attach */

.pgsize_bitmap = -1UL,
};
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:35:50 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

The user is allowed to register a reserved IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
It provides the base address and the size. This region is stored in the
vfio_dma rb tree. At that point the iova range is not mapped to any target
address yet. The host kernel will use those iova when needed, typically
when the VFIO-PCI device allocates its MSI's.

This patch also handles the destruction of the reserved binding RB-tree and
domain's iova_domains.

Signed-off-by: Eric Auger <eric....@linaro.org>
Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

---

v1 -> v2:
- set returned value according to alloc_reserved_iova_domain result
- free the iova domains in case any error occurs

RFC v1 -> v1:
- takes into account Alex comments, based on
[RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:

- use the existing dma map/unmap ioctl interface with a flag to register

a reserved IOVA range. A single reserved iova region is allowed.
---
drivers/vfio/vfio_iommu_type1.c | 75 ++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 9 +++++
2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b9326c9..c5d3b48 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -673,6 +673,75 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
return ret;
}

+static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_dma_map *map)
+{
+ dma_addr_t iova = map->iova;
+ size_t size = map->size;
+ uint64_t mask;
+ struct vfio_dma *dma;

+ int ret = 0;

+ struct vfio_domain *d;

+ unsigned long order;
+

+ /* Verify that none of our __u64 fields overflow */
+ if (map->size != size || map->iova != iova)
+ return -EINVAL;
+
+ order = __ffs(vfio_pgsize_bitmap(iommu));
+ mask = ((uint64_t)1 << order) - 1;
+
+ WARN_ON(mask & PAGE_MASK);
+
+ /* we currently only support MSI_RESERVED_IOVA */
+ if (!(map->flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA))
+ return -EINVAL;
+
+ if (!size || (size | iova) & mask)
+ return -EINVAL;
+
+ /* Don't allow IOVA address wrap */
+ if (iova + size - 1 < iova)
+ return -EINVAL;
+
+ mutex_lock(&iommu->lock);
+
+ /* check if the iova domain has not been instantiated already*/

+ d = list_first_entry(&iommu->domain_list,

+ struct vfio_domain, next);
+
+ if (vfio_find_dma(iommu, iova, size)) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+ if (!dma) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ dma->iova = iova;
+ dma->size = size;
+ dma->type = VFIO_IOVA_RESERVED;

+
+ list_for_each_entry(d, &iommu->domain_list, next)

+ ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
+ size, order);
+
+ if (ret) {

+ list_for_each_entry(d, &iommu->domain_list, next)

+ iommu_free_reserved_iova_domain(d->domain);
+ goto out;
+ }
+
+ vfio_link_dma(iommu, dma);
+
+out:
+ mutex_unlock(&iommu->lock);

+ return ret;
+}
+

static int vfio_bus_type(struct device *dev, void *data)
{
struct bus_type **bus = data;
@@ -1045,7 +1114,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
} else if (cmd == VFIO_IOMMU_MAP_DMA) {
struct vfio_iommu_type1_dma_map map;
uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
- VFIO_DMA_MAP_FLAG_WRITE;
+ VFIO_DMA_MAP_FLAG_WRITE |
+ VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;

minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);

@@ -1055,6 +1125,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
if (map.argsz < minsz || map.flags & ~mask)
return -EINVAL;

+ if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
+ return vfio_register_reserved_iova_range(iommu, &map);
+
return vfio_dma_do_map(iommu, &map);

} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 43e183b..982e326 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -411,12 +411,21 @@ struct vfio_iommu_type1_info {
*
* Map process virtual addresses to IO virtual addresses using the
* provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case MSI_RESERVED_IOVA is set, the API only aims at registering an IOVA
+ * region which will be used on some platforms to map the host MSI frame.
+ * in that specific case, vaddr and prot are ignored. The requirement for
+ * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
+ * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
+ * MSI_RESERVED_IOVA region can be registered
*/
struct vfio_iommu_type1_dma_map {
__u32 argsz;
__u32 flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */
+/* reserved iova for MSI vectors*/
+#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
__u64 vaddr; /* Process virtual address */
__u64 iova; /* IO virtual address */
__u64 size; /* Size of mapping (bytes) */
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:35:55 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

We plan to use msi_get_domain_info in VFIO module so let's export it.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

include/linux/msi.h | 4 ++++
kernel/irq/msi.c | 1 +
2 files changed, 5 insertions(+)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index 03eda72..df19315 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -323,6 +323,10 @@ static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
{
return NULL;
}
+static struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain)
+{
+ return NULL;
+}
#endif /* CONFIG_PCI_MSI_IRQ_DOMAIN */

#endif /* LINUX_MSI_H */
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 38e89ce..9b0ba4a 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -400,5 +400,6 @@ struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain)
{
return (struct msi_domain_info *)domain->host_data;
}
+EXPORT_SYMBOL_GPL(msi_get_domain_info);

#endif /* CONFIG_GENERIC_MSI_IRQ_DOMAIN */
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:35:58 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller. vfio_msi_parent_irq_remapping_capable allows to
check whether interrupts are "safe" for a given device. There are if the device
does not use MSI or if the device uses MSI and the msi-parent controller
supports IRQ remapping.

Then we check at group level if all devices have safe interrupts: if not
only allow the group to be attached if allow_unsafe_interrupts is set.

At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
changed in next patch.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/vfio/vfio_iommu_type1.c | 38 ++++++++++++++++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c5d3b48..080321b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,8 @@
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/workqueue.h>
+#include <linux/irqdomain.h>
+#include <linux/msi.h>

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <alex.wi...@redhat.com>"
@@ -754,6 +756,31 @@ static int vfio_bus_type(struct device *dev, void *data)
return 0;
}

+/**
+ * vfio_msi_parent_irq_remapping_capable: returns whether the device msi-parent
+ * controller supports IRQ remapping, aka interrupt translation
+ *
+ * @dev: device handle
+ * @data: unused
+ * returns 0 if irq remapping is supported or -1 if not supported.
+ */
+static int vfio_msi_parent_irq_remapping_capable(struct device *dev, void *data)
+{
+ struct irq_domain *domain;
+ struct msi_domain_info *info;
+
+ domain = dev_get_msi_domain(dev);
+ if (!domain)
+ return 0;
+
+ info = msi_get_domain_info(domain);
+
+ if (!(info->flags & MSI_FLAG_IRQ_REMAPPING))
+ return -1;
+
+ return 0;
+}
+
static int vfio_iommu_replay(struct vfio_iommu *iommu,
struct vfio_domain *domain)
{
@@ -848,7 +875,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
struct vfio_group *group, *g;
struct vfio_domain *domain, *d;
struct bus_type *bus = NULL;
- int ret;
+ int ret, irq_remapping;

mutex_lock(&iommu->lock);

@@ -871,6 +898,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

group->iommu_group = iommu_group;

+ /*
+ * Determine if all the devices of the group has an MSI-parent that
+ * supports irq remapping
+ */
+ irq_remapping = !iommu_group_for_each_dev(iommu_group, &bus,
+ vfio_msi_parent_irq_remapping_capable);
+
/* Determine bus_type in order to allocate a domain */
ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
if (ret)
@@ -899,7 +933,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
list_add(&group->next, &domain->group_list);

if (!allow_unsafe_interrupts &&
- !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
+ (!iommu_capable(bus, IOMMU_CAP_INTR_REMAP) && !irq_remapping)) {
pr_warn("%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
__func__);
ret = -EPERM;
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:36:08 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

In case the msi_desc references a device attached to an iommu
domain, the msi address needs to be mapped in the IOMMU. Else any
MSI write transaction will cause a fault.

gic_set_msi_addr detects that case and allocates an iova bound
to the physical address page comprising the MSI frame. This iova
then is used as the msi_msg address. Unset operation decrements the
reference on the binding.

The functions are called in the irq_write_msi_msg ops implementation.
At that time we can recognize whether the msi is setup or teared down
looking at the msi_msg content. Indeed msi_domain_deactivate zeroes all
the fields.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/irqchip/irq-gic-common.c | 60 ++++++++++++++++++++++++++++++++
drivers/irqchip/irq-gic-common.h | 5 +++

drivers/irqchip/irq-gic-v2m.c | 3 +-
drivers/irqchip/irq-gic-v3-its-pci-msi.c | 3 +-

4 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-gic-common.c b/drivers/irqchip/irq-gic-common.c
index f174ce0..690802b 100644
--- a/drivers/irqchip/irq-gic-common.c
+++ b/drivers/irqchip/irq-gic-common.c
@@ -18,6 +18,8 @@
#include <linux/io.h>
#include <linux/irq.h>
#include <linux/irqchip/arm-gic.h>
+#include <linux/iommu.h>
+#include <linux/msi.h>

#include "irq-gic-common.h"

@@ -121,3 +123,61 @@ void gic_cpu_config(void __iomem *base, void (*sync_access)(void))
if (sync_access)
sync_access();
}
+
+int gic_set_msi_addr(struct irq_data *data, struct msi_msg *msg)
+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct device *dev = msi_desc_to_dev(desc);
+ struct iommu_domain *d;
+ phys_addr_t addr;
+ dma_addr_t iova;
+ int ret;
+
+ d = iommu_get_domain_for_dev(dev);
+ if (!d)
+ return 0;
+
+ addr = ((phys_addr_t)(msg->address_hi) << 32) |
+ msg->address_lo;
+
+ ret = iommu_get_single_reserved(d, addr, IOMMU_WRITE, &iova);
+
+ if (!ret) {
+ msg->address_lo = lower_32_bits(iova);
+ msg->address_hi = upper_32_bits(iova);
+ }

+ return ret;
+}
+

+
+void gic_unset_msi_addr(struct irq_data *data)
+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct device *dev;
+ struct iommu_domain *d;
+ dma_addr_t iova;
+
+ iova = ((dma_addr_t)(desc->msg.address_hi) << 32) |
+ desc->msg.address_lo;
+
+ dev = msi_desc_to_dev(desc);
+ if (!dev)
+ return;
+
+ d = iommu_get_domain_for_dev(dev);
+ if (!d)
+ return;
+
+ iommu_put_single_reserved(d, iova);
+}
+
+void gic_pci_msi_domain_write_msg(struct irq_data *irq_data,
+ struct msi_msg *msg)
+{
+ if (!msg->address_hi && !msg->address_lo && !msg->data)
+ gic_unset_msi_addr(irq_data); /* deactivate */
+ else
+ gic_set_msi_addr(irq_data, msg); /* activate, set_affinity */
+
+ pci_msi_domain_write_msg(irq_data, msg);
+}
diff --git a/drivers/irqchip/irq-gic-common.h b/drivers/irqchip/irq-gic-common.h
index fff697d..e99e321 100644
--- a/drivers/irqchip/irq-gic-common.h
+++ b/drivers/irqchip/irq-gic-common.h
@@ -35,4 +35,9 @@ void gic_cpu_config(void __iomem *base, void (*sync_access)(void));
void gic_enable_quirks(u32 iidr, const struct gic_quirk *quirks,
void *data);

+int gic_set_msi_addr(struct irq_data *data, struct msi_msg *msg);
+void gic_unset_msi_addr(struct irq_data *data);
+void gic_pci_msi_domain_write_msg(struct irq_data *irq_data,
+ struct msi_msg *msg);
+
#endif /* _IRQ_GIC_COMMON_H */
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index c779f83..5d7b89f 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -24,6 +24,7 @@
#include <linux/of_pci.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include "irq-gic-common.h"

/*
* MSI_TYPER:
@@ -83,7 +84,7 @@ static struct irq_chip gicv2m_msi_irq_chip = {
.irq_mask = gicv2m_mask_msi_irq,
.irq_unmask = gicv2m_unmask_msi_irq,
.irq_eoi = irq_chip_eoi_parent,
- .irq_write_msi_msg = pci_msi_domain_write_msg,
+ .irq_write_msi_msg = gic_pci_msi_domain_write_msg,
};

static struct msi_domain_info gicv2m_msi_domain_info = {
diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
index aee60ed..6d5cbce 100644
--- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
@@ -19,6 +19,7 @@
#include <linux/of.h>
#include <linux/of_irq.h>
#include <linux/of_pci.h>
+#include "irq-gic-common.h"

static void its_mask_msi_irq(struct irq_data *d)
{
@@ -37,7 +38,7 @@ static struct irq_chip its_msi_irq_chip = {
.irq_unmask = its_unmask_msi_irq,
.irq_mask = its_mask_msi_irq,
.irq_eoi = irq_chip_eoi_parent,
- .irq_write_msi_msg = pci_msi_domain_write_msg,
+ .irq_write_msi_msg = gic_pci_msi_domain_write_msg,
};

struct its_pci_alias {
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:36:33 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Do not advertise IOMMU_CAP_INTR_REMAP for arm-smmu. Indeed the
irq_remapping capability is abstracted on irqchip side for ARM as
opposed to Intel IOMMU featuring IRQ remapping HW.

So to check IRQ remmapping capability, the msi domain needs to be
checked instead.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/iommu/arm-smmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ae8a97d..9a83285 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1354,7 +1354,7 @@ static bool arm_smmu_capable(enum iommu_cap cap)
*/
return true;
case IOMMU_CAP_INTR_REMAP:
- return true; /* MSIs are just memory writes */
+ return false; /* MSIs are just memory writes */
case IOMMU_CAP_NOEXEC:
return true;
default:
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:38:01 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Let's introduce a new msi_domain_info flag value, MSI_FLAG_IRQ_REMAPPING
meant to tell the domain supports IRQ REMAPPING, also known as Interrupt
Translation Service. On Intel HW this capability is abstracted on IOMMU
side while on ARM it is abstracted on MSI controller side.

GICv3 ITS HW is the first HW advertising that feature.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/irqchip/irq-gic-v3-its.c | 1 +
include/linux/msi.h | 2 ++
2 files changed, 3 insertions(+)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 3447549..a5e0d8b 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1544,6 +1544,7 @@ static int its_probe(struct device_node *node, struct irq_domain *parent)
inner_domain->parent = parent;
inner_domain->bus_token = DOMAIN_BUS_NEXUS;
info->ops = &its_msi_domain_ops;
+ info->flags |= MSI_FLAG_IRQ_REMAPPING;
info->data = its;
inner_domain->host_data = info;
}
diff --git a/include/linux/msi.h b/include/linux/msi.h
index a2a0068..03eda72 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -261,6 +261,8 @@ enum {
MSI_FLAG_MULTI_PCI_MSI = (1 << 3),
/* Support PCI MSIX interrupts */
MSI_FLAG_PCI_MSIX = (1 << 4),
+ /* Support MSI IRQ remapping service */
+ MSI_FLAG_IRQ_REMAPPING = (1 << 5),
};

int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:39:23 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

arm_smmu_unmap_reserved releases all reserved binding resources:
destroy all bindings, free iova, free iova_domain. This happens
on domain deletion.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/iommu/arm-smmu.c | 34 +++++++++++++++++++++++++++++-----
1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 9961bfd..ae8a97d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -363,6 +363,7 @@ struct arm_smmu_reserved_binding {
dma_addr_t iova;
size_t size;
};
+static void arm_smmu_unmap_reserved(struct iommu_domain *domain);

static struct iommu_ops arm_smmu_ops;

@@ -1057,6 +1058,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
* already been detached.
*/
arm_smmu_destroy_domain_context(domain);
+ arm_smmu_unmap_reserved(domain);
kfree(smmu_domain);
}

@@ -1547,19 +1549,23 @@ unlock:
return ret;
}

-static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
+static void __arm_smmu_free_reserved_iova_domain(struct arm_smmu_domain *sd)
{
- struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
- struct iova_domain *iovad = smmu_domain->reserved_iova_domain;
+ struct iova_domain *iovad = sd->reserved_iova_domain;

if (!iovad)
return;

- mutex_lock(&smmu_domain->reserved_mutex);
-
put_iova_domain(iovad);
kfree(iovad);

+}

+static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)

+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+

+ mutex_lock(&smmu_domain->reserved_mutex);
+ __arm_smmu_free_reserved_iova_domain(smmu_domain);
mutex_unlock(&smmu_domain->reserved_mutex);
}

@@ -1675,6 +1681,24 @@ unlock:
mutex_unlock(&smmu_domain->reserved_mutex);
}

+static void arm_smmu_unmap_reserved(struct iommu_domain *domain)

+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+ struct rb_node *node;
+
+ mutex_lock(&smmu_domain->reserved_mutex);
+ while ((node = rb_first(&smmu_domain->reserved_binding_list))) {

+ struct arm_smmu_reserved_binding *b =

+ rb_entry(node, struct arm_smmu_reserved_binding, node);
+
+ while (!kref_put(&b->kref, reserved_binding_release))
+ ;
+ }
+ smmu_domain->reserved_binding_list = RB_ROOT;
+ __arm_smmu_free_reserved_iova_domain(smmu_domain);

+ mutex_unlock(&smmu_domain->reserved_mutex);
+}
+
static struct iommu_ops arm_smmu_ops = {
.capable = arm_smmu_capable,
.domain_alloc = arm_smmu_domain_alloc,

--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:39:48 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

This patch introduces iommu_get/put_single_reserved.

iommu_get_single_reserved allows to allocate a new reserved iova page
and map it onto the physical page that contains a given physical address.
It returns the iova that is mapped onto the provided physical address.
Hence the physical address passed in argument does not need to be aligned.

In case a mapping already exists between both pages, the IOVA mapped
to the PA is directly returned.

Each time an iova is successfully returned a binding ref count is
incremented.

iommu_put_single_reserved decrements the ref count and when this latter
is null, the mapping is destroyed and the iova is released.

Signed-off-by: Eric Auger <eric....@linaro.org>
Signed-off-by: Ankit Jindal <aji...@apm.com>
Signed-off-by: Pranavkumar Sawargaonkar <prana...@linaro.org>

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

---

Currently the ref counting is does not really used. All bindings will be
destroyed when the domain is killed.

v1 -> v2:
- previously a VFIO API, named vfio_alloc_map/unmap_free_reserved_iova
---
drivers/iommu/iommu.c | 21 +++++++++++++++++++++
include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a994f34..14ebde1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1415,6 +1415,27 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
return unmapped;
}
EXPORT_SYMBOL_GPL(iommu_unmap);
+int iommu_get_single_reserved(struct iommu_domain *domain,

+ phys_addr_t addr, int prot,
+ dma_addr_t *iova)
+{

+ if (!domain->ops->get_single_reserved)
+ return -ENODEV;
+
+ return domain->ops->get_single_reserved(domain, addr, prot, iova);
+
+}
+EXPORT_SYMBOL_GPL(iommu_get_single_reserved);
+
+void iommu_put_single_reserved(struct iommu_domain *domain,
+ dma_addr_t iova)
+{
+ if (!domain->ops->put_single_reserved)
+ return;
+
+ domain->ops->put_single_reserved(domain, iova);
+}
+EXPORT_SYMBOL_GPL(iommu_put_single_reserved);

size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
struct scatterlist *sg, unsigned int nents, int prot)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32c1a4e..148465b8 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -201,6 +201,21 @@ struct iommu_ops {
unsigned long order);

/* frees the reserved iova domain */

void (*free_reserved_iova_domain)(struct iommu_domain *domain);
+ /**
+ * allocate a reserved iova page and bind it onto the page that
+ * contains a physical address (@addr), returns the @iova bound to
+ * @addr. In case the 2 pages already are bound simply return @iova
+ * and increment a ref count.
+ */
+ int (*get_single_reserved)(struct iommu_domain *domain,

+ phys_addr_t addr, int prot,
+ dma_addr_t *iova);

+ /**
+ * decrement a ref count of the iova page. If null, unmap the iova page
+ * and release the iova
+ */
+ void (*put_single_reserved)(struct iommu_domain *domain,
+ dma_addr_t iova);

#ifdef CONFIG_OF_IOMMU
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);

@@ -276,6 +291,11 @@ extern int iommu_alloc_reserved_iova_domain(struct iommu_domain *domain,
dma_addr_t iova, size_t size,
unsigned long order);
extern void iommu_free_reserved_iova_domain(struct iommu_domain *domain);
+extern int iommu_get_single_reserved(struct iommu_domain *domain,
+ phys_addr_t paddr, int prot,
+ dma_addr_t *iova);
+extern void iommu_put_single_reserved(struct iommu_domain *domain,
+ dma_addr_t iova);

struct device *iommu_device_create(struct device *parent, void *drvdata,
const struct attribute_group **groups,
const char *fmt, ...) __printf(4, 5);

@@ -562,6 +582,17 @@ static void iommu_free_reserved_iova_domain(struct iommu_domain *domain)
{
}

+static int iommu_get_single_reserved(struct iommu_domain *domain,
+ phys_addr_t paddr, int prot,
+ dma_addr_t *iova)
+{
+ return -EINVAL;
+}
+static void iommu_put_single_reserved(struct iommu_domain *domain,

+ dma_addr_t iova)
+{
+}

Eric Auger

unread,

Feb 11, 2016, 9:40:22 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

we will need to track which host physical addresses are mapped to
reserved IOVA. In that prospect we introduce a new RB tree indexed
by physical address. This RB tree only is used for reserved IOVA
bindings.

It is expected this RB tree will contain very few bindings. Those
generally correspond to single page mapping one MSI frame (GICv2m
frame or ITS GITS_TRANSLATER frame).

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/iommu/arm-smmu.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index f42341d..729a4c6 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -349,10 +349,21 @@ struct arm_smmu_domain {

struct mutex init_mutex; /* Protects smmu pointer */
struct iommu_domain domain;

struct iova_domain *reserved_iova_domain;
- /* protects reserved domain manipulation */
+ /* rb tree indexed by PA, for reserved bindings only */
+ struct rb_root reserved_binding_list;
+ /* protects reserved domain and rbtree manipulation */
struct mutex reserved_mutex;
};

+struct arm_smmu_reserved_binding {
+ struct kref kref;
+ struct rb_node node;
+ struct arm_smmu_domain *domain;

+ phys_addr_t addr;
+ dma_addr_t iova;

+ size_t size;
+};
+
static struct iommu_ops arm_smmu_ops;

static DEFINE_SPINLOCK(arm_smmu_devices_lock);
@@ -400,6 +411,57 @@ static struct device_node *dev_get_dev_node(struct device *dev)
return dev->of_node;
}

+/* Reserved binding RB-tree manipulation */
+
+static struct arm_smmu_reserved_binding *find_reserved_binding(
+ struct arm_smmu_domain *d,
+ phys_addr_t start, size_t size)
+{
+ struct rb_node *node = d->reserved_binding_list.rb_node;
+
+ while (node) {
+ struct arm_smmu_reserved_binding *binding =

+ rb_entry(node, struct arm_smmu_reserved_binding, node);
+

+ if (start + size <= binding->addr)
+ node = node->rb_left;
+ else if (start >= binding->addr + binding->size)
+ node = node->rb_right;
+ else
+ return binding;
+ }

+
+ return NULL;
+}

+
+static void link_reserved_binding(struct arm_smmu_domain *d,
+ struct arm_smmu_reserved_binding *new)
+{
+ struct rb_node **link = &d->reserved_binding_list.rb_node;
+ struct rb_node *parent = NULL;
+ struct arm_smmu_reserved_binding *binding;
+
+ while (*link) {
+ parent = *link;
+ binding = rb_entry(parent, struct arm_smmu_reserved_binding,
+ node);
+
+ if (new->addr + new->size <= binding->addr)
+ link = &(*link)->rb_left;
+ else
+ link = &(*link)->rb_right;
+ }
+
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, &d->reserved_binding_list);
+}
+
+static void unlink_reserved_binding(struct arm_smmu_domain *d,
+ struct arm_smmu_reserved_binding *old)
+{
+ rb_erase(&old->node, &d->reserved_binding_list);
+}
+
static struct arm_smmu_master *find_smmu_master(struct arm_smmu_device *smmu,
struct device_node *dev_node)
{
@@ -981,6 +1043,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
mutex_init(&smmu_domain->init_mutex);
mutex_init(&smmu_domain->reserved_mutex);
spin_lock_init(&smmu_domain->pgtbl_lock);

+ smmu_domain->reserved_binding_list = RB_ROOT;

return &smmu_domain->domain;
}
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:41:36 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

We introduce a vfio_dma type since we will need to discriminate
legacy vfio_dma's from new reserved ones. Since those latter are
not mapped at registration, some treatments need to be reworked:
removal, replay. Currently they are unplugged. In subsequent patches
they will be reworked.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/vfio/vfio_iommu_type1.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c5b57e1..b9326c9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -53,6 +53,15 @@ module_param_named(disable_hugepages,
MODULE_PARM_DESC(disable_hugepages,
"Disable VFIO IOMMU support for IOMMU hugepages.");

+enum vfio_iova_type {
+ VFIO_IOVA_USER = 0, /* standard IOVA used to map user vaddr */
+ /*
+ * IOVA reserved to map special host physical addresses,
+ * MSI frames for instance
+ */
+ VFIO_IOVA_RESERVED,
+};
+
struct vfio_iommu {
struct list_head domain_list;
struct mutex lock;
@@ -75,6 +84,7 @@ struct vfio_dma {
unsigned long vaddr; /* Process virtual addr */
size_t size; /* Map size (bytes) */
int prot; /* IOMMU_READ/WRITE */
+ enum vfio_iova_type type; /* type of IOVA */
};

struct vfio_group {
@@ -418,7 +428,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)

static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
{
- vfio_unmap_unpin(iommu, dma);
+ if (likely(dma->type != VFIO_IOVA_RESERVED))
+ vfio_unmap_unpin(iommu, dma);
vfio_unlink_dma(iommu, dma);
kfree(dma);
}
@@ -694,6 +705,10 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
dma_addr_t iova;

dma = rb_entry(n, struct vfio_dma, node);
+
+ if (unlikely(dma->type == VFIO_IOVA_RESERVED))
+ continue;
+
iova = dma->iova;

while (iova < dma->iova + dma->size) {
--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:41:52 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

This patch allows the user-space to retrieve whether msi write
transaction addresses must be mapped. This is returned through the
VFIO_IOMMU_GET_INFO API and its new flag: VFIO_IOMMU_INFO_REQUIRE_MSI_MAP.

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

Signed-off-by: Eric Auger <eric....@linaro.org>

---

RFC v1 -> v1:
- derived from
[RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
drivers/vfio/vfio_iommu_type1.c | 26 ++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 1 +
2 files changed, 27 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 6f1ea3d..c5b57e1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -255,6 +255,29 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
}

/*
+ * vfio_domains_require_msi_mapping: indicates whether MSI write transaction
+ * addresses must be mapped
+ *
+ * returns true if it does
+ */
+static bool vfio_domains_require_msi_mapping(struct vfio_iommu *iommu)
+{
+ struct vfio_domain *d;
+ bool ret;
+
+ mutex_lock(&iommu->lock);
+ /* All domains have same require_msi_map property, pick first */
+ d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+ if (iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_MAPPING, NULL) < 0)
+ ret = false;
+ else
+ ret = true;
+ mutex_unlock(&iommu->lock);

+
+ return ret;
+}
+

+/*
* Attempt to pin pages. We really don't want to track all the pfns and
* the iommu can only map chunks of consecutive pfns anyway, so get the
* first page and all consecutive pages with the same locking.
@@ -997,6 +1020,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

info.flags = VFIO_IOMMU_INFO_PGSIZES;

+ if (vfio_domains_require_msi_mapping(iommu))
+ info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
+
info.iova_pgsizes = vfio_pgsize_bitmap(iommu);

return copy_to_user((void __user *)arg, &info, minsz);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 7d7a4c6..43e183b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -400,6 +400,7 @@ struct vfio_iommu_type1_info {
__u32 argsz;
__u32 flags;
#define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
+#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
__u64 iova_pgsizes; /* Bitmap of supported page sizes */
};

--
1.9.1

Eric Auger

unread,

Feb 11, 2016, 9:42:25 AM2/11/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Introduce new DOMAIN_ATTR_MSI_MAPPING domain attribute. If supported,
this means the MSI addresses need to be mapped in the IOMMU. ARM SMMUS
and FSL PAMU, at least expose this attribute.

x86 IOMMUs typically don't expose the attribute since on x86, MSI write
transaction addresses always are within the 1MB PA region [FEE0_0000h -
FEF0_000h] window which directly targets the APIC configuration space and
hence bypass the sMMU.

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>
Signed-off-by: Eric Auger <eric....@linaro.org>

---

RFC v1 -> v1:

- the data field is not used
- for this attribute domain_get_attr simply returns 0 if the MSI_MAPPING
capability if needed or <0 if not.
- removed struct iommu_domain_msi_maps
---
drivers/iommu/arm-smmu.c | 2 ++
drivers/iommu/fsl_pamu_domain.c | 2 ++
include/linux/iommu.h | 1 +
3 files changed, 5 insertions(+)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 59ee4b8..c8b7e71 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1409,6 +1409,8 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
case DOMAIN_ATTR_NESTING:
*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
return 0;
+ case DOMAIN_ATTR_MSI_MAPPING:
+ return 0;
default:
return -ENODEV;
}
diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index da0e1e3..46d5c6a 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -856,6 +856,8 @@ static int fsl_pamu_get_domain_attr(struct iommu_domain *domain,
case DOMAIN_ATTR_FSL_PAMUV1:
*(int *)data = DOMAIN_ATTR_FSL_PAMUV1;
break;
+ case DOMAIN_ATTR_MSI_MAPPING:
+ break;
default:
pr_debug("Unsupported attribute type\n");
ret = -EINVAL;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a5c539f..a4fe04a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -112,6 +112,7 @@ enum iommu_attr {
DOMAIN_ATTR_FSL_PAMU_ENABLE,
DOMAIN_ATTR_FSL_PAMUV1,
DOMAIN_ATTR_NESTING, /* two stages of translation */
+ DOMAIN_ATTR_MSI_MAPPING, /* Require MSIs mapping in iommu */
DOMAIN_ATTR_MAX,
};

--
1.9.1

Eric Auger

unread,

Feb 12, 2016, 3:14:11 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

- Not tested: ARM with SR-IOV, ARM GICv3 ITS, ...

References:
[1] [RFC 0/2] VFIO: Add virtual MSI doorbell support
(https://lkml.org/lkml/2015/7/24/135)
[2] [RFC PATCH 0/6] vfio: Add interface to map MSI pages
(https://lists.cs.columbia.edu/pipermail/kvmarm/2015-September/016607.html)
[3] [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO
(http://permalink.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858)

Git:

https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc3-pcie-passthrough-rfcv3

previous version at
v2: https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc3-pcie-passthrough-rfcv2
v1: https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc1-pcie-passthrough-v1

QEMU Integration:
[RFC v2 0/8] KVM PCI/MSI passthrough with mach-virt
(http://lists.gnu.org/archive/html/qemu-arm/2016-01/msg00444.html)
https://git.linaro.org/people/eric.auger/qemu.git/shortlog/refs/heads/v2.5.0-pci-passthrough-rfc-v2

User Hints:
To allow PCI/MSI passthrough with GICv2M, compile VFIO as a module and
load the vfio_iommu_type1 module with allow_unsafe_interrupts param:

sudo modprobe -v vfio-pci
sudo modprobe -r vfio_iommu_type1

sudo modprobe -v vfio_iommu_type1 allow_unsafe_interrupts=1

History:

RFC v2 -> RFC v3:
- should fix wrong handling of some CONFIG combinations:
CONFIG_IOVA, CONFIG_IOMMU_API, CONFIG_PCI_MSI_IRQ_DOMAIN
- fix MSI_FLAG_IRQ_REMAPPING setting in GICv3 ITS (although not tested)

PATCH v1 -> RFC v2:
- reverted to RFC since it looks more reasonable ;-) the code is split
between VFIO, IOMMU, MSI controller and I am not sure I did the right
choices. Also API need to be further discussed.
- iova API usage in arm-smmu.c.
- MSI controller natively programs the MSI addr with either the PA or IOVA.
This is not done anymore in vfio-pci driver as suggested by Alex.
- check irq remapping capability of the group

RFC v1 [2] -> PATCH v1:

- use the existing dma map/unmap ioctl interface with a flag to register a

drivers/iommu/Kconfig | 2 +

drivers/iommu/arm-smmu.c | 292 +++++++++++++++++++++++++++++--
drivers/iommu/fsl_pamu_domain.c | 2 +
drivers/iommu/iommu.c | 43 +++++

include/uapi/linux/vfio.h | 10 ++
kernel/irq/msi.c | 1 +

13 files changed, 607 insertions(+), 22 deletions(-)

--
1.9.1

Eric Auger

unread,

Feb 12, 2016, 3:14:22 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Eric Auger

unread,

Feb 12, 2016, 3:14:37 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

we will need to track which host physical addresses are mapped to
reserved IOVA. In that prospect we introduce a new RB tree indexed
by physical address. This RB tree only is used for reserved IOVA
bindings.

It is expected this RB tree will contain very few bindings. Those
generally correspond to single page mapping one MSI frame (GICv2m
frame or ITS GITS_TRANSLATER frame).

Signed-off-by: Eric Auger <eric....@linaro.org>

---

---
drivers/iommu/arm-smmu.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index f42341d..729a4c6 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c

Eric Auger

unread,

Feb 12, 2016, 3:14:41 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Let's introduce a new msi_domain_info flag value, MSI_FLAG_IRQ_REMAPPING
meant to tell the domain supports IRQ REMAPPING, also known as Interrupt
Translation Service. On Intel HW this capability is abstracted on IOMMU
side while on ARM it is abstracted on MSI controller side.

GICv3 ITS HW is the first HW advertising that feature.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/irqchip/irq-gic-v3-its-pci-msi.c | 3 ++-
include/linux/msi.h | 2 ++
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
index aee60ed..8223765 100644
--- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
@@ -96,7 +96,8 @@ static struct msi_domain_ops its_pci_msi_ops = {

static struct msi_domain_info its_pci_msi_domain_info = {
.flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
- MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX),
+ MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX |
+ MSI_FLAG_IRQ_REMAPPING),
.ops = &its_pci_msi_ops,
.chip = &its_msi_irq_chip,

Eric Auger

unread,

Feb 12, 2016, 3:14:46 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

We plan to use msi_get_domain_info in VFIO module so let's export it.

Signed-off-by: Eric Auger <eric....@linaro.org>

---

v2 -> v3:
- remove static implementation in case CONFIG_PCI_MSI_IRQ_DOMAIN is not set
---
kernel/irq/msi.c | 1 +
1 file changed, 1 insertion(+)

Eric Auger

unread,

Feb 12, 2016, 3:14:51 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Do not advertise IOMMU_CAP_INTR_REMAP for arm-smmu. Indeed the
irq_remapping capability is abstracted on irqchip side for ARM as
opposed to Intel IOMMU featuring IRQ remapping HW.

So to check IRQ remmapping capability, the msi domain needs to be
checked instead.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/iommu/arm-smmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ae8a97d..9a83285 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c

Eric Auger

unread,

Feb 12, 2016, 3:15:12 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

In case the msi_desc references a device attached to an iommu
domain, the msi address needs to be mapped in the IOMMU. Else any
MSI write transaction will cause a fault.

gic_set_msi_addr detects that case and allocates an iova bound
to the physical address page comprising the MSI frame. This iova
then is used as the msi_msg address. Unset operation decrements the
reference on the binding.

The functions are called in the irq_write_msi_msg ops implementation.
At that time we can recognize whether the msi is setup or teared down
looking at the msi_msg content. Indeed msi_domain_deactivate zeroes all
the fields.

Signed-off-by: Eric Auger <eric....@linaro.org>

---

v2 -> v3:
- protect iova/addr manipulation with CONFIG_ARCH_DMA_ADDR_T_64BIT and
CONFIG_PHYS_ADDR_T_64BIT
- only expose gic_pci_msi_domain_write_msg in case CONFIG_IOMMU_API &
CONFIG_PCI_MSI_IRQ_DOMAIN are set.
- gic_set/unset_msi_addr duly become static
---
drivers/irqchip/irq-gic-common.c | 69 ++++++++++++++++++++++++++++++++
drivers/irqchip/irq-gic-common.h | 5 +++
drivers/irqchip/irq-gic-v2m.c | 7 +++-
drivers/irqchip/irq-gic-v3-its-pci-msi.c | 5 +++
4 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-common.c b/drivers/irqchip/irq-gic-common.c
index f174ce0..46cd06c 100644

--- a/drivers/irqchip/irq-gic-common.c
+++ b/drivers/irqchip/irq-gic-common.c
@@ -18,6 +18,8 @@
#include <linux/io.h>
#include <linux/irq.h>
#include <linux/irqchip/arm-gic.h>
+#include <linux/iommu.h>
+#include <linux/msi.h>

#include "irq-gic-common.h"

@@ -121,3 +123,70 @@ void gic_cpu_config(void __iomem *base, void (*sync_access)(void))
if (sync_access)
sync_access();
}
+
+#if defined(CONFIG_IOMMU_API) && defined(CONFIG_PCI_MSI_IRQ_DOMAIN)
+static int gic_set_msi_addr(struct irq_data *data, struct msi_msg *msg)

+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct device *dev = msi_desc_to_dev(desc);
+ struct iommu_domain *d;

+ phys_addr_t addr;
+ dma_addr_t iova;

+ int ret;
+
+ d = iommu_get_domain_for_dev(dev);
+ if (!d)
+ return 0;

+#ifdef CONFIG_PHYS_ADDR_T_64BIT
+ addr = ((phys_addr_t)(msg->address_hi) << 32) | msg->address_lo;
+#else
+ addr = msg->address_lo;
+#endif

+
+ ret = iommu_get_single_reserved(d, addr, IOMMU_WRITE, &iova);
+
+ if (!ret) {
+ msg->address_lo = lower_32_bits(iova);
+ msg->address_hi = upper_32_bits(iova);

+ }
+ return ret;
+}
+
+

+static void gic_unset_msi_addr(struct irq_data *data)

+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct device *dev;
+ struct iommu_domain *d;
+ dma_addr_t iova;
+

+#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT

+ iova = ((dma_addr_t)(desc->msg.address_hi) << 32) |
+ desc->msg.address_lo;

+#else
+ iova = desc->msg.address_lo;
+#endif

+
+ dev = msi_desc_to_dev(desc);
+ if (!dev)
+ return;
+
+ d = iommu_get_domain_for_dev(dev);
+ if (!d)
+ return;
+
+ iommu_put_single_reserved(d, iova);
+}
+
+void gic_pci_msi_domain_write_msg(struct irq_data *irq_data,
+ struct msi_msg *msg)
+{
+ if (!msg->address_hi && !msg->address_lo && !msg->data)
+ gic_unset_msi_addr(irq_data); /* deactivate */
+ else
+ gic_set_msi_addr(irq_data, msg); /* activate, set_affinity */
+
+ pci_msi_domain_write_msg(irq_data, msg);
+}

+#endif

+
diff --git a/drivers/irqchip/irq-gic-common.h b/drivers/irqchip/irq-gic-common.h

index fff697d..98681fd 100644

--- a/drivers/irqchip/irq-gic-common.h
+++ b/drivers/irqchip/irq-gic-common.h
@@ -35,4 +35,9 @@ void gic_cpu_config(void __iomem *base, void (*sync_access)(void));
void gic_enable_quirks(u32 iidr, const struct gic_quirk *quirks,
void *data);

+#if defined(CONFIG_PCI_MSI_IRQ_DOMAIN) && defined(CONFIG_IOMMU_API)
+void gic_pci_msi_domain_write_msg(struct irq_data *irq_data,
+ struct msi_msg *msg);
+#endif

+
#endif /* _IRQ_GIC_COMMON_H */
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c

index c779f83..692d809 100644

--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -24,6 +24,7 @@
#include <linux/of_pci.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include "irq-gic-common.h"

/*
* MSI_TYPER:

@@ -83,7 +84,11 @@ static struct irq_chip gicv2m_msi_irq_chip = {

.irq_mask = gicv2m_mask_msi_irq,
.irq_unmask = gicv2m_unmask_msi_irq,
.irq_eoi = irq_chip_eoi_parent,
- .irq_write_msi_msg = pci_msi_domain_write_msg,

+#ifdef CONFIG_IOMMU_API
+ .irq_write_msi_msg = gic_pci_msi_domain_write_msg,
+#else
+ .irq_write_msi_msg = pci_msi_domain_write_msg,
+#endif

};

static struct msi_domain_info gicv2m_msi_domain_info = {

diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
index 8223765..690504e 100644
--- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c

@@ -19,6 +19,7 @@
#include <linux/of.h>
#include <linux/of_irq.h>
#include <linux/of_pci.h>
+#include "irq-gic-common.h"

static void its_mask_msi_irq(struct irq_data *d)
{

@@ -37,7 +38,11 @@ static struct irq_chip its_msi_irq_chip = {

.irq_unmask = its_unmask_msi_irq,
.irq_mask = its_mask_msi_irq,
.irq_eoi = irq_chip_eoi_parent,

+#ifdef CONFIG_IOMMU_API
+ .irq_write_msi_msg = gic_pci_msi_domain_write_msg,
+#else
.irq_write_msi_msg = pci_msi_domain_write_msg,
+#endif
};

struct its_pci_alias {
--
1.9.1

Eric Auger

unread,

Feb 12, 2016, 3:15:53 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller. vfio_msi_parent_irq_remapping_capable allows to
check whether interrupts are "safe" for a given device. There are if the device
does not use MSI or if the device uses MSI and the msi-parent controller
supports IRQ remapping.

Then we check at group level if all devices have safe interrupts: if not
only allow the group to be attached if allow_unsafe_interrupts is set.

At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
changed in next patch.

Signed-off-by: Eric Auger <eric....@linaro.org>

---

v2 -> v3:

- protect vfio_msi_parent_irq_remapping_capable with
CONFIG_GENERIC_MSI_IRQ_DOMAIN
---
drivers/vfio/vfio_iommu_type1.c | 39 +++++++++++++++++++++++++++++++++++++--
1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c5d3b48..3afb815 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c

@@ -36,6 +36,8 @@
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/workqueue.h>
+#include <linux/irqdomain.h>
+#include <linux/msi.h>

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <alex.wi...@redhat.com>"

@@ -754,6 +756,32 @@ static int vfio_bus_type(struct device *dev, void *data)

return 0;
}

+/**
+ * vfio_msi_parent_irq_remapping_capable: returns whether the device msi-parent
+ * controller supports IRQ remapping, aka interrupt translation
+ *
+ * @dev: device handle
+ * @data: unused
+ * returns 0 if irq remapping is supported or -1 if not supported.
+ */
+static int vfio_msi_parent_irq_remapping_capable(struct device *dev, void *data)
+{

+#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN

+ struct irq_domain *domain;
+ struct msi_domain_info *info;
+
+ domain = dev_get_msi_domain(dev);
+ if (!domain)

+ return 0;
+

+ info = msi_get_domain_info(domain);
+
+ if (!(info->flags & MSI_FLAG_IRQ_REMAPPING))
+ return -1;

+#endif

+ return 0;
+}
+

static int vfio_iommu_replay(struct vfio_iommu *iommu,

struct vfio_domain *domain)
{
@@ -848,7 +876,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

struct vfio_group *group, *g;
struct vfio_domain *domain, *d;
struct bus_type *bus = NULL;
- int ret;
+ int ret, irq_remapping;

mutex_lock(&iommu->lock);

@@ -871,6 +899,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

group->iommu_group = iommu_group;

+ /*
+ * Determine if all the devices of the group has an MSI-parent that
+ * supports irq remapping
+ */
+ irq_remapping = !iommu_group_for_each_dev(iommu_group, &bus,
+ vfio_msi_parent_irq_remapping_capable);
+
/* Determine bus_type in order to allocate a domain */
ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
if (ret)

@@ -899,7 +934,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

Eric Auger

unread,

Feb 12, 2016, 3:16:39 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Implement alloc/free_reserved_iova_domain for arm-smmu. we use
the iova allocator (iova.c). The iova_domain is attached to the
arm_smmu_domain struct. A mutex is introduced to protect it.

Signed-off-by: Eric Auger <eric....@linaro.org>

---
v2 -> v3:

- select IOMMU_IOVA when ARM_SMMU or ARM_SMMU_V3 is set

v1 -> v2:
- formerly implemented in vfio_iommu_type1
---

drivers/iommu/Kconfig | 2 ++
drivers/iommu/arm-smmu.c | 87 +++++++++++++++++++++++++++++++++++++++---------
2 files changed, 74 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index a1e75cb..1106528 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -289,6 +289,7 @@ config ARM_SMMU
bool "ARM Ltd. System MMU (SMMU) Support"
depends on (ARM64 || ARM) && MMU
select IOMMU_API
+ select IOMMU_IOVA
select IOMMU_IO_PGTABLE_LPAE
select ARM_DMA_USE_IOMMU if ARM
help
@@ -302,6 +303,7 @@ config ARM_SMMU_V3
bool "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
depends on ARM64 && PCI
select IOMMU_API
+ select IOMMU_IOVA
select IOMMU_IO_PGTABLE_LPAE
select GENERIC_MSI_IRQ_DOMAIN
help
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index c8b7e71..f42341d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c

@@ -42,6 +42,7 @@
#include <linux/platform_device.h>

#include <linux/slab.h>
#include <linux/spinlock.h>

+#include <linux/iova.h>

#include <linux/amba/bus.h>

@@ -347,6 +348,9 @@ struct arm_smmu_domain {
enum arm_smmu_domain_stage stage;

struct mutex init_mutex; /* Protects smmu pointer */
struct iommu_domain domain;

+ struct iova_domain *reserved_iova_domain;
+ /* protects reserved domain manipulation */
+ struct mutex reserved_mutex;
};

static struct iommu_ops arm_smmu_ops;

@@ -975,6 +979,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
return NULL;

mutex_init(&smmu_domain->init_mutex);
+ mutex_init(&smmu_domain->reserved_mutex);
spin_lock_init(&smmu_domain->pgtbl_lock);

return &smmu_domain->domain;
@@ -1446,22 +1451,74 @@ out_unlock:
return ret;
}

+static int arm_smmu_alloc_reserved_iova_domain(struct iommu_domain *domain,
+ dma_addr_t iova, size_t size,
+ unsigned long order)
+{
+ unsigned long granule, mask;

+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+ int ret = 0;
+

+ granule = 1UL << order;
+ mask = granule - 1;
+ if (iova & mask || (!size) || (size & mask))

+ return -EINVAL;
+

+ if (smmu_domain->reserved_iova_domain)
+ return -EEXIST;

+
+ mutex_lock(&smmu_domain->reserved_mutex);
+

+ smmu_domain->reserved_iova_domain =
+ kzalloc(sizeof(struct iova_domain), GFP_KERNEL);
+ if (!smmu_domain->reserved_iova_domain) {

+ ret = -ENOMEM;

+ goto unlock;
+ }
+
+ init_iova_domain(smmu_domain->reserved_iova_domain,
+ granule, iova >> order, (iova + size - 1) >> order);

+
+unlock:
+ mutex_unlock(&smmu_domain->reserved_mutex);

+ return ret;
+}
+

+static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+ struct iova_domain *iovad = smmu_domain->reserved_iova_domain;
+
+ if (!iovad)
+ return;

+
+ mutex_lock(&smmu_domain->reserved_mutex);
+

+ put_iova_domain(iovad);
+ kfree(iovad);
+
+ mutex_unlock(&smmu_domain->reserved_mutex);

+}
+

Eric Auger

unread,

Feb 12, 2016, 3:16:50 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Introduce alloc/free_reserved_iova_domain in the IOMMU API.
alloc_reserved_iova_domain initializes an iova domain at a given
iova base address and with a given size. This iova domain will
be used to allocate iova within that window. Those IOVAs will be reserved
for special purpose, typically MSI frame binding. Allocation function
within the reserved iova domain will be introduced in subsequent patches.

This is the responsability of the API user to make sure any IOVA
belonging to that domain are allocated with those allocation functions.

Signed-off-by: Eric Auger <eric....@linaro.org>

---
v2 -> v3:

- remove iommu_alloc_reserved_iova_domain & iommu_free_reserved_iova_domain
static implementation in case CONFIG_IOMMU_API is not set

v1 -> v2:

- moved from vfio API to IOMMU API
---
drivers/iommu/iommu.c | 22 ++++++++++++++++++++++

include/linux/iommu.h | 10 ++++++++++
2 files changed, 32 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0e3b009..a994f34 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c

@@ -1557,6 +1557,28 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
}
EXPORT_SYMBOL_GPL(iommu_domain_set_attr);

+int iommu_alloc_reserved_iova_domain(struct iommu_domain *domain,

+ dma_addr_t iova, size_t size,
+ unsigned long order)
+{

+ int ret;
+

+ if (!domain->ops->alloc_reserved_iova_domain)
+ return -EINVAL;

+ ret = domain->ops->alloc_reserved_iova_domain(domain, iova,
+ size, order);

+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_alloc_reserved_iova_domain);
+
+void iommu_free_reserved_iova_domain(struct iommu_domain *domain)

+{

+ if (!domain->ops->free_reserved_iova_domain)
+ return;
+ domain->ops->free_reserved_iova_domain(domain);
+}
+EXPORT_SYMBOL_GPL(iommu_free_reserved_iova_domain);
+
void iommu_get_dm_regions(struct device *dev, struct list_head *list)
{
const struct iommu_ops *ops = dev->bus->iommu_ops;

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a4fe04a..2d1f155 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h

@@ -195,6 +195,12 @@ struct iommu_ops {
int (*domain_set_windows)(struct iommu_domain *domain, u32 w_count);
/* Get the number of windows per domain */
u32 (*domain_get_windows)(struct iommu_domain *domain);

+ /* allocates the reserved iova domain */
+ int (*alloc_reserved_iova_domain)(struct iommu_domain *domain,

+ dma_addr_t iova, size_t size,
+ unsigned long order);

+ /* frees the reserved iova domain */
+ void (*free_reserved_iova_domain)(struct iommu_domain *domain);

#ifdef CONFIG_OF_IOMMU
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);

@@ -266,6 +272,10 @@ extern int iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr,
void *data);
extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
void *data);

+extern int iommu_alloc_reserved_iova_domain(struct iommu_domain *domain,

+ dma_addr_t iova, size_t size,

+ unsigned long order);
+extern void iommu_free_reserved_iova_domain(struct iommu_domain *domain);

struct device *iommu_device_create(struct device *parent, void *drvdata,
const struct attribute_group **groups,
const char *fmt, ...) __printf(4, 5);

--
1.9.1

Eric Auger

unread,

Feb 12, 2016, 3:16:52 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

The user is allowed to register a reserved IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
It provides the base address and the size. This region is stored in the
vfio_dma rb tree. At that point the iova range is not mapped to any target
address yet. The host kernel will use those iova when needed, typically
when the VFIO-PCI device allocates its MSI's.

This patch also handles the destruction of the reserved binding RB-tree and
domain's iova_domains.

Signed-off-by: Eric Auger <eric....@linaro.org>

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

---

v1 -> v2:
- set returned value according to alloc_reserved_iova_domain result
- free the iova domains in case any error occurs

RFC v1 -> v1:
- takes into account Alex comments, based on
[RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:

- use the existing dma map/unmap ioctl interface with a flag to register

a reserved IOVA range. A single reserved iova region is allowed.
---
drivers/vfio/vfio_iommu_type1.c | 75 ++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 9 +++++
2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b9326c9..c5d3b48 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c

@@ -673,6 +673,75 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
return ret;
}

+static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_dma_map *map)
+{
+ dma_addr_t iova = map->iova;
+ size_t size = map->size;
+ uint64_t mask;
+ struct vfio_dma *dma;

+ int ret = 0;

+ struct vfio_domain *d;
+ unsigned long order;
+
+ /* Verify that none of our __u64 fields overflow */
+ if (map->size != size || map->iova != iova)

+ return -EINVAL;
+

+ order = __ffs(vfio_pgsize_bitmap(iommu));
+ mask = ((uint64_t)1 << order) - 1;
+
+ WARN_ON(mask & PAGE_MASK);
+
+ /* we currently only support MSI_RESERVED_IOVA */
+ if (!(map->flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA))

+ return -EINVAL;
+

+ if (!size || (size | iova) & mask)

+ return -EINVAL;
+

+ /* Don't allow IOVA address wrap */
+ if (iova + size - 1 < iova)

+ return -EINVAL;
+

+ mutex_lock(&iommu->lock);
+
+ /* check if the iova domain has not been instantiated already*/

+ d = list_first_entry(&iommu->domain_list,

+ struct vfio_domain, next);
+
+ if (vfio_find_dma(iommu, iova, size)) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+ if (!dma) {

+ ret = -ENOMEM;

+ goto out;
+ }
+
+ dma->iova = iova;
+ dma->size = size;
+ dma->type = VFIO_IOVA_RESERVED;
+
+ list_for_each_entry(d, &iommu->domain_list, next)

+ ret |= iommu_alloc_reserved_iova_domain(d->domain, iova,
+ size, order);
+

+ if (ret) {
+ list_for_each_entry(d, &iommu->domain_list, next)
+ iommu_free_reserved_iova_domain(d->domain);
+ goto out;
+ }
+
+ vfio_link_dma(iommu, dma);
+
+out:
+ mutex_unlock(&iommu->lock);

+ return ret;
+}
+

static int vfio_bus_type(struct device *dev, void *data)

{
struct bus_type **bus = data;
@@ -1045,7 +1114,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
} else if (cmd == VFIO_IOMMU_MAP_DMA) {
struct vfio_iommu_type1_dma_map map;
uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
- VFIO_DMA_MAP_FLAG_WRITE;
+ VFIO_DMA_MAP_FLAG_WRITE |
+ VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;

minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);

@@ -1055,6 +1125,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
if (map.argsz < minsz || map.flags & ~mask)
return -EINVAL;

+ if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
+ return vfio_register_reserved_iova_range(iommu, &map);
+
return vfio_dma_do_map(iommu, &map);

} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 43e183b..982e326 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h

@@ -411,12 +411,21 @@ struct vfio_iommu_type1_info {
*
* Map process virtual addresses to IO virtual addresses using the
* provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case MSI_RESERVED_IOVA is set, the API only aims at registering an IOVA
+ * region which will be used on some platforms to map the host MSI frame.
+ * in that specific case, vaddr and prot are ignored. The requirement for
+ * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
+ * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
+ * MSI_RESERVED_IOVA region can be registered
*/
struct vfio_iommu_type1_dma_map {

__u32 argsz;
__u32 flags;

Eric Auger

unread,

Feb 12, 2016, 3:17:18 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

arm_smmu_unmap_reserved releases all reserved binding resources:
destroy all bindings, free iova, free iova_domain. This happens
on domain deletion.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/iommu/arm-smmu.c | 34 +++++++++++++++++++++++++++++-----
1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 9961bfd..ae8a97d 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c

@@ -363,6 +363,7 @@ struct arm_smmu_reserved_binding {
dma_addr_t iova;
size_t size;
};
+static void arm_smmu_unmap_reserved(struct iommu_domain *domain);

static struct iommu_ops arm_smmu_ops;

@@ -1057,6 +1058,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
* already been detached.
*/
arm_smmu_destroy_domain_context(domain);
+ arm_smmu_unmap_reserved(domain);
kfree(smmu_domain);
}

@@ -1547,19 +1549,23 @@ unlock:
return ret;
}

-static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
+static void __arm_smmu_free_reserved_iova_domain(struct arm_smmu_domain *sd)
{
- struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
- struct iova_domain *iovad = smmu_domain->reserved_iova_domain;
+ struct iova_domain *iovad = sd->reserved_iova_domain;

if (!iovad)
return;

- mutex_lock(&smmu_domain->reserved_mutex);
-
put_iova_domain(iovad);
kfree(iovad);

+}

+static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+

+ mutex_lock(&smmu_domain->reserved_mutex);
+ __arm_smmu_free_reserved_iova_domain(smmu_domain);
mutex_unlock(&smmu_domain->reserved_mutex);
}

@@ -1675,6 +1681,24 @@ unlock:
mutex_unlock(&smmu_domain->reserved_mutex);
}

+static void arm_smmu_unmap_reserved(struct iommu_domain *domain)

+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+ struct rb_node *node;
+
+ mutex_lock(&smmu_domain->reserved_mutex);

+ while ((node = rb_first(&smmu_domain->reserved_binding_list))) {

+ struct arm_smmu_reserved_binding *b =

+ rb_entry(node, struct arm_smmu_reserved_binding, node);
+

+ while (!kref_put(&b->kref, reserved_binding_release))
+ ;
+ }

+ smmu_domain->reserved_binding_list = RB_ROOT;

+ __arm_smmu_free_reserved_iova_domain(smmu_domain);

+ mutex_unlock(&smmu_domain->reserved_mutex);
+}
+
static struct iommu_ops arm_smmu_ops = {

Eric Auger

unread,

Feb 12, 2016, 3:18:07 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

We introduce a vfio_dma type since we will need to discriminate
legacy vfio_dma's from new reserved ones. Since those latter are
not mapped at registration, some treatments need to be reworked:
removal, replay. Currently they are unplugged. In subsequent patches
they will be reworked.

Signed-off-by: Eric Auger <eric....@linaro.org>
---

drivers/vfio/vfio_iommu_type1.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c5b57e1..b9326c9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c

Eric Auger

unread,

Feb 12, 2016, 3:18:13 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Implement the iommu_get/put_single_reserved API in arm-smmu.

In order to track which physical address is already mapped we
use the RB tree indexed by PA.

Signed-off-by: Eric Auger <eric....@linaro.org>

---

v1 -> v2:
- previously in vfio_iommu_type1.c
---
drivers/iommu/arm-smmu.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 114 insertions(+)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 729a4c6..9961bfd 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1563,6 +1563,118 @@ static void arm_smmu_free_reserved_iova_domain(struct iommu_domain *domain)
mutex_unlock(&smmu_domain->reserved_mutex);
}

+static int arm_smmu_get_single_reserved(struct iommu_domain *domain,

+ phys_addr_t addr, int prot,
+ dma_addr_t *iova)

+{
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

+ unsigned long order = __ffs(domain->ops->pgsize_bitmap);
+ size_t page_size = 1 << order;
+ phys_addr_t mask = page_size - 1;
+ phys_addr_t aligned_addr = addr & ~mask;
+ phys_addr_t offset = addr - aligned_addr;
+ struct arm_smmu_reserved_binding *b;
+ struct iova *p_iova;
+ struct iova_domain *iovad = smmu_domain->reserved_iova_domain;
+ int ret;
+
+ if (!iovad)
+ return -EINVAL;

+
+ mutex_lock(&smmu_domain->reserved_mutex);
+

+ b = find_reserved_binding(smmu_domain, aligned_addr, page_size);
+ if (b) {
+ *iova = b->iova + offset;
+ kref_get(&b->kref);
+ ret = 0;

+ goto unlock;
+ }
+

+ /* there is no existing reserved iova for this pa */
+ p_iova = alloc_iova(iovad, 1, iovad->dma_32bit_pfn, true);
+ if (!p_iova) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+ *iova = p_iova->pfn_lo << order;
+
+ b = kzalloc(sizeof(*b), GFP_KERNEL);
+ if (!b) {
+ ret = -ENOMEM;
+ goto free_iova_unlock;
+ }
+
+ ret = arm_smmu_map(domain, *iova, aligned_addr, page_size, prot);
+ if (ret)
+ goto free_binding_iova_unlock;
+
+ kref_init(&b->kref);
+ kref_get(&b->kref);
+ b->domain = smmu_domain;
+ b->addr = aligned_addr;
+ b->iova = *iova;
+ b->size = page_size;
+
+ link_reserved_binding(smmu_domain, b);
+
+ *iova += offset;
+ goto unlock;
+
+free_binding_iova_unlock:
+ kfree(b);
+free_iova_unlock:
+ free_iova(smmu_domain->reserved_iova_domain, *iova >> order);
+unlock:
+ mutex_unlock(&smmu_domain->reserved_mutex);

+ return ret;
+}
+

+/* called with reserved_mutex locked */
+static void reserved_binding_release(struct kref *kref)
+{

+ struct arm_smmu_reserved_binding *b =

+ container_of(kref, struct arm_smmu_reserved_binding, kref);
+ struct arm_smmu_domain *smmu_domain = b->domain;
+ struct iommu_domain *d = &smmu_domain->domain;
+ unsigned long order = __ffs(b->size);
+
+
+ arm_smmu_unmap(d, b->iova, b->size);
+ free_iova(smmu_domain->reserved_iova_domain, b->iova >> order);
+ unlink_reserved_binding(smmu_domain, b);
+ kfree(b);
+}
+
+static void arm_smmu_put_single_reserved(struct iommu_domain *domain,
+ dma_addr_t iova)
+{
+ unsigned long order;
+ phys_addr_t aligned_addr;
+ dma_addr_t aligned_iova, page_size, mask, offset;
+ struct arm_smmu_reserved_binding *b;

+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+

+ order = __ffs(domain->ops->pgsize_bitmap);
+ page_size = (uint64_t)1 << order;
+ mask = page_size - 1;
+
+ aligned_iova = iova & ~mask;
+ offset = iova - aligned_iova;
+
+ aligned_addr = iommu_iova_to_phys(domain, aligned_iova);

+
+ mutex_lock(&smmu_domain->reserved_mutex);
+

+ b = find_reserved_binding(smmu_domain, aligned_addr, page_size);
+ if (!b)
+ goto unlock;
+ kref_put(&b->kref, reserved_binding_release);
+
+unlock:

+ mutex_unlock(&smmu_domain->reserved_mutex);
+}
+
static struct iommu_ops arm_smmu_ops = {
.capable = arm_smmu_capable,
.domain_alloc = arm_smmu_domain_alloc,

@@ -1580,6 +1692,8 @@ static struct iommu_ops arm_smmu_ops = {
.domain_set_attr = arm_smmu_domain_set_attr,
.alloc_reserved_iova_domain = arm_smmu_alloc_reserved_iova_domain,
.free_reserved_iova_domain = arm_smmu_free_reserved_iova_domain,
+ .get_single_reserved = arm_smmu_get_single_reserved,
+ .put_single_reserved = arm_smmu_put_single_reserved,

/* Page size bitmap, restricted during device attach */

.pgsize_bitmap = -1UL,
};
--
1.9.1

Eric Auger

unread,

Feb 12, 2016, 3:18:48 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

This patch introduces iommu_get/put_single_reserved.

iommu_get_single_reserved allows to allocate a new reserved iova page
and map it onto the physical page that contains a given physical address.
It returns the iova that is mapped onto the provided physical address.
Hence the physical address passed in argument does not need to be aligned.

In case a mapping already exists between both pages, the IOVA mapped
to the PA is directly returned.

Each time an iova is successfully returned a binding ref count is
incremented.

iommu_put_single_reserved decrements the ref count and when this latter
is null, the mapping is destroyed and the iova is released.

Signed-off-by: Eric Auger <eric....@linaro.org>

Signed-off-by: Ankit Jindal <aji...@apm.com>
Signed-off-by: Pranavkumar Sawargaonkar <prana...@linaro.org>

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

---

v2 -> v3:
- remove static implementation of iommu_get_single_reserved &
iommu_put_single_reserved when CONFIG_IOMMU_API is not set

v1 -> v2:

- previously a VFIO API, named vfio_alloc_map/unmap_free_reserved_iova
---
drivers/iommu/iommu.c | 21 +++++++++++++++++++++

include/linux/iommu.h | 20 ++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a994f34..14ebde1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c

@@ -1415,6 +1415,27 @@ size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
return unmapped;
}
EXPORT_SYMBOL_GPL(iommu_unmap);

+int iommu_get_single_reserved(struct iommu_domain *domain,

+ phys_addr_t addr, int prot,
+ dma_addr_t *iova)
+{

+ if (!domain->ops->get_single_reserved)
+ return -ENODEV;
+
+ return domain->ops->get_single_reserved(domain, addr, prot, iova);
+
+}
+EXPORT_SYMBOL_GPL(iommu_get_single_reserved);
+

+void iommu_put_single_reserved(struct iommu_domain *domain,
+ dma_addr_t iova)
+{

+ if (!domain->ops->put_single_reserved)
+ return;
+
+ domain->ops->put_single_reserved(domain, iova);
+}
+EXPORT_SYMBOL_GPL(iommu_put_single_reserved);

size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
struct scatterlist *sg, unsigned int nents, int prot)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2d1f155..1e00c1b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h

@@ -201,6 +201,21 @@ struct iommu_ops {
unsigned long order);

/* frees the reserved iova domain */

void (*free_reserved_iova_domain)(struct iommu_domain *domain);
+ /**
+ * allocate a reserved iova page and bind it onto the page that
+ * contains a physical address (@addr), returns the @iova bound to
+ * @addr. In case the 2 pages already are bound simply return @iova
+ * and increment a ref count.
+ */

+ int (*get_single_reserved)(struct iommu_domain *domain,

+ phys_addr_t addr, int prot,
+ dma_addr_t *iova);

+ /**
+ * decrement a ref count of the iova page. If null, unmap the iova page
+ * and release the iova
+ */

+ void (*put_single_reserved)(struct iommu_domain *domain,
+ dma_addr_t iova);

#ifdef CONFIG_OF_IOMMU
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);

@@ -276,6 +291,11 @@ extern int iommu_alloc_reserved_iova_domain(struct iommu_domain *domain,
dma_addr_t iova, size_t size,
unsigned long order);
extern void iommu_free_reserved_iova_domain(struct iommu_domain *domain);
+extern int iommu_get_single_reserved(struct iommu_domain *domain,

+ phys_addr_t paddr, int prot,
+ dma_addr_t *iova);
+extern void iommu_put_single_reserved(struct iommu_domain *domain,
+ dma_addr_t iova);

Eric Auger

unread,

Feb 12, 2016, 3:19:15 AM2/12/16

to eric....@st.com, eric....@linaro.org, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

This patch allows the user-space to retrieve whether msi write
transaction addresses must be mapped. This is returned through the
VFIO_IOMMU_GET_INFO API and its new flag: VFIO_IOMMU_INFO_REQUIRE_MSI_MAP.

Signed-off-by: Bharat Bhushan <Bharat....@freescale.com>

Signed-off-by: Eric Auger <eric....@linaro.org>

---

RFC v1 -> v1:

- derived from
[RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
drivers/vfio/vfio_iommu_type1.c | 26 ++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 1 +
2 files changed, 27 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 6f1ea3d..c5b57e1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c

@@ -255,6 +255,29 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
}

/*
+ * vfio_domains_require_msi_mapping: indicates whether MSI write transaction
+ * addresses must be mapped
+ *
+ * returns true if it does
+ */
+static bool vfio_domains_require_msi_mapping(struct vfio_iommu *iommu)
+{

+ struct vfio_domain *d;
+ bool ret;
+
+ mutex_lock(&iommu->lock);

+ /* All domains have same require_msi_map property, pick first */
+ d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+ if (iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_MAPPING, NULL) < 0)
+ ret = false;
+ else
+ ret = true;
+ mutex_unlock(&iommu->lock);
+

+ return ret;
+}
+
+/*

* Attempt to pin pages. We really don't want to track all the pfns and
* the iommu can only map chunks of consecutive pfns anyway, so get the
* first page and all consecutive pages with the same locking.
@@ -997,6 +1020,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

info.flags = VFIO_IOMMU_INFO_PGSIZES;

+ if (vfio_domains_require_msi_mapping(iommu))
+ info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
+
info.iova_pgsizes = vfio_pgsize_bitmap(iommu);

return copy_to_user((void __user *)arg, &info, minsz);

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 7d7a4c6..43e183b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h

@@ -400,6 +400,7 @@ struct vfio_iommu_type1_info {

__u32 argsz;
__u32 flags;

Marc Zyngier

unread,

Feb 18, 2016, 4:35:26 AM2/18/16

to Eric Auger, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

nit: this could be simplified as:

ret = (iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_MAPPING, NULL) == 0);

FWIW:

Acked-by: Marc Zyngier <marc.z...@arm.com>

M.
--
Jazz is not dead. It just smells funny.

Marc Zyngier

unread,

Feb 18, 2016, 6:06:52 AM2/18/16

to Eric Auger, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

On Fri, 12 Feb 2016 08:13:09 +0000
Eric Auger <eric....@linaro.org> wrote:

> This patch introduces iommu_get/put_single_reserved.
>
> iommu_get_single_reserved allows to allocate a new reserved iova page
> and map it onto the physical page that contains a given physical address.
> It returns the iova that is mapped onto the provided physical address.
> Hence the physical address passed in argument does not need to be aligned.
>
> In case a mapping already exists between both pages, the IOVA mapped
> to the PA is directly returned.
>
> Each time an iova is successfully returned a binding ref count is
> incremented.
>
> iommu_put_single_reserved decrements the ref count and when this latter
> is null, the mapping is destroyed and the iova is released.

I wonder if there is a requirement for the caller to find out about the
size of the mapping, or to impose a given size... MSIs clearly do not
have that requirement (this is always a 32bit value), but since
allocations usually pair address and size, I though I'd ask...

Thanks,

Robin Murphy

unread,

Feb 18, 2016, 6:09:30 AM2/18/16

to Eric Auger, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Thomas....@amd.com, brijes...@amd.com, pat...@linaro.org, Manish...@caviumnetworks.com, p.f...@samsung.com, linux-...@vger.kernel.org, io...@lists.linux-foundation.org, pranav.sa...@gmail.com, sherry....@amd.com

Hi Eric,

On 12/02/16 08:13, Eric Auger wrote:
> Implement alloc/free_reserved_iova_domain for arm-smmu. we use
> the iova allocator (iova.c). The iova_domain is attached to the
> arm_smmu_domain struct. A mutex is introduced to protect it.

The IOMMU API currently leaves IOVA management entirely up to the caller
- VFIO is already managing its own IOVA space, so what warrants this
being pushed all the way down to the IOMMU driver? All I see here is
abstract code with no hardware-specific details that'll have to be
copy-pasted into other IOMMU drivers (e.g. SMMUv3), which strongly
suggests it's the wrong place to do it.

As I understand the problem, VFIO has a generic "configure an IOMMU to
point at an MSI doorbell" step to do in the process of attaching a
device, which hasn't needed implementing yet due to VT-d's
IOMMU_CAP_I_AM_ALSO_ACTUALLY_THE_MSI_CONTROLLER_IN_DISGUISE flag, which
most of us have managed to misinterpret so far. AFAICS all the IOMMU
driver should need to know about this is an iommu_map() call (which will
want a slight extension[1] to make things behave properly). We should be
fixing the abstraction to be less x86-centric, not hacking up all the
ARM drivers to emulate x86 hardware behaviour in software.

Robin.

[1]:http://article.gmane.org/gmane.linux.kernel.cross-arch/30833

Marc Zyngier

unread,

Feb 18, 2016, 6:33:30 AM2/18/16

to Eric Auger, leo....@amd.com, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, Thomas....@amd.com

So by doing that, you are specializing this infrastructure to PCI.
If you hijacked irq_compose_msi_msg() instead, you'd have both
platform and PCI MSI for the same price.

I can see a potential problem with the teardown of an MSI (I don't
think the compose method is called on teardown), but I think this could
be easily addressed.

Irrespective of the way you implement the translation procedure, you
should make this unconditional, and have the #ifdefery in the code that
implements it.

> };
>
> static struct msi_domain_info gicv2m_msi_domain_info = {
> diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
> index 8223765..690504e 100644
> --- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
> +++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
> @@ -19,6 +19,7 @@
> #include <linux/of.h>
> #include <linux/of_irq.h>
> #include <linux/of_pci.h>
> +#include "irq-gic-common.h"
>
> static void its_mask_msi_irq(struct irq_data *d)
> {
> @@ -37,7 +38,11 @@ static struct irq_chip its_msi_irq_chip = {
> .irq_unmask = its_unmask_msi_irq,
> .irq_mask = its_mask_msi_irq,
> .irq_eoi = irq_chip_eoi_parent,
> +#ifdef CONFIG_IOMMU_API
> + .irq_write_msi_msg = gic_pci_msi_domain_write_msg,
> +#else
> .irq_write_msi_msg = pci_msi_domain_write_msg,
> +#endif
> };
>
> struct its_pci_alias {

Eric Auger

unread,

Feb 18, 2016, 10:23:13 AM2/18/16

to Robin Murphy, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Thomas....@amd.com, brijes...@amd.com, pat...@linaro.org, Manish...@caviumnetworks.com, p.f...@samsung.com, linux-...@vger.kernel.org, io...@lists.linux-foundation.org, pranav.sa...@gmail.com, sherry....@amd.com

Hi Robin,

On 02/18/2016 12:09 PM, Robin Murphy wrote:
> Hi Eric,
>
> On 12/02/16 08:13, Eric Auger wrote:
>> Implement alloc/free_reserved_iova_domain for arm-smmu. we use
>> the iova allocator (iova.c). The iova_domain is attached to the
>> arm_smmu_domain struct. A mutex is introduced to protect it.
>
> The IOMMU API currently leaves IOVA management entirely up to the caller

I agree

> - VFIO is already managing its own IOVA space, so what warrants this
> being pushed all the way down to the IOMMU driver?

In practice, with upstreamed code, VFIO uses IOVA = GPA provided by the
user-space (corresponding to RAM regions) and does not allocate IOVA
itself. The IOVA is passed through the VFIO_IOMMU_MAP_DMA ioctl.

With that series we propose that the user-space provides a pool of
unused IOVA that can be used to map Host physical addresses (MSI frame
address). So effectively someone needs to use an iova allocator to
allocate within that window. This can be vfio or the iommu driver. But
in both cases this is a new capability introduced in either component.

In the first version of this series
(https://lkml.org/lkml/2016/1/26/371) I put this iova allocation in
vfio_iommu_type1. the vfio-pci driver then was using this vfio internal
API to overwrite the physical address written in the PCI device by the
MSI controller.

However I was advised by Alex to move things at a lower level
(http://www.spinics.net/lists/kvm/msg126809.html), IOMMU core or irq
remapping driver; also the MSI controller should directly program the
IOVA address in the PCI device.

On ARM, irq remapping is somehow abstracted in ITS driver. Also we need
that functionality in GICv2M so I eventually chose to put the
functionality in the IOMMU driver. Since iova.c is not compiled by
everyone and since that functionality is needed for a restricted set of
architectures, ARM/ARM64 & PowerPC I chose to implement this in arhc
specific code, for the time being in arm-smmu.c.

This allows the MSI controller to interact with the IOMMU API to bind
its MSI address. I think it may be feasible to have the MSI controller
interact with the vfio external user API but does it look better?

Assuming we can agree on the relevance of adding that functionality at
IOMMU API level, maybe we can create a separate .c file to share code
between arm-smmu and arm-smmu-v3.c? or even I could dare to add this
into the iommu generic part. What is your opinion?

All I see here is
> abstract code with no hardware-specific details that'll have to be
> copy-pasted into other IOMMU drivers (e.g. SMMUv3), which strongly
> suggests it's the wrong place to do it.
>
> As I understand the problem, VFIO has a generic "configure an IOMMU to
> point at an MSI doorbell" step to do in the process of attaching a
> device, which hasn't needed implementing yet due to VT-d's
> IOMMU_CAP_I_AM_ALSO_ACTUALLY_THE_MSI_CONTROLLER_IN_DISGUISE flag, which
> most of us have managed to misinterpret so far.

Maybe I misunderstood the above comment but I would say this is the
contrary: ie up to now, VFIO did not need to care about that issue since
MSI addresses were not mapped in the IOMMU on x86. Now they need to be
so we need to extend an existing API, would it be VFIO external user API
or IOMMU API. But please correct if I misunderstood you.

Also I found it more practical to have a all-in-one API doing both the
iova allocation and binding (dma_map_single like). the user of the API
does not have to care about the iommu page size.

Thanks for your time and looking forward to reading from you!

Best Regards

Eric

Eric Auger

unread,

Feb 18, 2016, 10:27:01 AM2/18/16

to Marc Zyngier, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Hi Marc,

sure ;-)

thanks

Eric
>
> M.
>

Eric Auger

unread,

Feb 18, 2016, 10:34:18 AM2/18/16

to Marc Zyngier, leo....@amd.com, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, Thomas....@amd.com

Hi Marc,

Yes effectively this is the reason why I moved it from
irq_compose_msi_msg (my original implementation) to irq_write_msi_msg. I
noticed I had no way to detect the teardown whereas the
msi_domain_deactivate also calls irq_write_msi_msg which is quite
practical ;-) To be honest I need to further look at the non PCI
implementation.

OK

Thanks

Eric

Marc Zyngier

unread,

Feb 18, 2016, 10:47:31 AM2/18/16

to Eric Auger, leo....@amd.com, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, Thomas....@amd.com

Another thing to consider is that MSI controllers may use different
doorbells for different CPU affinities. With your implementation,
repeatedly changing the affinity from one CPU to another would increase
the refcounting, and never actually lower it (you don't necessarily go
via an "unmap"). Of course, none of that applies to GICv2m/GICv3-ITS,
but that's worth considering.

So I think we may need some better tracking of the IOVA we program in
the device, and offer a generic infrastructure for this instead of
hiding it in the MSI controller drivers.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

Alex Williamson

unread,

Feb 18, 2016, 11:06:32 AM2/18/16

to Robin Murphy, Eric Auger, eric....@st.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, marc.z...@arm.com, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, Thomas....@amd.com, brijes...@amd.com, pat...@linaro.org, Manish...@caviumnetworks.com, p.f...@samsung.com, linux-...@vger.kernel.org, io...@lists.linux-foundation.org, pranav.sa...@gmail.com, sherry....@amd.com

On Thu, 18 Feb 2016 11:09:17 +0000
Robin Murphy <robin....@arm.com> wrote:

> Hi Eric,
>
> On 12/02/16 08:13, Eric Auger wrote:
> > Implement alloc/free_reserved_iova_domain for arm-smmu. we use
> > the iova allocator (iova.c). The iova_domain is attached to the
> > arm_smmu_domain struct. A mutex is introduced to protect it.
>
> The IOMMU API currently leaves IOVA management entirely up to the caller
> - VFIO is already managing its own IOVA space, so what warrants this
> being pushed all the way down to the IOMMU driver? All I see here is
> abstract code with no hardware-specific details that'll have to be
> copy-pasted into other IOMMU drivers (e.g. SMMUv3), which strongly
> suggests it's the wrong place to do it.
>
> As I understand the problem, VFIO has a generic "configure an IOMMU to
> point at an MSI doorbell" step to do in the process of attaching a
> device, which hasn't needed implementing yet due to VT-d's
> IOMMU_CAP_I_AM_ALSO_ACTUALLY_THE_MSI_CONTROLLER_IN_DISGUISE flag, which
> most of us have managed to misinterpret so far. AFAICS all the IOMMU
> driver should need to know about this is an iommu_map() call (which will
> want a slight extension[1] to make things behave properly). We should be
> fixing the abstraction to be less x86-centric, not hacking up all the
> ARM drivers to emulate x86 hardware behaviour in software.

The gap I see is that, that the I_AM_ALSO_ACTUALLY_THE_MSI...
solution transparently fixes, is that there's no connection between
pci_enable_msi{x}_range and the IOMMU API. If I want to allow a device
managed by an IOMMU API domain to perform MSI, I need to go scrape the
MSI vectors out of the device, setup a translation into my IOVA space,
and re-write those vectors. Not to mention that as an end user, I
have no idea what might be sharing the page where those vectors are
targeted and what I might be allowing the user DMA access to. MSI
setup is necessarily making use of the IOVA space of the device, so
there's clearly an opportunity to interact with the IOMMU API to manage
that IOVA usage. x86 has an implicit range of IOVA space for MSI, this
makes an explicit range, reserved by the IOMMU API user for this
purpose. At the vfio level, I just want to be able to call the PCI
MSI/X setup routines and have them automatically program vectors that
make use of IOVA space that I've already marked reserved for this
purpose. I don't see how that's x86-centric other than x86 has already
managed to make this transparent and spoiled users into expecting
working IOVAs on the device after using standard MSI vector setup
callbacks. That's the goal I'm looking for. Thanks,

Alex

Eric Auger

unread,

Feb 18, 2016, 11:43:10 AM2/18/16

to Marc Zyngier, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Hello,

On 02/18/2016 12:06 PM, Marc Zyngier wrote:
> On Fri, 12 Feb 2016 08:13:09 +0000
> Eric Auger <eric....@linaro.org> wrote:
>
>> This patch introduces iommu_get/put_single_reserved.
>>
>> iommu_get_single_reserved allows to allocate a new reserved iova page
>> and map it onto the physical page that contains a given physical address.
>> It returns the iova that is mapped onto the provided physical address.
>> Hence the physical address passed in argument does not need to be aligned.
>>
>> In case a mapping already exists between both pages, the IOVA mapped
>> to the PA is directly returned.
>>
>> Each time an iova is successfully returned a binding ref count is
>> incremented.
>>
>> iommu_put_single_reserved decrements the ref count and when this latter
>> is null, the mapping is destroyed and the iova is released.
>
> I wonder if there is a requirement for the caller to find out about the
> size of the mapping, or to impose a given size... MSIs clearly do not

> have that requirement (this is always a 32bit value), but since.

> allocations usually pair address and size, I though I'd ask...

Yes. Currently this only makes sure the host PA is mapped and returns
the corresponding IOVA. It is part of the discussion we need to have on
the API besides the problematic of which API it should belong to.

Thanks

Eric
>
> Thanks,
>
> M.
>

Marc Zyngier

unread,

Feb 18, 2016, 11:51:22 AM2/18/16

to Eric Auger, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

One of the issues I have with the API at the moment is that there is no
control on the page size. Imagine you have allocated a 4kB IOVA window
for your MSI, but your IOMMU can only map 64kB (not unreasonable to
imagine on arm64). What happens then?

Somehow, userspace should be told about it, one way or another.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

Eric Auger

unread,

Feb 18, 2016, 11:59:29 AM2/18/16

to Marc Zyngier, leo....@amd.com, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, Thomas....@amd.com

Hi Marc,

OK thanks for pointing this out.

This is also a good confirmation that a single IOVA address is not
always sufficient (at some point we wondered if we could directly use
the MSI controller guest PA instead of having the user-space providing
an IOVA pool)

With your implementation,
> repeatedly changing the affinity from one CPU to another would increase
> the refcounting, and never actually lower it (you don't necessarily go
> via an "unmap").

Of course, none of that applies to GICv2m/GICv3-ITS,
> but that's worth considering.
>
> So I think we may need some better tracking of the IOVA we program in
> the device, and offer a generic infrastructure for this instead of
> hiding it in the MSI controller drivers.

OK I will study that.

Thanks for your time!

Eric
>
> Thanks,
>
> M.
>

Eric Auger

unread,

Feb 18, 2016, 12:19:54 PM2/18/16

to Marc Zyngier, eric....@st.com, alex.wi...@redhat.com, will....@arm.com, jo...@8bytes.org, tg...@linutronix.de, ja...@lakedaemon.net, christof...@linaro.org, linux-ar...@lists.infradead.org, kvm...@lists.cs.columbia.edu, k...@vger.kernel.org, suravee.su...@amd.com, pat...@linaro.org, linux-...@vger.kernel.org, Manish...@caviumnetworks.com, Bharat....@freescale.com, pranav.sa...@gmail.com, p.f...@samsung.com, io...@lists.linux-foundation.org, sherry....@amd.com, brijes...@amd.com, leo....@amd.com, Thomas....@amd.com

Hi Marc,

The code checks the IOVA window size is aligned with the IOMMU page size
so I think that case is handled at iova domain creation
(arm_smmu_alloc_reserved_iova_domain).

>
> Somehow, userspace should be told about it, one way or another.

I agree on that point. The user-space should be provided with the
information about the requested iova pool size and alignments. This is
missing in current rfc series.

Best Regards

Eric
>
> Thanks,
>
> M.
>