[PATCH v1 0/3] uio: add MSI/MSI-X support to uio_pci

Vlad Zolotarov

unread,

Oct 4, 2015, 12:50:05 PM10/4/15

to

This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.

Currently uio_pci_generic supports only legacy INT#x interrupts source. However
there are situations when this is not enough, for instance SR-IOV VF devices that
simply don't have INT#x capability. For such devices uio_pci_generic will simply
fail (more specifically probe() will fail).

When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to performance
overhead and thus VFIO is not an option users that develop user-space drivers are left
without any option but to develop some proprietary UIO drivers (e.g. igb_uio driver in Intel's
DPDK) just to be able to use UIO infrastructure.

This series provides a generic solution for this problem while preserving the original behaviour
for devices for which the original uio_pci_generic had worked before (i.e. INT#x will be used by default).

Vlad Zolotarov (3):
uio: add ioctl support
uio_pci_generic: add MSI/MSI-X support
Documentation: update uio-howto

Documentation/DocBook/uio-howto.tmpl | 29 ++-
drivers/uio/uio.c | 15 ++
drivers/uio/uio_pci_generic.c | 405 +++++++++++++++++++++++++++++++++--
include/linux/uio_driver.h | 3 +
include/linux/uio_pci_generic.h | 36 ++++
5 files changed, 462 insertions(+), 26 deletions(-)
create mode 100644 include/linux/uio_pci_generic.h

--
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Message has been deleted

Vlad Zolotarov

unread,

Oct 4, 2015, 12:50:05 PM10/4/15

to

Change the chapters related to uio_pci_generic that refer interrupt mode.
Add the relevant explanation regarding MSI and MSI-X interrupt modes
support.

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---
Documentation/DocBook/uio-howto.tmpl | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/Documentation/DocBook/uio-howto.tmpl b/Documentation/DocBook/uio-howto.tmpl
index cd0e452..a176129 100644
--- a/Documentation/DocBook/uio-howto.tmpl
+++ b/Documentation/DocBook/uio-howto.tmpl
@@ -46,6 +46,12 @@ GPL version 2.

<revhistory>
<revision>
+ <revnumber>0.10</revnumber>
+ <date>2015-10-04</date>
+ <authorinitials>vz</authorinitials>
+ <revremark>Added MSI and MSI-X support to uio_pci_generic.</revremark>
+ </revision>
+ <revision>
<revnumber>0.9</revnumber>
<date>2009-07-16</date>
<authorinitials>mst</authorinitials>
@@ -935,15 +941,32 @@ and look in the output for failure reasons
<sect1 id="uio_pci_generic_internals">
<title>Things to know about uio_pci_generic</title>
<para>
-Interrupts are handled using the Interrupt Disable bit in the PCI command
+Interrupts are handled either as MSI-X or MSI interrupts (if the device supports it) or
+as legacy INTx interrupts. By default INTx interrupts are used.
+ </para>
+ <para>
+uio_pci_generic automatically configures a device to use INTx interrupt for backward
+compatibility. If INTx are not available MSI-X interrupts will be used if the device
+supports it and if not MSI interrupts are going to be used. If none of the interrupts
+modes is supported probe() will fail.
+ </para>
+ <para>
+To get the used interrupt mode application has to use UIO_PCI_GENERIC_INT_MODE_GET ioctl
+command.
+UIO_PCI_GENERIC_IRQ_NUM_GET ioctl command may be used to get the total number of IRQs.
+Then UIO_PCI_GENERIC_IRQ_SET ioctl command may be used to bind a specific eventfd to a specific
+IRQ vector.
+ </para>
+ <para>
+Legacy interrupts are handled using the Interrupt Disable bit in the PCI command
register and Interrupt Status bit in the PCI status register. All devices
compliant to PCI 2.3 (circa 2002) and all compliant PCI Express devices should
support these bits. uio_pci_generic detects this support, and won't bind to
devices which do not support the Interrupt Disable Bit in the command register.
</para>
<para>
-On each interrupt, uio_pci_generic sets the Interrupt Disable bit.
-This prevents the device from generating further interrupts
+If legacy interrupts are used, uio_pci_generic sets the Interrupt Disable bit on
+each interrupt. This prevents the device from generating further interrupts
until the bit is cleared. The userspace driver should clear this
bit before blocking and waiting for more interrupts.
</para>

Vlad Zolotarov

unread,

Oct 4, 2015, 12:50:05 PM10/4/15

to

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

drivers/uio/uio.c | 15 +++++++++++++++
include/linux/uio_driver.h | 3 +++
2 files changed, 18 insertions(+)

diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
index 8196581..714b0e5 100644
--- a/drivers/uio/uio.c
+++ b/drivers/uio/uio.c
@@ -704,6 +704,20 @@ static int uio_mmap(struct file *filep, struct vm_area_struct *vma)
}
}

+static long uio_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
+{
+ struct uio_listener *listener = filep->private_data;
+ struct uio_device *idev = listener->dev;
+
+ if (!idev->info)
+ return -EIO;
+
+ if (!idev->info->ioctl)
+ return -ENOTTY;
+
+ return idev->info->ioctl(idev->info, cmd, arg);
+}
+
static const struct file_operations uio_fops = {
.owner = THIS_MODULE,
.open = uio_open,
@@ -712,6 +726,7 @@ static const struct file_operations uio_fops = {
.write = uio_write,
.mmap = uio_mmap,
.poll = uio_poll,
+ .unlocked_ioctl = uio_ioctl,
.fasync = uio_fasync,
.llseek = noop_llseek,
};
diff --git a/include/linux/uio_driver.h b/include/linux/uio_driver.h
index 32c0e83..10d7833 100644
--- a/include/linux/uio_driver.h
+++ b/include/linux/uio_driver.h
@@ -89,6 +89,7 @@ struct uio_device {
* @mmap: mmap operation for this uio device
* @open: open operation for this uio device
* @release: release operation for this uio device
+ * @ioctl: ioctl handler
* @irqcontrol: disable/enable irqs when 0/1 is written to /dev/uioX
*/
struct uio_info {
@@ -105,6 +106,8 @@ struct uio_info {
int (*open)(struct uio_info *info, struct inode *inode);
int (*release)(struct uio_info *info, struct inode *inode);
int (*irqcontrol)(struct uio_info *info, s32 irq_on);
+ int (*ioctl)(struct uio_info *info, unsigned int cmd,
+ unsigned long arg);
};

extern int __must_check

Vlad Zolotarov

unread,

Oct 4, 2015, 1:40:07 PM10/4/15

to

Change the chapters related to uio_pci_generic that refer interrupt mode.
Add the relevant explanation regarding MSI and MSI-X interrupt modes
support.

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

Vlad Zolotarov

unread,

Oct 4, 2015, 1:40:07 PM10/4/15

to

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

Vlad Zolotarov

unread,

Oct 4, 2015, 1:40:08 PM10/4/15

to

This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.

Currently uio_pci_generic supports only legacy INT#x interrupts source. However
there are situations when this is not enough, for instance SR-IOV VF devices that
simply don't have INT#x capability. For such devices uio_pci_generic will simply
fail (more specifically probe() will fail).

When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to performance
overhead and thus VFIO is not an option users that develop user-space drivers are left
without any option but to develop some proprietary UIO drivers (e.g. igb_uio driver in Intel's
DPDK) just to be able to use UIO infrastructure.

This series provides a generic solution for this problem while preserving the original behaviour
for devices for which the original uio_pci_generic had worked before (i.e. INT#x will be used by default).

New in v2:
- Added #include <linux/uaccess.h> to uio_pci_generic.c

Vlad Zolotarov (3):
uio: add ioctl support
uio_pci_generic: add MSI/MSI-X support
Documentation: update uio-howto

Documentation/DocBook/uio-howto.tmpl | 29 ++-
drivers/uio/uio.c | 15 ++

drivers/uio/uio_pci_generic.c | 410 +++++++++++++++++++++++++++++++++--

include/linux/uio_driver.h | 3 +
include/linux/uio_pci_generic.h | 36 +++

5 files changed, 467 insertions(+), 26 deletions(-)
create mode 100644 include/linux/uio_pci_generic.h

Vlad Zolotarov

unread,

Oct 4, 2015, 4:40:09 PM10/4/15

to

Vlad Zolotarov

unread,

Oct 4, 2015, 4:40:09 PM10/4/15

to

Change the chapters related to uio_pci_generic that refer interrupt mode.
Add the relevant explanation regarding MSI and MSI-X interrupt modes
support.

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

Vlad Zolotarov

unread,

Oct 4, 2015, 4:40:09 PM10/4/15

to

This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.

Currently uio_pci_generic supports only legacy INT#x interrupts source. However
there are situations when this is not enough, for instance SR-IOV VF devices that
simply don't have INT#x capability. For such devices uio_pci_generic will simply
fail (more specifically probe() will fail).

When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to performance
overhead and thus VFIO is not an option users that develop user-space drivers are left
without any option but to develop some proprietary UIO drivers (e.g. igb_uio driver in Intel's
DPDK) just to be able to use UIO infrastructure.

This series provides a generic solution for this problem while preserving the original behaviour
for devices for which the original uio_pci_generic had worked before (i.e. INT#x will be used by default).

New in v3:
- Add __iomem qualifier to temp buffer receiving ioremap value.

New in v2:
- Added #include <linux/uaccess.h> to uio_pci_generic.c

Vlad Zolotarov (3):
uio: add ioctl support
uio_pci_generic: add MSI/MSI-X support
Documentation: update uio-howto

Documentation/DocBook/uio-howto.tmpl | 29 ++-
drivers/uio/uio.c | 15 ++
drivers/uio/uio_pci_generic.c | 410 +++++++++++++++++++++++++++++++++--
include/linux/uio_driver.h | 3 +
include/linux/uio_pci_generic.h | 36 +++
5 files changed, 467 insertions(+), 26 deletions(-)
create mode 100644 include/linux/uio_pci_generic.h

Message has been deleted

Vlad Zolotarov

unread,

Oct 4, 2015, 4:50:06 PM10/4/15

to

This is the same v3 but with the correct email address of Greg. In the
first iteration the first letter of the email was missing... ;)

Message has been deleted

Vlad Zolotarov

unread,

Oct 4, 2015, 4:50:06 PM10/4/15

to

2.1.0

Vlad Zolotarov

unread,

Oct 4, 2015, 4:50:06 PM10/4/15

to

Vlad Zolotarov

unread,

Oct 4, 2015, 4:50:07 PM10/4/15

to

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

Greg KH

unread,

Oct 4, 2015, 11:10:05 PM10/4/15

to

On Sun, Oct 04, 2015 at 11:43:16PM +0300, Vlad Zolotarov wrote:
> Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
> ---
> drivers/uio/uio.c | 15 +++++++++++++++
> include/linux/uio_driver.h | 3 +++
> 2 files changed, 18 insertions(+)

You add an ioctl yet fail to justify _why_ you need/want that ioctl, and
you don't document it at all? Come on, you know better than that, no
one can take a patch that has no changelog comments at all like this :(

Also, I _REALLY_ don't want to add any ioctls to the UIO interface, so
you had better have a really compelling argument as to why this is the
_ONLY_ way you can solve this unknown problem by using such a horrid
thing...

thanks,

greg k-h

Greg KH

unread,

Oct 4, 2015, 11:20:06 PM10/4/15

to

On Sun, Oct 04, 2015 at 11:43:17PM +0300, Vlad Zolotarov wrote:
> Add support for MSI and MSI-X interrupt modes:
> - Interrupt mode selection order is:
> INT#X (for backward compatibility) -> MSI-X -> MSI.
> - Add ioctl() commands:
> - UIO_PCI_GENERIC_INT_MODE_GET: query the current interrupt mode.
> - UIO_PCI_GENERIC_IRQ_NUM_GET: query the maximum number of IRQs.
> - UIO_PCI_GENERIC_IRQ_SET: bind the IRQ to eventfd (similar to vfio).
> - Add mappings to all bars (memory and portio): some devices have
> registers related to MSI/MSI-X handling outside BAR0.

>
> Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
> ---

> New in v3:
> - Add __iomem qualifier to temp buffer receiving ioremap value.
>
> New in v2:
> - Added #include <linux/uaccess.h> to uio_pci_generic.c
>

> Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
> ---

> drivers/uio/uio_pci_generic.c | 410 +++++++++++++++++++++++++++++++++++++---
> include/linux/uio_pci_generic.h | 36 ++++
> 2 files changed, 423 insertions(+), 23 deletions(-)
> create mode 100644 include/linux/uio_pci_generic.h
>
> diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c
> index d0b508b..6b8b1789 100644
> --- a/drivers/uio/uio_pci_generic.c
> +++ b/drivers/uio/uio_pci_generic.c
> @@ -22,16 +22,32 @@
> #include <linux/device.h>
> #include <linux/module.h>
> #include <linux/pci.h>
> +#include <linux/msi.h>
> #include <linux/slab.h>
> #include <linux/uio_driver.h>
> +#include <linux/uio_pci_generic.h>
> +#include <linux/eventfd.h>
> +#include <linux/uaccess.h>
>
> #define DRIVER_VERSION "0.01.0"
> #define DRIVER_AUTHOR "Michael S. Tsirkin <m...@redhat.com>"
> #define DRIVER_DESC "Generic UIO driver for PCI 2.3 devices"
>
> +struct msix_info {
> + int num_irqs;
> + struct msix_entry *table;
> + struct uio_msix_irq_ctx {
> + struct eventfd_ctx *trigger; /* MSI-x vector to eventfd */

Why are you using eventfd for msi vectors? What's the reason for
needing this?

You haven't documented how this api works at all, you are going to have
to a lot more work to justify this, as this greatly increases the
complexity of the user/kernel api in unknown ways.

greg k-h

Vlad Zolotarov

unread,

Oct 5, 2015, 3:40:06 AM10/5/15

to

On 10/05/15 06:03, Greg KH wrote:
> On Sun, Oct 04, 2015 at 11:43:16PM +0300, Vlad Zolotarov wrote:
>> Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
>> ---
>> drivers/uio/uio.c | 15 +++++++++++++++
>> include/linux/uio_driver.h | 3 +++
>> 2 files changed, 18 insertions(+)
> You add an ioctl yet fail to justify _why_ you need/want that ioctl, and
> you don't document it at all? Come on, you know better than that, no
> one can take a patch that has no changelog comments at all like this :(

My bad. U are absolutely right here - it was late and I was tired that I
missed that to someone it may not be so "crystal clear" like it is to
me... :)
Again, my bad - let me clarify it here and if we agree I'll respin the
series with all relevant updates including the changelog.

>
> Also, I _REALLY_ don't want to add any ioctls to the UIO interface, so
> you had better have a really compelling argument as to why this is the
> _ONLY_ way you can solve this unknown problem by using such a horrid
> thing...

Pls., note that this doesn't _ADD_ any ioctls directly to UIO driver,
but only lets the underlying PCI drivers to have them. UIO in this case
is only a proxy.

The main idea of this series is, as mentioned in PATCH0, to add the MSI
and MSI-X support for uio_pci_generic driver.
While with MSI the things are quite simple and we may just ride the
existing infrastructure, with the MSI-X the things get a bit more
complicated since we may have more than one interrupt vector. Therefore
we have to decide which interface we want to give to the user.

One option could be to make all existing interrupts trigger the same
objects in UIO as the current single interrupt does, however this would
create an awkward, quite not-flexible semantics. For instance a regular
(kernel) driver has a separate state machine for each interrupt line,
which sometimes runs on a separate CPU, etc. This way we get to the
second option - allow indication for each separate interrupt vector. And
for obvious reasons (mentioned above) we (Stephen has sent a similar
series on a dpdk-dev list) chose a second approach.

In order not to invent the wheel we mimicked the VFIO approach, which
allows to bind the pre-allocated eventfd descriptor to the specific
interrupt vector using the ioctl().

The interface is simple. The UIO_PCI_GENERIC_IRQ_SET ioctl() data is:

struct uio_pci_generic_irq_set {
int vec; /* index of the IRQ to connect to starting from 0 */
int fd;
};

where "vec" is an index of the IRQ starting from 0 and "fd" is an
eventfd file descriptor a user wants to poll() for in order to get the
interrupt indications. If "fd" is less than 0, ioctl() will unbind the
interrupt from the previously bound eventfd descriptor.

This way a user may poll() for any IRQ it wants separately, or epoll()
for any subset of them, or do whatever he/she wants to do.

That's why we needed the ioctl(). I admit that it may not be the _ONLY_
way to achieve the goal described above but again we took VFIO approach
as a template for a solution and just followed it. If u think there is
more elegant/robust/better way to do so, pls., share. :)

thanks,
vlad

>
> thanks,
>
> greg k-h

Vlad Zolotarov

unread,

Oct 5, 2015, 3:50:06 AM10/5/15

to

A small correction - for MSI-X vectors. There may be only one MSI vector
per PCI function and if it's used it would use the same interface as a
legacy INT#x interrupt uses at the moment.
So, for MSI-X case the reason is that there may be (in most cases there
will be) more than one interrupt vector. Thus, as I've explained in a
PATCH1 thread we need a way to indicated each of them separately.
eventfd seems like a good way of doing so. If u have better ideas, pls.,
share.

>
> You haven't documented how this api works at all, you are going to have
> to a lot more work to justify this, as this greatly increases the
> complexity of the user/kernel api in unknown ways.

I actually do documented it a bit. Pls., check PATCH3 out. I admit that
I could do a better job by for instance providing a code example. I'll
improve this in v4 once we agree on all other details.

thanks,
vlad

Avi Kivity

unread,

Oct 5, 2015, 4:30:10 AM10/5/15

to

Of course it has to be documented, but this just follows vfio.

Eventfd is a natural enough representation of an interrupt; both kvm and
vfio use it, and are also able to share the eventfd, allowing a vfio
interrupt to generate a kvm interrupt, without userspace intervention,
and one day without even kernel intervention.

Stephen Hemminger

unread,

Oct 5, 2015, 4:50:07 AM10/5/15

to

On Sun, 4 Oct 2015 23:43:17 +0300
Vlad Zolotarov <vl...@cloudius-systems.com> wrote:

> +static int setup_maps(struct pci_dev *pdev, struct uio_info *info)
> +{
> + int i, m = 0, p = 0, err;
> + static const char * const bar_names[] = {
> + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5",
> + };
> +
> + for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> + unsigned long start = pci_resource_start(pdev, i);
> + unsigned long flags = pci_resource_flags(pdev, i);
> + unsigned long len = pci_resource_len(pdev, i);
> +
> + if (start == 0 || len == 0)
> + continue;
> +
> + if (flags & IORESOURCE_MEM) {
> + void __iomem *addr;
> +
> + if (m >= MAX_UIO_MAPS)
> + continue;
> +
> + addr = ioremap(start, len);
> + if (addr == NULL) {
> + err = -EINVAL;
> + goto fail;
> + }
> +
> + info->mem[m].name = bar_names[i];
> + info->mem[m].addr = start;
> + info->mem[m].internal_addr = addr;
> + info->mem[m].size = len;
> + info->mem[m].memtype = UIO_MEM_PHYS;
> + ++m;
> + } else if (flags & IORESOURCE_IO) {
> + if (p >= MAX_UIO_PORT_REGIONS)
> + continue;
> +
> + info->port[p].name = bar_names[i];
> + info->port[p].start = start;
> + info->port[p].size = len;
> + info->port[p].porttype = UIO_PORT_X86;
> + ++p;
> + }
> + }
> +
> + return 0;
> +fail:
> + for (i = 0; i < m; i++) {
> + iounmap(info->mem[i].internal_addr);
> + info->mem[i].internal_addr = NULL;
> + }
> +
> + return err;
> +

I wonder do we really have to setup all the BAR's in uio_pci_generic?
The DPDK code works with uio_pci_generic already, and it didn't setup the BAR's.
One possible issue is that without that maybe kernel would not know about the
region used for MSI-X vectors table.

Vlad Zolotarov

unread,

Oct 5, 2015, 5:10:07 AM10/5/15

to

DPDK never used uio_pci_generic with MSI-X support so far and all MSI-X
capable UIO DPDK drivers like igb_uio and your newly proposed uio_msi do
map them all.
U also mentioned in the other thread that virtio requires portio bars too.
The thing is that generally bars are needed for programming the device
therefore it's logical to have them exposed. In general different
devices have different registers layout therefore a general driver may
not know which bar exactly is going to be required for a specific device
and for a specific usage. That's why I think that "map them all"
approach is rather generic and appropriate. However if there are other
motives not to do so that I'm missing, pls., let me know.

thanks,
vlad

Vlad Zolotarov

unread,

Oct 5, 2015, 5:20:06 AM10/5/15

to

On 10/05/15 11:41, Stephen Hemminger wrote:

So, what's your point? It sounds like u are for setting the mappings in
the uio_pci_generic after all, aren't u? ;)

Greg KH

unread,

Oct 5, 2015, 5:50:06 AM10/5/15

to

On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:
> Of course it has to be documented, but this just follows vfio.
>
> Eventfd is a natural enough representation of an interrupt; both kvm and
> vfio use it, and are also able to share the eventfd, allowing a vfio
> interrupt to generate a kvm interrupt, without userspace intervention, and
> one day without even kernel intervention.

That's nice and wonderful, but it's not how UIO works today, so this is
now going to be a mix and match type interface, with no justification so
far as to why to create this new api and exactly how this is all going
to be used from userspace.

Example code would be even better...

thanks,

greg k-h

Greg KH

unread,

Oct 5, 2015, 5:50:06 AM10/5/15

to

On Mon, Oct 05, 2015 at 10:33:20AM +0300, Vlad Zolotarov wrote:
> On 10/05/15 06:03, Greg KH wrote:
> >On Sun, Oct 04, 2015 at 11:43:16PM +0300, Vlad Zolotarov wrote:
> >>Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
> >>---
> >> drivers/uio/uio.c | 15 +++++++++++++++
> >> include/linux/uio_driver.h | 3 +++
> >> 2 files changed, 18 insertions(+)
> >You add an ioctl yet fail to justify _why_ you need/want that ioctl, and
> >you don't document it at all? Come on, you know better than that, no
> >one can take a patch that has no changelog comments at all like this :(
>
> My bad. U are absolutely right here - it was late and I was tired that I
> missed that to someone it may not be so "crystal clear" like it is to me...
> :)
> Again, my bad - let me clarify it here and if we agree I'll respin the
> series with all relevant updates including the changelog.
>
> >
> >Also, I _REALLY_ don't want to add any ioctls to the UIO interface, so
> >you had better have a really compelling argument as to why this is the
> >_ONLY_ way you can solve this unknown problem by using such a horrid
> >thing...
>
> Pls., note that this doesn't _ADD_ any ioctls directly to UIO driver, but
> only lets the underlying PCI drivers to have them. UIO in this case is only
> a proxy.

Exactly, and I don't want to provide an ioctl "proxy" for UIO drivers.
That way lies madness and horrid code, and other nasty things (hint,
each ioctl is a custom syscall, so you are opening up the box for all
sorts of bad things to happen in drivers...)

For example, your ioctl you use here is incorrect, and will fail
horribly on a large majority of systems. I don't want to open up the
requirements that more people have to know how to "do it right" in order
to use the UIO interface for their drivers, as people will get it wrong
(as this patch series shows...)

> The main idea of this series is, as mentioned in PATCH0, to add the MSI and
> MSI-X support for uio_pci_generic driver.

Yes, I know that, but I don't see anything that shows _how_ to use this
api. And then there's the issue of why we even need this, why not just
write a whole new driver for this, like the previous driver did (which
also used ioctls, yes, I didn't have the chance to object to that before
everyone else did...)

> While with MSI the things are quite simple and we may just ride the existing
> infrastructure, with the MSI-X the things get a bit more complicated since
> we may have more than one interrupt vector. Therefore we have to decide
> which interface we want to give to the user.
>
> One option could be to make all existing interrupts trigger the same objects
> in UIO as the current single interrupt does, however this would create an
> awkward, quite not-flexible semantics. For instance a regular (kernel)
> driver has a separate state machine for each interrupt line, which sometimes
> runs on a separate CPU, etc. This way we get to the second option - allow
> indication for each separate interrupt vector. And for obvious reasons
> (mentioned above) we (Stephen has sent a similar series on a dpdk-dev list)
> chose a second approach.
>
> In order not to invent the wheel we mimicked the VFIO approach, which allows
> to bind the pre-allocated eventfd descriptor to the specific interrupt
> vector using the ioctl().
>
> The interface is simple. The UIO_PCI_GENERIC_IRQ_SET ioctl() data is:
>
> struct uio_pci_generic_irq_set {
> int vec; /* index of the IRQ to connect to starting from 0 */
> int fd;
> };

And that's broken :(

NEVER use an "int" for an ioctl, it is wrong and will cause horrible
issues on a large number of systems. That is what the __u16 and friends
variable types are for. You know better than this :)

Greg KH

unread,

Oct 5, 2015, 5:50:07 AM10/5/15

to

On Mon, Oct 05, 2015 at 10:41:39AM +0300, Vlad Zolotarov wrote:
> >>+struct msix_info {
> >>+ int num_irqs;
> >>+ struct msix_entry *table;
> >>+ struct uio_msix_irq_ctx {
> >>+ struct eventfd_ctx *trigger; /* MSI-x vector to eventfd */
> >Why are you using eventfd for msi vectors? What's the reason for
> >needing this?
>
> A small correction - for MSI-X vectors. There may be only one MSI vector per
> PCI function and if it's used it would use the same interface as a legacy
> INT#x interrupt uses at the moment.
> So, for MSI-X case the reason is that there may be (in most cases there will
> be) more than one interrupt vector. Thus, as I've explained in a PATCH1
> thread we need a way to indicated each of them separately. eventfd seems
> like a good way of doing so. If u have better ideas, pls., share.

You need to document what you are doing here, I don't see any
explaination for using eventfd at all.

And no, I don't know of any other solution as I don't know what you are
trying to do here (hint, the changelog didn't document it...)

> >You haven't documented how this api works at all, you are going to have
> >to a lot more work to justify this, as this greatly increases the
> >complexity of the user/kernel api in unknown ways.
>
> I actually do documented it a bit. Pls., check PATCH3 out.

That provided no information at all about how to use the api.

If it did, you would see that your api is broken for 32/64bit kernels
and will fall over into nasty pieces the first time you try to use it
there, which means it hasn't been tested at all :(

thanks,

Vlad Zolotarov

unread,

Oct 5, 2015, 6:10:06 AM10/5/15

to

Having said all that however I'd agree if someone would say that
mappings setting would rather come as a separate patch in this series... ;)
it will in v4...

Avi Kivity

unread,

Oct 5, 2015, 6:30:07 AM10/5/15

to

On 10/05/2015 12:49 PM, Greg KH wrote:
> On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:
>> Of course it has to be documented, but this just follows vfio.
>>
>> Eventfd is a natural enough representation of an interrupt; both kvm and
>> vfio use it, and are also able to share the eventfd, allowing a vfio
>> interrupt to generate a kvm interrupt, without userspace intervention, and
>> one day without even kernel intervention.
> That's nice and wonderful, but it's not how UIO works today, so this is
> now going to be a mix and match type interface, with no justification so
> far as to why to create this new api and exactly how this is all going
> to be used from userspace.

The intended user is dpdk (http://dpdk.org), which is a family of
userspace networking drivers for high performance networking applications.

The natural device driver for dpdk is vfio, which both provides memory
protection and exposes msi/msix interrupts. However, in many cases vfio
cannot be used, either due to the lack of an iommu (for example, in
virtualized environments) or out of a desire to avoid the iommus
performance impact.

The challenge in exposing msix interrupts to user space is that there
are many of them, so you can't simply poll the device fd. If you do,
how do you know which interrupt was triggered? The solution that vfio
adopted was to associate each interrupt with an eventfd, allowing it to
be individually polled. Since you can pass an eventfd with SCM_RIGHTS,
and since kvm can trigger guest interrupts using an eventfd, the
solution is very flexible.

> Example code would be even better...
>
>

This is the vfio dpdk interface code:

http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c

basically, the equivalent uio msix code would be very similar if uio
adopts a similar interface:

http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_uio.c

(current code lacks msi/msix support, of course).

Vlad Zolotarov

unread,

Oct 5, 2015, 6:40:06 AM10/5/15

to

Sometimes there is no other (better) way to get things done. And bugs -
isn't it what code review is for? ;)
I'll fix the "int" issue.

>
>> The main idea of this series is, as mentioned in PATCH0, to add the MSI and
>> MSI-X support for uio_pci_generic driver.
> Yes, I know that, but I don't see anything that shows _how_ to use this
> api.

I get that, i'll extend PATCH3 of this series with a detailed
description in v4.

U use it as follows:

1. Bind the PCI function to uio_pci_generic.
2. Query for its interrupt mode with UIO_PCI_GENERIC_INT_MODE_GET ioctl.
3. If interrupt mode is INT#x or MSI - use the current UIO interface
for polling, namely use the UIO file descriptor.
4. Else
1. Query for the number of MSI-X vectors with
UIO_PCI_GENERIC_IRQ_NUM_GET ioctl.
2. Allocate the required number of eventfd descriptors using
eventfd() from sys/eventfd.h.
3. Bind them to the required IRQs with UIO_PCI_GENERIC_IRQ_SET ioctl.
5. When done, just unbind the PCI function from the uio_pci_generic.

> And then there's the issue of why we even need this, why not just
> write a whole new driver for this, like the previous driver did (which
> also used ioctls, yes, I didn't have the chance to object to that before
> everyone else did...)

Which "previous driver" do u refer here?
IMHO writing something instead of UIO (not just uio_pci_generic) seems
like an overkill for solving this issue. Supporting MSI-X interrupts
seem like a very beneficial feature for uio_pci_generic and it's really
not _THAT_ complicated API - just look at VFIO for a comparison... ;)
uio_pci_generic is clearly missing this important feature. And creating
another user space driver infrastructure just to add it seems extremely
unjustified.

>
>> While with MSI the things are quite simple and we may just ride the existing
>> infrastructure, with the MSI-X the things get a bit more complicated since
>> we may have more than one interrupt vector. Therefore we have to decide
>> which interface we want to give to the user.
>>
>> One option could be to make all existing interrupts trigger the same objects
>> in UIO as the current single interrupt does, however this would create an
>> awkward, quite not-flexible semantics. For instance a regular (kernel)
>> driver has a separate state machine for each interrupt line, which sometimes
>> runs on a separate CPU, etc. This way we get to the second option - allow
>> indication for each separate interrupt vector. And for obvious reasons
>> (mentioned above) we (Stephen has sent a similar series on a dpdk-dev list)
>> chose a second approach.
>>
>> In order not to invent the wheel we mimicked the VFIO approach, which allows
>> to bind the pre-allocated eventfd descriptor to the specific interrupt
>> vector using the ioctl().
>>
>> The interface is simple. The UIO_PCI_GENERIC_IRQ_SET ioctl() data is:
>>
>> struct uio_pci_generic_irq_set {
>> int vec; /* index of the IRQ to connect to starting from 0 */
>> int fd;
>> };
> And that's broken :(

Good catch. Thanks. Will fix.
I'm not a big ioctl fan myself but unfortunately I don't see a good
alternative here. proc? Would it make it cleaner?

Vlad Zolotarov

unread,

Oct 5, 2015, 6:50:09 AM10/5/15

to

It has been tested of course ;)
I tested it only in 64 bit environment however where both kernel and
user space applications were compiled on the same machine with the same
compiler and it could be that "int" had the same number of bytes both in
kernel and in user space application. Therefore it worked perfectly - I
patched DPDK to use the new uio_pci_generic MSI-X API to test this and I
have verified that all 3 interrupt modes work: MSI-X with SR-IOV VF
device in Amazon EC2 guest and INT#x and MSI with a PF device on bare
metal server.

However I agree using uint32_t for "vec" and "fd" would be much more
correct.

Greg KH

unread,

Oct 5, 2015, 7:10:07 AM10/5/15

to

I don't think file descriptors are __u32 on a 64bit arch, are they?

And NEVER use the _t types in kernel code, the namespaces is all wrong
and it is not applicable for us, sorry.

Avi Kivity

unread,

Oct 5, 2015, 7:10:08 AM10/5/15

to

Wasn't the real reason that they aren't defined (or reserved) by C89,
and therefore could clash with a user identifier, rather than some
inherent wrongness?

Avi Kivity

unread,

Oct 5, 2015, 7:50:08 AM10/5/15

to

On 10/05/2015 02:41 PM, Vlad Zolotarov wrote:

> I think they are "int" on all platforms and as far as I know u32
> should be enough to contain int on any platform.
>

You need to make sure structures have the same layout on both 32-bit and
64-bit systems, or you'll have to code compat ioctl translations for
them. The best way to do that is to use __u32 so the sizes are obvious,
even for int, and to pad everything to 64 bit:

> +struct msix_info {

+ __u32 num_irqs;
+ __u32 pad; // so pointer below is aligned to 64-bit on both 32-bit
and 64-bit userspace

>
> + struct msix_entry *table;
> + struct uio_msix_irq_ctx {
> + struct eventfd_ctx *trigger; /* MSI-x vector to eventfd */

>>

>> And NEVER use the _t types in kernel code,
>

> Never meant it - it was for a user space interface. For a kernel it's
> u32 of course.
>

For interfaces, use __u32. You can't use uint32_t because if someone
uses C89 in 2015, they may not have <cstdint.h>.

Vlad Zolotarov

unread,

Oct 5, 2015, 7:50:08 AM10/5/15

to

I think they are "int" on all platforms and as far as I know u32 should
be enough to contain int on any platform.

>

> And NEVER use the _t types in kernel code,

Never meant it - it was for a user space interface. For a kernel it's
u32 of course.

Vlad Zolotarov

unread,

Oct 5, 2015, 8:00:07 AM10/5/15

to

Sure, but the structure below is not the one that is passed in ioctl() -
it's an internal uio_pci_generic state and there is nothing to worry about.
The one in question is struct uio_pci_generic_irq_set from
uio_pci_generic.h:

struct uio_pci_generic_irq_set {
int vec; /* index of the IRQ to connect to starting from 0 */
int fd;
};

It should be
struct uio_pci_generic_irq_set {
__u32 vec; /* index of the IRQ to connect to starting from 0 */
__u32 fd;
};

instead.

Greg KH

unread,

Oct 5, 2015, 9:10:06 AM10/5/15

to

Kind of, my memory is vague. There's a great rant from Linus about why
they don't work in the kernel somewhere in the lkml archives...

Message has been deleted

Michael S. Tsirkin

unread,

Oct 5, 2015, 4:00:07 PM10/5/15

to

On Sun, Oct 04, 2015 at 11:43:15PM +0300, Vlad Zolotarov wrote:
> This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.
>
> Currently uio_pci_generic supports only legacy INT#x interrupts source. However
> there are situations when this is not enough, for instance SR-IOV VF devices that
> simply don't have INT#x capability. For such devices uio_pci_generic will simply
> fail (more specifically probe() will fail).
>
> When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to performance
> overhead and thus VFIO is not an option
> users that develop user-space drivers are left
> without any option but to develop some proprietary UIO drivers (e.g. igb_uio driver in Intel's
> DPDK) just to be able to use UIO infrastructure.
>
> This series provides a generic solution for this problem while preserving the original behaviour
> for devices for which the original uio_pci_generic had worked before (i.e. INT#x will be used by default).

What is missing here is that drivers using uio_pci_generic generally
poke at config and BAR sysfs files of the device.

We can not stop them without breaking existing users, but this means
that we can't enable bus mastering and MSI/MSI-X blindly: userspace
bugs will corrupt the MSI-X table and/or MSi/MSI-X capability,
and cause device to overwrite random addresses, corrupting kernel
memory.

Your solution seems to be a warning in dmesg and tainting the
kernel, but that's not enough.

You need to add infrastructure to prevent this.

VFIO has some code to do this, but it's not bound by existing UIO API so it
simply fails the mmap. We want I think existing applications to work,
so I suspect we need to make a hole there (probably map a zero page in
case apps want to read it, and maybe even set it up for COW in case they
tweak the PBA which sometimes happens to be in the same page).

Your patches also seem to add in eventfd and mmap capabilities which
seems to be orthogonal. They are there in VFIO which I'm guessing is the
real reason you do it.

So, what you are trying to do might be closer to extending VFIO which
already has a bunch of checks like that. Yes, it also wants to program
the IOMMU. So maybe do it with a separate device that can be root-only,
so unpriveledged users can't abuse it.

You should Cc, and talk to the VFIO maintainer.

Michael S. Tsirkin

unread,

Oct 5, 2015, 4:10:07 PM10/5/15

to

On Mon, Oct 05, 2015 at 01:36:35PM +0300, Vlad Zolotarov wrote:
> >And then there's the issue of why we even need this, why not just
> >write a whole new driver for this, like the previous driver did (which
> >also used ioctls, yes, I didn't have the chance to object to that before
> >everyone else did...)
>
> Which "previous driver" do u refer here?
> IMHO writing something instead of UIO (not just uio_pci_generic) seems like
> an overkill for solving this issue. Supporting MSI-X interrupts seem like a
> very beneficial feature for uio_pci_generic and it's really not _THAT_
> complicated API - just look at VFIO for a comparison... ;)

Except most things VFIO does is actually there for security.
Which, for a device that can do DMA and isn't even behind an IOMMU,
sounds like a pretty big deal actually.

> uio_pci_generic is clearly missing this important feature. And creating
> another user space driver infrastructure just to add it seems extremely
> unjustified.

uio_pci_generic was always intended to be used with extremely simple
devices which don't do DMA, or where someone else has set up the IOMMU
(like kvm does with CONFIG_KVM_DEVICE_ASSIGNMENT).

We need to be much more careful with MSI if there's no IOMMU.

--
MST

Michael S. Tsirkin

unread,

Oct 5, 2015, 4:20:07 PM10/5/15

to

On Mon, Oct 05, 2015 at 01:06:09PM +0300, Vlad Zolotarov wrote:
> Having said all that however I'd agree if someone would say that mappings
> setting would rather come as a separate patch in this series... ;)
> it will in v4...

Just drop this is my advice. There are enough controversial things here
as it is.

--
MST

Michael S. Tsirkin

unread,

Oct 5, 2015, 6:30:06 PM10/5/15

to

On Tue, Oct 06, 2015 at 12:43:45AM +0300, Vladislav Zolotarov wrote:
> So, like it has already been asked in a different thread I'm going to
> ask a rhetorical question: what adding an MSI and MSI-X interrupts support to
> uio_pci_generic has to do with security?

memory protection is a better term than security.

It's very simple: you enable bus mastering and you ask userspace to map
all device BARs. One of these BARs holds the address to which device
writes to trigger MSI-X interrupt.

This is how MSI-X works, internally: from the point of view of
PCI it's a memory write. It just so happens that the destination
address is in the interrupt controller, that triggers an interrupt.

But a bug in this userspace application can corrupt the MSI-X table,
which in turn can easily corrupt kernel memory, or unrelated processes's
memory. This is in my opinion unacceptable.

So you need to be very careful
- probably need to reset device before you even enable bus master
- prevent userspace from touching msi config
- prevent userspace from moving BARs since msi-x config is within a BAR
- detect reset and prevent linux from touching device while it's under
reset

The list goes on and on.

This is pretty much what VFIO spent the last 3 years doing, except VFIO
also can do IOMMU groups.

> What "security threat" does it add
> that u don't already have today?

Yes, userspace can create this today if it tweaks PCI config space to
enable MSI-X, then corrupts the MSI-X table. It's unfortunate that we
don't yet prevent this, but at least you need two things to go wrong for
this to trigger.

The reason, as I tried to point out, is simply that I didn't think
uio_pci_generic will be used for these configurations.
But there's nothing fundamental here that makes them secure
and that therefore makes your patches secure as well.

Fixing this to make uio_pci_generic write-protect MSI/MSI-X enable
registers sounds kind of reasonable, this shouldn't be too hard.

Vlad Zolotarov

unread,

Oct 6, 2015, 4:40:07 AM10/6/15

to

On 10/05/15 22:50, Michael S. Tsirkin wrote:
> On Sun, Oct 04, 2015 at 11:43:15PM +0300, Vlad Zolotarov wrote:
>> This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.
>>
>> Currently uio_pci_generic supports only legacy INT#x interrupts source. However
>> there are situations when this is not enough, for instance SR-IOV VF devices that
>> simply don't have INT#x capability. For such devices uio_pci_generic will simply
>> fail (more specifically probe() will fail).
>>
>> When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to performance
>> overhead and thus VFIO is not an option
>> users that develop user-space drivers are left
>> without any option but to develop some proprietary UIO drivers (e.g. igb_uio driver in Intel's
>> DPDK) just to be able to use UIO infrastructure.
>>
>> This series provides a generic solution for this problem while preserving the original behaviour
>> for devices for which the original uio_pci_generic had worked before (i.e. INT#x will be used by default).
> What is missing here is that drivers using uio_pci_generic generally
> poke at config and BAR sysfs files of the device.
>
> We can not stop them without breaking existing users, but this means
> that we can't enable bus mastering and MSI/MSI-X blindly: userspace
> bugs will corrupt the MSI-X table and/or MSi/MSI-X capability,
> and cause device to overwrite random addresses, corrupting kernel
> memory.
>
> Your solution seems to be a warning in dmesg and tainting the
> kernel, but that's not enough.
>
> You need to add infrastructure to prevent this.

Bus mastering is easily enabled from the user space (taken from DPDK code):

static int
pci_uio_set_bus_master(int dev_fd)
{
uint16_t reg;
int ret;

ret = pread(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
if (ret != sizeof(reg)) {
RTE_LOG(ERR, EAL,
"Cannot read command from PCI config space!\n");
return -1;
}

/* return if bus mastering is already on */
if (reg & PCI_COMMAND_MASTER)
return 0;

reg |= PCI_COMMAND_MASTER;

ret = pwrite(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
if (ret != sizeof(reg)) {
RTE_LOG(ERR, EAL,
"Cannot write command to PCI config space!\n");
return -1;
}

return 0;
}

So, this is a non-issue. ;)

Vlad Zolotarov

unread,

Oct 6, 2015, 4:40:09 AM10/6/15

to

Sure. But like u've just pointed out yourself - this is a general issue
and it has nothing to do with the ability to get notifications per
MSI-X/MSI interrupts, which this series adds (bus mastering may and is
easily enabled from the user space - look for pci_uio_set_bus_master()
function in the DPDK).

So, while I absolutely agree with u in regard to the fact that we have a
security/memory corruption threat in the current in-tree uio_pci_generic
- the solution u propose should be a matter of a separate patch and is
obviously orthogonal to this series.

thanks,
vlad

Michael S. Tsirkin

unread,

Oct 6, 2015, 10:20:08 AM10/6/15

to

On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> the solution u propose should be a matter of a separate patch and is
> obviously orthogonal to this series.

Doesn't work this way, sorry. You want a patch enabling MSI merged,
you need to secure the MSI configuration.

And it's going to be a lot of work, duplicating a bunch of code from
VFIO.

--
MST

Gleb Natapov

unread,

Oct 6, 2015, 10:40:08 AM10/6/15

to

On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> > the solution u propose should be a matter of a separate patch and is
> > obviously orthogonal to this series.
>
> Doesn't work this way, sorry. You want a patch enabling MSI merged,
> you need to secure the MSI configuration.
>

MSI can be enabled right now without the patch by writing directly into
PCI bar. The only thing this patch adds is forwarding the interrupt to
an eventfd.

--
Gleb.

Michael S. Tsirkin

unread,

Oct 6, 2015, 10:40:10 AM10/6/15

to

On Tue, Oct 06, 2015 at 11:37:59AM +0300, Vlad Zolotarov wrote:
> Bus mastering is easily enabled from the user space (taken from DPDK code):
>
> static int
> pci_uio_set_bus_master(int dev_fd)
> {
> uint16_t reg;
> int ret;
>
> ret = pread(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
> if (ret != sizeof(reg)) {
> RTE_LOG(ERR, EAL,
> "Cannot read command from PCI config space!\n");
> return -1;
> }
>
> /* return if bus mastering is already on */
> if (reg & PCI_COMMAND_MASTER)
> return 0;
>
> reg |= PCI_COMMAND_MASTER;
>
> ret = pwrite(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
> if (ret != sizeof(reg)) {
> RTE_LOG(ERR, EAL,
> "Cannot write command to PCI config space!\n");
> return -1;
> }
>
> return 0;
> }
>
> So, this is a non-issue. ;)

There might be valid reasons for DPDK to do this, e.g. if using VFIO.

I'm guessing it doesn't enable MSI though, does it?

--
MST

Michael S. Tsirkin

unread,

Oct 6, 2015, 10:40:11 AM10/6/15

to

So you really want a driver that behaves exactly like vfio.
Which immediately begs a question: why not extend vfio
to cover your usecase.

--
MST

Vlad Zolotarov

unread,

Oct 6, 2015, 10:50:07 AM10/6/15

to

On 10/06/15 17:30, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 11:37:59AM +0300, Vlad Zolotarov wrote:
>> Bus mastering is easily enabled from the user space (taken from DPDK code):
>>
>> static int
>> pci_uio_set_bus_master(int dev_fd)
>> {
>> uint16_t reg;
>> int ret;
>>
>> ret = pread(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
>> if (ret != sizeof(reg)) {
>> RTE_LOG(ERR, EAL,
>> "Cannot read command from PCI config space!\n");
>> return -1;
>> }
>>
>> /* return if bus mastering is already on */
>> if (reg & PCI_COMMAND_MASTER)
>> return 0;
>>
>> reg |= PCI_COMMAND_MASTER;
>>
>> ret = pwrite(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
>> if (ret != sizeof(reg)) {
>> RTE_LOG(ERR, EAL,
>> "Cannot write command to PCI config space!\n");
>> return -1;
>> }
>>
>> return 0;
>> }
>>
>> So, this is a non-issue. ;)
> There might be valid reasons for DPDK to do this, e.g. if using VFIO.

Michael, I'm afraid u are missing the main point here - the code above
destroys all your long arguments. U can't possibly prevent the root-user
from enabling the device bus mastering. And as me and other people on
this thread have already mentioned MSI and MSI-X device configuration is
controlled from the device BAR, thus may not be prevented too.

>
> I'm guessing it doesn't enable MSI though, does it?

Again, enabling MSI is a matter of a trivial patch configuring device
registers on the device BAR.

Vlad Zolotarov

unread,

Oct 6, 2015, 10:50:07 AM10/6/15

to

The only "like VFIO" behavior we implement here is binding the MSI-X
interrupt notification to eventfd descriptor. This doesn't justifies the
hassle of implementing IOMMU-less VFIO mode.

Michael S. Tsirkin

unread,

Oct 6, 2015, 10:50:08 AM10/6/15

to

On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:
> Eventfd is a natural enough representation of an interrupt; both kvm and
> vfio use it, and are also able to share the eventfd, allowing a vfio
> interrupt to generate a kvm interrupt, without userspace intervention, and
> one day without even kernel intervention.

eventfd without kernel intervention sounds unlikely.

kvm might configure the cpu such that an interrupt will not trigger a
vmexit. eventfd seems like an unlikely interface to do that: with the
eventfd, device triggering it has no info about the interrupt so it
can't send it to the correct VM.

--
MST

Michael S. Tsirkin

unread,

Oct 6, 2015, 11:00:10 AM10/6/15

to

On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:
> The only "like VFIO" behavior we implement here is binding the MSI-X
> interrupt notification to eventfd descriptor.

There will be more if you add some basic memory protections.

Besides, that's not true.
Your patch queries MSI capability, sets # of vectors.
You even hinted you want to add BAR mapping down the road.

VFIO does all of that.

> This doesn't justifies the
> hassle of implementing IOMMU-less VFIO mode.

This applies to both VFIO and UIO really. I'm not sure the hassle of
maintaining this functionality in tree is justified. It remains to be
seen whether there are any users that won't taint the kernel.
Apparently not in the current form of the patch, but who knows.

--
MST

Michael S. Tsirkin

unread,

Oct 6, 2015, 11:20:08 AM10/6/15

to

> While it is possible that userspace malfunctions and accidentally programs
> MSI incorrectly, the risk is dwarfed by the ability of userspace to program
> DMA incorrectly.

That seems to imply that for the upstream kernel this is not a valid usecase at all.

Avi Kivity

unread,

Oct 6, 2015, 11:20:08 AM10/6/15

to

On 10/06/2015 05:30 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 11:37:59AM +0300, Vlad Zolotarov wrote:
>> Bus mastering is easily enabled from the user space (taken from DPDK code):
>>
>> static int
>> pci_uio_set_bus_master(int dev_fd)
>> {
>> uint16_t reg;
>> int ret;
>>
>> ret = pread(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
>> if (ret != sizeof(reg)) {
>> RTE_LOG(ERR, EAL,
>> "Cannot read command from PCI config space!\n");
>> return -1;
>> }
>>
>> /* return if bus mastering is already on */
>> if (reg & PCI_COMMAND_MASTER)
>> return 0;
>>
>> reg |= PCI_COMMAND_MASTER;
>>
>> ret = pwrite(dev_fd, &reg, sizeof(reg), PCI_COMMAND);
>> if (ret != sizeof(reg)) {
>> RTE_LOG(ERR, EAL,
>> "Cannot write command to PCI config space!\n");
>> return -1;
>> }
>>
>> return 0;
>> }
>>
>> So, this is a non-issue. ;)
> There might be valid reasons for DPDK to do this, e.g. if using VFIO.

DPDK does this when using vfio, and when using uio_pci_generic. All of
the network cards that DPDK supports require DMA.

> I'm guessing it doesn't enable MSI though, does it?

It does not enable MSI, because the main kernel driver used for
interacting with the device, pci_uio_generic, does not support MSI. In
some configurations, PCI INTA is not available, while MSI(X) is, hence
the desire that pci_uio_generic support MSI.

While it is possible that userspace malfunctions and accidentally
programs MSI incorrectly, the risk is dwarfed by the ability of

userspace to program DMA incorrectly. Under normal operation userspace
programs tens of millions of DMA operations per second, while it never
touches the MSI BARs (it is the kernel that programs them).

Michael S. Tsirkin

unread,

Oct 6, 2015, 11:20:12 AM10/6/15

to

On Tue, Oct 06, 2015 at 05:40:23PM +0300, Vlad Zolotarov wrote:
> >I'm guessing it doesn't enable MSI though, does it?
>
> Again, enabling MSI is a matter of a trivial patch configuring device
> registers on the device BAR.

No, not really.

--
mST

Michael S. Tsirkin

unread,

Oct 6, 2015, 11:30:08 AM10/6/15

to

On Tue, Oct 06, 2015 at 05:30:31PM +0300, Gleb Natapov wrote:
> On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> > > the solution u propose should be a matter of a separate patch and is
> > > obviously orthogonal to this series.
> >
> > Doesn't work this way, sorry. You want a patch enabling MSI merged,
> > you need to secure the MSI configuration.
> >
> MSI can be enabled right now without the patch by writing directly into
> PCI bar.

By poking at config registers in sysfs? We can block this, or we
can log this, pretty easily. We don't ATM but it's not hard to do.

> The only thing this patch adds is forwarding the interrupt to
> an eventfd.

This one just adds a bunch of ioctls. The next ones do
more than you describe.

Vlad Zolotarov

unread,

Oct 6, 2015, 11:30:08 AM10/6/15

to

On 10/06/15 17:56, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:
>> The only "like VFIO" behavior we implement here is binding the MSI-X
>> interrupt notification to eventfd descriptor.
> There will be more if you add some basic memory protections.

I've already explained that there is no need for any additional memory
protections since it won't be able to protect anything.

>
> Besides, that's not true.
> Your patch queries MSI capability, sets # of vectors.

My patch doesn't set # of vectors.

> You even hinted you want to add BAR mapping down the road.
>
> VFIO does all of that.
>
>> This doesn't justifies the
>> hassle of implementing IOMMU-less VFIO mode.
> This applies to both VFIO and UIO really. I'm not sure the hassle of
> maintaining this functionality in tree is justified. It remains to be
> seen whether there are any users that won't taint the kernel.
> Apparently not in the current form of the patch, but who knows.

Again, uio_pci_generic with my patch simply follows the UIO design and
in addition allows mapping MSI-X interrupts to eventfd's and it does it
much more laconically compared to VFIO. Therefore these two modules
won't be more related than they are today.

thanks,
vlad

Avi Kivity

unread,

Oct 6, 2015, 11:30:09 AM10/6/15

to

On 10/06/2015 05:46 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:
>> Eventfd is a natural enough representation of an interrupt; both kvm and
>> vfio use it, and are also able to share the eventfd, allowing a vfio
>> interrupt to generate a kvm interrupt, without userspace intervention, and
>> one day without even kernel intervention.
> eventfd without kernel intervention sounds unlikely.
>
> kvm might configure the cpu such that an interrupt will not trigger a
> vmexit. eventfd seems like an unlikely interface to do that: with the
> eventfd, device triggering it has no info about the interrupt so it
> can't send it to the correct VM.

https://lwn.net/Articles/650863/

Avi Kivity

unread,

Oct 6, 2015, 11:30:09 AM10/6/15

to

On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:
>> The only "like VFIO" behavior we implement here is binding the MSI-X
>> interrupt notification to eventfd descriptor.
> There will be more if you add some basic memory protections.
>
> Besides, that's not true.
> Your patch queries MSI capability, sets # of vectors.
> You even hinted you want to add BAR mapping down the road.

BAR mapping is already available from sysfs; it is not mandatory.

> VFIO does all of that.
>

Copying vfio maintainer Alex (hi!).

vfio's charter is modern iommu-capable configurations. It is designed to
be secure enough to be usable by an unprivileged user.

For performance and hardware reasons, many dpdk deployments use
uio_pci_generic. They are willing to trade off the security provided by
vfio for the performance and deployment flexibility of pci_uio_generic.
Forcing these features into vfio will compromise its security and
needlessly complicate its code (I guess it can be done with a "null"
iommu, but then vfio will have to decide whether it is secure or not).

>> This doesn't justifies the
>> hassle of implementing IOMMU-less VFIO mode.
> This applies to both VFIO and UIO really. I'm not sure the hassle of
> maintaining this functionality in tree is justified. It remains to be
> seen whether there are any users that won't taint the kernel.
> Apparently not in the current form of the patch, but who knows.

It is not msix that taints the kernel, it's uio_pci_generic. Msix is a
tiny feature addition that doesn't change the security situation one bit.

btw, currently you can map BARs and dd to /dev/mem to your heart's
content without tainting the kernel. I don't see how you can claim that
msix support makes the situation worse, when root can access every bit
of physical memory, either directly or via DMA.

Vlad Zolotarov

unread,

Oct 6, 2015, 11:40:06 AM10/6/15

to

On 10/06/15 18:19, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:30:31PM +0300, Gleb Natapov wrote:
>> On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
>>> On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
>>>> the solution u propose should be a matter of a separate patch and is
>>>> obviously orthogonal to this series.
>>> Doesn't work this way, sorry. You want a patch enabling MSI merged,
>>> you need to secure the MSI configuration.
>>>
>> MSI can be enabled right now without the patch by writing directly into
>> PCI bar.
> By poking at config registers in sysfs? We can block this, or we
> can log this, pretty easily. We don't ATM but it's not hard to do.
>
>> The only thing this patch adds is forwarding the interrupt to
>> an eventfd.
> This one just adds a bunch of ioctls. The next ones do
> more than you describe.

This one adds zero ioctls and the next one does exactly what Gleb
describes.

Gleb Natapov

unread,

Oct 6, 2015, 12:00:09 PM10/6/15

to

On Tue, Oct 06, 2015 at 06:19:34PM +0300, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:30:31PM +0300, Gleb Natapov wrote:
> > On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> > > > the solution u propose should be a matter of a separate patch and is
> > > > obviously orthogonal to this series.
> > >
> > > Doesn't work this way, sorry. You want a patch enabling MSI merged,
> > > you need to secure the MSI configuration.
> > >
> > MSI can be enabled right now without the patch by writing directly into
> > PCI bar.
>
> By poking at config registers in sysfs? We can block this, or we
> can log this, pretty easily. We don't ATM but it's not hard to do.
>

Blocking this will break userspace API. As a maintainer you should know
that we do not break userspace APIs. Logging this is fine, but how
exactly it helps you with "security"? The patch in question already
taints the kernel which is much stronger than logging.

> > The only thing this patch adds is forwarding the interrupt to
> > an eventfd.
>
> This one just adds a bunch of ioctls. The next ones do
> more than you describe.
>

Yes, it adds bunch of ioctls to do exactly what I wrote above. What
point have you tried to make by this statement? It eluded me.

Avi Kivity

unread,

Oct 6, 2015, 12:10:08 PM10/6/15

to

On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote:
>> While it is possible that userspace malfunctions and accidentally programs
>> MSI incorrectly, the risk is dwarfed by the ability of userspace to program
>> DMA incorrectly.
> That seems to imply that for the upstream kernel this is not a valid usecase at all.
>

That is trivially incorrect, upstream pci_uio_generic is used with dpdk
for years. Are dpdk applications an invalid use case?

Again:

- security is not compromised. you need to be root to (ab)use this.
- all of the potentially compromising functionality has been there from
day 1
- uio_pci_generic is the only way to provide the required performance on
some configurations (where kernel drivers, or userspace drivers + iommu
are too slow)
- uio_pci_generic + msix is the only way to enable userspace drivers on
some configurations (SRIOV)

The proposed functionality does not increase the attack surface.
The proposed functionality marginally increases the bug surface.
The proposed functionality is a natural evolution of uio_pci_generic.

There is a new class of applications (network function virtualization)
which require this. They can't use the kernel drivers because they are
too slow. They can't use the iommu because it is either too slow, or
taken over by the hypervisor. They are willing to live with less kernel
protection, because they are a single user application anyway (and since
they use a kernel bypass, they don't really care that much about the
kernel).

The kernel serves more use-cases than a desktop or a multi-user
servers. Some of these users are willing to trade off protection for
performance or functionality (an extreme, yet similar, example is
linux-nommu, which allows any application to access any bit of memory,
due to the lack of protection hardware).

Gleb Natapov

unread,

Oct 6, 2015, 12:10:08 PM10/6/15

to

On Tue, Oct 06, 2015 at 06:15:54PM +0300, Michael S. Tsirkin wrote:
> > While it is possible that userspace malfunctions and accidentally programs
> > MSI incorrectly, the risk is dwarfed by the ability of userspace to program
> > DMA incorrectly.
>
> That seems to imply that for the upstream kernel this is not a valid usecase at all.
>

Are you implying that uio_pci_generic should be removed from upstream
kernel because Avi did not describe anything that cannot be done with
upstream kernel right now.

--
Gleb.

Vlad Zolotarov

unread,

Oct 6, 2015, 12:20:09 PM10/6/15

to

This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.

Currently uio_pci_generic demands INT#x interrupts source be available. However
there are devices that simply don't have INT#x capability, for instance SR-IOV

VF devices that simply don't have INT#x capability. For such devices

uio_pci_generic will simply fail (more specifically its probe() will fail).

When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to
performance overhead and thus VFIO is not an option users that develop
user-space drivers are left without any option but to develop some proprietary
UIO drivers (e.g. igb_uio driver in Intel's DPDK) just to be able to use UIO
infrastructure.

This series provides a generic solution for this problem while preserving the
original behaviour for devices for which the original uio_pci_generic had worked
before (i.e. INT#x will be used by default).

New in v4:
- Use portable __u32 and __s32 types from asm/types.h for
defining uio_pci_generic_irq_set fields.
- Use proper _IO macros for defining read and write ioctl()
commands.
- Moved bars mapping setting into a separate patch.
- Update uio_pci_generic example in uio-howto.tmpl.

New in v3:
- Add __iomem qualifier to temp buffer receiving ioremap value.

New in v2:
- Added #include <linux/uaccess.h> to uio_pci_generic.c

Vlad Zolotarov (4):
uio: add ioctl support
uio_pci_generic: properly initialize PCI bars mappings towards UIO

uio_pci_generic: add MSI/MSI-X support
Documentation: update uio-howto

Documentation/DocBook/uio-howto.tmpl | 139 ++++++++++--
drivers/uio/uio.c | 15 ++
drivers/uio/uio_pci_generic.c | 409 +++++++++++++++++++++++++++++++++--
include/linux/uio_driver.h | 3 +
include/uapi/linux/uio_pci_generic.h | 51 +++++
5 files changed, 574 insertions(+), 43 deletions(-)
create mode 100644 include/uapi/linux/uio_pci_generic.h

--
2.1.0

Message has been deleted

Vlad Zolotarov

unread,

Oct 6, 2015, 12:20:12 PM10/6/15

to

Change the chapters related to uio_pci_generic that refer interrupt mode.
Add the relevant explanation regarding MSI and MSI-X interrupt modes
support.

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---
New in v4:

- Update uio_pci_generic example in uio-howto.tmpl

---
Documentation/DocBook/uio-howto.tmpl | 133 ++++++++++++++++++++++++++++++-----
1 file changed, 116 insertions(+), 17 deletions(-)

diff --git a/Documentation/DocBook/uio-howto.tmpl b/Documentation/DocBook/uio-howto.tmpl
index cd0e452..507b2ca 100644
--- a/Documentation/DocBook/uio-howto.tmpl
+++ b/Documentation/DocBook/uio-howto.tmpl
@@ -46,6 +46,12 @@ GPL version 2.

<revhistory>
<revision>
+ <revnumber>0.10</revnumber>
+ <date>2015-10-04</date>
+ <authorinitials>vz</authorinitials>
+ <revremark>Added MSI and MSI-X support to uio_pci_generic.</revremark>
+ </revision>
+ <revision>
<revnumber>0.9</revnumber>
<date>2009-07-16</date>
<authorinitials>mst</authorinitials>
@@ -935,15 +941,32 @@ and look in the output for failure reasons
<sect1 id="uio_pci_generic_internals">
<title>Things to know about uio_pci_generic</title>
<para>
-Interrupts are handled using the Interrupt Disable bit in the PCI command
+Interrupts are handled either as MSI-X or MSI interrupts (if the device supports it) or
+as legacy INTx interrupts. By default INTx interrupts are used.
+ </para>
+ <para>
+uio_pci_generic automatically configures a device to use INTx interrupt for backward
+compatibility. If INTx are not available MSI-X interrupts will be used if the device
+supports it and if not MSI interrupts are going to be used. If none of the interrupts
+modes is supported probe() will fail.
+ </para>
+ <para>
+To get the used interrupt mode application has to use UIO_PCI_GENERIC_INT_MODE_GET ioctl
+command.
+UIO_PCI_GENERIC_IRQ_NUM_GET ioctl command may be used to get the total number of IRQs.
+Then UIO_PCI_GENERIC_IRQ_SET ioctl command may be used to bind a specific eventfd to a specific
+IRQ vector.
+ </para>
+ <para>
+Legacy interrupts are handled using the Interrupt Disable bit in the PCI command
register and Interrupt Status bit in the PCI status register. All devices
compliant to PCI 2.3 (circa 2002) and all compliant PCI Express devices should
support these bits. uio_pci_generic detects this support, and won't bind to
devices which do not support the Interrupt Disable Bit in the command register.
</para>
<para>
-On each interrupt, uio_pci_generic sets the Interrupt Disable bit.
-This prevents the device from generating further interrupts
+If legacy interrupts are used, uio_pci_generic sets the Interrupt Disable bit on
+each interrupt. This prevents the device from generating further interrupts
until the bit is cleared. The userspace driver should clear this
bit before blocking and waiting for more interrupts.
</para>
@@ -966,17 +989,23 @@ Here is some sample userspace driver code using uio_pci_generic:
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
+#include <linux/uio_pci_generic.h>
#include <fcntl.h>
#include <errno.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>

int main()
{
- int uiofd;
+ int uiofd, event_fd;
int configfd;
int err;
int i;
unsigned icount;
+ __u64 read_buf;
unsigned char command_high;
+ __u32 int_mode, num_irqs;
+ int bytes_to_read;

uiofd = open("/dev/uio0", O_RDONLY);
if (uiofd < 0) {
@@ -989,13 +1018,65 @@ int main()
return errno;
}

- /* Read and cache command value */
- err = pread(configfd, &command_high, 1, 5);
- if (err != 1) {
- perror("command config read:");
+ /* Get the interrupt mode */
+ if (ioctl(uiofd, UIO_PCI_GENERIC_INT_MODE_GET, &int_mode) < 0)
+ { perror("getting interrupt mode"); return errno;
+ }
+
+ /* Read the number of available IRQs - for INT#x and MSI 1 (one) will be returned */
+ if (ioctl(uiofd, UIO_PCI_GENERIC_IRQ_NUM_GET, &num_irqs) < 0) {
+ perror("getting IRQs number");
return errno;
}
- command_high &= ~0x4;
+
+ /* if interrupt mode is MSI-X - allocate eventfd file descriptor and bind it to
+ * the IRQ you need. We'll bind it to the first available IRQ in this example.
+ */
+ if (int_mode == UIO_INT_MODE_MSIX) {
+ struct uio_pci_generic_irq_set irq_set;
+
+ if (num_irqs < 1) {
+ printf("Hmmm... Zero IRQ numbers. Something wrong with MSI-X configuration\n");
+ return -1;
+ }
+
+ printf("Interrupt mode is MSI-X, we are going to use IRQ[0]\n");
+
+ /* set up an eventfd for an interrupt */
+ event_fd = eventfd(0, EFD_CLOEXEC);
+ if (event_fd < 0) {
+ perror("cannot create irq eventfd");
+ return errno;
+ }
+
+ /* connect to the first IRQ */
+ irq_set.vec = 0;
+ irq_set.fd = event_fd;
+
+ if (ioctl(uiofd, UIO_PCI_GENERIC_IRQ_SET, &irq_set) < 0) {
+ perror("binding the eventfd descriptor to IRQ[0]");
+ return errno;
+ }
+
+ /* eventfd read() requires to read 8 bytes, while UIO - requires 4 bytes */
+ bytes_to_read = 8;
+ } else if (int_mode == UIO_INT_MODE_MSI || int_mode == UIO_INT_MODE_INTX) {
+ event_fd = uiofd;
+ bytes_to_read = 4;
+
+ if (int_mode == UIO_INT_MODE_INTX) {
+ /* Read and cache command value */
+ err = pread(configfd, &command_high, 1, 5);
+ if (err != 1) {
+ perror("command config read:");
+ return errno;
+ }
+ command_high &= ~0x4;
+ }
+ } else {
+ printf("Interrupts are not supported\n");
+ return -1;
+ }

for(i = 0;; ++i) {
/* Print out a message, for debugging. */
@@ -1006,24 +1087,42 @@ int main()

/****************************************/
/* Here we got an interrupt from the
- device. Do something to it. */
+ device. Do something to it and handle the HW
+ interrupt state machine if needed. */
/****************************************/

- /* Re-enable interrupts. */
- err = pwrite(configfd, &command_high, 1, 5);
- if (err != 1) {
- perror("config write:");
- break;
+ if (int_mode == UIO_INT_MODE_INTX) {
+ /* Re-enable interrupts. */
+ err = pwrite(configfd, &command_high, 1, 5);
+ if (err != 1) {
+ perror("config write:");
+ break;
+ }
}

/* Wait for next interrupt. */
- err = read(uiofd, &icount, 4);
- if (err != 4) {
+ err = read(event_fd, &read_buf, bytes_to_read);
+ if (err != bytes_to_read) {
perror("uio read:");
break;
}

+ icount++;
}
+
+ /* optional: unbind the eventfd from the IRQ */
+ if (int_mode == UIO_INT_MODE_MSIX) {
+ struct uio_pci_generic_irq_set irq_set = {
+ .vec = 0,
+ .fd = -1
+ };
+
+ if (ioctl(uiofd, UIO_PCI_GENERIC_IRQ_SET, &irq_set) < 0) {
+ perror("unbinding the eventfd descriptor from IRQ[0]");
+ return errno;
+ }
+ }
+
return errno;

Vlad Zolotarov

unread,

Oct 6, 2015, 12:30:08 PM10/6/15

to

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

drivers/uio/uio.c | 15 +++++++++++++++
include/linux/uio_driver.h | 3 +++
2 files changed, 18 insertions(+)

diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
index 8196581..714b0e5 100644
--- a/drivers/uio/uio.c
+++ b/drivers/uio/uio.c
@@ -704,6 +704,20 @@ static int uio_mmap(struct file *filep, struct vm_area_struct *vma)
}
}

+static long uio_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
+{
+ struct uio_listener *listener = filep->private_data;
+ struct uio_device *idev = listener->dev;
+
+ if (!idev->info)
+ return -EIO;
+
+ if (!idev->info->ioctl)
+ return -ENOTTY;
+
+ return idev->info->ioctl(idev->info, cmd, arg);
+}
+
static const struct file_operations uio_fops = {
.owner = THIS_MODULE,
.open = uio_open,
@@ -712,6 +726,7 @@ static const struct file_operations uio_fops = {
.write = uio_write,
.mmap = uio_mmap,
.poll = uio_poll,
+ .unlocked_ioctl = uio_ioctl,
.fasync = uio_fasync,
.llseek = noop_llseek,
};
diff --git a/include/linux/uio_driver.h b/include/linux/uio_driver.h
index 32c0e83..10d7833 100644
--- a/include/linux/uio_driver.h
+++ b/include/linux/uio_driver.h
@@ -89,6 +89,7 @@ struct uio_device {
* @mmap: mmap operation for this uio device
* @open: open operation for this uio device
* @release: release operation for this uio device
+ * @ioctl: ioctl handler
* @irqcontrol: disable/enable irqs when 0/1 is written to /dev/uioX
*/
struct uio_info {
@@ -105,6 +106,8 @@ struct uio_info {
int (*open)(struct uio_info *info, struct inode *inode);
int (*release)(struct uio_info *info, struct inode *inode);
int (*irqcontrol)(struct uio_info *info, s32 irq_on);
+ int (*ioctl)(struct uio_info *info, unsigned int cmd,
+ unsigned long arg);
};

extern int __must_check

Vlad Zolotarov

unread,

Oct 6, 2015, 12:40:07 PM10/6/15

to

On 10/06/15 18:13, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:40:23PM +0300, Vlad Zolotarov wrote:
>>> I'm guessing it doesn't enable MSI though, does it?
>> Again, enabling MSI is a matter of a trivial patch configuring device
>> registers on the device BAR.
> No, not really.

Sure that is!
Look at pci_msi_set_enable(): it's a single read and then write, just
like the code for bus master enabling. The msi capability address may be
retrieved from lspci and then mapped for instance using sysfs.

Configuring the MSI table is a few more reads and writes too but at the
bottom line - it's a trivial code too...

Vlad Zolotarov

unread,

Oct 6, 2015, 1:20:08 PM10/6/15

to

Add the ability for underlying device drivers to register the ioctl
commands. This is useful when some interaction with the user space
beyond sysfs capabilities is required, e.g. query the interrupt mode
or bind eventfd to interrupt notifications (similarly to vfio ioctl
VFIO_DEVICE_SET_IRQS).

Vlad Zolotarov

unread,

Oct 6, 2015, 1:20:08 PM10/6/15

to

This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.

Currently uio_pci_generic demands INT#x interrupts source be available. However
there are devices that simply don't have INT#x capability, for instance SR-IOV
VF devices that simply don't have INT#x capability. For such devices
uio_pci_generic will simply fail (more specifically its probe() will fail).

When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to
performance overhead and thus VFIO is not an option users that develop
user-space drivers are left without any option but to develop some proprietary
UIO drivers (e.g. igb_uio driver in Intel's DPDK) just to be able to use UIO
infrastructure.

This series provides a generic solution for this problem while preserving the
original behaviour for devices for which the original uio_pci_generic had worked
before (i.e. INT#x will be used by default).

New in v5:
- Expanded the commitlog on PATCH1.

New in v4:
- Use portable __u32 and __s32 types from asm/types.h for
defining uio_pci_generic_irq_set fields.
- Use proper _IO macros for defining read and write ioctl()
commands.
- Moved bars mapping setting into a separate patch.
- Update uio_pci_generic example in uio-howto.tmpl.

New in v3:
- Add __iomem qualifier to temp buffer receiving ioremap value.

New in v2:
- Added #include <linux/uaccess.h> to uio_pci_generic.c

Vlad Zolotarov (4):
uio: add ioctl support
uio_pci_generic: properly initialize PCI bars mappings towards UIO
uio_pci_generic: add MSI/MSI-X support
Documentation: update uio-howto

Documentation/DocBook/uio-howto.tmpl | 139 ++++++++++--
drivers/uio/uio.c | 15 ++
drivers/uio/uio_pci_generic.c | 409 +++++++++++++++++++++++++++++++++--
include/linux/uio_driver.h | 3 +
include/uapi/linux/uio_pci_generic.h | 51 +++++
5 files changed, 574 insertions(+), 43 deletions(-)
create mode 100644 include/uapi/linux/uio_pci_generic.h

Message has been deleted

Vlad Zolotarov

unread,

Oct 6, 2015, 1:20:10 PM10/6/15

to

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

drivers/uio/uio_pci_generic.c | 89 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 89 insertions(+)

diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c
index d0b508b..2c6e2b1 100644
--- a/drivers/uio/uio_pci_generic.c
+++ b/drivers/uio/uio_pci_generic.c
@@ -40,6 +40,75 @@ to_uio_pci_generic_dev(struct uio_info *info)
return container_of(info, struct uio_pci_generic_dev, info);
}

+/* Unmap previously ioremap'd resources */
+static void release_iomaps(struct uio_pci_generic_dev *gdev)
+{
+ int i;
+ struct uio_mem *mem = gdev->info.mem;
+
+ for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
+ if (mem->internal_addr) {
+ iounmap(mem->internal_addr);
+ mem->internal_addr = NULL;
+ }
+ }
+}
+
+static int setup_maps(struct pci_dev *pdev, struct uio_info *info)
+{
+ int i, m = 0, p = 0, err;
+ static const char * const bar_names[] = {
+ "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5",
+ };
+
+ for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
+ unsigned long start = pci_resource_start(pdev, i);
+ unsigned long flags = pci_resource_flags(pdev, i);
+ unsigned long len = pci_resource_len(pdev, i);
+
+ if (start == 0 || len == 0)
+ continue;
+
+ if (flags & IORESOURCE_MEM) {
+ void __iomem *addr;
+
+ if (m >= MAX_UIO_MAPS)
+ continue;
+
+ addr = ioremap(start, len);
+ if (addr == NULL) {
+ err = -EINVAL;
+ goto fail;
+ }
+
+ info->mem[m].name = bar_names[i];
+ info->mem[m].addr = start;
+ info->mem[m].internal_addr = addr;
+ info->mem[m].size = len;
+ info->mem[m].memtype = UIO_MEM_PHYS;
+ ++m;
+ } else if (flags & IORESOURCE_IO) {
+ if (p >= MAX_UIO_PORT_REGIONS)
+ continue;
+
+ info->port[p].name = bar_names[i];
+ info->port[p].start = start;
+ info->port[p].size = len;
+ info->port[p].porttype = UIO_PORT_X86;
+ ++p;
+ }
+ }
+
+ return 0;
+fail:
+ for (i = 0; i < m; i++) {
+ iounmap(info->mem[i].internal_addr);
+ info->mem[i].internal_addr = NULL;
+ }
+
+ return err;
+}
+
/* Interrupt handler. Read/modify/write the command register to disable
* the interrupt. */
static irqreturn_t irqhandler(int irq, struct uio_info *info)
@@ -86,18 +155,35 @@ static int probe(struct pci_dev *pdev,

gdev->info.name = "uio_pci_generic";
gdev->info.version = DRIVER_VERSION;
+
+ err = pci_request_regions(pdev, "uio_pci_generic");
+ if (err != 0) {
+ dev_err(&pdev->dev, "Cannot request regions\n");
+ goto err_request_regions;
+ }
+
gdev->info.irq = pdev->irq;
gdev->info.irq_flags = IRQF_SHARED;
gdev->info.handler = irqhandler;
gdev->pdev = pdev;

+ /* remap resources */
+ err = setup_maps(pdev, &gdev->info);
+ if (err)
+ goto err_maps;
+
err = uio_register_device(&pdev->dev, &gdev->info);
if (err)
goto err_register;
pci_set_drvdata(pdev, gdev);

return 0;
+
err_register:
+ release_iomaps(gdev);
+err_maps:
+ pci_release_regions(pdev);
+err_request_regions:
kfree(gdev);
err_alloc:
err_verify:
@@ -110,8 +196,11 @@ static void remove(struct pci_dev *pdev)
struct uio_pci_generic_dev *gdev = pci_get_drvdata(pdev);

uio_unregister_device(&gdev->info);
+ release_iomaps(gdev);
+ pci_release_regions(pdev);
pci_disable_device(pdev);
kfree(gdev);
+ pci_set_drvdata(pdev, NULL);
}

static struct pci_driver uio_pci_driver = {

Vlad Zolotarov

unread,

Oct 6, 2015, 1:20:13 PM10/6/15

to

Change the chapters related to uio_pci_generic that refer interrupt mode.
Add the relevant explanation regarding MSI and MSI-X interrupt modes
support.

Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
---

Michael S. Tsirkin

unread,

Oct 6, 2015, 2:30:08 PM10/6/15

to

On Tue, Oct 06, 2015 at 08:17:35PM +0300, Vlad Zolotarov wrote:
> This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.
>
> Currently uio_pci_generic demands INT#x interrupts source be available. However
> there are devices that simply don't have INT#x capability, for instance SR-IOV
> VF devices that simply don't have INT#x capability. For such devices
> uio_pci_generic will simply fail (more specifically its probe() will fail).
>
> When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to
> performance overhead and thus VFIO is not an option users that develop
> user-space drivers are left without any option but to develop some proprietary
> UIO drivers (e.g. igb_uio driver in Intel's DPDK) just to be able to use UIO
> infrastructure.
>
> This series provides a generic solution for this problem while preserving the
> original behaviour for devices for which the original uio_pci_generic had worked
> before (i.e. INT#x will be used by default).
>
> New in v5:
> - Expanded the commitlog on PATCH1.

Looks like you didn't attempt to address any of my review comments.
I don't intend to review this until you do.

Alex Williamson

unread,

Oct 6, 2015, 3:00:09 PM10/6/15

to

On Tue, 2015-10-06 at 18:23 +0300, Avi Kivity wrote:
>
> On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote:
> > On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:
> >> The only "like VFIO" behavior we implement here is binding the MSI-X
> >> interrupt notification to eventfd descriptor.
> > There will be more if you add some basic memory protections.
> >
> > Besides, that's not true.
> > Your patch queries MSI capability, sets # of vectors.
> > You even hinted you want to add BAR mapping down the road.
>
> BAR mapping is already available from sysfs; it is not mandatory.
>
> > VFIO does all of that.
> >
>
> Copying vfio maintainer Alex (hi!).
>
> vfio's charter is modern iommu-capable configurations. It is designed to
> be secure enough to be usable by an unprivileged user.
>
> For performance and hardware reasons, many dpdk deployments use
> uio_pci_generic. They are willing to trade off the security provided by
> vfio for the performance and deployment flexibility of pci_uio_generic.
> Forcing these features into vfio will compromise its security and
> needlessly complicate its code (I guess it can be done with a "null"
> iommu, but then vfio will have to decide whether it is secure or not).

It's not just the iommu model vfio uses, it's that vfio is built around
iommu groups. For instance to use a device in vfio, the user opens the
vfio group file and asks for the device within that group. That's a
fairly fundamental part of the mechanics to sidestep.

However, is there an opportunity at a lower level? Systems without an
iommu typically have dma ops handled via a software iotlb (ie. bounce
buffers), but I think they simply don't have iommu ops registered.
Could a no-iommu, iommu subsystem provide enough dummy iommu ops to fake
out vfio? It would need to iterate the devices on the bus and come up
with dummy iommu groups and dummy versions of iommu_map and unmap. The
grouping is easy, one device per group, there's no isolation anyway.
The vfio type1 iommu backend will do pinning, which seems like an
improvement over the mlock that uio users probably try to do now. I
guess the no-iommu map would error if the IOVA isn't simply the bus
address of the page mapped.

Of course this is entirely unsafe and this no-iommu driver should taint
the kernel, but it at least standardizes on one userspace API and you're
already doing completely unsafe things with uio. vfio should be
enlightened at least to the point that it allows only privileged users
access to devices under such a (lack of) iommu.

Vlad Zolotarov

unread,

Oct 6, 2015, 4:40:09 PM10/6/15

to

On 10/06/15 21:27, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 08:17:35PM +0300, Vlad Zolotarov wrote:
>> This series add support for MSI and MSI-X interrupts to uio_pci_generic driver.
>>
>> Currently uio_pci_generic demands INT#x interrupts source be available. However
>> there are devices that simply don't have INT#x capability, for instance SR-IOV
>> VF devices that simply don't have INT#x capability. For such devices
>> uio_pci_generic will simply fail (more specifically its probe() will fail).
>>
>> When IOMMU is either not available (e.g. Amazon EC2) or not acceptable due to
>> performance overhead and thus VFIO is not an option users that develop
>> user-space drivers are left without any option but to develop some proprietary
>> UIO drivers (e.g. igb_uio driver in Intel's DPDK) just to be able to use UIO
>> infrastructure.
>>
>> This series provides a generic solution for this problem while preserving the
>> original behaviour for devices for which the original uio_pci_generic had worked
>> before (i.e. INT#x will be used by default).
>>
>> New in v5:
>> - Expanded the commitlog on PATCH1.
> Looks like you didn't attempt to address any of my review comments.
> I don't intend to review this until you do.

So far there hasn't been any comments related to the code in these
patches from your side but rather comments about the general flaws of
the current uio_pci_generic in particular and UIO in general that have
nothing to do with this series. Therefore obviously there was nothing to
address.
If u have any comments related to _THIS_ series I'd be glad to address.
So far I was under the strong impression that u develop an obviously
theoretical discussion about "nice to have fixed" stuff in UIO, which
was obvious to everybody on this thread had nothing to do with this
patch series.

Could it be that I've got u wrong? If so, could u, pls., clarify what
u'd like me to fix in these patches exactly and why?

thanks,
vlad

Vlad Zolotarov

unread,

Oct 6, 2015, 4:50:08 PM10/6/15

to

I beg my pardon, there was one relevant comment of yours - to split the
bars mapping into a separate patch. And this has been addressed. ;)

Stephen Hemminger

unread,

Oct 6, 2015, 5:40:07 PM10/6/15

to

On Tue, 06 Oct 2015 12:51:20 -0600
Alex Williamson <alex.wi...@redhat.com> wrote:

> Of course this is entirely unsafe and this no-iommu driver should taint
> the kernel, but it at least standardizes on one userspace API and you're
> already doing completely unsafe things with uio. vfio should be
> enlightened at least to the point that it allows only privileged users
> access to devices under such a (lack of) iommu

I agree with the design, but not with the taint argument.
(Unless you want to taint any and all use of UIO drivers which can
already do this).

Stephen Hemminger

unread,

Oct 6, 2015, 5:40:08 PM10/6/15

to

On Tue, 6 Oct 2015 20:17:38 +0300
Vlad Zolotarov <vl...@cloudius-systems.com> wrote:

> + int i, vectors = pci_msix_vec_count(pdev);
> +
> + if (vectors <= 0)
> + return false;
> +

Since devices like fm10k can thousands of MSI vectors, and often systems
have limited number of slots, some resource control is needed.
I limited the number in my version. That was one of the reasons for the late
binding of MSI vectors. Intended to allow application to ask for how
many it wanted.

Stephen Hemminger

unread,

Oct 6, 2015, 5:40:08 PM10/6/15

to

On Tue, 6 Oct 2015 20:17:36 +0300
Vlad Zolotarov <vl...@cloudius-systems.com> wrote:

> Add the ability for underlying device drivers to register the ioctl
> commands. This is useful when some interaction with the user space
> beyond sysfs capabilities is required, e.g. query the interrupt mode
> or bind eventfd to interrupt notifications (similarly to vfio ioctl
> VFIO_DEVICE_SET_IRQS).
>
> Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>

After discussions on and off list, the idea of an ioctl interface
is just not going to be accepted upstream because it can be abused.
Therefore another API will be necessary.

Ps: since I did most of this first, I am surprised you never gave
any attribution for the earlier work.

Alex Williamson

unread,

Oct 6, 2015, 5:50:07 PM10/6/15

to

On Tue, 2015-10-06 at 22:32 +0100, Stephen Hemminger wrote:
> On Tue, 06 Oct 2015 12:51:20 -0600
> Alex Williamson <alex.wi...@redhat.com> wrote:
>
> > Of course this is entirely unsafe and this no-iommu driver should taint
> > the kernel, but it at least standardizes on one userspace API and you're
> > already doing completely unsafe things with uio. vfio should be
> > enlightened at least to the point that it allows only privileged users
> > access to devices under such a (lack of) iommu
>
> I agree with the design, but not with the taint argument.
> (Unless you want to taint any and all use of UIO drivers which can
> already do this).

Yes, actually, if the bus master bit gets enabled all bets are off. I
don't see how that leaves a supportable kernel, so we might as well
taint it. Isn't this exactly why we taint for proprietary drivers, we
have no idea what it has mucked with in kernel space. This just moves
the proprietary driver out to userspace without an iommu to protect the
host. Thanks,

Alex

Michael S. Tsirkin

unread,

Oct 6, 2015, 7:00:07 PM10/6/15

to

The issues I pointed out are not "nice to have fixed" at all.
The patchset isn't acceptable if you don't address them.
Sorry, I don't have the time to go over them again.
Please just dig them out of the archive.

Avi Kivity

unread,

Oct 7, 2015, 3:00:07 AM10/7/15

to

Right now, people use hugetlbfs maps, which both locks the memory and
provides better performance.

> I
> guess the no-iommu map would error if the IOVA isn't simply the bus
> address of the page mapped.
>
> Of course this is entirely unsafe and this no-iommu driver should taint
> the kernel, but it at least standardizes on one userspace API and you're
> already doing completely unsafe things with uio. vfio should be
> enlightened at least to the point that it allows only privileged users
> access to devices under such a (lack of) iommu.

There is an additional complication. With an iommu, userspace programs
the device with virtual addresses, but without it, they have to program
physical addresses. So vfio would need to communicate this bit of
information.

We can go further and define a better translation API than the current
one (reading /proc/pagemap). But it's going to be a bigger change to
vfio than I thought at first.

Vlad Zolotarov

unread,

Oct 7, 2015, 4:00:08 AM10/7/15

to

Thanks for clarification, Alex.
One of the important points in the above description is that vfio has
been build around IOMMU groups - and that's a good thing!
This means that this ensures the safety for vfio users and IMHO breaking
this by introducing the no-iommu mode won't bring any good.

What do we have on a negative side of this step:

1. Just a description of the work that has to be done implies a
non-trivial code that will have to be maintained later.
2. This new mode will be absolutely unsafe while some users may
mistakenly assume that using vfio is safe in all situations like it
is now.
3. The vfio user interface is "a bit" more complicated than the one of
UIO's and if there isn't any added value (see below) users will just
prefer continue using UIO.

Let's try to analyze the possible positive sides (as u've described them
above):

1. /This added feature may allow to standardize the user-space drivers
interface. Why is it good? - This could allow us to maintain only
one infrastructure (vfio)./ That's true but unfortunately UIO
interface is already widely used and thus it can't be just killed.
Therefore instead of maintaining one unsafe user-space driver
infrastructure we'll have to maintain two. Therefore the result will
be exactly the opposite from the expected.
2. I'm not very familiar with all vfio features but regarding the "vfio
type1 iommu backend going to do pinning instead of mlock" - another
alternative to pin the pages in the memory is to use hugetlbfs,
which UIO users like DPDK do. I may be wrong but I'm not sure that
in this case using vfio type1 iommu backend would be beneficial it
terms of performance and performance is usually the most important
factor for un-safe mode users (e.g. DPDK).

So, considering the above I think that instead of complicating the
already non-trivial vfio interface even more we'd rather have two types
of user-space interfaces:

* safe - VFIO
* not safe - UIO

The thing is that this is more or less the situation right now and
according to negatives.3 above it is likely to remain this way (at least
the UIO part ;)) for some (long) time so all this "adding unsafe mode to
VFIO" initiative looks completely useless.

thanks,
vlad

Vlad Zolotarov

unread,

Oct 7, 2015, 4:00:10 AM10/7/15

to

On 10/07/15 00:58, Stephen Hemminger wrote:
> Go ahead and submit a seperate taint bit for UIO as a patch.

This patch already does this.

thanks,
vlad

>
>
> On Tue, Oct 6, 2015 at 10:41 PM, Alex Williamson
> <alex.wi...@redhat.com <mailto:alex.wi...@redhat.com>> wrote:
>
> On Tue, 2015-10-06 at 22:32 +0100, Stephen Hemminger wrote:
> > On Tue, 06 Oct 2015 12:51:20 -0600
> > Alex Williamson <alex.wi...@redhat.com

Vlad Zolotarov

unread,

Oct 7, 2015, 4:10:06 AM10/7/15

to

On 10/07/15 09:53, Avi Kivity wrote:

> On 10/07/2015 12:58 AM, Stephen Hemminger wrote:
>> Go ahead and submit a seperate taint bit for UIO as a patch.
>>
>

> Taint should only be applied if bus mastering is enabled (to avoid
> annoying the users of the original uio use case)

Pls., note that this series would enable the legacy INT#X mode if
possible and this, of course, without enabling bus mastering and without
tainting the kernel.
This means that the current users of uio_pci_generic won't feel/get any
difference after/if these patches are applied since before these patches
it could only be used with the devices that do have INT#X capability.

>
>>
>> On Tue, Oct 6, 2015 at 10:41 PM, Alex Williamson
>> <alex.wi...@redhat.com <mailto:alex.wi...@redhat.com>> wrote:
>>

>> On Tue, 2015-10-06 at 22:32 +0100, Stephen Hemminger wrote:
>> > On Tue, 06 Oct 2015 12:51:20 -0600
>> > Alex Williamson <alex.wi...@redhat.com

Vlad Zolotarov

unread,

Oct 7, 2015, 4:10:06 AM10/7/15

to

On 10/07/15 11:00, Vlad Zolotarov wrote:
>
>
> On 10/07/15 09:53, Avi Kivity wrote:
>> On 10/07/2015 12:58 AM, Stephen Hemminger wrote:
>>> Go ahead and submit a seperate taint bit for UIO as a patch.
>>>
>>
>> Taint should only be applied if bus mastering is enabled (to avoid
>> annoying the users of the original uio use case)
>
> Pls., note that this series would enable the legacy INT#X mode if
> possible

By default I meant.

Vlad Zolotarov

unread,

Oct 7, 2015, 4:20:06 AM10/7/15

to

On 10/07/15 11:17, Vlad Zolotarov wrote:

>
>
> On 10/07/15 00:33, Stephen Hemminger wrote:
>> On Tue, 6 Oct 2015 20:17:36 +0300
>> Vlad Zolotarov <vl...@cloudius-systems.com> wrote:
>>
>>> Add the ability for underlying device drivers to register the ioctl
>>> commands. This is useful when some interaction with the user space
>>> beyond sysfs capabilities is required, e.g. query the interrupt mode
>>> or bind eventfd to interrupt notifications (similarly to vfio ioctl
>>> VFIO_DEVICE_SET_IRQS).
>>>
>>> Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
>> After discussions on and off list, the idea of an ioctl interface
>> is just not going to be accepted upstream because it can be abused.
>> Therefore another API will be necessary.
>

> I'm open for ideas. So far ioctl seems like the most appropriate
> candidate for the job... ;)

>
>>
>> Ps: since I did most of this first,
>

> this particular patch and parts of SET_IRQ code - yes. But that would
> be all. Let's just be specific... ;)

Pardon - mapping bars was also snitched from your patches... ;)

> This still doesn't mean that u don't deserve a proper credit for that.
> ;) But there was a reason why I didn't do that. See below.

>> I am surprised you never gave
>> any attribution for the earlier work.
>

> If memory serves me well I asked u about the "attribution" a few days
> ago of the dpdk-dev list but got no answer.
> Therefore I took no steps in this regard since it left me under the
> impression that u just didn't want it. However it seems now
> that that's not the case... ;)
>
> Pls., let me know if u want me to mention that this patch and SET_IRQ
> ioctl code is based on your patches on dpdk-dev list and if yes, in which
> form: just mentioning it in the patch description or putting your
> signed-off to the patch(es).
>
> thanks,
> vlad

Vlad Zolotarov

unread,

Oct 7, 2015, 4:20:06 AM10/7/15

to

On 10/07/15 00:33, Stephen Hemminger wrote:

> On Tue, 6 Oct 2015 20:17:36 +0300
> Vlad Zolotarov <vl...@cloudius-systems.com> wrote:
>
>> Add the ability for underlying device drivers to register the ioctl
>> commands. This is useful when some interaction with the user space
>> beyond sysfs capabilities is required, e.g. query the interrupt mode
>> or bind eventfd to interrupt notifications (similarly to vfio ioctl
>> VFIO_DEVICE_SET_IRQS).
>>
>> Signed-off-by: Vlad Zolotarov <vl...@cloudius-systems.com>
> After discussions on and off list, the idea of an ioctl interface
> is just not going to be accepted upstream because it can be abused.
> Therefore another API will be necessary.

I'm open for ideas. So far ioctl seems like the most appropriate
candidate for the job... ;)

>

> Ps: since I did most of this first,

this particular patch and parts of SET_IRQ code - yes. But that would be
all. Let's just be specific... ;)

This still doesn't mean that u don't deserve a proper credit for that.
;) But there was a reason why I didn't do that. See below.

> I am surprised you never gave
> any attribution for the earlier work.

If memory serves me well I asked u about the "attribution" a few days
ago of the dpdk-dev list but got no answer.
Therefore I took no steps in this regard since it left me under the
impression that u just didn't want it. However it seems now
that that's not the case... ;)

Pls., let me know if u want me to mention that this patch and SET_IRQ
ioctl code is based on your patches on dpdk-dev list and if yes, in which
form: just mentioning it in the patch description or putting your
signed-off to the patch(es).

thanks,
vlad

Vlad Zolotarov

unread,

Oct 7, 2015, 4:40:07 AM10/7/15

to

Let's summarize "the issues u've pointed out" and the explanations why
they are not relevant to this series for those that haven't been addressed.

1. "This patch enables bus mastering when MSI or MSI-X interrupts mode
is used and this would allow bogus user space application trash
kernel memory".
1. It's been explained that simple bug in the application that uses
the newly added interface may not do this. Only specifically
written user space code that explicitly programs MSI table from
the user space may cause such harm and this may be done even
without these patches with the current uio_pci_generic
implementation.
2. It's been described how bus mastering and MSI may be enabled
from the user space right now.
3. It's been explained how root may trash the kernel memory right
now even without UIO.
2. "This patch allows insecure DMA operations since MSI actually
performs them"
1. It's been explained that this series adds no additional threat
to the current driver state since (as described above) this may
be done without these patches right now.
3. "Users should not be able to program MSI table from the user space
therefore it should be zero-mapped".
1. This may be true but this has nothing to do with this patches
since they have nothing to do with this ability.
4. "Why to mimic VFIO SET_IRQ ioctl instead of patching VFIO to be able
to work in iommu-less mode?"
1. Because this would break the basic VFIO design assumption and
would require a lot of non-trivial codding with little-to-none
added value. See the relevant thread for more detail.
5. "Separate bars mapping code into a separate patch"
1. Addressed.

Hope I didn't miss anything.
So, as u may see everything that could be addressed have been addressed
and the rest are as I've already stated - irrelevant for this patch.

thanks,
vlad

Greg KH

unread,

Oct 7, 2015, 4:50:06 AM10/7/15

to

As Stephen said, I will not take this, sorry. It opens up the ability
to add "new system calls" to a huge range of crappy drivers, it's
something that vendors have been trying to push for years and is
something that I will not allow.

greg k-h

Vlad Zolotarov

unread,

Oct 7, 2015, 5:00:10 AM10/7/15

to

Ok. Another alternative could be to add new sysfs attributes for the
MSI-X functionality similarly to what is done with "maps".
Would it be acceptable?

Avi Kivity

unread,

Oct 7, 2015, 6:30:10 AM10/7/15

to

On 10/07/2015 01:25 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 07:09:11PM +0300, Avi Kivity wrote:
>>
>> On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote:
>>>> While it is possible that userspace malfunctions and accidentally programs
>>>> MSI incorrectly, the risk is dwarfed by the ability of userspace to program
>>>> DMA incorrectly.
>>> That seems to imply that for the upstream kernel this is not a valid usecase at all.
>>>
>> That is trivially incorrect, upstream pci_uio_generic is used with dpdk for
>> years.
> dpdk used to do polling for years. patch to use interrupts was posted in
> june 2015.

dpdk used interrupts long before that.

>
>> Are dpdk applications an invalid use case?
> The way dpdk is using UIO/sysfs is borderline at best, and can't be used
> to justify new interfaces. They have a more secure mode using VFIO.
> That one's more reasonable.
>

Maybe this was not stressed enough times, but not all configurations
have an iommu, or want to use one.

Michael S. Tsirkin

unread,

Oct 7, 2015, 6:30:11 AM10/7/15

to

On Tue, Oct 06, 2015 at 07:09:11PM +0300, Avi Kivity wrote:
>
>
> On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote:
> >>While it is possible that userspace malfunctions and accidentally programs
> >>MSI incorrectly, the risk is dwarfed by the ability of userspace to program
> >>DMA incorrectly.
> >That seems to imply that for the upstream kernel this is not a valid usecase at all.
> >
>
> That is trivially incorrect, upstream pci_uio_generic is used with dpdk for
> years.

dpdk used to do polling for years. patch to use interrupts was posted in
june 2015.

> Are dpdk applications an invalid use case?

The way dpdk is using UIO/sysfs is borderline at best, and can't be used
to justify new interfaces. They have a more secure mode using VFIO.
That one's more reasonable.

--
MST

Michael S. Tsirkin

unread,

Oct 7, 2015, 6:50:07 AM10/7/15

to

On Wed, Oct 07, 2015 at 11:38:12AM +0300, Vlad Zolotarov wrote:
> So, as u may see everything that could be addressed have been addressed

Yea. Wrt upstream you've chosen a wrong approach, it creates issues that
can't be addressed.

--
MST

Vlad Zolotarov

unread,

Oct 7, 2015, 9:50:12 AM10/7/15

to

On 10/07/15 13:42, Michael S. Tsirkin wrote:
> On Wed, Oct 07, 2015 at 11:38:12AM +0300, Vlad Zolotarov wrote:
>> So, as u may see everything that could be addressed have been addressed
> Yea. Wrt upstream you've chosen a wrong approach, it creates issues that
> can't be addressed.

It has been explained and proven to u that it creates no issues - see my
previous email.

Vlad Zolotarov

unread,

Oct 7, 2015, 10:00:15 AM10/7/15

to

On 10/07/15 16:44, Vlad Zolotarov wrote:
>
>
> On 10/07/15 13:42, Michael S. Tsirkin wrote:
>> On Wed, Oct 07, 2015 at 11:38:12AM +0300, Vlad Zolotarov wrote:
>>> So, as u may see everything that could be addressed have been addressed
>> Yea. Wrt upstream you've chosen a wrong approach, it creates issues that
>> can't be addressed.
>
> It has been explained and proven to u that it creates no issues - see
> my previous email.

And please be specific. I suggest we define the "issues" this series
adds in a list so that we can address the issues in this list in order
not to explain the same thing more than once.

I've started such list two emails ago and have already addressed all the
issues in it. If there are more "issues" that are not in that list then,
pls., reply to that email and add them so that I'd be able to address
them too.

And if there isn't then could u, pls., clarify your point since to me it
looks like u are blocking this series without any reason at all.

thanks,
vlad

Alex Williamson

unread,

Oct 7, 2015, 12:40:09 PM10/7/15

to

It sounds like a separate vfio iommu backend from type1, one that just
pins the page and returns the bus address. The curse and benefit would
be that existing type1 users wouldn't "just work" in an insecure mode,
the DMA mapping code would need to be aware of the difference. Still, I
do really prefer to keep vfio as only exposing a secure, iommu protected
device to the user because surely someone will try and users would
expect that removing iommu restrictions from vfio means they can do
device assignment to VMs w/o an iommu.

Avi Kivity

unread,

Oct 7, 2015, 12:40:10 PM10/7/15

to

On 10/07/2015 07:31 PM, Alex Williamson wrote:
>>> I
>>> guess the no-iommu map would error if the IOVA isn't simply the bus
>>> address of the page mapped.
>>>
>>> Of course this is entirely unsafe and this no-iommu driver should taint
>>> the kernel, but it at least standardizes on one userspace API and you're
>>> already doing completely unsafe things with uio. vfio should be
>>> enlightened at least to the point that it allows only privileged users
>>> access to devices under such a (lack of) iommu.
>> There is an additional complication. With an iommu, userspace programs
>> the device with virtual addresses, but without it, they have to program
>> physical addresses. So vfio would need to communicate this bit of
>> information.
>>
>> We can go further and define a better translation API than the current
>> one (reading /proc/pagemap). But it's going to be a bigger change to
>> vfio than I thought at first.
> It sounds like a separate vfio iommu backend from type1, one that just
> pins the page and returns the bus address. The curse and benefit would
> be that existing type1 users wouldn't "just work" in an insecure mode,
> the DMA mapping code would need to be aware of the difference. Still, I
> do really prefer to keep vfio as only exposing a secure, iommu protected
> device to the user because surely someone will try and users would
> expect that removing iommu restrictions from vfio means they can do
> device assignment to VMs w/o an iommu.

That's what I thought as well, but apparently adding msix support to the
already insecure uio drivers is even worse.

Greg KH

unread,

Oct 7, 2015, 1:30:08 PM10/7/15

to

If you get everyone else here to agree that this is the interface you
all are going to be using, sure. All I care is that you not add ioctl
to the UIO interface.

[PATCH v1 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Greg KH

Greg KH

Vlad Zolotarov

Vlad Zolotarov

Avi Kivity

Stephen Hemminger

Vlad Zolotarov

Vlad Zolotarov

Greg KH

Greg KH

Greg KH

Vlad Zolotarov

Avi Kivity

Vlad Zolotarov

Vlad Zolotarov

Greg KH

Avi Kivity

Avi Kivity

Vlad Zolotarov

Vlad Zolotarov

Greg KH

Michael S. Tsirkin

Michael S. Tsirkin

Michael S. Tsirkin

Michael S. Tsirkin

Vlad Zolotarov

Vlad Zolotarov

Michael S. Tsirkin

Gleb Natapov

Michael S. Tsirkin

Michael S. Tsirkin

Vlad Zolotarov

Vlad Zolotarov

Michael S. Tsirkin

Michael S. Tsirkin

Michael S. Tsirkin

Avi Kivity

Michael S. Tsirkin

Michael S. Tsirkin

Vlad Zolotarov

Avi Kivity

Avi Kivity

Vlad Zolotarov

Gleb Natapov

Avi Kivity

Gleb Natapov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Vlad Zolotarov

Michael S. Tsirkin

Alex Williamson

Vlad Zolotarov

Vlad Zolotarov

Stephen Hemminger

Stephen Hemminger

Stephen Hemminger

Alex Williamson

Michael S. Tsirkin

Avi Kivity

Vlad Zolotarov

Vlad Zolotarov