[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read & write calls) - but includes table
driven code to determine whats safe to write and what is not. Also, some
virtualization of the config space to allow drivers to think they're writing
some registers when they're not. Also, IO space accesses are also allowed.
Drivers for devices which use MSI-X are now prevented from directly writing
the MSI-X vector area.
All interrupts are now handled using eventfds, which makes things very simple.
The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
but the driver does support many more types of devices. I was none too sure
what driver directory this should live in, so for now I made up my own under
drivers/vfio. As a new driver/new directory, who makes the commit decision?
I currently have user level drivers working for 3 different network adapters
- the Cisco "Palo" enic, the Intel 82599 VF, and the Intel 82576 VF (but the
whole user level framework is a long ways from release). This driver could
also clearly replace a number of other drivers written just to give user
access to certain devices - but that will take time.
diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
--- linux-2.6.34/Documentation/vfio.txt 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/Documentation/vfio.txt 2010-05-28 14:03:05.000000000 -0700
@@ -0,0 +1,176 @@
+-------------------------------------------------------------------------------
+The VFIO "driver" is used to allow privileged AND non-privileged processes to
+implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
+devices.
+
+Why is this interesting? Some applications, especially in the high performance
+computing field, need access to hardware functions with as little overhead as
+possible. Examples are in network adapters (typically non tcp/ip based) and
+in compute accelerators - i.e., array processors, FPGA processors, etc.
+Previous to the VFIO drivers these apps would need either a kernel-level
+driver (with corrsponding overheads), or else root permissions to directly
+access the hardware. The VFIO driver allows generic access to the hardware
+from non-privileged apps IF the hardware is "well-behaved" enough for this
+to be safe.
+
+While there have long been ways to implement user-level drivers using specific
+corresponding drivers in the kernel, it was not until the introduction of the
+UIO driver framework, and the uio_pci_generic driver that one could have a
+generic kernel component supporting many types of user level drivers. However,
+even with the uio_pci_generic driver, processes implementing the user level
+drivers had to be trusted - they could do dangerous manipulation of DMA
+addreses and were required to be root to write PCI configuration space
+registers.
+
+Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
+new hardware capabilities which the VFIO solution exploits to allow non-root
+user level drivers. The main role of the IOMMU is to ensure that DMA accesses
+from devices go only to the appropriate memory locations, this allows VFIO to
+ensure that user level drivers do not corrupt inappropriate memory. PCI I/O
+virtualization (SR-IOV) was defined to allow "pass-through" of virtual devices
+to guest virtual machines. VFIO in essence implements pass-through of devices
+to user processes, not virtual machines. SR-IOV devices implement a
+traditional PCI device (the physical function) and a dynamic number of special
+PCI devices (virtual functions) whose feature set is somewhat restricted - in
+order to allow the operating system or virtual machine monitor to ensure the
+safe operation of the system.
+
+Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
+there are many other non-IOV PCI devices which also meet the defintion.
+Elements of this definition are:
+- The size of any memory BARs to be mmap'ed into the user process space must be
+ a multiple of the system page size.
+- If MSI-X interrupts are used, the device driver must not attempt to mmap or
+ write the MSI-X vector area.
+- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
+ revision 2.3 to allow its interrupts to be masked in a generic way.
+- The device must not use the PCI configuration space in any non-standard way,
+ i.e., the user level driver will be permitted only to read and write standard
+ fields of the PCI config space, and only if those fields cannot cause harm to
+ the system. In addition, some fields are "virtualized", so that the user
+ driver can read/write them like a kernel driver, but they do not affect the
+ real device.
+- For now, there is no support for user access to the PCIe and PCI-X extended
+ capabilities configuration space.
+
+Even with these restrictions, there are bound to be devices which are unsafe
+for user level use - it is still up to the system admin to decide whether to
+grant access to the device. When the vfio module is loaded, it will have
+access to no devices until the desired PCI devices are "bound" to the driver.
+First, make sure the devices are not bound to another kernel driver. You can
+unload that driver if you wish to unbind all its devices, or else enter the
+driver's sysfs directory, and unbind a specific device:
+ cd /sys/bus/pci/drivers/<drivername>
+ echo 0000:06:02.00 > unbind
+(The 0000:06:02.00 is a fully qualified PCI device name - different for each
+device). Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
+write the PCI device type of the target device to the new_id file:
+ echo 8086 10ca > new_id
+(8086 10ca are the vendor and device type for the Intel 82576 virtual function
+devices). A /dev/vfio<N> entry will be created for each device bound. The final
+step is to grant users permission by changing the mode and/or owner of the /dev
+entry - "chmod 666 /dev/vfio0".
+
+Reads & Writes:
+
+The user driver will typically use mmap to access the memory BAR(s) of a
+device; the I/O BARs and the PCI config space may be accessed through normal
+read and write system calls. Only 1 file descriptor is needed for all driver
+functions -- the desired BAR for I/O, memory, or config space is indicated via
+high-order bits of the file offset. For instance, the following implements a
+write to the PCI config space:
+
+ #include <linux/vfio.h>
+ void pci_write_config_word(int pci_fd, u16 off, u16 wd)
+ {
+ off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
+
+ if (pwrite(pci_fd, &wd, 2, cfg_off) != 2)
+ perror("pwrite config_dword");
+ }
+
+The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
+in vfio.h to convert bar numbers to file offsets and vice-versa.
+
+Interrupts:
+
+Device interrupts are translated by the vfio driver into input events on event
+notification file descriptors created by the eventfd system call. The user
+program must one or more event descriptors and pass them to the vfio driver
+via ioctls to arrange for the interrupt mapping:
+1.
+ efd = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
+ This provides an eventfd for traditional IRQ interrupts.
+ IRQs will be disable after each interrupt until the driver
+ re-enables them via the PCI COMMAND register.
+2.
+ efd = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
+ This connects MSI interrupts to an eventfd.
+3.
+ int arg[N+1];
+ arg[0] = N;
+ arg[1..N] = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
+ This connects N MSI-X interrupts with N eventfds.
+
+Waiting and checking for interrupts is done by the user program by reads,
+polls, or selects on the related event file descriptors.
+
+DMA:
+
+The VFIO driver uses ioctls to allow the user level driver to get DMA
+addresses which correspond to virtual addresses. In systems with IOMMUs,
+each PCI device will have its own address space for DMA operations, so when
+the user level driver programs the device registers, only addresses known to
+the IOMMU will be valid, any others will be rejected. The IOMMU creates the
+illusion (to the device) that multi-page buffers are physically contiguous,
+so a single DMA operation can safely span multiple user pages. Note that
+the VFIO driver is still useful in systems without IOMMUs, but only for
+trusted processes which can deal with DMAs which do not span pages (Huge
+pages count as a single page also).
+
+If the user process desires many DMA buffers, it may be wise to do a mapping
+of a single large buffer, and then allocate the smaller buffers from the
+large one.
+
+The DMA buffers are locked into physical memory for the duration of their
+existence - until VFIO_DMA_UNMAP is called, until the user pages are
+unmapped from the user process, or until the vfio file descriptor is closed.
+The user process must have permission to lock the pages given by the ulimit(-l)
+command, which in turn relies on settings in the /etc/security/limits.conf
+file.
+
+The vfio_dma_map structure is used as an argument to the ioctls which
+do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
+multiple of a page. Its rdwr field is zero for read-only (outbound), and
+non-zero for read/write buffers.
+
+ struct vfio_dma_map {
+ __u64 vaddr; /* process virtual addr */
+ __u64 dmaaddr; /* desired and/or returned dma address */
+ __u64 size; /* size in bytes */
+ int rdwr; /* bool: 0 for r/o; 1 for r/w */
+ };
+
+The VFIO_DMA_MAP_ANYWHERE is called with a vfio_dma_map structure as its
+argument, and returns the structure with a valid dmaaddr field.
+
+The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
+dmaaddr field already assigned. The system will attempt to map the DMA
+buffer into the IO space at the givne dmaaddr. This is expected to be
+useful if KVM or other virtualization facilities use this driver.
+
+The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
+the buffer and releases the corresponding system resources.
+
+The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
+(device dependent). It takes a single unsigned 64 bit integer as an argument.
+This call also has the side effect on enabled PCI bus mastership.
+
+Miscellaneous:
+
+The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
+device's base address region. It is passed a single integer specifying which
+BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.
diff -uprN linux-2.6.34/drivers/Kconfig vfio-linux-2.6.34/drivers/Kconfig
--- linux-2.6.34/drivers/Kconfig 2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/drivers/Kconfig 2010-05-27 17:01:02.000000000 -0700
@@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
source "drivers/staging/Kconfig"
source "drivers/platform/Kconfig"
+
+source "drivers/vfio/Kconfig"
endmenu
diff -uprN linux-2.6.34/drivers/Makefile vfio-linux-2.6.34/drivers/Makefile
--- linux-2.6.34/drivers/Makefile 2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/drivers/Makefile 2010-05-27 17:25:33.000000000 -0700
@@ -52,6 +52,7 @@ obj-$(CONFIG_FUSION) += message/
obj-$(CONFIG_FIREWIRE) += firewire/
obj-y += ieee1394/
obj-$(CONFIG_UIO) += uio/
+obj-$(CONFIG_VFIO) += vfio/
obj-y += cdrom/
obj-y += auxdisplay/
obj-$(CONFIG_PCCARD) += pcmcia/
diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig
--- linux-2.6.34/drivers/vfio/Kconfig 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/Kconfig 2010-05-27 17:07:25.000000000 -0700
@@ -0,0 +1,9 @@
+menuconfig VFIO
+ tristate "Non-Priv User Space PCI drivers"
+ depends on PCI
+ help
+ Driver to allow advanced user space drivers for PCI, PCI-X,
+ and PCIe devices. Requires IOMMU to allow non-privilged
+ processes to directly control the PCI devices.
+
+ If you don't know what to do here, say N.
diff -uprN linux-2.6.34/drivers/vfio/Makefile vfio-linux-2.6.34/drivers/vfio/Makefile
--- linux-2.6.34/drivers/vfio/Makefile 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/Makefile 2010-05-27 17:32:35.000000000 -0700
@@ -0,0 +1,5 @@
+obj-$(CONFIG_VFIO) := vfio.o
+
+vfio-y := vfio_main.o vfio_dma.o vfio_intrs.o \
+ vfio_pci_config.o vfio_rdwr.o vfio_sysfs.o
+
diff -uprN linux-2.6.34/drivers/vfio/vfio_dma.c vfio-linux-2.6.34/drivers/vfio/vfio_dma.c
--- linux-2.6.34/drivers/vfio/vfio_dma.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_dma.c 2010-05-28 14:04:04.000000000 -0700
@@ -0,0 +1,372 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/iommu.h>
+#include <linux/sched.h>
+
+#include <linux/vfio.h>
+
+/* Unmap DMA region */
+static void vfio_dma_unmap(struct vfio_listener *listener,
+ struct dma_map_page *mlp)
+{
+ int i;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+
+ mutex_lock(&vdev->gate);
+ list_del(&mlp->list);
+ if (mlp->sg) {
+ dma_unmap_sg(&pdev->dev, mlp->sg, mlp->npage,
+ DMA_BIDIRECTIONAL);
+ kfree(mlp->sg);
+ } else {
+ for (i = 0; i < mlp->npage; i++)
+ (void) iommu_unmap_range(vdev->domain,
+ mlp->daddr + i*PAGE_SIZE, PAGE_SIZE);
+ }
+ for (i = 0; i < mlp->npage; i++) {
+ if (mlp->rdwr)
+ SetPageDirty(mlp->pages[i]);
+ put_page(mlp->pages[i]);
+ }
+ listener->mm->locked_vm -= mlp->npage;
+ vdev->locked_pages -= mlp->npage;
+ kfree(mlp->pages);
+ kfree(mlp);
+ vdev->mapcount--;
+ mutex_unlock(&vdev->gate);
+}
+
+/* Unmap ALL DMA regions */
+void vfio_dma_unmapall(struct vfio_listener *listener)
+{
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ vfio_dma_unmap(listener, mlp);
+ }
+}
+
+int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
+{
+ unsigned long start, npage;
+ struct dma_map_page *mlp;
+ struct list_head *pos, *pos2;
+ int ret;
+
+ start = dmp->vaddr & ~PAGE_SIZE;
+ npage = dmp->size >> PAGE_SHIFT;
+
+ ret = -ENXIO;
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
+ continue;
+ ret = 0;
+ vfio_dma_unmap(listener, mlp);
+ break;
+ }
+ return ret;
+}
+
+/* Handle MMU notifications - user process freed or realloced memory
+ * which may be in use in a DMA region. Clean up region if so.
+ */
+static void vfio_dma_handle_mmu_notify(struct mmu_notifier *mn,
+ unsigned long start, unsigned long end)
+{
+ struct vfio_listener *listener;
+ unsigned long myend;
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ listener = container_of(mn, struct vfio_listener, mmu_notifier);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (mlp->vaddr >= end)
+ continue;
+ /*
+ * Ranges overlap if they're not disjoint; and they're
+ * disjoint if the end of one is before the start of
+ * the other one.
+ */
+ myend = mlp->vaddr + (mlp->npage << PAGE_SHIFT) - 1;
+ if (!(myend <= start || end <= mlp->vaddr)) {
+ printk(KERN_WARNING
+ "%s: demap start %lx end %lx va %lx pa %lx\n",
+ __func__, start, end,
+ mlp->vaddr, (long)mlp->daddr);
+ vfio_dma_unmap(listener, mlp);
+ }
+ }
+}
+
+static void vfio_dma_inval_page(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long addr)
+{
+ vfio_dma_handle_mmu_notify(mn, addr, addr + PAGE_SIZE);
+}
+
+static void vfio_dma_inval_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long start, unsigned long end)
+{
+ vfio_dma_handle_mmu_notify(mn, start, end);
+}
+
+static const struct mmu_notifier_ops vfio_dma_mmu_notifier_ops = {
+ .invalidate_page = vfio_dma_inval_page,
+ .invalidate_range_start = vfio_dma_inval_range_start,
+};
+
+/*
+ * Map usr buffer at specific IO virtual address
+ */
+static int vfio_dma_map_iova(
+ struct vfio_listener *listener,
+ unsigned long start_iova,
+ struct page **pages,
+ int npage,
+ int rdwr,
+ struct dma_map_page **mlpp)
+{
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ int ret;
+ int i;
+ phys_addr_t hpa;
+ struct dma_map_page *mlp;
+ unsigned long iova = start_iova;
+
+ if (vdev->domain == NULL) {
+ /* can't mix iova with anywhere */
+ if (vdev->mapcount > 0)
+ return -EINVAL;
+ if (!iommu_found())
+ return -EINVAL;
+ vdev->domain = iommu_domain_alloc();
+ if (vdev->domain == NULL)
+ return -ENXIO;
+ vdev->cachec = iommu_domain_has_cap(vdev->domain,
+ IOMMU_CAP_CACHE_COHERENCY);
+ ret = iommu_attach_device(vdev->domain, &pdev->dev);
+ if (ret) {
+ iommu_domain_free(vdev->domain);
+ vdev->domain = NULL;
+ printk(KERN_ERR "%s: device_attach failed %d\n",
+ __func__, ret);
+ return ret;
+ }
+ }
+ for (i = 0; i < npage; i++) {
+ if (iommu_iova_to_phys(vdev->domain, iova + i*PAGE_SIZE))
+ return -EBUSY;
+ }
+ rdwr = rdwr ? IOMMU_READ|IOMMU_WRITE : IOMMU_READ;
+ if (vdev->cachec)
+ rdwr |= IOMMU_CACHE;
+ for (i = 0; i < npage; i++) {
+ hpa = page_to_phys(pages[i]);
+ ret = iommu_map_range(vdev->domain, iova,
+ hpa, PAGE_SIZE, rdwr);
+ if (ret) {
+ while (--i > 0) {
+ iova -= PAGE_SIZE;
+ (void) iommu_unmap_range(vdev->domain,
+ iova, PAGE_SIZE);
+ }
+ return ret;
+ }
+ iova += PAGE_SIZE;
+ }
+
+ mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
+ mlp->pages = pages;
+ mlp->daddr = start_iova;
+ mlp->npage = npage;
+ *mlpp = mlp;
+ return 0;
+}
+
+/*
+ * Map user buffer - return IO virtual address
+ */
+static int vfio_dma_map_anywhere(
+ struct vfio_listener *listener,
+ struct page **pages,
+ int npage,
+ int rdwr,
+ struct dma_map_page **mlpp)
+{
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ struct scatterlist *sg, *nsg;
+ int i, nents;
+ struct dma_map_page *mlp;
+ unsigned long length;
+
+ if (vdev->domain) {
+ /* map anywhere and map iova don't mix */
+ if (vdev->mapcount > 0)
+ return -EINVAL;
+ iommu_domain_free(vdev->domain);
+ vdev->domain = NULL;
+ }
+ sg = kzalloc(npage * sizeof(struct scatterlist), GFP_KERNEL);
+ if (sg == NULL)
+ return -ENOMEM;
+ for (i = 0; i < npage; i++)
+ sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);
+ nents = dma_map_sg(&pdev->dev, sg, npage,
+ rdwr ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE);
+ /* The API for dma_map_sg suggests that it may squash together
+ * adjacent pages, but noone seems to really do that. So we squash
+ * it ourselves, because the user level wants a single buffer.
+ * This works if (a) there is an iommu, or (b) the user allocates
+ * large buffers from a huge page
+ */
+ nsg = sg;
+ for (i = 1; i < nents; i++) {
+ length = sg[i].dma_length;
+ sg[i].dma_length = 0;
+ if (sg[i].dma_address == (nsg->dma_address + nsg->dma_length)) {
+ nsg->dma_length += length;
+ } else {
+ nsg++;
+ nsg->dma_address = sg[i].dma_address;
+ nsg->dma_length = length;
+ }
+ }
+ nents = 1 + (nsg - sg);
+ if (nents != 1) {
+ if (nents > 0)
+ dma_unmap_sg(&pdev->dev, sg, npage,
+ DMA_BIDIRECTIONAL);
+ for (i = 0; i < npage; i++)
+ put_page(pages[i]);
+ kfree(sg);
+ printk(KERN_ERR "%s: sequential dma mapping failed\n",
+ __func__);
+ return -EFAULT;
+ }
+
+ mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
+ mlp->pages = pages;
+ mlp->sg = sg;
+ mlp->daddr = sg_dma_address(sg);
+ mlp->npage = npage;
+ *mlpp = mlp;
+ return 0;
+}
+
+int vfio_dma_map_common(struct vfio_listener *listener,
+ unsigned int cmd, struct vfio_dma_map *dmp)
+{
+ int locked, lock_limit;
+ struct page **pages;
+ int npage;
+ struct dma_map_page *mlp = NULL;
+ int ret = 0;
+
+ if (dmp->vaddr & (PAGE_SIZE-1))
+ return -EINVAL;
+ if (dmp->size & (PAGE_SIZE-1))
+ return -EINVAL;
+ if (dmp->size <= 0)
+ return -EINVAL;
+ npage = dmp->size >> PAGE_SHIFT;
+
+ mutex_lock(&listener->vdev->gate);
+
+ /* account for locked pages */
+ locked = npage + current->mm->locked_vm;
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
+ >> PAGE_SHIFT;
+ if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
+ __func__);
+ ret = -ENOMEM;
+ goto out_lock;
+ }
+ /* only 1 address space per fd */
+ if (current->mm != listener->mm) {
+ if (listener->mm != NULL)
+ return -EINVAL;
+ listener->mm = current->mm;
+ listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
+ ret = mmu_notifier_register(&listener->mmu_notifier,
+ listener->mm);
+ if (ret)
+ printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
+ __func__, ret);
+ ret = 0;
+ }
+
+ pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
+ if (pages == NULL) {
+ ret = ENOMEM;
+ goto out_lock;
+ }
+ ret = get_user_pages_fast(dmp->vaddr, npage, dmp->rdwr, pages);
+ if (ret != npage) {
+ printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
+ __func__, ret, npage);
+ kfree(pages);
+ ret = -EFAULT;
+ goto out_lock;
+ }
+
+ if (cmd == VFIO_DMA_MAP_IOVA)
+ ret = vfio_dma_map_iova(listener, dmp->dmaaddr,
+ pages, npage, dmp->rdwr, &mlp);
+ else
+ ret = vfio_dma_map_anywhere(listener, pages,
+ npage, dmp->rdwr, &mlp);
+ if (ret) {
+ kfree(pages);
+ goto out_lock;
+ }
+ listener->vdev->mapcount++;
+ mlp->vaddr = dmp->vaddr;
+ mlp->rdwr = dmp->rdwr;
+ dmp->dmaaddr = mlp->daddr;
+ list_add(&mlp->list, &listener->dm_list);
+
+ current->mm->locked_vm += npage;
+ listener->vdev->locked_pages += npage;
+out_lock:
+ mutex_unlock(&listener->vdev->gate);
+ return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_intrs.c vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c
--- linux-2.6.34/drivers/vfio/vfio_intrs.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c 2010-05-28 14:09:15.000000000 -0700
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+
+/*
+ * vfio_interrupt - IRQ hardware interrupt handler
+ */
+irqreturn_t vfio_interrupt(int irq, void *dev_id)
+{
+ struct vfio_dev *vdev = (struct vfio_dev *)dev_id;
+ struct pci_dev *pdev = vdev->pdev;
+ irqreturn_t ret = IRQ_NONE;
+ u32 cmd_status_dword;
+ u16 origcmd, newcmd, status;
+
+ spin_lock_irq(&vdev->lock);
+ pci_block_user_cfg_access(pdev);
+
+ /* Read both command and status registers in a single 32-bit operation.
+ * Note: we could cache the value for command and move the status read
+ * out of the lock if there was a way to get notified of user changes
+ * to command register through sysfs. Should be good for shared irqs. */
+ pci_read_config_dword(pdev, PCI_COMMAND, &cmd_status_dword);
+ origcmd = cmd_status_dword;
+ status = cmd_status_dword >> 16;
+
+ /* Check interrupt status register to see whether our device
+ * triggered the interrupt. */
+ if (!(status & PCI_STATUS_INTERRUPT))
+ goto done;
+
+ /* We triggered the interrupt, disable it. */
+ newcmd = origcmd | PCI_COMMAND_INTX_DISABLE;
+ if (newcmd != origcmd)
+ pci_write_config_word(pdev, PCI_COMMAND, newcmd);
+
+ ret = IRQ_HANDLED;
+done:
+ pci_unblock_user_cfg_access(pdev);
+ spin_unlock_irq(&vdev->lock);
+ if (ret != IRQ_HANDLED)
+ return ret;
+ if (vdev->ev_irq)
+ eventfd_signal(vdev->ev_irq, 1);
+ return ret;
+}
+
+/*
+ * MSI and MSI-X Interrupt handler.
+ * Just signal an event
+ */
+static irqreturn_t msihandler(int irq, void *arg)
+{
+ struct eventfd_ctx *ctx = arg;
+
+ eventfd_signal(ctx, 1);
+ return IRQ_HANDLED;
+}
+
+void vfio_disable_msi(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+
+ if (vdev->ev_msi) {
+ eventfd_ctx_put(vdev->ev_msi);
+ free_irq(pdev->irq, vdev->ev_msi);
+ vdev->ev_msi = NULL;
+ }
+ pci_disable_msi(pdev);
+}
+
+int vfio_enable_msi(struct vfio_dev *vdev, int fd)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct eventfd_ctx *ctx;
+ int ret;
+
+ ctx = eventfd_ctx_fdget(fd);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+ vdev->ev_msi = ctx;
+ pci_enable_msi(pdev);
+ ret = request_irq(pdev->irq, msihandler, 0,
+ vdev->name, ctx);
+ if (ret) {
+ eventfd_ctx_put(ctx);
+ pci_disable_msi(pdev);
+ vdev->ev_msi = NULL;
+ }
+ return ret;
+}
+
+void vfio_disable_msix(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int i;
+
+ if (vdev->ev_msix && vdev->msix) {
+ for (i = 0; i < vdev->nvec; i++) {
+ free_irq(vdev->msix[i].vector, vdev->ev_msix[i]);
+ if (vdev->ev_msix[i])
+ eventfd_ctx_put(vdev->ev_msix[i]);
+ }
+ }
+ kfree(vdev->ev_msix);
+ vdev->ev_msix = NULL;
+ kfree(vdev->msix);
+ vdev->msix = NULL;
+ vdev->nvec = 0;
+ pci_disable_msix(pdev);
+}
+
+int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct eventfd_ctx *ctx;
+ int ret = 0;
+ int i;
+ int fd;
+
+ vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+ GFP_KERNEL);
+ vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
+ GFP_KERNEL);
+ if (vdev->msix == NULL || vdev->ev_msix == NULL)
+ ret = -ENOMEM;
+ else {
+ for (i = 0; i < nvec; i++) {
+ if (copy_from_user(&fd, uarg, sizeof fd)) {
+ ret = -EFAULT;
+ break;
+ }
+ uarg += sizeof fd;
+ ctx = eventfd_ctx_fdget(fd);
+ if (IS_ERR(ctx)) {
+ ret = PTR_ERR(ctx);
+ break;
+ }
+ vdev->msix[i].entry = i;
+ vdev->ev_msix[i] = ctx;
+ }
+ }
+ if (!ret)
+ ret = pci_enable_msix(pdev, vdev->msix, nvec);
+ vdev->nvec = 0;
+ for (i = 0; i < nvec && !ret; i++) {
+ ret = request_irq(vdev->msix[i].vector, msihandler, 0,
+ vdev->name, vdev->ev_msix[i]);
+ if (ret)
+ break;
+ vdev->nvec = i+1;
+ }
+ if (ret)
+ vfio_disable_msix(vdev);
+ return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_main.c vfio-linux-2.6.34/drivers/vfio/vfio_main.c
--- linux-2.6.34/drivers/vfio/vfio_main.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_main.c 2010-05-28 14:13:38.000000000 -0700
@@ -0,0 +1,627 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mm.h>
+#include <linux/idr.h>
+#include <linux/string.h>
+#include <linux/interrupt.h>
+#include <linux/fs.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+
+#include <linux/vfio.h>
+
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "Tom Lyon <pu...@cisco.com>"
+#define DRIVER_DESC "VFIO - User Level PCI meta-driver"
+
+static int vfio_major = -1;
+DEFINE_IDR(vfio_idr);
+/* Protect idr accesses */
+DEFINE_MUTEX(vfio_minor_lock);
+
+/*
+ * Does [a1,b1) overlap [a2,b2) ?
+ */
+static inline int overlap(int a1, int b1, int a2, int b2)
+{
+ /*
+ * Ranges overlap if they're not disjoint; and they're
+ * disjoint if the end of one is before the start of
+ * the other one.
+ */
+ return !(b2 <= a1 || b1 <= a2);
+}
+
+static int vfio_open(struct inode *inode, struct file *filep)
+{
+ struct vfio_dev *vdev;
+ struct vfio_listener *listener;
+ int ret = 0;
+
+ mutex_lock(&vfio_minor_lock);
+ vdev = idr_find(&vfio_idr, iminor(inode));
+ mutex_unlock(&vfio_minor_lock);
+ if (!vdev) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ listener = kzalloc(sizeof(*listener), GFP_KERNEL);
+ if (!listener) {
+ ret = -ENOMEM;
+ goto err_alloc_listener;
+ }
+
+ listener->vdev = vdev;
+ INIT_LIST_HEAD(&listener->dm_list);
+ filep->private_data = listener;
+
+ mutex_lock(&vdev->gate);
+ if (vdev->listeners == 0) { /* first open */
+ if (vdev->pmaster && !iommu_found() &&
+ !capable(CAP_SYS_RAWIO)) {
+ mutex_unlock(&vdev->gate);
+ ret = -EPERM;
+ goto err_perm;
+ }
+ /* reset to known state if we can */
+ (void) pci_reset_function(vdev->pdev);
+ }
+ vdev->listeners++;
+ mutex_unlock(&vdev->gate);
+ return 0;
+
+err_perm:
+ kfree(listener);
+
+err_alloc_listener:
+out:
+ return ret;
+}
+
+static int vfio_release(struct inode *inode, struct file *filep)
+{
+ int ret = 0;
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+
+ vfio_dma_unmapall(listener);
+ if (listener->mm) {
+ mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
+ listener->mm = NULL;
+ }
+
+ mutex_lock(&vdev->gate);
+ if (--vdev->listeners <= 0) {
+ if (vdev->ev_msix)
+ vfio_disable_msix(vdev);
+ if (vdev->ev_msi)
+ vfio_disable_msi(vdev);
+ if (vdev->ev_irq) {
+ eventfd_ctx_put(vdev->ev_msi);
+ vdev->ev_irq = NULL;
+ }
+ if (vdev->domain) {
+ iommu_domain_free(vdev->domain);
+ vdev->domain = NULL;
+ }
+ /* reset to known state if we can */
+ (void) pci_reset_function(vdev->pdev);
+ }
+ mutex_unlock(&vdev->gate);
+
+ kfree(listener);
+ return ret;
+}
+
+static ssize_t vfio_read(struct file *filep, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ int pci_space;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+ return vfio_config_readwrite(0, vdev, buf, count, ppos);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+ return vfio_io_readwrite(0, vdev, buf, count, ppos);
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM)
+ return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+ if (pci_space == PCI_ROM_RESOURCE)
+ return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+ return -EINVAL;
+}
+
+static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u16 pos;
+ u32 table_offset;
+ u16 table_size;
+ u8 bir;
+ u32 lo, hi, startp, endp;
+
+ pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+ if (!pos)
+ return 0;
+
+ pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
+ table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
+ pci_read_config_dword(pdev, pos + 4, &table_offset);
+ bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
+ lo = table_offset >> PAGE_SHIFT;
+ hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
+ >> PAGE_SHIFT;
+ startp = start >> PAGE_SHIFT;
+ endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (bir == vfio_offset_to_pci_space(start) &&
+ overlap(lo, hi, startp, endp)) {
+ printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
+ __func__);
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static ssize_t vfio_write(struct file *filep, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ int pci_space;
+ int ret;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+ return vfio_config_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+ return vfio_io_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
+ /* don't allow writes to msi-x vectors */
+ ret = vfio_msix_check(vdev, *ppos, count);
+ if (ret)
+ return ret;
+ return vfio_mem_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ }
+ return -EINVAL;
+}
+
+static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ unsigned long requested, actual;
+ int pci_space;
+ u64 start;
+ u32 len;
+ unsigned long phys;
+ int ret;
+
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+
+
+ pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ switch (pci_space) {
+ case PCI_ROM_RESOURCE:
+ if (vma->vm_flags & VM_WRITE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
+ return -EINVAL;
+ actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
+ break;
+ default:
+ if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
+ return -EINVAL;
+ actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
+ break;
+ }
+
+ requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (requested > actual || actual == 0)
+ return -EINVAL;
+
+ /*
+ * Can't allow non-priv users to mmap MSI-X vectors
+ * else they can write anywhere in phys memory
+ */
+ start = vma->vm_pgoff << PAGE_SHIFT;
+ len = vma->vm_end - vma->vm_start;
+ if (vma->vm_flags & VM_WRITE) {
+ ret = vfio_msix_check(vdev, start, len);
+ if (ret)
+ return ret;
+ }
+
+ vma->vm_private_data = vdev;
+ vma->vm_flags |= VM_IO | VM_RESERVED;
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
+
+ return remap_pfn_range(vma, vma->vm_start, phys,
+ vma->vm_end - vma->vm_start,
+ vma->vm_page_prot);
+}
+
+static long vfio_unl_ioctl(struct file *filep,
+ unsigned int cmd,
+ unsigned long arg)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ void __user *uarg = (void __user *)arg;
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_dma_map dm;
+ int ret = 0;
+ u64 mask;
+ int fd, nfd;
+ int bar;
+
+ if (vdev == NULL)
+ return -EINVAL;
+
+ switch (cmd) {
+
+ case VFIO_DMA_MAP_ANYWHERE:
+ case VFIO_DMA_MAP_IOVA:
+ if (copy_from_user(&dm, uarg, sizeof dm))
+ return -EFAULT;
+ ret = vfio_dma_map_common(listener, cmd, &dm);
+ if (!ret && copy_to_user(uarg, &dm, sizeof dm))
+ ret = -EFAULT;
+ break;
+
+ case VFIO_DMA_UNMAP:
+ if (copy_from_user(&dm, uarg, sizeof dm))
+ return -EFAULT;
+ ret = vfio_dma_unmap_dm(listener, &dm);
+ break;
+
+ case VFIO_DMA_MASK: /* set master mode and DMA mask */
+ if (copy_from_user(&mask, uarg, sizeof mask))
+ return -EFAULT;
+ pci_set_master(pdev);
+ ret = pci_set_dma_mask(pdev, mask);
+ break;
+
+ case VFIO_EVENTFD_IRQ:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ if (vdev->ev_irq)
+ eventfd_ctx_put(vdev->ev_irq);
+ if (fd >= 0) {
+ vdev->ev_irq = eventfd_ctx_fdget(fd);
+ if (vdev->ev_irq == NULL)
+ ret = -EINVAL;
+ }
+ break;
+
+ case VFIO_EVENTFD_MSI:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+ ret = vfio_enable_msi(vdev, fd);
+ else if (fd < 0 && vdev->ev_msi)
+ vfio_disable_msi(vdev);
+ else
+ ret = -EINVAL;
+ break;
+
+ case VFIO_EVENTFDS_MSIX:
+ if (copy_from_user(&nfd, uarg, sizeof nfd))
+ return -EFAULT;
+ uarg += sizeof nfd;
+ if (nfd > 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+ ret = vfio_enable_msix(vdev, nfd, uarg);
+ else if (nfd == 0 && vdev->ev_msix)
+ vfio_disable_msix(vdev);
+ else
+ ret = -EINVAL;
+ break;
+
+ case VFIO_BAR_LEN:
+ if (copy_from_user(&bar, uarg, sizeof bar))
+ return -EFAULT;
+ if (bar < 0 || bar > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ bar = pci_resource_len(pdev, bar);
+ if (copy_to_user(uarg, &bar, sizeof bar))
+ return -EFAULT;
+ break;
+
+ default:
+ return -EINVAL;
+ }
+ return ret;
+}
+
+static const struct file_operations vfio_fops = {
+ .owner = THIS_MODULE,
+ .open = vfio_open,
+ .release = vfio_release,
+ .read = vfio_read,
+ .write = vfio_write,
+ .unlocked_ioctl = vfio_unl_ioctl,
+ .mmap = vfio_mmap,
+};
+
+static int vfio_get_devnum(struct vfio_dev *vdev)
+{
+ int retval = -ENOMEM;
+ int id;
+
+ mutex_lock(&vfio_minor_lock);
+ if (idr_pre_get(&vfio_idr, GFP_KERNEL) == 0)
+ goto exit;
+
+ retval = idr_get_new(&vfio_idr, vdev, &id);
+ if (retval < 0) {
+ if (retval == -EAGAIN)
+ retval = -ENOMEM;
+ goto exit;
+ }
+ if (id > MINORMASK) {
+ idr_remove(&vfio_idr, id);
+ retval = -ENOMEM;
+ }
+ if (vfio_major < 0) {
+ retval = register_chrdev(0, "vfio", &vfio_fops);
+ if (retval < 0)
+ goto exit;
+ vfio_major = retval;
+ }
+
+ retval = MKDEV(vfio_major, id);
+exit:
+ mutex_unlock(&vfio_minor_lock);
+ return retval;
+}
+
+static void vfio_free_minor(struct vfio_dev *vdev)
+{
+ mutex_lock(&vfio_minor_lock);
+ idr_remove(&vfio_idr, MINOR(vdev->devnum));
+ mutex_unlock(&vfio_minor_lock);
+}
+
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2. (from uio_pci_generic)
+ */
+static int verify_pci_2_3(struct pci_dev *pdev)
+{
+ u16 orig, new;
+ int err = 0;
+ u8 line;
+
+ pci_block_user_cfg_access(pdev);
+
+ pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, &line);
+ if (line == 0)
+ goto out;
+
+ pci_read_config_word(pdev, PCI_COMMAND, &orig);
+ pci_write_config_word(pdev, PCI_COMMAND,
+ orig ^ PCI_COMMAND_INTX_DISABLE);
+ pci_read_config_word(pdev, PCI_COMMAND, &new);
+ /* There's no way to protect against
+ * hardware bugs or detect them reliably, but as long as we know
+ * what the value should be, let's go ahead and check it. */
+ if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+ err = -EBUSY;
+ dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+ "driver or HW bug?\n", orig, new);
+ goto out;
+ }
+ if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+ dev_warn(&pdev->dev, "Device does not support "
+ "disabling interrupts: unable to bind.\n");
+ err = -ENODEV;
+ goto out;
+ }
+ /* Now restore the original value. */
+ pci_write_config_word(pdev, PCI_COMMAND, orig);
+out:
+ pci_unblock_user_cfg_access(pdev);
+ return err;
+}
+
+static int pci_is_master(struct pci_dev *pdev)
+{
+ int ret;
+ u16 orig, new;
+
+ if (pci_find_capability(pdev, PCI_CAP_ID_MSI))
+ return 1;
+ if (pci_find_capability(pdev, PCI_CAP_ID_MSIX))
+ return 1;
+
+ pci_block_user_cfg_access(pdev);
+
+ pci_read_config_word(pdev, PCI_COMMAND, &orig);
+ ret = orig & PCI_COMMAND_MASTER;
+ if (!ret) {
+ new = orig | PCI_COMMAND_MASTER;
+ pci_write_config_word(pdev, PCI_COMMAND, new);
+ pci_read_config_word(pdev, PCI_COMMAND, &new);
+ ret = new & PCI_COMMAND_MASTER;
+ pci_write_config_word(pdev, PCI_COMMAND, orig);
+ }
+
+ pci_unblock_user_cfg_access(pdev);
+ return ret;
+}
+
+static int vfio_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct vfio_dev *vdev;
+ int err;
+
+ err = pci_enable_device(pdev);
+ if (err) {
+ dev_err(&pdev->dev, "%s: pci_enable_device failed: %d\n",
+ __func__, err);
+ return err;
+ }
+
+ err = verify_pci_2_3(pdev);
+ if (err)
+ goto err_verify;
+
+ vdev = kzalloc(sizeof(struct vfio_dev), GFP_KERNEL);
+ if (!vdev) {
+ err = -ENOMEM;
+ goto err_alloc;
+ }
+ vdev->pdev = pdev;
+ vdev->pmaster = pci_is_master(pdev);
+
+ err = vfio_class_init();
+ if (err)
+ goto err_class;
+
+ mutex_init(&vdev->gate);
+
+ err = vfio_get_devnum(vdev);
+ if (err < 0)
+ goto err_get_devnum;
+ vdev->devnum = err;
+ err = 0;
+
+ sprintf(vdev->name, "vfio%d", MINOR(vdev->devnum));
+ pci_set_drvdata(pdev, vdev);
+ vdev->dev = device_create(vfio_class->class, &pdev->dev,
+ vdev->devnum, vdev, vdev->name);
+ if (IS_ERR(vdev->dev)) {
+ printk(KERN_ERR "VFIO: device register failed\n");
+ err = PTR_ERR(vdev->dev);
+ goto err_device_create;
+ }
+
+ err = vfio_dev_add_attributes(vdev);
+ if (err)
+ goto err_vfio_dev_add_attributes;
+
+
+ if (pdev->irq > 0) {
+ err = request_irq(pdev->irq, vfio_interrupt,
+ IRQF_SHARED, "vfio", vdev);
+ if (err)
+ goto err_request_irq;
+ }
+ vdev->vinfo.bardirty = 1;
+
+ return 0;
+
+err_request_irq:
+#ifdef notdef
+ vfio_dev_del_attributes(vdev);
+#endif
+err_vfio_dev_add_attributes:
+ device_destroy(vfio_class->class, vdev->devnum);
+err_device_create:
+ vfio_free_minor(vdev);
+err_get_devnum:
+err_class:
+ kfree(vdev);
+err_alloc:
+err_verify:
+ pci_disable_device(pdev);
+ return err;
+}
+
+static void vfio_remove(struct pci_dev *pdev)
+{
+ struct vfio_dev *vdev = pci_get_drvdata(pdev);
+
+ vfio_free_minor(vdev);
+
+ if (pdev->irq > 0)
+ free_irq(pdev->irq, vdev);
+
+#ifdef notdef
+ vfio_dev_del_attributes(vdev);
+#endif
+
+ pci_set_drvdata(pdev, NULL);
+ device_destroy(vfio_class->class, vdev->devnum);
+ kfree(vdev);
+ vfio_class_destroy();
+ pci_disable_device(pdev);
+}
+
+static struct pci_driver driver = {
+ .name = "vfio",
+ .id_table = NULL, /* only dynamic id's */
+ .probe = vfio_probe,
+ .remove = vfio_remove,
+};
+
+static int __init init(void)
+{
+ pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+ return pci_register_driver(&driver);
+}
+
+static void __exit cleanup(void)
+{
+ if (vfio_major >= 0)
+ unregister_chrdev(vfio_major, "vfio");
+ pci_unregister_driver(&driver);
+}
+
+module_init(init);
+module_exit(cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff -uprN linux-2.6.34/drivers/vfio/vfio_pci_config.c vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c
--- linux-2.6.34/drivers/vfio/vfio_pci_config.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c 2010-05-28 14:26:47.000000000 -0700
@@ -0,0 +1,554 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#define PCI_CAP_ID_BASIC 0
+#ifndef PCI_CAP_ID_MAX
+#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
+#endif
+
+/*
+ * Lengths of PCI Config Capabilities
+ * 0 means unknown (but at least 4)
+ * FF means special/variable
+ */
+static u8 pci_capability_length[] = {
+ [PCI_CAP_ID_BASIC] = 64, /* pci config header */
+ [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
+ [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
+ [PCI_CAP_ID_VPD] = 8,
+ [PCI_CAP_ID_SLOTID] = 4,
+ [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, or 24 */
+ [PCI_CAP_ID_CHSWP] = 4,
+ [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
+ [PCI_CAP_ID_HT] = 28,
+ [PCI_CAP_ID_VNDR] = 0xFF,
+ [PCI_CAP_ID_DBG] = 0,
+ [PCI_CAP_ID_CCRC] = 0,
+ [PCI_CAP_ID_SHPC] = 0,
+ [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
+ [PCI_CAP_ID_AGP3] = 0,
+ [PCI_CAP_ID_EXP] = 36,
+ [PCI_CAP_ID_MSIX] = 12,
+ [PCI_CAP_ID_AF] = 6,
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists,
+ * but what is read depends on whether the field
+ * is 'virtualized', or just pass thru to the hardware.
+ * Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+ u32 rvirt; /* read bits which must be virtualized */
+ u32 write; /* writeable bits - virt if read virt */
+};
+
+static struct perm_bits pci_cap_basic_perm[] = {
+ { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
+ { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
+ { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
+ { 0, 0, }, /* 0x24 cardbus - not yet */
+ { 0, 0, }, /* 0x28 subsys vendor & dev */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
+ { 0, 0, }, /* 0x30 capability ptr & resv */
+ { 0, 0, }, /* 0x34 resv */
+ { 0, 0, }, /* 0x38 resv */
+ { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
+};
+
+static struct perm_bits pci_cap_pm_perm[] = {
+ { 0, 0, }, /* 0x00 PM capabilities */
+ { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
+};
+
+static struct perm_bits pci_cap_vpd_perm[] = {
+ { 0, 0xFFFF0000, }, /* 0x00 address */
+ { 0, 0xFFFFFFFF, }, /* 0x04 data */
+};
+
+static struct perm_bits pci_cap_slotid_perm[] = {
+ { 0, 0, }, /* 0x00 all read only */
+};
+
+static struct perm_bits pci_cap_msi_perm[] = {
+ { 0, 0, }, /* 0x00 MSI message control */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message addr/data */
+ { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
+ { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
+ { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
+};
+
+static struct perm_bits pci_cap_pcix_perm[] = {
+ { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
+ { 0, 0, }, /* 0x04 PCI_X_STATUS */
+ { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
+ { 0, 0, }, /* 0x0c ECC first addr */
+ { 0, 0, }, /* 0x10 ECC second addr */
+ { 0, 0, }, /* 0x14 ECC attr */
+};
+
+/* pci express capabilities */
+static struct perm_bits pci_cap_exp_perm[] = {
+ { 0, 0, }, /* 0x00 PCIe capabilities */
+ { 0, 0, }, /* 0x04 PCIe device capabilities */
+ { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
+ { 0, 0, }, /* 0x0c PCIe link capabilities */
+ { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
+ { 0, 0, }, /* 0x14 PCIe slot capabilities */
+ { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
+ { 0, 0, }, /* 0x1c PCIe root port stuff */
+ { 0, 0, }, /* 0x20 PCIe root port stuff */
+};
+
+static struct perm_bits pci_cap_msix_perm[] = {
+ { 0, 0, }, /* 0x00 MSI-X Enable */
+ { 0, 0, }, /* 0x04 table offset & bir */
+ { 0, 0, }, /* 0x08 pba offset & bir */
+};
+
+static struct perm_bits pci_cap_af_perm[] = {
+ { 0, 0, }, /* 0x00 af capability */
+ { 0, 0x0001, }, /* 0x04 af flr bit */
+};
+
+static struct perm_bits *pci_cap_perms[] = {
+ [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
+ [PCI_CAP_ID_PM] = pci_cap_pm_perm,
+ [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
+ [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
+ [PCI_CAP_ID_MSI] = pci_cap_msi_perm,
+ [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
+ [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
+ [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
+ [PCI_CAP_ID_AF] = pci_cap_af_perm,
+};
+
+/*
+ * We build a map of the config space that tells us where
+ * and what capabilities exist, so that we can map reads and
+ * writes back to capabilities, and thus figure out what to
+ * allow, deny, or virtualize
+ */
+int vfio_build_config_map(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map;
+ int i, len;
+ u8 pos, cap, tmp;
+ u16 flags;
+ int ret;
+ int loops = 100;
+
+ map = kmalloc(pdev->cfg_size, GFP_KERNEL);
+ if (map == NULL)
+ return -ENOMEM;
+ for (i = 0; i < pdev->cfg_size; i++)
+ map[i] = 0xFF;
+ vdev->pci_config_map = map;
+
+ /* default config space */
+ for (i = 0; i < pci_capability_length[0]; i++)
+ map[i] = 0;
+
+ /* any capabilities? */
+ ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
+ if (ret < 0)
+ return ret;
+ if ((flags & PCI_STATUS_CAP_LIST) == 0)
+ return 0;
+
+ ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+ if (ret < 0)
+ return ret;
+ while (pos && --loops > 0) {
+ ret = pci_read_config_byte(pdev, pos, &cap);
+ if (ret < 0)
+ return ret;
+ if (cap == 0) {
+ printk(KERN_WARNING "%s: cap 0\n", __func__);
+ break;
+ }
+ if (cap > PCI_CAP_ID_MAX) {
+ printk(KERN_WARNING "%s: unknown pci capability id %x\n",
+ __func__, cap);
+ len = 0;
+ } else
+ len = pci_capability_length[cap];
+ if (len == 0) {
+ printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
+ __func__, cap);
+ len = 4;
+ }
+ if (len == 0xFF) {
+ switch (cap) {
+ case PCI_CAP_ID_MSI:
+ ret = pci_read_config_word(pdev,
+ pos + PCI_MSI_FLAGS, &flags);
+ if (ret < 0)
+ return ret;
+ if (flags & PCI_MSI_FLAGS_MASKBIT)
+ /* per vec masking */
+ len = 24;
+ else if (flags & PCI_MSI_FLAGS_64BIT)
+ /* 64 bit */
+ len = 14;
+ else
+ len = 10;
+ break;
+ case PCI_CAP_ID_PCIX:
+ ret = pci_read_config_word(pdev, pos + 2,
+ &flags);
+ if (ret < 0)
+ return ret;
+ if (flags & 0x3000)
+ len = 24;
+ else
+ len = 8;
+ break;
+ case PCI_CAP_ID_VNDR:
+ /* length follows next field */
+ ret = pci_read_config_byte(pdev, pos + 2, &tmp);
+ if (ret < 0)
+ return ret;
+ len = tmp;
+ break;
+ default:
+ len = 0;
+ break;
+ }
+ }
+
+ for (i = 0; i < len; i++) {
+ if (map[pos+i] != 0xFF)
+ printk(KERN_WARNING
+ "%s: pci config conflict at %x, "
+ "caps %x %x\n",
+ __func__, i, map[pos+i], cap);
+ map[pos+i] = cap;
+ }
+ ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
+ if (ret < 0)
+ return ret;
+ }
+ if (loops <= 0)
+ printk(KERN_ERR "%s: config space loop!\n", __func__);
+ return 0;
+}
+
+static void vfio_virt_init(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int bar;
+ u32 *lp;
+ u32 val;
+ u8 *map, pos;
+ u16 flags;
+ int i, len;
+ int ret;
+
+ for (bar = 0; bar <= 5; bar++) {
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
+ *lp++ = val;
+ }
+ lp = (u32 *)vdev->vinfo.rombar;
+ pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
+ *lp = val;
+
+ vdev->vinfo.intr = pdev->irq;
+
+ pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
+ map = vdev->pci_config_map + pos;
+ if (pos > 0) {
+ ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+ if (ret < 0)
+ return;
+ if (flags & PCI_MSI_FLAGS_MASKBIT) /* per vec masking */
+ len = 24;
+ else if (flags & PCI_MSI_FLAGS_64BIT) /* 64 bit */
+ len = 14;
+ else
+ len = 10;
+ for (i = 0; i < len; i++)
+ (void) pci_read_config_byte(pdev, pos + i,
+ &vdev->vinfo.msi[i]);
+ }
+}
+
+static void vfio_bar_fixup(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int bar;
+ u32 *lp;
+ u32 len;
+
+ for (bar = 0; bar <= 5; bar++) {
+ len = pci_resource_len(pdev, bar);
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ if (len == 0) {
+ *lp = 0;
+ } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
+ *lp &= ~0x1;
+ *lp = (*lp & ~(len-1)) |
+ (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
+ if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
+ bar++;
+ } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
+ *lp |= PCI_BASE_ADDRESS_SPACE_IO;
+ *lp = (*lp & ~(len-1)) |
+ (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
+ }
+ }
+ lp = (u32 *)vdev->vinfo.rombar;
+ len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
+ *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
+ vdev->vinfo.bardirty = 0;
+}
+
+static int vfio_config_rwbyte(int write,
+ struct vfio_dev *vdev,
+ int pos,
+ char __user *buf)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map = vdev->pci_config_map;
+ u8 cap, val, newval;
+ u16 start, off;
+ int p;
+ struct perm_bits *perm;
+ u8 wr, virt;
+ int ret;
+
+ cap = map[pos];
+ if (cap == 0xFF) { /* unknown region */
+ if (write)
+ return 0; /* silent no-op */
+ val = 0;
+ if (pos <= pci_capability_length[0]) /* ok to read */
+ (void) pci_read_config_byte(pdev, pos, &val);
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ return 0;
+ }
+
+ /* scan back to start of cap region */
+ for (p = pos; p >= 0; p--) {
+ if (map[p] != cap)
+ break;
+ start = p;
+ }
+ off = pos - start; /* offset within capability */
+
+ perm = pci_cap_perms[cap];
+ if (perm == NULL) {
+ wr = 0;
+ virt = 0;
+ } else {
+ perm += (off >> 2);
+ wr = perm->write >> ((off & 3) * 8);
+ virt = perm->rvirt >> ((off & 3) * 8);
+ }
+ if (write && !wr) /* no writeable bits */
+ return 0;
+ if (!virt) {
+ if (write) {
+ if (copy_from_user(&val, buf, 1))
+ return -EFAULT;
+ val &= wr;
+ if (wr != 0xFF) {
+ u8 existing;
+
+ ret = pci_read_config_byte(pdev, pos,
+ &existing);
+ if (ret < 0)
+ return ret;
+ val |= (existing & ~wr);
+ }
+ pci_write_config_byte(pdev, pos, val);
+ } else {
+ ret = pci_read_config_byte(pdev, pos, &val);
+ if (ret < 0)
+ return ret;
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ }
+ return 0;
+ }
+
+ if (write) {
+ if (copy_from_user(&newval, buf, 1))
+ return -EFAULT;
+ }
+ /*
+ * We get here if there are some virt bits
+ * handle remaining real bits, if any
+ */
+ if (~virt) {
+ u8 rbits = (~virt) & wr;
+
+ ret = pci_read_config_byte(pdev, pos, &val);
+ if (ret < 0)
+ return ret;
+ if (write && rbits) {
+ val &= ~rbits;
+ newval &= rbits;
+ val |= newval;
+ pci_write_config_byte(pdev, pos, val);
+ }
+ }
+ /*
+ * Now handle entirely virtual fields
+ */
+ switch (cap) {
+ case PCI_CAP_ID_BASIC: /* virtualize BARs */
+ switch (off) {
+ /*
+ * vendor and device are virt because they don't
+ * show up otherwise for sr-iov vfs
+ */
+ case PCI_VENDOR_ID:
+ val = pdev->vendor;
+ break;
+ case PCI_VENDOR_ID + 1:
+ val = pdev->vendor >> 8;
+ break;
+ case PCI_DEVICE_ID:
+ val = pdev->device;
+ break;
+ case PCI_DEVICE_ID + 1:
+ val = pdev->device >> 8;
+ break;
+ case PCI_INTERRUPT_LINE:
+ if (write)
+ vdev->vinfo.intr = newval;
+ else
+ val = vdev->vinfo.intr;
+ break;
+ case PCI_ROM_ADDRESS:
+ case PCI_ROM_ADDRESS+1:
+ case PCI_ROM_ADDRESS+2:
+ case PCI_ROM_ADDRESS+3:
+ if (write) {
+ vdev->vinfo.rombar[off & 3] = newval;
+ vdev->vinfo.bardirty = 1;
+ } else {
+ if (vdev->vinfo.bardirty)
+ vfio_bar_fixup(vdev);
+ val = vdev->vinfo.rombar[off & 3];
+ }
+ break;
+ default:
+ if (off >= PCI_BASE_ADDRESS_0 &&
+ off <= PCI_BASE_ADDRESS_5 + 3) {
+ int boff = off - PCI_BASE_ADDRESS_0;
+
+ if (write) {
+ vdev->vinfo.bar[boff] = newval;
+ vdev->vinfo.bardirty = 1;
+ } else {
+ if (vdev->vinfo.bardirty)
+ vfio_bar_fixup(vdev);
+ val = vdev->vinfo.bar[boff];
+ }
+ }
+ break;
+ }
+ break;
+ case PCI_CAP_ID_MSI: /* virtualize MSI */
+ if (off >= PCI_MSI_ADDRESS_LO && off <= (PCI_MSI_DATA_64 + 2)) {
+ int moff = off - PCI_MSI_ADDRESS_LO;
+
+ if (write)
+ vdev->vinfo.msi[moff] = newval;
+ else
+ val = vdev->vinfo.msi[moff];
+ break;
+ }
+ break;
+ }
+ if (!write && copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ return 0;
+}
+
+ssize_t vfio_config_readwrite(int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int done = 0;
+ int ret;
+ int pos;
+
+ pci_block_user_cfg_access(pdev);
+
+ if (vdev->pci_config_map == NULL) {
+ ret = vfio_build_config_map(vdev);
+ if (ret < 0)
+ goto out;
+ vfio_virt_init(vdev);
+ }
+
+ while (count > 0) {
+ pos = *ppos;
+ if (pos == pdev->cfg_size)
+ break;
+ if (pos > pdev->cfg_size) {
+ ret = -EINVAL;
+ goto out;
+ }
+ ret = vfio_config_rwbyte(write, vdev, pos, buf);
+ if (ret < 0)
+ goto out;
+ buf++;
+ done++;
+ count--;
+ (*ppos)++;
+ }
+ ret = done;
+out:
+ pci_unblock_user_cfg_access(pdev);
+ return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_rdwr.c vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c
--- linux-2.6.34/drivers/vfio/vfio_rdwr.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c 2010-05-28 14:27:40.000000000 -0700
@@ -0,0 +1,147 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/mmu_notifier.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include <linux/vfio.h>
+
+ssize_t vfio_io_readwrite(
+ int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ size_t done = 0;
+ resource_size_t end;
+ void __iomem *io;
+ loff_t pos;
+ int pci_space;
+ int unit;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ pos = (*ppos & 0xFFFFFFFF);
+
+ if (vdev->bar[pci_space] == NULL)
+ vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+ io = vdev->bar[pci_space];
+ end = pci_resource_len(pdev, pci_space);
+ if (pos + count > end)
+ return -EINVAL;
+
+ while (count > 0) {
+ if ((pos % 4) == 0 && count >= 4) {
+ u32 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 4))
+ return -EFAULT;
+ iowrite32(val, io + pos);
+ } else {
+ val = ioread32(io + pos);
+ if (copy_to_user(buf, &val, 4))
+ return -EFAULT;
+ }
+ unit = 4;
+ } else if ((pos % 2) == 0 && count >= 2) {
+ u16 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 2))
+ return -EFAULT;
+ iowrite16(val, io + pos);
+ } else {
+ val = ioread16(io + pos);
+ if (copy_to_user(buf, &val, 2))
+ return -EFAULT;
+ }
+ unit = 2;
+ } else {
+ u8 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 1))
+ return -EFAULT;
+ iowrite8(val, io + pos);
+ } else {
+ val = ioread8(io + pos);
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ }
+ unit = 1;
+ }
+ pos += unit;
+ buf += unit;
+ count -= unit;
+ done += unit;
+ }
+ *ppos += done;
+ return done;
+}
+
+ssize_t vfio_mem_readwrite(
+ int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ resource_size_t end;
+ void __iomem *io;
+ loff_t pos;
+ int pci_space;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ pos = (*ppos & 0xFFFFFFFF);
+
+ if (vdev->bar[pci_space] == NULL)
+ vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+ io = vdev->bar[pci_space];
+ end = pci_resource_len(pdev, pci_space);
+ if (pos > end)
+ return -EINVAL;
+ if (pos == end)
+ return 0;
+ if (pos + count > end)
+ count = end - pos;
+ if (write) {
+ if (copy_from_user(io + pos, buf, count))
+ return -EFAULT;
+ } else {
+ if (copy_to_user(buf, io + pos, count))
+ return -EFAULT;
+ }
+ *ppos += count;
+ return count;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_sysfs.c vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c
--- linux-2.6.34/drivers/vfio/vfio_sysfs.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c 2010-05-28 14:04:34.000000000 -0700
@@ -0,0 +1,153 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+struct vfio_class *vfio_class;
+
+int vfio_class_init(void)
+{
+ int ret = 0;
+
+ if (vfio_class != NULL) {
+ kref_get(&vfio_class->kref);
+ goto exit;
+ }
+
+ vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
+ if (!vfio_class) {
+ ret = -ENOMEM;
+ goto err_kzalloc;
+ }
+
+ kref_init(&vfio_class->kref);
+ vfio_class->class = class_create(THIS_MODULE, "vfio");
+ if (IS_ERR(vfio_class->class)) {
+ ret = IS_ERR(vfio_class->class);
+ printk(KERN_ERR "class_create failed for vfio\n");
+ goto err_class_create;
+ }
+ return 0;
+
+err_class_create:
+ kfree(vfio_class);
+ vfio_class = NULL;
+err_kzalloc:
+exit:
+ return ret;
+}
+
+static void vfio_class_release(struct kref *kref)
+{
+ /* Ok, we cheat as we know we only have one vfio_class */
+ class_destroy(vfio_class->class);
+ kfree(vfio_class);
+ vfio_class = NULL;
+}
+
+void vfio_class_destroy(void)
+{
+ if (vfio_class)
+ kref_put(&vfio_class->kref, vfio_class_release);
+}
+
+ssize_t config_map_read(struct kobject *kobj, struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t count)
+{
+ struct vfio_dev *vdev = bin_attr->private;
+ int ret;
+
+ if (off >= 256)
+ return 0;
+ if (off + count > 256)
+ count = 256 - off;
+ if (vdev->pci_config_map == NULL) {
+ ret = vfio_build_config_map(vdev);
+ if (ret < 0)
+ return ret;
+ }
+ memcpy(buf, vdev->pci_config_map + off, count);
+ return count;
+}
+
+static ssize_t show_locked_pages(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vfio_dev *vdev = dev_get_drvdata(dev);
+
+ if (vdev == NULL)
+ return -ENODEV;
+ return sprintf(buf, "%u\n", vdev->locked_pages);
+}
+
+static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
+
+static struct attribute *vfio_attrs[] = {
+ &dev_attr_locked_pages.attr,
+ NULL,
+};
+
+static struct attribute_group vfio_attr_grp = {
+ .attrs = vfio_attrs,
+};
+
+struct bin_attribute config_map_bin_attribute = {
+ .attr = {
+ .name = "config_map",
+ .mode = S_IRUGO,
+ },
+ .size = 256,
+ .read = config_map_read,
+};
+
+int vfio_dev_add_attributes(struct vfio_dev *vdev)
+{
+ struct bin_attribute *bi;
+ int ret;
+
+ ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
+ if (ret)
+ return ret;
+ bi = kmalloc(sizeof(*bi), GFP_KERNEL);
+ if (bi == NULL)
+ return -ENOMEM;
+ *bi = config_map_bin_attribute;
+ bi->private = vdev;
+ return sysfs_create_bin_file(&vdev->dev->kobj, bi);
+}
diff -uprN linux-2.6.34/include/linux/vfio.h vfio-linux-2.6.34/include/linux/vfio.h
--- linux-2.6.34/include/linux/vfio.h 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/include/linux/vfio.h 2010-05-28 14:29:49.000000000 -0700
@@ -0,0 +1,193 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+/*
+ * VFIO driver - allow mapping and use of certain PCI devices
+ * in unprivileged user processes. (If IOMMU is present)
+ * Especially useful for Virtual Function parts of SR-IOV devices
+ */
+
+#ifdef __KERNEL__
+
+struct vfio_dev {
+ struct device *dev;
+ struct pci_dev *pdev;
+ u8 *pci_config_map;
+ int pci_config_size;
+ char name[8];
+ int devnum;
+ int pmaster;
+ void __iomem *bar[PCI_ROM_RESOURCE+1];
+ spinlock_t lock; /* guards command register accesses */
+ int listeners;
+ int mapcount;
+ u32 locked_pages;
+ struct mutex gate;
+ struct msix_entry *msix;
+ int nvec;
+ struct iommu_domain *domain;
+ int cachec;
+ struct eventfd_ctx *ev_irq;
+ struct eventfd_ctx *ev_msi;
+ struct eventfd_ctx **ev_msix;
+ struct {
+ u8 intr;
+ u8 bardirty;
+ u8 rombar[4];
+ u8 bar[6*4];
+ u8 msi[24];
+ } vinfo;
+};
+
+struct vfio_listener {
+ struct vfio_dev *vdev;
+ struct list_head dm_list;
+ struct mm_struct *mm;
+ struct mmu_notifier mmu_notifier;
+};
+
+/*
+ * Structure for keeping track of memory nailed down by the
+ * user for DMA
+ */
+struct dma_map_page {
+ struct list_head list;
+ struct page **pages;
+ struct scatterlist *sg;
+ dma_addr_t daddr;
+ unsigned long vaddr;
+ int npage;
+ int rdwr;
+};
+
+/* VFIO class infrastructure */
+struct vfio_class {
+ struct kref kref;
+ struct class *class;
+};
+extern struct vfio_class *vfio_class;
+
+ssize_t vfio_io_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+ssize_t vfio_config_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+
+void vfio_disable_msi(struct vfio_dev *);
+void vfio_disable_msix(struct vfio_dev *);
+int vfio_enable_msi(struct vfio_dev *, int);
+int vfio_enable_msix(struct vfio_dev *, int, void __user *);
+
+#ifndef PCI_MSIX_ENTRY_SIZE
+#define PCI_MSIX_ENTRY_SIZE 16
+#endif
+#ifndef PCI_STATUS_INTERRUPT
+#define PCI_STATUS_INTERRUPT 0x08
+#endif
+
+struct vfio_dma_map;
+void vfio_dma_unmapall(struct vfio_listener *);
+int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
+int vfio_dma_map_common(struct vfio_listener *, unsigned int,
+ struct vfio_dma_map *);
+
+int vfio_class_init(void);
+void vfio_class_destroy(void);
+int vfio_dev_add_attributes(struct vfio_dev *);
+extern struct idr vfio_idr;
+extern struct mutex vfio_minor_lock;
+int vfio_build_config_map(struct vfio_dev *);
+
+irqreturn_t vfio_interrupt(int, void *);
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for ioctls */
+
+/*
+ * Structure for DMA mapping of user buffers
+ * vaddr, dmaaddr, and size must all be page aligned
+ * buffer may only be larger than 1 page if (a) there is
+ * an iommu in the system, or (b) buffer is part of a huge page
+ */
+struct vfio_dma_map {
+ __u64 vaddr; /* process virtual addr */
+ __u64 dmaaddr; /* desired and/or returned dma address */
+ __u64 size; /* size in bytes */
+ int rdwr; /* bool: 0 for r/o; 1 for r/w */
+};
+
+/* map user pages at any dma address */
+#define VFIO_DMA_MAP_ANYWHERE _IOWR(';', 100, struct vfio_dma_map)
+
+/* map user pages at specific dma address */
+#define VFIO_DMA_MAP_IOVA _IOWR(';', 101, struct vfio_dma_map)
+
+/* unmap user pages */
+#define VFIO_DMA_UNMAP _IOW(';', 102, struct vfio_dma_map)
+
+/* set device DMA mask & master status */
+#define VFIO_DMA_MASK _IOW(';', 103, __u64)
+
+/* request IRQ interrupts; use given eventfd */
+#define VFIO_EVENTFD_IRQ _IOW(';', 104, int)
+
+/* request MSI interrupts; use given eventfd */
+#define VFIO_EVENTFD_MSI _IOW(';', 105, int)
+
+/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
+#define VFIO_EVENTFDS_MSIX _IOW(';', 106, int)
+
+/* Get length of a BAR */
+#define VFIO_BAR_LEN _IOWR(';', 107, __u32)
+
+/*
+ * Reads, writes, and mmaps determine which PCI BAR (or config space)
+ * from the high level bits of the file offset
+ */
+#define VFIO_PCI_BAR0_RESOURCE 0x0
+#define VFIO_PCI_BAR1_RESOURCE 0x1
+#define VFIO_PCI_BAR2_RESOURCE 0x2
+#define VFIO_PCI_BAR3_RESOURCE 0x3
+#define VFIO_PCI_BAR4_RESOURCE 0x4
+#define VFIO_PCI_BAR5_RESOURCE 0x5
+#define VFIO_PCI_ROM_RESOURCE 0x6
+#define VFIO_PCI_CONFIG_RESOURCE 0xF
+#define VFIO_PCI_SPACE_SHIFT 32
+#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
+
+static inline int vfio_offset_to_pci_space(__u64 off)
+{
+ return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
+}
+
+static __u64 vfio_pci_space_to_offset(int sp)
+{
+ return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
+}
diff -uprN linux-2.6.34/MAINTAINERS vfio-linux-2.6.34/MAINTAINERS
--- linux-2.6.34/MAINTAINERS 2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/MAINTAINERS 2010-05-28 12:30:21.000000000 -0700
@@ -5968,6 +5968,13 @@ S: Maintained
F: Documentation/fb/uvesafb.txt
F: drivers/video/uvesafb.*
+VFIO DRIVER
+M: Tom Lyon <pu...@cisco.com>
+S: Supported
+F: Documentation/vfio.txt
+F: drivers/vfio/
+F: include/linux/vfio.h
+
VFAT/FAT/MSDOS FILESYSTEM
M: OGAWA Hirofumi <hiro...@mail.parknet.co.jp>
S: Maintained
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote:
Missing diffstat -p1 -w 70:
Documentation/vfio.txt | 176 ++++++++
MAINTAINERS | 7
drivers/Kconfig | 2
drivers/Makefile | 1
drivers/vfio/Kconfig | 9
drivers/vfio/Makefile | 5
drivers/vfio/vfio_dma.c | 372 ++++++++++++++++++
drivers/vfio/vfio_intrs.c | 189 +++++++++
drivers/vfio/vfio_main.c | 627 +++++++++++++++++++++++++++++++
drivers/vfio/vfio_pci_config.c | 554 +++++++++++++++++++++++++++
drivers/vfio/vfio_rdwr.c | 147 +++++++
drivers/vfio/vfio_sysfs.c | 153 +++++++
include/linux/vfio.h | 193 +++++++++
13 files changed, 2435 insertions(+)
which shows that the patch is missing an update to
Documentation/ioctl/ioctl-number.txt for ioctl code ';'. Please add that.
> diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig
> --- linux-2.6.34/drivers/vfio/Kconfig 1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/Kconfig 2010-05-27 17:07:25.000000000 -0700
> @@ -0,0 +1,9 @@
> +menuconfig VFIO
> + tristate "Non-Priv User Space PCI drivers"
Non-privileged
> + depends on PCI
> + help
> + Driver to allow advanced user space drivers for PCI, PCI-X,
> + and PCIe devices. Requires IOMMU to allow non-privilged
non-privileged
> + processes to directly control the PCI devices.
> +
> + If you don't know what to do here, say N.
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
> diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
> --- linux-2.6.34/Documentation/vfio.txt 1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/Documentation/vfio.txt 2010-05-28 14:03:05.000000000 -0700
> @@ -0,0 +1,176 @@
> +-------------------------------------------------------------------------------
> +The VFIO "driver" is used to allow privileged AND non-privileged processes to
> +implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> +devices.
> +
> +Why is this interesting? Some applications, especially in the high performance
> +computing field, need access to hardware functions with as little overhead as
> +possible. Examples are in network adapters (typically non tcp/ip based) and
non-TCP/IP-based)
> +in compute accelerators - i.e., array processors, FPGA processors, etc.
> +Previous to the VFIO drivers these apps would need either a kernel-level
> +driver (with corrsponding overheads), or else root permissions to directly
corresponding
> +access the hardware. The VFIO driver allows generic access to the hardware
> +from non-privileged apps IF the hardware is "well-behaved" enough for this
> +to be safe.
> +
> +While there have long been ways to implement user-level drivers using specific
> +corresponding drivers in the kernel, it was not until the introduction of the
> +UIO driver framework, and the uio_pci_generic driver that one could have a
> +generic kernel component supporting many types of user level drivers. However,
> +even with the uio_pci_generic driver, processes implementing the user level
> +drivers had to be trusted - they could do dangerous manipulation of DMA
> +addreses and were required to be root to write PCI configuration space
> +registers.
> +
> +Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
> +new hardware capabilities which the VFIO solution exploits to allow non-root
> +user level drivers. The main role of the IOMMU is to ensure that DMA accesses
> +from devices go only to the appropriate memory locations, this allows VFIO to
locations;
BAR
> +
> +Interrupts:
> +
> +Device interrupts are translated by the vfio driver into input events on event
> +notification file descriptors created by the eventfd system call. The user
> +program must one or more event descriptors and pass them to the vfio driver
must ___ ? missing word?
> +via ioctls to arrange for the interrupt mapping:
> +1.
> + efd = eventfd(0, 0);
> + ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
> + This provides an eventfd for traditional IRQ interrupts.
> + IRQs will be disable after each interrupt until the driver
disabled
given
> +useful if KVM or other virtualization facilities use this driver.
> +
> +The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
> +the buffer and releases the corresponding system resources.
> +
> +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> +This call also has the side effect on enabled PCI bus mastership.
eh? I don't get that last sentence...
> +
> +Miscellaneous:
> +
> +The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
> +device's base address region. It is passed a single integer specifying which
> +BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
Please add a 32 bit padding word at the end of this, otherwise the
size of the data structure is incompatible between 32 x86 applications
and 64 bit kernels.
Arnd
Might as well call it 'flags' and reserve a bit more space (keeping
64-bit aligned size) for future expansion.
rdwr can be folded into it.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
How do we enforce security then? We need to ensure that unprivileged
users can only use the device with an iommu.
>> [a pony for avi...]
>> The major new functionality in this version is the ability to deal with
>> PCI config space accesses (through read& write calls) - but includes table
>> driven code to determine whats safe to write and what is not.
>>
> I don't really see why this is helpful: a driver written corrrectly
> will not access these addresses, and we need an iommu anyway to protect
> us against a drivers.
>
Haven't reviewed the code (yet) but things like the BARs, MSI, and
interrupt disable need to be protected from the guest regardless of the
iommu.
--
error compiling committee.c: too many arguments to function
IMO this was because this driver does two things: programming iommu and
handling interrupts. uio does interrupt handling.
We could have moved iommu / DMA programming to
a separate driver, and have uio work with it.
This would solve limitation of the current driver
that is needs an iommu domain per device.
> [a pony for avi...]
> The major new functionality in this version is the ability to deal with
> PCI config space accesses (through read & write calls) - but includes table
> driven code to determine whats safe to write and what is not.
I don't really see why this is helpful: a driver written corrrectly
will not access these addresses, and we need an iommu anyway to protect
us against a drivers.
> Also, some
> virtualization of the config space to allow drivers to think they're writing
> some registers when they're not.
This emulation seems unlikely to be complete.
It can be done in userspace. What's the need for it in kernel?
IMO we definitely do not want to enable this non-IOMMU mode of operation.
If you are writing a driver that can DMA into arbitrary memory,
you should do this in kernel.
> +If the user process desires many DMA buffers, it may be wise to do a mapping
> +of a single large buffer, and then allocate the smaller buffers from the
> +large one.
> +
> +The DMA buffers are locked into physical memory for the duration of their
> +existence - until VFIO_DMA_UNMAP is called, until the user pages are
> +unmapped from the user process, or until the vfio file descriptor is closed.
> +The user process must have permission to lock the pages given by the ulimit(-l)
> +command, which in turn relies on settings in the /etc/security/limits.conf
> +file.
I think it's better to have userspace handle this by calling mlock.
To protect against, buggy userspace,
you can detect when a page is unmapped from a process and
remove it from iommu.
So there's a domain per device? Since a domain uses resources,
and since a single application is likely to use multiple devices,
I think it is better to enable sharing a domain between them.
> + if (vdev->domain == NULL)
> + return -ENXIO;
An ioctl to attach/detach from domain explicitly would be cleaner.
It says 'account' but does not seem to change anything?
Also, this seems racy: userspace
can do mlock after you checked the limit.
> + /* only 1 address space per fd */
> + if (current->mm != listener->mm) {
> + if (listener->mm != NULL)
> + return -EINVAL;
return with lock held
> + listener->mm = current->mm;
> + listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> + ret = mmu_notifier_register(&listener->mmu_notifier,
> + listener->mm);
> + if (ret)
> + printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> + __func__, ret);
debugging code?
> + ret = 0;
> + }
> +
> + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> + if (pages == NULL) {
> + ret = ENOMEM;
> + goto out_lock;
> + }
> + ret = get_user_pages_fast(dmp->vaddr, npage, dmp->rdwr, pages);
> + if (ret != npage) {
> + printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
> + __func__, ret, npage);
above too.
root already can control devices through sysfs.
Let's not add more unsafe ways to do this:
if there's no iommu, just fail the open.
I don't get the above comment. If device can do DMA,
with BAR access you can make it DMA to an arbitrary address
anyway. Does not this driver enforce an iommu to protect against
this?
> + start = vma->vm_pgoff << PAGE_SHIFT;
> + len = vma->vm_end - vma->vm_start;
> + if (vma->vm_flags & VM_WRITE) {
> + ret = vfio_msix_check(vdev, start, len);
> + if (ret)
> + return ret;
> + }
> +
> + vma->vm_private_data = vdev;
> + vma->vm_flags |= VM_IO | VM_RESERVED;
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
> +
> + return remap_pfn_range(vma, vma->vm_start, phys,
> + vma->vm_end - vma->vm_start,
> + vma->vm_page_prot);
This seems to map PCI BAR directly into userspace, but does not
seem to do any accounting.
What prevents userspace from hanging on to the mapped
address after closing an fd?
If this happens, iommu won't protect us.
Add some comments + named constants above?
Force assigning to iommu before we allow any other operation?
>>> [a pony for avi...]
>>> The major new functionality in this version is the ability to deal with
>>> PCI config space accesses (through read& write calls) - but includes table
>>> driven code to determine whats safe to write and what is not.
>>>
>> I don't really see why this is helpful: a driver written corrrectly
>> will not access these addresses, and we need an iommu anyway to protect
>> us against a drivers.
>>
>
> Haven't reviewed the code (yet) but things like the BARs, MSI, and
> interrupt disable need to be protected from the guest regardless of the
> iommu.
Yes but userspace can do this. As long as userspace can not
crash the kernel, no reason to put this policy into kernel.
> +
> +Why is this interesting? Some applications, especially in the high performance
> +computing field, need access to hardware functions with as little overhead as
> +possible. Examples are in network adapters (typically non tcp/ip based) and
> +in compute accelerators - i.e., array processors, FPGA processors, etc.
> +Previous to the VFIO drivers these apps would need either a kernel-level
> +driver (with corrsponding overheads), or else root permissions to directly
> +access the hardware. The VFIO driver allows generic access to the hardware
> +from non-privileged apps IF the hardware is "well-behaved" enough for this
> +to be safe.
>
> +
> +Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
> +there are many other non-IOV PCI devices which also meet the defintion.
> +Elements of this definition are:
> +- The size of any memory BARs to be mmap'ed into the user process space must be
> + a multiple of the system page size.
>
You can relax this.
- smaller than page size can be mapped if the rest of the page unused
and if the platform tolerates writes to unused areas
- if the rest of the page is used, we can relocate the BAR
- otherwise, we can prevent mmap() but still allow mediated access via
a syscall
> +- If MSI-X interrupts are used, the device driver must not attempt to mmap or
> + write the MSI-X vector area.
>
We can allow mediated access (that's what qemu-kvm does). I guess the
ioctls for setting up msi interrupts are equivalent to this mediated access.
(later I see you do provide mediated access via pwrite - please confirm)
> +- The device must not use the PCI configuration space in any non-standard way,
> + i.e., the user level driver will be permitted only to read and write standard
> + fields of the PCI config space, and only if those fields cannot cause harm to
> + the system. In addition, some fields are "virtualized", so that the user
> + driver can read/write them like a kernel driver, but they do not affect the
> + real device.
>
What's wrong with nonstandard fields?
> +
> +Even with these restrictions, there are bound to be devices which are unsafe
> +for user level use - it is still up to the system admin to decide whether to
> +grant access to the device. When the vfio module is loaded, it will have
> +access to no devices until the desired PCI devices are "bound" to the driver.
> +First, make sure the devices are not bound to another kernel driver. You can
> +unload that driver if you wish to unbind all its devices, or else enter the
> +driver's sysfs directory, and unbind a specific device:
> + cd /sys/bus/pci/drivers/<drivername>
> + echo 0000:06:02.00> unbind
> +(The 0000:06:02.00 is a fully qualified PCI device name - different for each
> +device). Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
> +write the PCI device type of the target device to the new_id file:
> + echo 8086 10ca> new_id
> +(8086 10ca are the vendor and device type for the Intel 82576 virtual function
> +devices). A /dev/vfio<N> entry will be created for each device bound. The final
> +step is to grant users permission by changing the mode and/or owner of the /dev
> +entry - "chmod 666 /dev/vfio0".
>
What if I have several such devices? Isn't it better to bind by topoloy
(device address)?
> +
> +Reads& Writes:
> +
> +The user driver will typically use mmap to access the memory BAR(s) of a
> +device; the I/O BARs and the PCI config space may be accessed through normal
> +read and write system calls. Only 1 file descriptor is needed for all driver
> +functions -- the desired BAR for I/O, memory, or config space is indicated via
> +high-order bits of the file offset.
My preference would be one fd per BAR, but that's a matter of personal
taste.
> For instance, the following implements a
> +write to the PCI config space:
> +
> + #include<linux/vfio.h>
> + void pci_write_config_word(int pci_fd, u16 off, u16 wd)
> + {
> + off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
> +
> + if (pwrite(pci_fd,&wd, 2, cfg_off) != 2)
> + perror("pwrite config_dword");
> + }
> +
>
Nice, has the benefit of avoiding endianness issues in the interface.
> +The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
> +in vfio.h to convert bar numbers to file offsets and vice-versa.
> +
> +Interrupts:
> +
> +Device interrupts are translated by the vfio driver into input events on event
> +notification file descriptors created by the eventfd system call. The user
> +program must one or more event descriptors and pass them to the vfio driver
> +via ioctls to arrange for the interrupt mapping:
> +1.
> + efd = eventfd(0, 0);
> + ioctl(vfio_fd, VFIO_EVENTFD_IRQ,&efd);
> + This provides an eventfd for traditional IRQ interrupts.
> + IRQs will be disable after each interrupt until the driver
> + re-enables them via the PCI COMMAND register.
>
My thinking was to emulate a level-triggered interrupt but I think your
way is better. For virtualization, it becomes the responsibility of
user space to multiplex between the guest writing PCI COMMAND and
userspace writing PCI COMMAND to re-enable interrupts, but that's fine.
> +2.
> + efd = eventfd(0, 0);
> + ioctl(vfio_fd, VFIO_EVENTFD_MSI,&efd);
> + This connects MSI interrupts to an eventfd.
> +3.
> + int arg[N+1];
> + arg[0] = N;
> + arg[1..N] = eventfd(0, 0);
> + ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
> + This connects N MSI-X interrupts with N eventfds.
> +
> +Waiting and checking for interrupts is done by the user program by reads,
> +polls, or selects on the related event file descriptors.
>
This all looks nice and clean.
> +
> +DMA:
> +
> +The VFIO driver uses ioctls to allow the user level driver to get DMA
> +addresses which correspond to virtual addresses. In systems with IOMMUs,
> +each PCI device will have its own address space for DMA operations, so when
> +the user level driver programs the device registers, only addresses known to
> +the IOMMU will be valid, any others will be rejected. The IOMMU creates the
> +illusion (to the device) that multi-page buffers are physically contiguous,
> +so a single DMA operation can safely span multiple user pages. Note that
> +the VFIO driver is still useful in systems without IOMMUs, but only for
> +trusted processes which can deal with DMAs which do not span pages (Huge
> +pages count as a single page also).
> +
> +If the user process desires many DMA buffers, it may be wise to do a mapping
> +of a single large buffer, and then allocate the smaller buffers from the
> +large one.
>
Or use scatter/gather, if the device supports it.
How many such mappings can be mapped simultaneously?
Note you need privileges (RLIMIT_MEMLOCK) to lock memory, this should be
accounted for.
> + /* account for locked pages */
> + locked = npage + current->mm->locked_vm;
> + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> + >> PAGE_SHIFT;
>
Ah, you already do.
> +/* Kernel& User level defines for ioctls */
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + * buffer may only be larger than 1 page if (a) there is
> + * an iommu in the system, or (b) buffer is part of a huge page
> + */
> +struct vfio_dma_map {
> + __u64 vaddr; /* process virtual addr */
> + __u64 dmaaddr; /* desired and/or returned dma address */
> + __u64 size; /* size in bytes */
> + int rdwr; /* bool: 0 for r/o; 1 for r/w */
> +};
>
As noted before, align, add flags, and reserve space.
> +
> +/* Get length of a BAR */
> +#define VFIO_BAR_LEN _IOWR(';', 107, __u32)
>
A 64-bit BAR will overflow on a 32-bit system.
> +
> +/*
> + * Reads, writes, and mmaps determine which PCI BAR (or config space)
> + * from the high level bits of the file offset
> + */
> +#define VFIO_PCI_BAR0_RESOURCE 0x0
> +#define VFIO_PCI_BAR1_RESOURCE 0x1
> +#define VFIO_PCI_BAR2_RESOURCE 0x2
> +#define VFIO_PCI_BAR3_RESOURCE 0x3
> +#define VFIO_PCI_BAR4_RESOURCE 0x4
> +#define VFIO_PCI_BAR5_RESOURCE 0x5
> +#define VFIO_PCI_ROM_RESOURCE 0x6
> +#define VFIO_PCI_CONFIG_RESOURCE 0xF
> +#define VFIO_PCI_SPACE_SHIFT 32
>
64-bit BARs break this. 51 would be a good value for x86 systems (the
PTE format makes bits 52:62 available to software, so the address space
cannot grow beyond 2PB).
> +#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
> +
> +static inline int vfio_offset_to_pci_space(__u64 off)
> +{
> + return (off>> VFIO_PCI_SPACE_SHIFT)& 0xF;
> +}
> +
> +static __u64 vfio_pci_space_to_offset(int sp)
> +{
> + return (__u64)(sp)<< VFIO_PCI_SPACE_SHIFT;
> +}
>
Needs to be inline too.
Suggest the last function also take the offset, and add a function to
extract the offset from a space/offset combo.
--
error compiling committee.c: too many arguments to function
--
That means the driver must be aware of the iommu.
The userspace driver? Yes. And It is a good thing to be explicit
there anyway, since this lets userspace map a non-contigious
virtual address list into a contiguous bus address range.
No, the kernel driver. It cannot allow userspace to enable bus
mastering unless it knows the iommu is enabled for the device and remaps
dma to user pages.
So what I suggested is failing any kind of access until iommu
is assigned.
So, the kernel driver must be aware of the iommu. In which case it may
as well program it.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
> + mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
Not good at that point. I think you need to allocate it first, error if
it can't be allocated and then do the work and free it on error ?
> + mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> + mlp->pages = pages;
Ditto
> +int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + struct eventfd_ctx *ctx;
> + int ret = 0;
> + int i;
> + int fd;
> +
> + vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
> + GFP_KERNEL);
> + vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
> + GFP_KERNEL);
These don't seem to get freed on the error path - or indeed protected
against being allocated twice (eg two parallel ioctls ?)
> + case VFIO_DMA_MAP_ANYWHERE:
> + case VFIO_DMA_MAP_IOVA:
> + if (copy_from_user(&dm, uarg, sizeof dm))
> + return -EFAULT;
> + ret = vfio_dma_map_common(listener, cmd, &dm);
> + if (!ret && copy_to_user(uarg, &dm, sizeof dm))
So the vfio_dma_map is untrusted. That seems to be checked ok later but
the dma_map_common code then plays in current->mm-> without apparently
holding any locks to stop the values getting corrupted by a parallel
mlock ?
Actually no I take that back
dmp->size is 64bit
So npage can end up with values like 0xFFFFFFFF and cause 32bit
boxes to go kerblam
> +
> + case VFIO_EVENTFD_IRQ:
> + if (copy_from_user(&fd, uarg, sizeof fd))
> + return -EFAULT;
> + if (vdev->ev_irq)
> + eventfd_ctx_put(vdev->ev_irq);
These paths need locking - suppose two EVENTFD irq ioctls occur at once
(in general these paths seem not to be covered)
>
> + case VFIO_BAR_LEN:
> + if (copy_from_user(&bar, uarg, sizeof bar))
> + return -EFAULT;
> + if (bar < 0 || bar > PCI_ROM_RESOURCE)
> + return -EINVAL;
> + bar = pci_resource_len(pdev, bar);
> + if (copy_to_user(uarg, &bar, sizeof bar))
> + return -EFAULT;
How does this all work out if the device is a bridge ?
> + pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, &line);
> + if (line == 0)
> + goto out;
That may produce some interestingly wrong answers. Firstly the platform
has interrupt abstraction so dev->irq may not match PCI_INTERRUPT_LINE,
secondly you have devices that report their IRQ via other paths as per
spec (notably IDE class devices in non-native mode)
So that would also want extra checks.
> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
> + ret = orig & PCI_COMMAND_MASTER;
> + if (!ret) {
> + new = orig | PCI_COMMAND_MASTER;
> + pci_write_config_word(pdev, PCI_COMMAND, new);
> + pci_read_config_word(pdev, PCI_COMMAND, &new);
> + ret = new & PCI_COMMAND_MASTER;
> + pci_write_config_word(pdev, PCI_COMMAND, orig);
The master bit on some devices can be turned on but not off. Not sure it
matters here.
> + vdev->pdev = pdev;
Probably best to take/drop a reference. Not needed if you can prove your
last use is before the end of the remove path though.
Does look like it needs a locking audit, some memory and error checks
reviewing and some further review of the ioctl security and
overflows/trusted values.
Rather a nice way of attacking the user space PCI problem.
It's a kernel driver anyway. Point is that
the *device* driver is better off not programming iommu,
this way we do not need to reprogram it for each device.
The device driver is in userspace. It can't program the iommu. What
the patch proposes is that userspace tells vfio about the needed
mappings, and vfio programs the iommu.
--
error compiling committee.c: too many arguments to function
--
I mean the kernel driver that grants userspace the access.
> It can't program the iommu.
> What
> the patch proposes is that userspace tells vfio about the needed
> mappings, and vfio programs the iommu.
There seems to be some misunderstanding. The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device. We are better off sharing a
domain between devices and programming the iommu once.
The natural way to do this is to have an iommu driver for programming
iommu.
This likely means we will have to pass the domain to 'vfio' or uio or
whatever the driver that gives userspace the access to device is called,
but this is only for security, there's no need to support programming
iommu there.
And using this design means the uio framework changes
required would be minor, so we won't have to duplicate code.
iommufd = open(/dev/iommu);
ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
?
If so, I agree.
> The natural way to do this is to have an iommu driver for programming
> iommu.
>
> This likely means we will have to pass the domain to 'vfio' or uio or
> whatever the driver that gives userspace the access to device is called,
> but this is only for security, there's no need to support programming
> iommu there.
>
> And using this design means the uio framework changes
> required would be minor, so we won't have to duplicate code.
>
Since vfio would be the only driver, there would be no duplication. But
a separate object for the iommu mapping is a good thing. Perhaps we can
even share it with vhost (without actually using the mmu, since vhost is
software only).
Yes.
> If so, I agree.
Good.
>> The natural way to do this is to have an iommu driver for programming
>> iommu.
>>
>> This likely means we will have to pass the domain to 'vfio' or uio or
>> whatever the driver that gives userspace the access to device is called,
>> but this is only for security, there's no need to support programming
>> iommu there.
>>
>> And using this design means the uio framework changes
>> required would be minor, so we won't have to duplicate code.
>>
>
> Since vfio would be the only driver, there would be no duplication. But
> a separate object for the iommu mapping is a good thing. Perhaps we can
> even share it with vhost (without actually using the mmu, since vhost is
> software only).
Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.
So you'd just take the memory map and not pin anything. This way you
can reuse the memory map.
But no, it doesn't handle the dirty bitmap, so no go.
I'm not really opposed to multiple devices per domain, but let me point out how I
ended up here. First, the driver has two ways of mapping pages, one based on the
iommu api and one based on the dma_map_sg api. With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers. If there are multiple drivers
in the user level, do we not want the same isoation to apply to them?
Also, domains are not a very scarce resource - my little core i5 has 256,
and the intel architecture goes to 64K.
And then there's the fact that it is possible to have multiple disjoint iommus on a system,
so it may not even be possible to bring 2 devices under one domain.
Given all that, I am inclined to leave it alone until someone has a real problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings
In the case of kvm, we don't want isolation between devices, because
that doesn't happen on real hardware. So if the guest programs devices
to dma to each other, we want that to succeed.
> Also, domains are not a very scarce resource - my little core i5 has 256,
> and the intel architecture goes to 64K.
>
But there is a 0.2% of mapped memory per domain cost for the page
tables. For the kvm use case, that could be significant since a guest
may have large amounts of memory and large numbers of assigned devices.
> And then there's the fact that it is possible to have multiple disjoint iommus on a system,
> so it may not even be possible to bring 2 devices under one domain.
>
That's indeed a deficiency.
> Given all that, I am inclined to leave it alone until someone has a real problem.
> Note that not sharing iommu domains doesn't mean you can't share device memory,
> just that you have to do multiple mappings
>
I think we do have a real problem (though a mild one).
The only issue I see with deferring the solution is that the API becomes
gnarly; both the kernel and userspace will have to support both APIs
forever. Perhaps we can implement the new API but defer the actual
sharing until later, don't know how much work this saves. Or Alex/Chris
can pitch in and help.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
It seems part of the annoyance of the current KVM device assignment is
that we have multiple files open, we mmap here, read there, write over
there, maybe, if it's not emulated. I quite like Tom's approach that we
have one stop shopping with /dev/vfio<n>, including config space
emulation so each driver doesn't have to try to write their own. So
continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
ioctl to vfio so that after we setup one device we can bind the next to
the same domain?
Alex
This is just what I was thinking. But rather than a get/set, just use two fds.
ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);
This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure, In addition, this is strictlyupwards compatible with
what is there now, so maybe we can add it later.
What happens if one of the fds is later closed?
I don't like this conceptually. There is a 1:n relationship between the
memory map and the devices. Ignoring it will cause the API to have
warts. It's more straightforward to have an object to represent the
memory mapping (and talk to the iommus), and have devices bind to this
object.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
Sure it does. That's exactly what happens when there's an iommu
involved with bare metal.
> So if the guest programs
> devices to dma to each other, we want that to succeed.
And it will as long as ATS is enabled (this is a basic requirement
for PCIe peer-to-peer traffic to succeed with an iommu involved on
bare metal).
That's how things currently are, i.e. we put all devices belonging to a
single guest in the same domain. However, it can be useful to put each
device belonging to a guest in a unique domain. Especially as qemu
grows support for iommu emulation, and guest OSes begin to understand
how to use a hw iommu.
> >Also, domains are not a very scarce resource - my little core i5 has 256,
> >and the intel architecture goes to 64K.
>
> But there is a 0.2% of mapped memory per domain cost for the page
> tables. For the kvm use case, that could be significant since a
> guest may have large amounts of memory and large numbers of assigned
> devices.
>
> >And then there's the fact that it is possible to have multiple disjoint iommus on a system,
> >so it may not even be possible to bring 2 devices under one domain.
>
> That's indeed a deficiency.
Not sure it's a deficiency. Typically to share page table mappings
across multiple iommu's you just have to do update/invalidate to each
hw iommu that is sharing the mapping. Alternatively, you can use more
memory and build/maintain identical mappings (as Tom alludes to below).
> >Given all that, I am inclined to leave it alone until someone has a real problem.
> >Note that not sharing iommu domains doesn't mean you can't share device memory,
> >just that you have to do multiple mappings
>
> I think we do have a real problem (though a mild one).
>
> The only issue I see with deferring the solution is that the API
> becomes gnarly; both the kernel and userspace will have to support
> both APIs forever. Perhaps we can implement the new API but defer
> the actual sharing until later, don't know how much work this saves.
> Or Alex/Chris can pitch in and help.
It really shouldn't be that complicated to create the API to allow for
flexible device <-> domain mappings, so I agree, makes sense to do it
right up front.
thanks,
-chris
But we are emulating a machine without an iommu.
When we emulate a machine with an iommu, then yes, we'll want to use as
many domains as the guest does.
>> So if the guest programs
>> devices to dma to each other, we want that to succeed.
>>
> And it will as long as ATS is enabled (this is a basic requirement
> for PCIe peer-to-peer traffic to succeed with an iommu involved on
> bare metal).
>
> That's how things currently are, i.e. we put all devices belonging to a
> single guest in the same domain. However, it can be useful to put each
> device belonging to a guest in a unique domain. Especially as qemu
> grows support for iommu emulation, and guest OSes begin to understand
> how to use a hw iommu.
>
Right, we need to keep flexibility.
>>> And then there's the fact that it is possible to have multiple disjoint iommus on a system,
>>> so it may not even be possible to bring 2 devices under one domain.
>>>
>> That's indeed a deficiency.
>>
> Not sure it's a deficiency. Typically to share page table mappings
> across multiple iommu's you just have to do update/invalidate to each
> hw iommu that is sharing the mapping. Alternatively, you can use more
> memory and build/maintain identical mappings (as Tom alludes to below).
>
Sharing the page tables is just an optimization, I was worried about
devices in separate domains not talking to each other. if ATS fixes
that, great.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
> There seems to be some misunderstanding. The userspace interface
> proposed forces a separate domain per device and forces userspace to
> repeat iommu programming for each device. We are better off sharing a
> domain between devices and programming the iommu once.
>
> The natural way to do this is to have an iommu driver for programming
> iommu.
IMO a seperate iommu-userspace driver is a nightmare for a userspace
interface. It is just too complicated to use. We can solve the problem
of multiple devices-per-domain with an ioctl which allows binding one
uio-device to the address-space on another. Thats much simpler.
Joerg
>> Main difference is that vhost works fine with unlocked
>> memory, paging it in on demand. iommu needs to unmap
>> memory when it is swapped out or relocated.
>>
> So you'd just take the memory map and not pin anything. This way you
> can reuse the memory map.
>
> But no, it doesn't handle the dirty bitmap, so no go.
IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.
Joerg
How can this fail with multiple IOMMUs? This should be handled
transparently by the IOMMU driver.
Joerg
One advantage would be that we can reuse the uio framework
for the devices themselves. So an existing app can just program
an iommu for DMA and keep using uio for interrupts and access.
> We can solve the problem
> of multiple devices-per-domain with an ioctl which allows binding one
> uio-device to the address-space on another.
This would imply switching an iommu domain for a device while
it could potentially be doing DMA. No idea whether this can be done
in a safe manner.
Forcing iommu assignment to be done as a first step seems much saner.
> Thats much simpler.
>
> Joerg
So instead of
dev = open();
ioctl(dev, ASSIGN, iommu)
mmap
and if we for ioctl mmap will fail
we have
dev = open();
if (ndevices > 0)
ioctl(devices[0], ASSIGN, dev)
mmap
And if we forget ioctl we get errors from device.
Seems more complicated to me.
There will also always exist the confusion: address space for
which device are we modifying? With a separate driver for iommu,
we can safely check that binding is done correctly.
--
MST
This is non trivial with hotplug.
--
error compiling committee.c: too many arguments to function
--
vhost doesn't pin memory.
What I proposed is to describe the memory map using an object (fd), and
pass it around to clients that use it: kvm, vhost, vfio. That way you
maintain the memory map in a central location and broadcast changes to
clients. Only a vfio client would result in memory being pinned.
It can still work, but the interface needs to be extended to include
dirty bitmap logging.
--
error compiling committee.c: too many arguments to function
--
Ah ok, so its only about the database which keeps the mapping
information.
> It can still work, but the interface needs to be extended to include
> dirty bitmap logging.
Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
dirty-bits in the page-table. And without demand-paging we can't really
tell what pages a device has written to. The only choice is to mark all
IOMMU-mapped pages dirty as long as they are mapped.
Joerg
> > IMO a seperate iommu-userspace driver is a nightmare for a userspace
> > interface. It is just too complicated to use.
>
> One advantage would be that we can reuse the uio framework
> for the devices themselves. So an existing app can just program
> an iommu for DMA and keep using uio for interrupts and access.
The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
IOMMU mappings belongs there.
> > We can solve the problem
> > of multiple devices-per-domain with an ioctl which allows binding one
> > uio-device to the address-space on another.
>
> This would imply switching an iommu domain for a device while
> it could potentially be doing DMA. No idea whether this can be done
> in a safe manner.
It can. The worst thing that can happen is an io-page-fault.
> Forcing iommu assignment to be done as a first step seems much saner.
If we force it, there is no reason why not doing it implicitly.
We can do something like this then:
dev1 = open();
ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
it*/
dev2 = open();
ioctl(dev2, IOMMU_MAP, ...);
/* Now dev1 and dev2 are in seperate domains */
ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
assigns it to the same domain as
dev1. Domain has a refcount of two
now */
close(dev1); /* domain refcount goes down to one */
close(dev2); /* domain refcount is zero and domain gets destroyed */
Joerg
Or mark them dirty when they are unmapped.
--
MST
MMU notifiers are problematic because they are designed for situations
where we can do demand paging. The invalidate_range_start and
invalidate_range_end functions are not only called on munmap, they also
run when mprotect is called (in which case we don't want to tear down
iommu mappings). So what may happen with mmu notifiers is that we
accidentially tear down iommu mappings. With demand-paging this is no
problem because the io-ptes could be re-faulted. But that does not work
here.
devices might not be able to recover from this.
> > Forcing iommu assignment to be done as a first step seems much saner.
>
> If we force it, there is no reason why not doing it implicitly.
What you describe below does 3 ioctls for what can be done with 1.
> We can do something like this then:
>
> dev1 = open();
> ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
> it*/
>
> dev2 = open();
> ioctl(dev2, IOMMU_MAP, ...);
>
> /* Now dev1 and dev2 are in seperate domains */
>
> ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> assigns it to the same domain as
> dev1. Domain has a refcount of two
> now */
Or maybe it destroys mapping for dev1?
How do you remember?
> close(dev1); /* domain refcount goes down to one */
> close(dev2); /* domain refcount is zero and domain gets destroyed */
>
>
> Joerg
Also, no way to unshare? That seems limiting.
--
MST
One of the issues I see with the current patch is that
it uses the mlock rlimit to do this pinning. So this wastes the rlimit
for an app that did mlockall already, and also consumes
this resource transparently, so an app might call mlock
on a small buffer and be surprised that it fails.
Using mmu notifiers might help?
--
MST
> > It can. The worst thing that can happen is an io-page-fault.
>
> devices might not be able to recover from this.
With the userspace interface a process can create io-page-faults
anyway if it wants. We can't protect us from this. And the process is
also responsible to not tear down iommu-mappings that are currently in
use.
> What you describe below does 3 ioctls for what can be done with 1.
The second IOMMU_MAP ioctl is just to show that existing mappings would
be destroyed if the device is assigned to another address space. Not
strictly necessary. So we have two ioctls but save one call to create
the iommu-domain.
> > ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> > assigns it to the same domain as
> > dev1. Domain has a refcount of two
> > now */
>
> Or maybe it destroys mapping for dev1?
> How do you remember?
Because we express here that "dev2 shares the iommu mappings of dev1".
Thats easy to remember.
> Also, no way to unshare? That seems limiting.
Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
because that would require a second parameter) ioctl is certainly also
required.
Joerg
Maybe it could be put there but the patch posted did not use uio.
And one of the reasons is that uio framework provides for
device access and interrupts but not for programming memory mappings.
Solutions (besides giving up on uio completely)
could include extending the framework in some way
(which was tried, but the result was not pretty) or adding
a separate driver for iommu and binding to that.
--
MST
We could fail all operations until an iommu is bound.
This will help catch bugs with access before setup. We can not do this
if a domain is bound by default.
> And the process is
> also responsible to not tear down iommu-mappings that are currently in
> use.
>
> > What you describe below does 3 ioctls for what can be done with 1.
>
> The second IOMMU_MAP ioctl is just to show that existing mappings would
> be destroyed if the device is assigned to another address space. Not
> strictly necessary. So we have two ioctls but save one call to create
> the iommu-domain.
With 10 devices you have 10 extra ioctls.
> > > ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> > > assigns it to the same domain as
> > > dev1. Domain has a refcount of two
> > > now */
> >
> > Or maybe it destroys mapping for dev1?
> > How do you remember?
>
> Because we express here that "dev2 shares the iommu mappings of dev1".
> Thats easy to remember.
they both share the mappings. which one gets the iommu
destroyed (breaking the device if it is now doing DMA)?
> > Also, no way to unshare? That seems limiting.
>
> Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
> because that would require a second parameter) ioctl is certainly also
> required.
>
> Joerg
--
MST
> > With the userspace interface a process can create io-page-faults
> > anyway if it wants. We can't protect us from this.
>
> We could fail all operations until an iommu is bound. This will help
> catch bugs with access before setup. We can not do this if a domain is
> bound by default.
Even if it is bound to a domain the userspace driver could program the
device to do dma to unmapped regions causing io-page-faults. The kernel
can't do anything about it.
> > The second IOMMU_MAP ioctl is just to show that existing mappings would
> > be destroyed if the device is assigned to another address space. Not
> > strictly necessary. So we have two ioctls but save one call to create
> > the iommu-domain.
>
> With 10 devices you have 10 extra ioctls.
And this works implicitly with your proposal? Remember that we still
need to be able to provide seperate mappings for each device to support
IOMMU emulation for the guest. I think my proposal does not have any
extra costs.
> > Because we express here that "dev2 shares the iommu mappings of dev1".
> > Thats easy to remember.
>
> they both share the mappings. which one gets the iommu
> destroyed (breaking the device if it is now doing DMA)?
As I wrote the domain has a reference count and is destroyed only when
it goes down to zero. This does not happen as long as a device is bound
to it.
Joerg
It can always corrupt its own memory directly as well :)
But that is not a reason not to detect errors if we can,
and not to make APIs hard to misuse.
> > > The second IOMMU_MAP ioctl is just to show that existing mappings would
> > > be destroyed if the device is assigned to another address space. Not
> > > strictly necessary. So we have two ioctls but save one call to create
> > > the iommu-domain.
> >
> > With 10 devices you have 10 extra ioctls.
>
> And this works implicitly with your proposal?
Yes. so you do:
iommu = open
ioctl(dev1, BIND, iommu)
ioctl(dev2, BIND, iommu)
ioctl(dev3, BIND, iommu)
ioctl(dev4, BIND, iommu)
No need to add a SHARE ioctl.
> Remember that we still
> need to be able to provide seperate mappings for each device to support
> IOMMU emulation for the guest.
Generally not true. E.g. guest can enable iommu passthrough
or have domain per a group of devices.
> I think my proposal does not have any
> extra costs.
with my proposal we have 1 ioctl per device + 1 per domain.
with yours we have 2 ioctls per device is iommu is shared
and 1 if it is not shared.
as current apps share iommu it seems to make sense
to optimize for that.
> > > Because we express here that "dev2 shares the iommu mappings of dev1".
> > > Thats easy to remember.
> >
> > they both share the mappings. which one gets the iommu
> > destroyed (breaking the device if it is now doing DMA)?
>
> As I wrote the domain has a reference count and is destroyed only when
> it goes down to zero. This does not happen as long as a device is bound
> to it.
>
> Joerg
We were talking about UNSHARE ioctl:
ioctl(dev1, UNSHARE, dev2)
Does it change the domain for dev1 or dev2?
If you make a mistake you get a hard to debug bug.
--
MST
Yes.
>
>> It can still work, but the interface needs to be extended to include
>> dirty bitmap logging.
>>
> Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
> dirty-bits in the page-table. And without demand-paging we can't really
> tell what pages a device has written to. The only choice is to mark all
> IOMMU-mapped pages dirty as long as they are mapped.
>
>
The interface would only work for clients which support it: kvm, vhost,
and iommu/devices with restartable dma.
Note dirty logging is not very interesting for vfio anyway, since you
can't live migrate with assigned devices.
--
error compiling committee.c: too many arguments to function
--
> > Even if it is bound to a domain the userspace driver could program the
> > device to do dma to unmapped regions causing io-page-faults. The kernel
> > can't do anything about it.
>
> It can always corrupt its own memory directly as well :)
> But that is not a reason not to detect errors if we can,
> and not to make APIs hard to misuse.
Changing the domain of a device while dma can happen is the same type of
bug as unmapping potential dma target addresses. We can't catch this
kind of misuse.
> > > With 10 devices you have 10 extra ioctls.
> >
> > And this works implicitly with your proposal?
>
> Yes. so you do:
> iommu = open
> ioctl(dev1, BIND, iommu)
> ioctl(dev2, BIND, iommu)
> ioctl(dev3, BIND, iommu)
> ioctl(dev4, BIND, iommu)
>
> No need to add a SHARE ioctl.
In my proposal this looks like:
dev1 = open();
ioctl(dev2, SHARE, dev1);
ioctl(dev3, SHARE, dev1);
ioctl(dev4, SHARE, dev1);
So we actually save an ioctl.
> > Remember that we still need to be able to provide seperate mappings
> > for each device to support IOMMU emulation for the guest.
>
> Generally not true. E.g. guest can enable iommu passthrough
> or have domain per a group of devices.
What I meant was that there may me multiple io-addresses spaces
necessary for one process. I didn't want to say that every device
_needs_ to have its own address space.
> > As I wrote the domain has a reference count and is destroyed only when
> > it goes down to zero. This does not happen as long as a device is bound
> > to it.
> >
> > Joerg
>
> We were talking about UNSHARE ioctl:
> ioctl(dev1, UNSHARE, dev2)
> Does it change the domain for dev1 or dev2?
> If you make a mistake you get a hard to debug bug.
As I already wrote we would have an UNBIND ioctl which just removes a
device from its current domain. UNBIND is better than UNSHARE for
exactly the reason you pointed out above. I thought I stated that
already.
Joerg
The problem with this is that it is assymetric, dev1 is treated
differently from dev[234]. It's an unintuitive API.
--
error compiling committee.c: too many arguments to function
--
you normally need device mapped to start DMA.
SHARE makes this bug more likely as you allow
switching domains: mmap could be done before switching.
> > > > With 10 devices you have 10 extra ioctls.
> > >
> > > And this works implicitly with your proposal?
> >
> > Yes. so you do:
> > iommu = open
> > ioctl(dev1, BIND, iommu)
> > ioctl(dev2, BIND, iommu)
> > ioctl(dev3, BIND, iommu)
> > ioctl(dev4, BIND, iommu)
> >
> > No need to add a SHARE ioctl.
>
> In my proposal this looks like:
>
>
> dev1 = open();
> ioctl(dev2, SHARE, dev1);
> ioctl(dev3, SHARE, dev1);
> ioctl(dev4, SHARE, dev1);
>
> So we actually save an ioctl.
I thought we had a BIND ioctl?
> > > Remember that we still need to be able to provide seperate mappings
> > > for each device to support IOMMU emulation for the guest.
> >
> > Generally not true. E.g. guest can enable iommu passthrough
> > or have domain per a group of devices.
>
> What I meant was that there may me multiple io-addresses spaces
> necessary for one process. I didn't want to say that every device
> _needs_ to have its own address space.
>
> > > As I wrote the domain has a reference count and is destroyed only when
> > > it goes down to zero. This does not happen as long as a device is bound
> > > to it.
> > >
> > > Joerg
> >
> > We were talking about UNSHARE ioctl:
> > ioctl(dev1, UNSHARE, dev2)
> > Does it change the domain for dev1 or dev2?
> > If you make a mistake you get a hard to debug bug.
>
> As I already wrote we would have an UNBIND ioctl which just removes a
> device from its current domain. UNBIND is better than UNSHARE for
> exactly the reason you pointed out above. I thought I stated that
> already.
>
> Joerg
You undo SHARE with UNBIND?
Its by far more unintuitive that a process needs to explicitly bind a
device to an iommu domain before it can do anything with it. If its
required anyway the binding can happen implicitly. We could allow to do
a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
Note that this way of handling userspace iommu mappings is also a lot
simpler for most use-cases outside of KVM. If a developer wants to write
a userspace driver all it needs to do is:
dev = open();
ioctl(dev, MAP, ...);
/* use device with mappings */
close(dev);
Which is much easier than the need to create a domain explicitly.
Joerg
I don't really care about the iommu domain. It's a side effect. The
kernel takes care of it. I'm only worried about the API.
We have a memory map that is (often) the same for a set of devices. If
you were coding a non-kernel interface, how would you code it?
struct memory_map;
void memory_map_init(struct memory_map *mm, ...);
struct device;
void device_set_memory_map(struct device *device, struct memory_map *mm);
or
struct device;
void device_init_memory_map(struct device *dev, ...);
void device_clone_memory_map(struct device *dev, struct device *other);
I wouldn't even think of the second one personally.
> If its
> required anyway the binding can happen implicitly. We could allow to do
> a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>
It's still special. You define the memory map only for the first
device. You have to make sure dev1 doesn't go away while sharing it.
> Note that this way of handling userspace iommu mappings is also a lot
> simpler for most use-cases outside of KVM. If a developer wants to write
> a userspace driver all it needs to do is:
>
> dev = open();
> ioctl(dev, MAP, ...);
> /* use device with mappings */
> close(dev);
>
> Which is much easier than the need to create a domain explicitly.
>
mm = open()
ioctl(mm, MAP, ...)
dev = open();
ioctl(dev, BIND, mm);
...
close(mm);
close(dev);
so yes, more work, but once you have multiple devices which come and go
dynamically things become simpler. The map object has global lifetime
(you can even construct it if you don't assign any devices), the devices
attach to it, memory hotplug updates the memory map but doesn't touch
devices.
--
error compiling committee.c: too many arguments to function
--
> you normally need device mapped to start DMA.
> SHARE makes this bug more likely as you allow
> switching domains: mmap could be done before switching.
We need to support domain switching anyway for iommu emulation in a
guest. So if you consider this to be a problem (I don't) it will not go
away with your proposal.
> > dev1 = open();
> > ioctl(dev2, SHARE, dev1);
> > ioctl(dev3, SHARE, dev1);
> > ioctl(dev4, SHARE, dev1);
> >
> > So we actually save an ioctl.
>
> I thought we had a BIND ioctl?
I can't remember a BIND ioctl in my proposal. I remember an UNBIND, but
thats bad naming as you pointed out below. See my statement on this
below too.
> You undo SHARE with UNBIND?
Thats bad naming, agreed. Lets keep UNSHARE. Point is, we only need one
parameter to do this which removes any ambiguity:
ioctl(dev1, UNSHARE);
Joerg
The reason it is more intuitive is because it is harder to get it
wrong. If you swap iommu and device in the call, you get BADF
so you know you made a mistake. We can even make it work
both ways if we wanted to. With ioctl(dev1, BIND, dev2)
it breaks silently.
> If its
> required anyway the binding can happen implicitly. We could allow to do
> a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
And then when we assign meaning to it we find that half the apps
are broken because they did not call this ioctl.
> Note that this way of handling userspace iommu mappings is also a lot
> simpler for most use-cases outside of KVM. If a developer wants to write
> a userspace driver all it needs to do is:
>
> dev = open();
> ioctl(dev, MAP, ...);
> /* use device with mappings */
> close(dev);
>
> Which is much easier than the need to create a domain explicitly.
>
> Joerg
This simple scenario ignores all the real-life corner cases.
For example, with an explicit iommu open and bind application
can naturally detect that:
- we have run out of iommu domains
- iommu is unsupported
- iommu is in use by another, incompatible device
- device is in bad state
because each is a separate operation, so it is easy to produce meaningful
errors.
Another interesting thing that a separate iommu device supports is when
application A controls the iommu and application B
controls the device. This might be good to e.g. improve security
(B is run by root, A is unpriveledged and passes commands to/from B
over a pipe).
This is not possible when same fd is used for iommu and device.
--
MST
>> Its by far more unintuitive that a process needs to explicitly bind a
>> device to an iommu domain before it can do anything with it.
>
> I don't really care about the iommu domain. It's a side effect. The
> kernel takes care of it. I'm only worried about the API.
The proposed memory-map object is nothing else than a userspace
abstraction of an iommu-domain.
> We have a memory map that is (often) the same for a set of devices. If
> you were coding a non-kernel interface, how would you code it?
>
> struct memory_map;
> void memory_map_init(struct memory_map *mm, ...);
> struct device;
> void device_set_memory_map(struct device *device, struct memory_map *mm);
>
> or
>
> struct device;
> void device_init_memory_map(struct device *dev, ...);
> void device_clone_memory_map(struct device *dev, struct device *other);
>
> I wouldn't even think of the second one personally.
Right, a kernel-interface would be designed the first way. The IOMMU-API
is actually designed in this manner. But I still think we should keep it
simpler for userspace.
>> If its required anyway the binding can happen implicitly. We could
>> allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>
> It's still special. You define the memory map only for the first
> device. You have to make sure dev1 doesn't go away while sharing it.
Must be a misunderstanding. In my proposal the domain is not owned by
one device. It is owned by all devices that share it and will only
vanish if all devices that use it are unbound (which happens when the file
descriptor is closed, for example).
> so yes, more work, but once you have multiple devices which come and go
> dynamically things become simpler. The map object has global lifetime
> (you can even construct it if you don't assign any devices), the devices
> attach to it, memory hotplug updates the memory map but doesn't touch
> devices.
I still think a userspace interface should be as simple as possible. But
since both ways will work I am not really opposed to Michael's proposal.
I just think its overkill for the common non-kvm usecase (a userspace
device driver).
Joerg
>
> > If its
> > required anyway the binding can happen implicitly. We could allow to do
> > a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>
> And then when we assign meaning to it we find that half the apps
> are broken because they did not call this ioctl.
The meaning is already assigned and chaning it means changing the
userspace-abi which is a no-go.
> This simple scenario ignores all the real-life corner cases.
> For example, with an explicit iommu open and bind application
> can naturally detect that:
> - we have run out of iommu domains
ioctl(dev, MAP, ...) will fail in this case.
> - iommu is unsupported
Is best checked by open() anyway because userspace can't do anything
with the device before it is bound to a domain.
> - iommu is in use by another, incompatible device
How should this happen?
> - device is in bad state
How is this checked with your proposal and why can this not be detected
with my one?
> because each is a separate operation, so it is easy to produce meaningful
> errors.
Ok, this is true.
> Another interesting thing that a separate iommu device supports is when
> application A controls the iommu and application B
> controls the device.
Until Linux becomes a micro-kernel the IOMMU itself will _never_ be
controlled by an application.
> This might be good to e.g. improve security (B is run by root, A is
> unpriveledged and passes commands to/from B over a pipe).
Micro-kernel arguments. I hope a userspace controlled IOMMU in Linux
will never happen ;-)
Joerg
BTW, there is no such thing as restartable dma. There is a provision in
new specs (read: no real hardware) that allows a device to request pages
before using them. So it's akin to demand paging, but the demand is an
explicit request rather than a page fault.
thanks,
-chris
This is not any hot path, so saving an ioctl shouldn't be a consideration.
Only important consideration is a good API. I may have lost context here,
but the SHARE API is limited to the vfio fd. The BIND API expects a new
iommu object. Are there other uses for this object? Tom's current vfio
driver exposes a dma mapping interface, would the iommu object expose
one as well? Current interface is device specific DMA interface for
host device drivers typically mapping in-flight dma buffers, and IOMMU
specific interface for assigned devices typically mapping entire virtual
address space.
thanks,
-chris
Actually, it a domain object - which may be usable among iommus (Joerg?).
However, you can't really do the dma mapping with just the domain because
every device supports a different size address space as a master, i.e.,
the dma_mask.
And I don't know how kvm would deal with devices with varying dma mask support,
or why they'd be in the same domain.
> > This is not any hot path, so saving an ioctl shouldn't be a consideration.
> > Only important consideration is a good API. I may have lost context here,
> > but the SHARE API is limited to the vfio fd. The BIND API expects a new
> > iommu object. Are there other uses for this object? Tom's current vfio
> > driver exposes a dma mapping interface, would the iommu object expose
> > one as well? Current interface is device specific DMA interface for
> > host device drivers typically mapping in-flight dma buffers, and IOMMU
> > specific interface for assigned devices typically mapping entire virtual
> > address space.
>
> Actually, it a domain object - which may be usable among iommus (Joerg?).
Yes, this 'iommu' thing is would be a domain object. But sharing among
iommus is not necessary because the fact that there are multiple iommus
in the system is hidden by the iommu drivers. This fact is not even
exposed by the iommu-api. This makes protection domains system global.
> However, you can't really do the dma mapping with just the domain because
> every device supports a different size address space as a master, i.e.,
> the dma_mask.
The dma_mask has to be handled by the device driver. With the
iommu-mapping interface the driver can specify the target io-address and
has to consider the dma_mask for that too.
> And I don't know how kvm would deal with devices with varying dma mask support,
> or why they'd be in the same domain.
KVM does not care about these masks. This is the business of the guest
device drivers.
Joerg
Both kvm and vhost use similar memory maps, so they could use the new
object (without invoking the iommu unless they want dma).
> Tom's current vfio
> driver exposes a dma mapping interface, would the iommu object expose
> one as well? Current interface is device specific DMA interface for
> host device drivers typically mapping in-flight dma buffers, and IOMMU
> specific interface for assigned devices typically mapping entire virtual
> address space.
>
A per-request mapping sounds like a device API since it would only
affect that device (whereas the address space API affects multiple devices).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
1. Create a user-iommu-domain driver - opening it will give a new empty domain.
Ultimately this can also populate sysfs with the state of its world, which would
also be a good addition to the base iommu stuff.
If someone closes the fd while in use, the domain stays valid anyway until users
drop off.
2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that
a domain be set before using the VFIO_DMA_MAP_IOVA ioctl (this is the one
that KVM wants). However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
which uses the dma_sg interface which has no expicit control of domains. I
intend to keep it the way it is, but expect only non-hypervisor programs would
want to use it.
3. Clean up the docs and other nits that folks have found.
Comments?
Require domain to be set before you allow any access to the device:
mmap, write, read. IMO this is the only safe way to make sure userspace
does not corrupt memory, and this removes the need to special-case
MSI memory, play with bus master enable and hope it can be cleared without
reset, etc.
> (this is the one
> that KVM wants).
Not sure I understand. I think that MAP should be done on the domain,
not the device, this handles pinning pages correctly and
this way you don't need any special checks.
> However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
> which uses the dma_sg interface which has no expicit control of domains. I
> intend to keep it the way it is, but expect only non-hypervisor programs would
> want to use it.
If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
non-hypervisors just pick an address?
Thanks for the correction.
--
error compiling committee.c: too many arguments to function
--
Michael - the light bulb finally lit for me and I now understand what you've been
saying the past few weeks. Of course you're right - we need iommu set before any
register access. I had thought that was done by default but now I see that the
dma_map_sg routine only attaches to the iommu on demand.
So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl
with the vfio fd so that the user can still do everything with one fd. I'm thinking the
fd opens and iommu bindings could be done in a program before spinning out the
program with the user driver.
This would kind of break Avi's idea that mappings are programmed
at the domain and shared by multiple devices, won't it?
Various locking, security, and documentation issues have also been fixed.
Please commit - it or me!
But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?
Blurb from previous patch version:
This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.
[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read & write calls) - but includes table
driven code to determine whats safe to write and what is not. Also, some
virtualization of the config space to allow drivers to think they're writing
some registers when they're not. Also, IO space accesses are also allowed.
Drivers for devices which use MSI-X are now prevented from directly writing
the MSI-X vector area.
All interrupts are now handled using eventfds, which makes things very simple.
The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
but the driver does support many more types of devices. I was none too sure
what driver directory this should live in, so for now I made up my own under
drivers/vfio. As a new driver/new directory, who makes the commit decision?
I currently have user level drivers working for 3 different network adapters
- the Cisco "Palo" enic, the Intel 82599 VF, and the Intel 82576 VF (but the
whole user level framework is a long ways from release). This driver could
also clearly replace a number of other drivers written just to give user
access to certain devices - but that will take time.
diff -uprN linux-2.6.34/Documentation/ioctl/ioctl-number.txt vfio-linux-2.6.34/Documentation/ioctl/ioctl-number.txt
--- linux-2.6.34/Documentation/ioctl/ioctl-number.txt 2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/Documentation/ioctl/ioctl-number.txt 2010-05-28 16:57:58.000000000 -0700
@@ -87,6 +87,7 @@ Code Seq#(hex) Include File Comments
and kernel/power/user.c
'8' all SNP8023 advanced NIC card
<mailto:m...@solidum.com>
+';' 64-6F linux/vfio.h
'@' 00-0F linux/radeonfb.h conflict!
'@' 00-0F drivers/video/aty/aty128fb.c conflict!
'A' 00-1F linux/apm_bios.h conflict!
diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
--- linux-2.6.34/Documentation/vfio.txt 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/Documentation/vfio.txt 2010-06-07 15:05:42.000000000 -0700
@@ -0,0 +1,177 @@
+-------------------------------------------------------------------------------
+The VFIO "driver" is used to allow privileged AND non-privileged processes to
+implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
+devices.
+
+Why is this interesting? Some applications, especially in the high performance
+computing field, need access to hardware functions with as little overhead as
+possible. Examples are in network adapters (typically non TCP/IP based) and
+in compute accelerators - i.e., array processors, FPGA processors, etc.
+Previous to the VFIO drivers these apps would need either a kernel-level
+driver (with corresponding overheads), or else root permissions to directly
+access the hardware. The VFIO driver allows generic access to the hardware
+from non-privileged apps IF the hardware is "well-behaved" enough for this
+to be safe.
+
+While there have long been ways to implement user-level drivers using specific
+corresponding drivers in the kernel, it was not until the introduction of the
+UIO driver framework, and the uio_pci_generic driver that one could have a
+generic kernel component supporting many types of user level drivers. However,
+even with the uio_pci_generic driver, processes implementing the user level
+drivers had to be trusted - they could do dangerous manipulation of DMA
+addreses and were required to be root to write PCI configuration space
+registers.
+
+Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
+new hardware capabilities which the VFIO solution exploits to allow non-root
+user level drivers. The main role of the IOMMU is to ensure that DMA accesses
+from devices go only to the appropriate memory locations; this allows VFIO to
+ensure that user level drivers do not corrupt inappropriate memory. PCI I/O
+virtualization (SR-IOV) was defined to allow "pass-through" of virtual devices
+to guest virtual machines. VFIO in essence implements pass-through of devices
+to user processes, not virtual machines. SR-IOV devices implement a
+traditional PCI device (the physical function) and a dynamic number of special
+PCI devices (virtual functions) whose feature set is somewhat restricted - in
+order to allow the operating system or virtual machine monitor to ensure the
+safe operation of the system.
+
+Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
+there are many other non-IOV PCI devices which also meet the defintion.
+Elements of this definition are:
+- The size of any memory BARs to be mmap'ed into the user process space must be
+ a multiple of the system page size.
+- If MSI-X interrupts are used, the device driver must not attempt to mmap or
+ write the MSI-X vector area.
+- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
+ revision 2.3 to allow its interrupts to be masked in a generic way.
+- The device must not use the PCI configuration space in any non-standard way,
+ i.e., the user level driver will be permitted only to read and write standard
+ fields of the PCI config space, and only if those fields cannot cause harm to
+ the system. In addition, some fields are "virtualized", so that the user
+ driver can read/write them like a kernel driver, but they do not affect the
+ real device.
+- For now, there is no support for user access to the PCIe and PCI-X extended
+ capabilities configuration space.
+
+Even with these restrictions, there are bound to be devices which are unsafe
+for user level use - it is still up to the system admin to decide whether to
+grant access to the device. When the vfio module is loaded, it will have
+access to no devices until the desired PCI devices are "bound" to the driver.
+First, make sure the devices are not bound to another kernel driver. You can
+unload that driver if you wish to unbind all its devices, or else enter the
+driver's sysfs directory, and unbind a specific device:
+ cd /sys/bus/pci/drivers/<drivername>
+ echo 0000:06:02.00 > unbind
+(The 0000:06:02.00 is a fully qualified PCI device name - different for each
+device). Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
+write the PCI device type of the target device to the new_id file:
+ echo 8086 10ca > new_id
+(8086 10ca are the vendor and device type for the Intel 82576 virtual function
+devices). A /dev/vfio<N> entry will be created for each device bound. The final
+step is to grant users permission by changing the mode and/or owner of the /dev
+entry - "chmod 666 /dev/vfio0".
+
+Reads & Writes:
+
+The user driver will typically use mmap to access the memory BAR(s) of a
+device; the I/O BARs and the PCI config space may be accessed through normal
+read and write system calls. Only 1 file descriptor is needed for all driver
+functions -- the desired BAR for I/O, memory, or config space is indicated via
+high-order bits of the file offset. For instance, the following implements a
+write to the PCI config space:
+
+ #include <linux/vfio.h>
+ void pci_write_config_word(int pci_fd, u16 off, u16 wd)
+ {
+ off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
+
+ if (pwrite(pci_fd, &wd, 2, cfg_off) != 2)
+ perror("pwrite config_dword");
+ }
+
+The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
+in vfio.h to convert BAR numbers to file offsets and vice-versa.
+
+Interrupts:
+
+Device interrupts are translated by the vfio driver into input events on event
+notification file descriptors created by the eventfd system call. The user
+program must create one or more event descriptors and pass them to the vfio
+driver via ioctls to arrange for the interrupt mapping:
+1.
+ efd = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
+ This provides an eventfd for traditional IRQ interrupts.
+ IRQs will be disable after each interrupt until the driver
+ re-enables them via the PCI COMMAND register.
+2.
+ efd = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
+ This connects MSI interrupts to an eventfd.
+3.
+ int arg[N+1];
+ arg[0] = N;
+ arg[1..N] = eventfd(0, 0);
+ ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
+ This connects N MSI-X interrupts with N eventfds.
+
+Waiting and checking for interrupts is done by the user program by reads,
+polls, or selects on the related event file descriptors.
+
+DMA:
+
+The VFIO driver uses ioctls to allow the user level driver to get DMA
+addresses which correspond to virtual addresses. In systems with IOMMUs,
+each PCI device will have its own address space for DMA operations, so when
+the user level driver programs the device registers, only addresses known to
+the IOMMU will be valid, any others will be rejected. The IOMMU creates the
+illusion (to the device) that multi-page buffers are physically contiguous,
+so a single DMA operation can safely span multiple user pages.
+
+If the user process desires many DMA buffers, it may be wise to do a mapping
+of a single large buffer, and then allocate the smaller buffers from the
+large one.
+
+The DMA buffers are locked into physical memory for the duration of their
+existence - until VFIO_DMA_UNMAP is called, until the user pages are
+unmapped from the user process, or until the vfio file descriptor is closed.
+The user process must have permission to lock the pages given by the ulimit(-l)
+command, which in turn relies on settings in the /etc/security/limits.conf
+file.
+
+The vfio_dma_map structure is used as an argument to the ioctls which
+do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
+multiple of a page. Its rdwr field is zero for read-only (outbound), and
+non-zero for read/write buffers.
+
+ struct vfio_dma_map {
+ __u64 vaddr; /* process virtual addr */
+ __u64 dmaaddr; /* desired and/or returned dma address */
+ __u64 size; /* size in bytes */
+ int rdwr; /* bool: 0 for r/o; 1 for r/w */
+ };
+
+The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
+dmaaddr field already assigned. The system will attempt to map the DMA
+buffer into the IO space at the given dmaaddr. This is expected to be
+useful if KVM or other virtualization facilities use this driver.
+Use of VFIO_DMA_MAP_IOVA requires an explicit assignment of the device
+to an IOMMU domain. A file descriptor for an empty IOMMU domain is
+acquired by opening /dev/uiommu. The device is then attached to the
+domain by issuing a VFIO_DOMAIN_SET ioctl with the domain fd address as
+the argument. The device may be detached from the domain with the
+VFIO_DOMAIN_UNSET ioctl (no argument). It is expected that hypervisors
+may wish to attach many devices to the same domain.
+
+The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
+the buffer and releases the corresponding system resources.
+
+The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
+(device dependent). It takes a single unsigned 64 bit integer as an argument.
+This call also has the side effect of enabling PCI bus mastership.
+
+Miscellaneous:
+
+The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
+device's base address region. It is passed a single integer specifying which
+BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.
diff -uprN linux-2.6.34/drivers/Kconfig vfio-linux-2.6.34/drivers/Kconfig
--- linux-2.6.34/drivers/Kconfig 2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/drivers/Kconfig 2010-05-27 17:01:02.000000000 -0700
@@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
source "drivers/staging/Kconfig"
source "drivers/platform/Kconfig"
+
+source "drivers/vfio/Kconfig"
endmenu
diff -uprN linux-2.6.34/drivers/Makefile vfio-linux-2.6.34/drivers/Makefile
--- linux-2.6.34/drivers/Makefile 2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/drivers/Makefile 2010-05-27 17:25:33.000000000 -0700
@@ -52,6 +52,7 @@ obj-$(CONFIG_FUSION) += message/
obj-$(CONFIG_FIREWIRE) += firewire/
obj-y += ieee1394/
obj-$(CONFIG_UIO) += uio/
+obj-$(CONFIG_VFIO) += vfio/
obj-y += cdrom/
obj-y += auxdisplay/
obj-$(CONFIG_PCCARD) += pcmcia/
diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig
--- linux-2.6.34/drivers/vfio/Kconfig 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/Kconfig 2010-06-07 15:28:14.000000000 -0700
@@ -0,0 +1,18 @@
+menuconfig VFIO
+ tristate "Non-Priv User Space PCI drivers"
+ depends on UIOMMU && PCI && IOMMU_API
+ help
+ Driver to allow advanced user space drivers for PCI, PCI-X,
+ and PCIe devices. Requires IOMMU to allow non-privilged
+ processes to directly control the PCI devices.
+
+ If you don't know what to do here, say N.
+
+menuconfig UIOMMU
+ tristate "User level manipulation of IOMMU"
+ help
+ Device driver to allow user level programs to
+ manipulate IOMMU domains
+
+ If you don't know what to do here, say N.
+
diff -uprN linux-2.6.34/drivers/vfio/Makefile vfio-linux-2.6.34/drivers/vfio/Makefile
--- linux-2.6.34/drivers/vfio/Makefile 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/Makefile 2010-06-03 16:32:32.000000000 -0700
@@ -0,0 +1,6 @@
+obj-$(CONFIG_VFIO) := vfio.o
+obj-$(CONFIG_UIOMMU) += uiommu.o
+
+vfio-y := vfio_main.o vfio_dma.o vfio_intrs.o \
+ vfio_pci_config.o vfio_rdwr.o vfio_sysfs.o
+
diff -uprN linux-2.6.34/drivers/vfio/uiommu.c vfio-linux-2.6.34/drivers/vfio/uiommu.c
--- linux-2.6.34/drivers/vfio/uiommu.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/uiommu.c 2010-06-07 15:28:24.000000000 -0700
@@ -0,0 +1,126 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+*/
+
+/*
+ * uiommu driver - issue fd handles for IOMMU domains
+ * so they may be passed to vfio (and others?)
+ */
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/iommu.h>
+#include <linux/uiommu.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tom Lyon <pu...@cisco.com>");
+MODULE_DESCRIPTION("User IOMMU driver");
+
+static struct uiommu_domain *uiommu_domain_alloc(void)
+{
+ struct iommu_domain *domain;
+ struct uiommu_domain *udomain;
+
+ domain = iommu_domain_alloc();
+ if (domain == NULL)
+ return NULL;
+ udomain = kzalloc(sizeof *udomain, GFP_KERNEL);
+ if (udomain == NULL) {
+ iommu_domain_free(domain);
+ return NULL;
+ }
+ udomain->domain = domain;
+ atomic_inc(&udomain->refcnt);
+ return udomain;
+}
+
+static int uiommu_open(struct inode *inode, struct file *file)
+{
+ struct uiommu_domain *udomain;
+
+ udomain = uiommu_domain_alloc();
+ if (udomain == NULL)
+ return -ENOMEM;
+ file->private_data = udomain;
+ return 0;
+}
+
+static int uiommu_release(struct inode *inode, struct file *file)
+{
+ struct uiommu_domain *udomain;
+
+ udomain = file->private_data;
+ uiommu_put(udomain);
+ return 0;
+}
+
+static const struct file_operations uiommu_fops = {
+ .owner = THIS_MODULE,
+ .open = uiommu_open,
+ .release = uiommu_release,
+};
+
+static struct miscdevice uiommu_dev = {
+ .name = "uiommu",
+ .minor = MISC_DYNAMIC_MINOR,
+ .fops = &uiommu_fops,
+};
+
+struct uiommu_domain *uiommu_fdget(int fd)
+{
+ struct file *file;
+ struct uiommu_domain *udomain;
+
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+ if (file->f_op != &uiommu_fops) {
+ fput(file);
+ return ERR_PTR(-EINVAL);
+ }
+ udomain = file->private_data;
+ atomic_inc(&udomain->refcnt);
+ return udomain;
+}
+EXPORT_SYMBOL(uiommu_fdget);
+
+void uiommu_put(struct uiommu_domain *udomain)
+{
+ if (atomic_dec_and_test(&udomain->refcnt)) {
+ iommu_domain_free(udomain->domain);
+ kfree(udomain);
+ }
+}
+EXPORT_SYMBOL(uiommu_put);
+
+static int __init uiommu_init(void)
+{
+ return misc_register(&uiommu_dev);
+}
+
+static void __exit uiommu_exit(void)
+{
+ misc_deregister(&uiommu_dev);
+}
+
+module_init(uiommu_init);
+module_exit(uiommu_exit);
diff -uprN linux-2.6.34/drivers/vfio/vfio_dma.c vfio-linux-2.6.34/drivers/vfio/vfio_dma.c
--- linux-2.6.34/drivers/vfio/vfio_dma.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_dma.c 2010-06-07 15:28:30.000000000 -0700
@@ -0,0 +1,324 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/iommu.h>
+#include <linux/uiommu.h>
+#include <linux/sched.h>
+#include <linux/vfio.h>
+
+/* Unmap DMA region */
+/* dgate must be held */
+static void vfio_dma_unmap(struct vfio_listener *listener,
+ struct dma_map_page *mlp)
+{
+ int i;
+ struct vfio_dev *vdev = listener->vdev;
+
+ list_del(&mlp->list);
+ for (i = 0; i < mlp->npage; i++)
+ (void) uiommu_unmap_range(vdev->udomain,
+ mlp->daddr + i*PAGE_SIZE, PAGE_SIZE);
+ for (i = 0; i < mlp->npage; i++) {
+ if (mlp->rdwr)
+ SetPageDirty(mlp->pages[i]);
+ put_page(mlp->pages[i]);
+ }
+ listener->mm->locked_vm -= mlp->npage;
+ vdev->locked_pages -= mlp->npage;
+ kfree(mlp->pages);
+ kfree(mlp);
+}
+
+/* Unmap ALL DMA regions */
+void vfio_dma_unmapall(struct vfio_listener *listener)
+{
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ vfio_dma_unmap(listener, mlp);
+ }
+ mutex_unlock(&listener->vdev->dgate);
+}
+
+int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
+{
+ unsigned long start, npage;
+ struct dma_map_page *mlp;
+ struct list_head *pos, *pos2;
+ int ret;
+
+ start = dmp->vaddr & ~PAGE_SIZE;
+ npage = dmp->size >> PAGE_SHIFT;
+
+ ret = -ENXIO;
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
+ continue;
+ ret = 0;
+ vfio_dma_unmap(listener, mlp);
+ break;
+ }
+ mutex_unlock(&listener->vdev->dgate);
+ return ret;
+}
+
+#ifdef CONFIG_MMU_NOTIFIER
+/* Handle MMU notifications - user process freed or realloced memory
+ * which may be in use in a DMA region. Clean up region if so.
+ */
+static void vfio_dma_handle_mmu_notify(struct mmu_notifier *mn,
+ unsigned long start, unsigned long end)
+{
+ struct vfio_listener *listener;
+ unsigned long myend;
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ listener = container_of(mn, struct vfio_listener, mmu_notifier);
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (mlp->vaddr >= end)
+ continue;
+ /*
+ * Ranges overlap if they're not disjoint; and they're
+ * disjoint if the end of one is before the start of
+ * the other one.
+ */
+ myend = mlp->vaddr + (mlp->npage << PAGE_SHIFT) - 1;
+ if (!(myend <= start || end <= mlp->vaddr)) {
+ printk(KERN_WARNING
+ "%s: demap start %lx end %lx va %lx pa %lx\n",
+ __func__, start, end,
+ mlp->vaddr, (long)mlp->daddr);
+ vfio_dma_unmap(listener, mlp);
+ }
+ }
+ mutex_unlock(&listener->vdev->dgate);
+}
+
+static void vfio_dma_inval_page(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long addr)
+{
+ vfio_dma_handle_mmu_notify(mn, addr, addr + PAGE_SIZE);
+}
+
+static void vfio_dma_inval_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long start, unsigned long end)
+{
+ vfio_dma_handle_mmu_notify(mn, start, end);
+}
+
+static const struct mmu_notifier_ops vfio_dma_mmu_notifier_ops = {
+ .invalidate_page = vfio_dma_inval_page,
+ .invalidate_range_start = vfio_dma_inval_range_start,
+};
+#endif /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Map usr buffer at specific IO virtual address
+ */
+static struct dma_map_page *vfio_dma_map_iova(
+ struct vfio_listener *listener,
+ unsigned long start_iova,
+ struct page **pages,
+ int npage,
+ int rdwr)
+{
+ struct vfio_dev *vdev = listener->vdev;
+ int ret;
+ int i;
+ phys_addr_t hpa;
+ struct dma_map_page *mlp;
+ unsigned long iova = start_iova;
+
+ if (vdev->udomain == NULL)
+ return ERR_PTR(-EINVAL);
+
+ for (i = 0; i < npage; i++) {
+ if (uiommu_iova_to_phys(vdev->udomain, iova + i*PAGE_SIZE))
+ return ERR_PTR(-EBUSY);
+ }
+
+ mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
+ if (mlp == NULL)
+ return ERR_PTR(-ENOMEM);
+ rdwr = rdwr ? IOMMU_READ|IOMMU_WRITE : IOMMU_READ;
+ if (vdev->cachec)
+ rdwr |= IOMMU_CACHE;
+ for (i = 0; i < npage; i++) {
+ hpa = page_to_phys(pages[i]);
+ ret = uiommu_map_range(vdev->udomain, iova,
+ hpa, PAGE_SIZE, rdwr);
+ if (ret) {
+ while (--i > 0) {
+ iova -= PAGE_SIZE;
+ (void) uiommu_unmap_range(vdev->udomain,
+ iova, PAGE_SIZE);
+ }
+ kfree(mlp);
+ return ERR_PTR(ret);
+ }
+ iova += PAGE_SIZE;
+ }
+
+ mlp->pages = pages;
+ mlp->daddr = start_iova;
+ mlp->npage = npage;
+ return mlp;
+}
+
+int vfio_dma_map_common(struct vfio_listener *listener,
+ unsigned int cmd, struct vfio_dma_map *dmp)
+{
+ int locked, lock_limit;
+ struct page **pages;
+ int npage;
+ struct dma_map_page *mlp;
+ int rdwr = (dmp->flags & VFIO_FLAG_WRITE) ? 1 : 0;
+ int ret = 0;
+
+ if (dmp->vaddr & (PAGE_SIZE-1))
+ return -EINVAL;
+ if (dmp->size & (PAGE_SIZE-1))
+ return -EINVAL;
+ if (dmp->size <= 0)
+ return -EINVAL;
+ npage = dmp->size >> PAGE_SHIFT;
+ if (npage <= 0)
+ return -EINVAL;
+
+ mutex_lock(&listener->vdev->dgate);
+
+ /* account for locked pages */
+ locked = npage + current->mm->locked_vm;
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
+ >> PAGE_SHIFT;
+ if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
+ __func__);
+ ret = -ENOMEM;
+ goto out_lock;
+ }
+ /* only 1 address space per fd */
+ if (current->mm != listener->mm) {
+ if (listener->mm != NULL) {
+ ret = -EINVAL;
+ goto out_lock;
+ }
+ listener->mm = current->mm;
+#ifdef CONFIG_MMU_NOTIFIER
+ listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
+ ret = mmu_notifier_register(&listener->mmu_notifier,
+ listener->mm);
+ if (ret)
+ printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
+ __func__, ret);
+ ret = 0;
+#endif
+ }
+
+ pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
+ if (pages == NULL) {
+ ret = ENOMEM;
+ goto out_lock;
+ }
+ ret = get_user_pages_fast(dmp->vaddr, npage, rdwr, pages);
+ if (ret != npage) {
+ printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
+ __func__, ret, npage);
+ kfree(pages);
+ ret = -EFAULT;
+ goto out_lock;
+ }
+ ret = 0;
+
+ mlp = vfio_dma_map_iova(listener, dmp->dmaaddr,
+ pages, npage, rdwr);
+ if (IS_ERR(mlp)) {
+ ret = PTR_ERR(mlp);
+ kfree(pages);
+ goto out_lock;
+ }
+ mlp->vaddr = dmp->vaddr;
+ mlp->rdwr = rdwr;
+ dmp->dmaaddr = mlp->daddr;
+ list_add(&mlp->list, &listener->dm_list);
+
+ current->mm->locked_vm += npage;
+ listener->vdev->locked_pages += npage;
+out_lock:
+ mutex_unlock(&listener->vdev->dgate);
+ return ret;
+}
+
+void vfio_domain_unset(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+
+ if (vdev->udomain == NULL)
+ return;
+ uiommu_detach_device(vdev->udomain, &pdev->dev);
+ uiommu_put(vdev->udomain);
+ vdev->udomain = NULL;
+}
+
+int vfio_domain_set(struct vfio_dev *vdev, int fd)
+{
+ struct uiommu_domain *udomain;
+ struct pci_dev *pdev = vdev->pdev;
+ int ret;
+
+ if (vdev->udomain)
+ return -EBUSY;
+ udomain = uiommu_fdget(fd);
+ if (IS_ERR(udomain))
+ return PTR_ERR(udomain);
+ vfio_domain_unset(vdev);
+ ret = uiommu_attach_device(udomain, &pdev->dev);
+ if (ret) {
+ printk(KERN_ERR "%s: attach_device failed %d\n",
+ __func__, ret);
+ uiommu_put(udomain);
+ return ret;
+ }
+ vdev->cachec = iommu_domain_has_cap(udomain->domain,
+ IOMMU_CAP_CACHE_COHERENCY);
+ vdev->udomain = udomain;
+ return 0;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_intrs.c vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c
--- linux-2.6.34/drivers/vfio/vfio_intrs.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c 2010-06-01 15:08:18.000000000 -0700
@@ -0,0 +1,191 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+
+/*
+ * vfio_interrupt - IRQ hardware interrupt handler
+ */
+irqreturn_t vfio_interrupt(int irq, void *dev_id)
+{
+ struct vfio_dev *vdev = (struct vfio_dev *)dev_id;
+ struct pci_dev *pdev = vdev->pdev;
+ irqreturn_t ret = IRQ_NONE;
+ u32 cmd_status_dword;
+ u16 origcmd, newcmd, status;
+
+ spin_lock_irq(&vdev->irqlock);
+ pci_block_user_cfg_access(pdev);
+
+ /* Read both command and status registers in a single 32-bit operation.
+ * Note: we could cache the value for command and move the status read
+ * out of the lock if there was a way to get notified of user changes
+ * to command register through sysfs. Should be good for shared irqs. */
+ pci_read_config_dword(pdev, PCI_COMMAND, &cmd_status_dword);
+ origcmd = cmd_status_dword;
+ status = cmd_status_dword >> 16;
+
+ /* Check interrupt status register to see whether our device
+ * triggered the interrupt. */
+ if (!(status & PCI_STATUS_INTERRUPT))
+ goto done;
+
+ /* We triggered the interrupt, disable it. */
+ newcmd = origcmd | PCI_COMMAND_INTX_DISABLE;
+ if (newcmd != origcmd)
+ pci_write_config_word(pdev, PCI_COMMAND, newcmd);
+
+ ret = IRQ_HANDLED;
+done:
+ pci_unblock_user_cfg_access(pdev);
+ spin_unlock_irq(&vdev->irqlock);
+ if (ret != IRQ_HANDLED)
+ return ret;
+ if (vdev->ev_irq)
+ eventfd_signal(vdev->ev_irq, 1);
+ return ret;
+}
+
+/*
+ * MSI and MSI-X Interrupt handler.
+ * Just signal an event
+ */
+static irqreturn_t msihandler(int irq, void *arg)
+{
+ struct eventfd_ctx *ctx = arg;
+
+ eventfd_signal(ctx, 1);
+ return IRQ_HANDLED;
+}
+
+void vfio_disable_msi(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+
+ if (vdev->ev_msi) {
+ eventfd_ctx_put(vdev->ev_msi);
+ free_irq(pdev->irq, vdev->ev_msi);
+ vdev->ev_msi = NULL;
+ }
+ pci_disable_msi(pdev);
+}
+
+int vfio_enable_msi(struct vfio_dev *vdev, int fd)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct eventfd_ctx *ctx;
+ int ret;
+
+ ctx = eventfd_ctx_fdget(fd);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+ vdev->ev_msi = ctx;
+ pci_enable_msi(pdev);
+ ret = request_irq(pdev->irq, msihandler, 0,
+ vdev->name, ctx);
+ if (ret) {
+ eventfd_ctx_put(ctx);
+ pci_disable_msi(pdev);
+ vdev->ev_msi = NULL;
+ }
+ return ret;
+}
+
+void vfio_disable_msix(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int i;
+
+ if (vdev->ev_msix && vdev->msix) {
+ for (i = 0; i < vdev->nvec; i++) {
+ free_irq(vdev->msix[i].vector, vdev->ev_msix[i]);
+ if (vdev->ev_msix[i])
+ eventfd_ctx_put(vdev->ev_msix[i]);
+ }
+ }
+ kfree(vdev->ev_msix);
+ vdev->ev_msix = NULL;
+ kfree(vdev->msix);
+ vdev->msix = NULL;
+ vdev->nvec = 0;
+ pci_disable_msix(pdev);
+}
+
+int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct eventfd_ctx *ctx;
+ int ret = 0;
+ int i;
+ int fd;
+
+ vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+ GFP_KERNEL);
+ if (vdev->msix == NULL)
+ return -ENOMEM;
+ vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
+ GFP_KERNEL);
+ if (vdev->ev_msix == NULL) {
+ kfree(vdev->msix);
+ return -ENOMEM;
+ }
+ for (i = 0; i < nvec; i++) {
+ if (copy_from_user(&fd, uarg, sizeof fd)) {
+ ret = -EFAULT;
+ break;
+ }
+ uarg += sizeof fd;
+ ctx = eventfd_ctx_fdget(fd);
+ if (IS_ERR(ctx)) {
+ ret = PTR_ERR(ctx);
+ break;
+ }
+ vdev->msix[i].entry = i;
+ vdev->ev_msix[i] = ctx;
+ }
+ if (!ret)
+ ret = pci_enable_msix(pdev, vdev->msix, nvec);
+ vdev->nvec = 0;
+ for (i = 0; i < nvec && !ret; i++) {
+ ret = request_irq(vdev->msix[i].vector, msihandler, 0,
+ vdev->name, vdev->ev_msix[i]);
+ if (ret)
+ break;
+ vdev->nvec = i+1;
+ }
+ if (ret)
+ vfio_disable_msix(vdev);
+ return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_main.c vfio-linux-2.6.34/drivers/vfio/vfio_main.c
--- linux-2.6.34/drivers/vfio/vfio_main.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_main.c 2010-06-07 12:39:17.000000000 -0700
@@ -0,0 +1,624 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mm.h>
+#include <linux/idr.h>
+#include <linux/string.h>
+#include <linux/interrupt.h>
+#include <linux/fs.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+
+#include <linux/vfio.h>
+
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "Tom Lyon <pu...@cisco.com>"
+#define DRIVER_DESC "VFIO - User Level PCI meta-driver"
+
+static int vfio_major = -1;
+DEFINE_IDR(vfio_idr);
+/* Protect idr accesses */
+DEFINE_MUTEX(vfio_minor_lock);
+
+/*
+ * Does [a1,b1) overlap [a2,b2) ?
+ */
+static inline int overlap(int a1, int b1, int a2, int b2)
+{
+ /*
+ * Ranges overlap if they're not disjoint; and they're
+ * disjoint if the end of one is before the start of
+ * the other one.
+ */
+ return !(b2 <= a1 || b1 <= a2);
+}
+
+static int vfio_open(struct inode *inode, struct file *filep)
+{
+ struct vfio_dev *vdev;
+ struct vfio_listener *listener;
+ int ret = 0;
+
+ mutex_lock(&vfio_minor_lock);
+ vdev = idr_find(&vfio_idr, iminor(inode));
+ mutex_unlock(&vfio_minor_lock);
+ if (!vdev) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ listener = kzalloc(sizeof(*listener), GFP_KERNEL);
+ if (!listener) {
+ ret = -ENOMEM;
+ goto err_alloc_listener;
+ }
+
+ listener->vdev = vdev;
+ INIT_LIST_HEAD(&listener->dm_list);
+ filep->private_data = listener;
+
+ mutex_lock(&vdev->lgate);
+ if (vdev->listeners == 0) { /* first open */
+ /* reset to known state if we can */
+ (void) pci_reset_function(vdev->pdev);
+ }
+ vdev->listeners++;
+ mutex_unlock(&vdev->lgate);
+ return 0;
+
+err_alloc_listener:
+out:
+ return ret;
+}
+
+static int vfio_release(struct inode *inode, struct file *filep)
+{
+ int ret = 0;
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+
+ vfio_dma_unmapall(listener);
+ if (listener->mm) {
+#ifdef CONFIG_MMU_NOTIFIER
+ mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
+#endif
+ listener->mm = NULL;
+ }
+
+ mutex_lock(&vdev->lgate);
+ if (--vdev->listeners <= 0) {
+ /* we don't need to hold igate here since there are
+ * no more listeners doing ioctls
+ */
+ if (vdev->ev_msix)
+ vfio_disable_msix(vdev);
+ if (vdev->ev_msi)
+ vfio_disable_msi(vdev);
+ if (vdev->ev_irq) {
+ eventfd_ctx_put(vdev->ev_msi);
+ vdev->ev_irq = NULL;
+ }
+ vfio_domain_unset(vdev);
+ /* reset to known state if we can */
+ (void) pci_reset_function(vdev->pdev);
+ }
+ mutex_unlock(&vdev->lgate);
+
+ kfree(listener);
+ return ret;
+}
+
+static ssize_t vfio_read(struct file *filep, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ int pci_space;
+
+ /* no reads or writes until IOMMU domain set */
+ if (vdev->udomain == NULL)
+ return -EINVAL;
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+ return vfio_config_readwrite(0, vdev, buf, count, ppos);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+ return vfio_io_readwrite(0, vdev, buf, count, ppos);
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM)
+ return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+ if (pci_space == PCI_ROM_RESOURCE)
+ return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+ return -EINVAL;
+}
+
+static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u16 pos;
+ u32 table_offset;
+ u16 table_size;
+ u8 bir;
+ u32 lo, hi, startp, endp;
+
+ pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+ if (!pos)
+ return 0;
+
+ pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
+ table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
+ pci_read_config_dword(pdev, pos + 4, &table_offset);
+ bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
+ lo = table_offset >> PAGE_SHIFT;
+ hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
+ >> PAGE_SHIFT;
+ startp = start >> PAGE_SHIFT;
+ endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (bir == vfio_offset_to_pci_space(start) &&
+ overlap(lo, hi, startp, endp)) {
+ printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
+ __func__);
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static ssize_t vfio_write(struct file *filep, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ int pci_space;
+ int ret;
+
+ /* no reads or writes until IOMMU domain set */
+ if (vdev->udomain == NULL)
+ return -EINVAL;
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+ return vfio_config_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+ return vfio_io_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
+ /* don't allow writes to msi-x vectors */
+ ret = vfio_msix_check(vdev, *ppos, count);
+ if (ret)
+ return ret;
+ return vfio_mem_readwrite(1, vdev,
+ (char __user *)buf, count, ppos);
+ }
+ return -EINVAL;
+}
+
+static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ struct pci_dev *pdev = vdev->pdev;
+ unsigned long requested, actual;
+ int pci_space;
+ u64 start;
+ u32 len;
+ unsigned long phys;
+ int ret;
+
+ /* no reads or writes until IOMMU domain set */
+ if (vdev->udomain == NULL)
+ return -EINVAL;
+
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+
+
+ pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
+ if (pci_space > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ switch (pci_space) {
+ case PCI_ROM_RESOURCE:
+ if (vma->vm_flags & VM_WRITE)
+ return -EINVAL;
+ if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
+ return -EINVAL;
+ actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
+ break;
+ default:
+ if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
+ return -EINVAL;
+ actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
+ break;
+ }
+
+ requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (requested > actual || actual == 0)
+ return -EINVAL;
+
+ /*
+ * Can't allow non-priv users to mmap MSI-X vectors
+ * else they can write anywhere in phys memory
+ */
+ start = vma->vm_pgoff << PAGE_SHIFT;
+ len = vma->vm_end - vma->vm_start;
+ if (vma->vm_flags & VM_WRITE) {
+ ret = vfio_msix_check(vdev, start, len);
+ if (ret)
+ return ret;
+ }
+
+ vma->vm_private_data = vdev;
+ vma->vm_flags |= VM_IO | VM_RESERVED;
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
+
+ return remap_pfn_range(vma, vma->vm_start, phys,
+ vma->vm_end - vma->vm_start,
+ vma->vm_page_prot);
+}
+
+static long vfio_unl_ioctl(struct file *filep,
+ unsigned int cmd,
+ unsigned long arg)
+{
+ struct vfio_listener *listener = filep->private_data;
+ struct vfio_dev *vdev = listener->vdev;
+ void __user *uarg = (void __user *)arg;
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_dma_map dm;
+ int ret = 0;
+ u64 mask;
+ int fd, nfd;
+ int bar;
+
+ if (vdev == NULL)
+ return -EINVAL;
+
+ switch (cmd) {
+
+ case VFIO_DMA_MAP_IOVA:
+ if (copy_from_user(&dm, uarg, sizeof dm))
+ return -EFAULT;
+ ret = vfio_dma_map_common(listener, cmd, &dm);
+ if (!ret && copy_to_user(uarg, &dm, sizeof dm))
+ ret = -EFAULT;
+ break;
+
+ case VFIO_DMA_UNMAP:
+ if (copy_from_user(&dm, uarg, sizeof dm))
+ return -EFAULT;
+ ret = vfio_dma_unmap_dm(listener, &dm);
+ break;
+
+ case VFIO_DMA_MASK: /* set master mode and DMA mask */
+ if (copy_from_user(&mask, uarg, sizeof mask))
+ return -EFAULT;
+ pci_set_master(pdev);
+ ret = pci_set_dma_mask(pdev, mask);
+ break;
+
+ case VFIO_EVENTFD_IRQ:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ mutex_lock(&vdev->igate);
+ if (vdev->ev_irq)
+ eventfd_ctx_put(vdev->ev_irq);
+ if (fd >= 0) {
+ vdev->ev_irq = eventfd_ctx_fdget(fd);
+ if (vdev->ev_irq == NULL)
+ ret = -EINVAL;
+ }
+ mutex_unlock(&vdev->igate);
+ break;
+
+ case VFIO_EVENTFD_MSI:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ mutex_lock(&vdev->igate);
+ if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+ ret = vfio_enable_msi(vdev, fd);
+ else if (fd < 0 && vdev->ev_msi)
+ vfio_disable_msi(vdev);
+ else
+ ret = -EINVAL;
+ mutex_unlock(&vdev->igate);
+ break;
+
+ case VFIO_EVENTFDS_MSIX:
+ if (copy_from_user(&nfd, uarg, sizeof nfd))
+ return -EFAULT;
+ uarg += sizeof nfd;
+ mutex_lock(&vdev->igate);
+ if (nfd > 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+ ret = vfio_enable_msix(vdev, nfd, uarg);
+ else if (nfd == 0 && vdev->ev_msix)
+ vfio_disable_msix(vdev);
+ else
+ ret = -EINVAL;
+ mutex_unlock(&vdev->igate);
+ break;
+
+ case VFIO_BAR_LEN:
+ if (copy_from_user(&bar, uarg, sizeof bar))
+ return -EFAULT;
+ if (bar < 0 || bar > PCI_ROM_RESOURCE)
+ return -EINVAL;
+ bar = pci_resource_len(pdev, bar);
+ if (copy_to_user(uarg, &bar, sizeof bar))
+ return -EFAULT;
+ break;
+
+ case VFIO_DOMAIN_SET:
+ if (copy_from_user(&fd, uarg, sizeof fd))
+ return -EFAULT;
+ ret = vfio_domain_set(vdev, fd);
+ break;
+
+ case VFIO_DOMAIN_UNSET:
+ vfio_domain_unset(vdev);
+ ret = 0;
+ break;
+
+ default:
+ return -EINVAL;
+ }
+ return ret;
+}
+
+static const struct file_operations vfio_fops = {
+ .owner = THIS_MODULE,
+ .open = vfio_open,
+ .release = vfio_release,
+ .read = vfio_read,
+ .write = vfio_write,
+ .unlocked_ioctl = vfio_unl_ioctl,
+ .mmap = vfio_mmap,
+};
+
+static int vfio_get_devnum(struct vfio_dev *vdev)
+{
+ int retval = -ENOMEM;
+ int id;
+
+ mutex_lock(&vfio_minor_lock);
+ if (idr_pre_get(&vfio_idr, GFP_KERNEL) == 0)
+ goto exit;
+
+ retval = idr_get_new(&vfio_idr, vdev, &id);
+ if (retval < 0) {
+ if (retval == -EAGAIN)
+ retval = -ENOMEM;
+ goto exit;
+ }
+ if (id > MINORMASK) {
+ idr_remove(&vfio_idr, id);
+ retval = -ENOMEM;
+ }
+ if (vfio_major < 0) {
+ retval = register_chrdev(0, "vfio", &vfio_fops);
+ if (retval < 0)
+ goto exit;
+ vfio_major = retval;
+ }
+
+ retval = MKDEV(vfio_major, id);
+exit:
+ mutex_unlock(&vfio_minor_lock);
+ return retval;
+}
+
+static void vfio_free_minor(struct vfio_dev *vdev)
+{
+ mutex_lock(&vfio_minor_lock);
+ idr_remove(&vfio_idr, MINOR(vdev->devnum));
+ mutex_unlock(&vfio_minor_lock);
+}
+
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2. (from uio_pci_generic)
+ */
+static int verify_pci_2_3(struct pci_dev *pdev)
+{
+ u16 orig, new;
+ int err = 0;
+ u8 pin;
+
+ pci_block_user_cfg_access(pdev);
+
+ pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
+ if (pin == 0) /* irqs not needed */
+ goto out;
+
+ pci_read_config_word(pdev, PCI_COMMAND, &orig);
+ pci_write_config_word(pdev, PCI_COMMAND,
+ orig ^ PCI_COMMAND_INTX_DISABLE);
+ pci_read_config_word(pdev, PCI_COMMAND, &new);
+ /* There's no way to protect against
+ * hardware bugs or detect them reliably, but as long as we know
+ * what the value should be, let's go ahead and check it. */
+ if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+ err = -EBUSY;
+ dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+ "driver or HW bug?\n", orig, new);
+ goto out;
+ }
+ if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+ dev_warn(&pdev->dev, "Device does not support "
+ "disabling interrupts: unable to bind.\n");
+ err = -ENODEV;
+ goto out;
+ }
+ /* Now restore the original value. */
+ pci_write_config_word(pdev, PCI_COMMAND, orig);
+out:
+ pci_unblock_user_cfg_access(pdev);
+ return err;
+}
+
+static int vfio_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct vfio_dev *vdev;
+ int err;
+
+ if (!iommu_found())
+ return -EINVAL;
+
+ err = pci_enable_device(pdev);
+ if (err) {
+ dev_err(&pdev->dev, "%s: pci_enable_device failed: %d\n",
+ __func__, err);
+ return err;
+ }
+
+ err = verify_pci_2_3(pdev);
+ if (err)
+ goto err_verify;
+
+ vdev = kzalloc(sizeof(struct vfio_dev), GFP_KERNEL);
+ if (!vdev) {
+ err = -ENOMEM;
+ goto err_alloc;
+ }
+ vdev->pdev = pdev;
+
+ err = vfio_class_init();
+ if (err)
+ goto err_class;
+
+ mutex_init(&vdev->lgate);
+ mutex_init(&vdev->dgate);
+ mutex_init(&vdev->igate);
+
+ err = vfio_get_devnum(vdev);
+ if (err < 0)
+ goto err_get_devnum;
+ vdev->devnum = err;
+ err = 0;
+
+ sprintf(vdev->name, "vfio%d", MINOR(vdev->devnum));
+ pci_set_drvdata(pdev, vdev);
+ vdev->dev = device_create(vfio_class->class, &pdev->dev,
+ vdev->devnum, vdev, vdev->name);
+ if (IS_ERR(vdev->dev)) {
+ printk(KERN_ERR "VFIO: device register failed\n");
+ err = PTR_ERR(vdev->dev);
+ goto err_device_create;
+ }
+
+ err = vfio_dev_add_attributes(vdev);
+ if (err)
+ goto err_vfio_dev_add_attributes;
+
+
+ if (pdev->irq > 0) {
+ err = request_irq(pdev->irq, vfio_interrupt,
+ IRQF_SHARED, "vfio", vdev);
+ if (err)
+ goto err_request_irq;
+ }
+ vdev->vinfo.bardirty = 1;
+
+ return 0;
+
+err_request_irq:
+#ifdef notdef
+ vfio_dev_del_attributes(vdev);
+#endif
+err_vfio_dev_add_attributes:
+ device_destroy(vfio_class->class, vdev->devnum);
+err_device_create:
+ vfio_free_minor(vdev);
+err_get_devnum:
+err_class:
+ kfree(vdev);
+err_alloc:
+err_verify:
+ pci_disable_device(pdev);
+ return err;
+}
+
+static void vfio_remove(struct pci_dev *pdev)
+{
+ struct vfio_dev *vdev = pci_get_drvdata(pdev);
+
+ vfio_free_minor(vdev);
+
+ if (pdev->irq > 0)
+ free_irq(pdev->irq, vdev);
+
+#ifdef notdef
+ vfio_dev_del_attributes(vdev);
+#endif
+
+ pci_set_drvdata(pdev, NULL);
+ device_destroy(vfio_class->class, vdev->devnum);
+ kfree(vdev);
+ vfio_class_destroy();
+ pci_disable_device(pdev);
+}
+
+static struct pci_driver driver = {
+ .name = "vfio",
+ .id_table = NULL, /* only dynamic id's */
+ .probe = vfio_probe,
+ .remove = vfio_remove,
+};
+
+static int __init init(void)
+{
+ pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+ return pci_register_driver(&driver);
+}
+
+static void __exit cleanup(void)
+{
+ if (vfio_major >= 0)
+ unregister_chrdev(vfio_major, "vfio");
+ pci_unregister_driver(&driver);
+}
+
+module_init(init);
+module_exit(cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff -uprN linux-2.6.34/drivers/vfio/vfio_pci_config.c vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c
--- linux-2.6.34/drivers/vfio/vfio_pci_config.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c 2010-05-28 14:26:47.000000000 -0700
@@ -0,0 +1,554 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#define PCI_CAP_ID_BASIC 0
+#ifndef PCI_CAP_ID_MAX
+#define PCI_CAP_ID_MAX PCI_CAP_ID_AF
+#endif
+
+/*
+ * Lengths of PCI Config Capabilities
+ * 0 means unknown (but at least 4)
+ * FF means special/variable
+ */
+static u8 pci_capability_length[] = {
+ [PCI_CAP_ID_BASIC] = 64, /* pci config header */
+ [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
+ [PCI_CAP_ID_AGP] = PCI_AGP_SIZEOF,
+ [PCI_CAP_ID_VPD] = 8,
+ [PCI_CAP_ID_SLOTID] = 4,
+ [PCI_CAP_ID_MSI] = 0xFF, /* 10, 14, or 24 */
+ [PCI_CAP_ID_CHSWP] = 4,
+ [PCI_CAP_ID_PCIX] = 0xFF, /* 8 or 24 */
+ [PCI_CAP_ID_HT] = 28,
+ [PCI_CAP_ID_VNDR] = 0xFF,
+ [PCI_CAP_ID_DBG] = 0,
+ [PCI_CAP_ID_CCRC] = 0,
+ [PCI_CAP_ID_SHPC] = 0,
+ [PCI_CAP_ID_SSVID] = 0, /* bridge only - not supp */
+ [PCI_CAP_ID_AGP3] = 0,
+ [PCI_CAP_ID_EXP] = 36,
+ [PCI_CAP_ID_MSIX] = 12,
+ [PCI_CAP_ID_AF] = 6,
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists,
+ * but what is read depends on whether the field
+ * is 'virtualized', or just pass thru to the hardware.
+ * Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+ u32 rvirt; /* read bits which must be virtualized */
+ u32 write; /* writeable bits - virt if read virt */
+};
+
+static struct perm_bits pci_cap_basic_perm[] = {
+ { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
+ { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
+ { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
+ { 0, 0, }, /* 0x24 cardbus - not yet */
+ { 0, 0, }, /* 0x28 subsys vendor & dev */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
+ { 0, 0, }, /* 0x30 capability ptr & resv */
+ { 0, 0, }, /* 0x34 resv */
+ { 0, 0, }, /* 0x38 resv */
+ { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
+};
+
+static struct perm_bits pci_cap_pm_perm[] = {
+ { 0, 0, }, /* 0x00 PM capabilities */
+ { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
+};
+
+static struct perm_bits pci_cap_vpd_perm[] = {
+ { 0, 0xFFFF0000, }, /* 0x00 address */
+ { 0, 0xFFFFFFFF, }, /* 0x04 data */
+};
+
+static struct perm_bits pci_cap_slotid_perm[] = {
+ { 0, 0, }, /* 0x00 all read only */
+};
+
+static struct perm_bits pci_cap_msi_perm[] = {
+ { 0, 0, }, /* 0x00 MSI message control */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x04 MSI message address */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x08 MSI message addr/data */
+ { 0x0000FFFF, 0x0000FFFF, }, /* 0x0c MSI message data */
+ { 0, 0xFFFFFFFF, }, /* 0x10 MSI mask bits */
+ { 0, 0xFFFFFFFF, }, /* 0x14 MSI pending bits */
+};
+
+static struct perm_bits pci_cap_pcix_perm[] = {
+ { 0, 0xFFFF0000, }, /* 0x00 PCI_X_CMD */
+ { 0, 0, }, /* 0x04 PCI_X_STATUS */
+ { 0, 0xFFFFFFFF, }, /* 0x08 ECC ctlr & status */
+ { 0, 0, }, /* 0x0c ECC first addr */
+ { 0, 0, }, /* 0x10 ECC second addr */
+ { 0, 0, }, /* 0x14 ECC attr */
+};
+
+/* pci express capabilities */
+static struct perm_bits pci_cap_exp_perm[] = {
+ { 0, 0, }, /* 0x00 PCIe capabilities */
+ { 0, 0, }, /* 0x04 PCIe device capabilities */
+ { 0, 0xFFFFFFFF, }, /* 0x08 PCIe device control & status */
+ { 0, 0, }, /* 0x0c PCIe link capabilities */
+ { 0, 0x000000FF, }, /* 0x10 PCIe link ctl/stat - SAFE? */
+ { 0, 0, }, /* 0x14 PCIe slot capabilities */
+ { 0, 0x00FFFFFF, }, /* 0x18 PCIe link ctl/stat - SAFE? */
+ { 0, 0, }, /* 0x1c PCIe root port stuff */
+ { 0, 0, }, /* 0x20 PCIe root port stuff */
+};
+
+static struct perm_bits pci_cap_msix_perm[] = {
+ { 0, 0, }, /* 0x00 MSI-X Enable */
+ { 0, 0, }, /* 0x04 table offset & bir */
+ { 0, 0, }, /* 0x08 pba offset & bir */
+};
+
+static struct perm_bits pci_cap_af_perm[] = {
+ { 0, 0, }, /* 0x00 af capability */
+ { 0, 0x0001, }, /* 0x04 af flr bit */
+};
+
+static struct perm_bits *pci_cap_perms[] = {
+ [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
+ [PCI_CAP_ID_PM] = pci_cap_pm_perm,
+ [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
+ [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
+ [PCI_CAP_ID_MSI] = pci_cap_msi_perm,
+ [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
+ [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
+ [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
+ [PCI_CAP_ID_AF] = pci_cap_af_perm,
+};
+
+/*
+ * We build a map of the config space that tells us where
+ * and what capabilities exist, so that we can map reads and
+ * writes back to capabilities, and thus figure out what to
+ * allow, deny, or virtualize
+ */
+int vfio_build_config_map(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map;
+ int i, len;
+ u8 pos, cap, tmp;
+ u16 flags;
+ int ret;
+ int loops = 100;
+
+ map = kmalloc(pdev->cfg_size, GFP_KERNEL);
+ if (map == NULL)
+ return -ENOMEM;
+ for (i = 0; i < pdev->cfg_size; i++)
+ map[i] = 0xFF;
+ vdev->pci_config_map = map;
+
+ /* default config space */
+ for (i = 0; i < pci_capability_length[0]; i++)
+ map[i] = 0;
+
+ /* any capabilities? */
+ ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
+ if (ret < 0)
+ return ret;
+ if ((flags & PCI_STATUS_CAP_LIST) == 0)
+ return 0;
+
+ ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+ if (ret < 0)
+ return ret;
+ while (pos && --loops > 0) {
+ ret = pci_read_config_byte(pdev, pos, &cap);
+ if (ret < 0)
+ return ret;
+ if (cap == 0) {
+ printk(KERN_WARNING "%s: cap 0\n", __func__);
+ break;
+ }
+ if (cap > PCI_CAP_ID_MAX) {
+ printk(KERN_WARNING "%s: unknown pci capability id %x\n",
+ __func__, cap);
+ len = 0;
+ } else
+ len = pci_capability_length[cap];
+ if (len == 0) {
+ printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
+ __func__, cap);
+ len = 4;
+ }
+ if (len == 0xFF) {
+ switch (cap) {
+ case PCI_CAP_ID_MSI:
+ ret = pci_read_config_word(pdev,
+ pos + PCI_MSI_FLAGS, &flags);
+ if (ret < 0)
+ return ret;
+ if (flags & PCI_MSI_FLAGS_MASKBIT)
+ /* per vec masking */
+ len = 24;
+ else if (flags & PCI_MSI_FLAGS_64BIT)
+ /* 64 bit */
+ len = 14;
+ else
+ len = 10;
+ break;
+ case PCI_CAP_ID_PCIX:
+ ret = pci_read_config_word(pdev, pos + 2,
+ &flags);
+ if (ret < 0)
+ return ret;
+ if (flags & 0x3000)
+ len = 24;
+ else
+ len = 8;
+ break;
+ case PCI_CAP_ID_VNDR:
+ /* length follows next field */
+ ret = pci_read_config_byte(pdev, pos + 2, &tmp);
+ if (ret < 0)
+ return ret;
+ len = tmp;
+ break;
+ default:
+ len = 0;
+ break;
+ }
+ }
+
+ for (i = 0; i < len; i++) {
+ if (map[pos+i] != 0xFF)
+ printk(KERN_WARNING
+ "%s: pci config conflict at %x, "
+ "caps %x %x\n",
+ __func__, i, map[pos+i], cap);
+ map[pos+i] = cap;
+ }
+ ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
+ if (ret < 0)
+ return ret;
+ }
+ if (loops <= 0)
+ printk(KERN_ERR "%s: config space loop!\n", __func__);
+ return 0;
+}
+
+static void vfio_virt_init(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int bar;
+ u32 *lp;
+ u32 val;
+ u8 *map, pos;
+ u16 flags;
+ int i, len;
+ int ret;
+
+ for (bar = 0; bar <= 5; bar++) {
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
+ *lp++ = val;
+ }
+ lp = (u32 *)vdev->vinfo.rombar;
+ pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
+ *lp = val;
+
+ vdev->vinfo.intr = pdev->irq;
+
+ pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
+ map = vdev->pci_config_map + pos;
+ if (pos > 0) {
+ ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+ if (ret < 0)
+ return;
+ if (flags & PCI_MSI_FLAGS_MASKBIT) /* per vec masking */
+ len = 24;
+ else if (flags & PCI_MSI_FLAGS_64BIT) /* 64 bit */
+ len = 14;
+ else
+ len = 10;
+ for (i = 0; i < len; i++)
+ (void) pci_read_config_byte(pdev, pos + i,
+ &vdev->vinfo.msi[i]);
+ }
+}
+
+static void vfio_bar_fixup(struct vfio_dev *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int bar;
+ u32 *lp;
+ u32 len;
+
+ for (bar = 0; bar <= 5; bar++) {
+ len = pci_resource_len(pdev, bar);
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ if (len == 0) {
+ *lp = 0;
+ } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
+ *lp &= ~0x1;
+ *lp = (*lp & ~(len-1)) |
+ (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
+ if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
+ bar++;
+ } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
+ *lp |= PCI_BASE_ADDRESS_SPACE_IO;
+ *lp = (*lp & ~(len-1)) |
+ (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
+ }
+ }
+ lp = (u32 *)vdev->vinfo.rombar;
+ len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
+ *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
+ vdev->vinfo.bardirty = 0;
+}
+
+static int vfio_config_rwbyte(int write,
+ struct vfio_dev *vdev,
+ int pos,
+ char __user *buf)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ u8 *map = vdev->pci_config_map;
+ u8 cap, val, newval;
+ u16 start, off;
+ int p;
+ struct perm_bits *perm;
+ u8 wr, virt;
+ int ret;
+
+ cap = map[pos];
+ if (cap == 0xFF) { /* unknown region */
+ if (write)
+ return 0; /* silent no-op */
+ val = 0;
+ if (pos <= pci_capability_length[0]) /* ok to read */
+ (void) pci_read_config_byte(pdev, pos, &val);
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ return 0;
+ }
+
+ /* scan back to start of cap region */
+ for (p = pos; p >= 0; p--) {
+ if (map[p] != cap)
+ break;
+ start = p;
+ }
+ off = pos - start; /* offset within capability */
+
+ perm = pci_cap_perms[cap];
+ if (perm == NULL) {
+ wr = 0;
+ virt = 0;
+ } else {
+ perm += (off >> 2);
+ wr = perm->write >> ((off & 3) * 8);
+ virt = perm->rvirt >> ((off & 3) * 8);
+ }
+ if (write && !wr) /* no writeable bits */
+ return 0;
+ if (!virt) {
+ if (write) {
+ if (copy_from_user(&val, buf, 1))
+ return -EFAULT;
+ val &= wr;
+ if (wr != 0xFF) {
+ u8 existing;
+
+ ret = pci_read_config_byte(pdev, pos,
+ &existing);
+ if (ret < 0)
+ return ret;
+ val |= (existing & ~wr);
+ }
+ pci_write_config_byte(pdev, pos, val);
+ } else {
+ ret = pci_read_config_byte(pdev, pos, &val);
+ if (ret < 0)
+ return ret;
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ }
+ return 0;
+ }
+
+ if (write) {
+ if (copy_from_user(&newval, buf, 1))
+ return -EFAULT;
+ }
+ /*
+ * We get here if there are some virt bits
+ * handle remaining real bits, if any
+ */
+ if (~virt) {
+ u8 rbits = (~virt) & wr;
+
+ ret = pci_read_config_byte(pdev, pos, &val);
+ if (ret < 0)
+ return ret;
+ if (write && rbits) {
+ val &= ~rbits;
+ newval &= rbits;
+ val |= newval;
+ pci_write_config_byte(pdev, pos, val);
+ }
+ }
+ /*
+ * Now handle entirely virtual fields
+ */
+ switch (cap) {
+ case PCI_CAP_ID_BASIC: /* virtualize BARs */
+ switch (off) {
+ /*
+ * vendor and device are virt because they don't
+ * show up otherwise for sr-iov vfs
+ */
+ case PCI_VENDOR_ID:
+ val = pdev->vendor;
+ break;
+ case PCI_VENDOR_ID + 1:
+ val = pdev->vendor >> 8;
+ break;
+ case PCI_DEVICE_ID:
+ val = pdev->device;
+ break;
+ case PCI_DEVICE_ID + 1:
+ val = pdev->device >> 8;
+ break;
+ case PCI_INTERRUPT_LINE:
+ if (write)
+ vdev->vinfo.intr = newval;
+ else
+ val = vdev->vinfo.intr;
+ break;
+ case PCI_ROM_ADDRESS:
+ case PCI_ROM_ADDRESS+1:
+ case PCI_ROM_ADDRESS+2:
+ case PCI_ROM_ADDRESS+3:
+ if (write) {
+ vdev->vinfo.rombar[off & 3] = newval;
+ vdev->vinfo.bardirty = 1;
+ } else {
+ if (vdev->vinfo.bardirty)
+ vfio_bar_fixup(vdev);
+ val = vdev->vinfo.rombar[off & 3];
+ }
+ break;
+ default:
+ if (off >= PCI_BASE_ADDRESS_0 &&
+ off <= PCI_BASE_ADDRESS_5 + 3) {
+ int boff = off - PCI_BASE_ADDRESS_0;
+
+ if (write) {
+ vdev->vinfo.bar[boff] = newval;
+ vdev->vinfo.bardirty = 1;
+ } else {
+ if (vdev->vinfo.bardirty)
+ vfio_bar_fixup(vdev);
+ val = vdev->vinfo.bar[boff];
+ }
+ }
+ break;
+ }
+ break;
+ case PCI_CAP_ID_MSI: /* virtualize MSI */
+ if (off >= PCI_MSI_ADDRESS_LO && off <= (PCI_MSI_DATA_64 + 2)) {
+ int moff = off - PCI_MSI_ADDRESS_LO;
+
+ if (write)
+ vdev->vinfo.msi[moff] = newval;
+ else
+ val = vdev->vinfo.msi[moff];
+ break;
+ }
+ break;
+ }
+ if (!write && copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ return 0;
+}
+
+ssize_t vfio_config_readwrite(int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int done = 0;
+ int ret;
+ int pos;
+
+ pci_block_user_cfg_access(pdev);
+
+ if (vdev->pci_config_map == NULL) {
+ ret = vfio_build_config_map(vdev);
+ if (ret < 0)
+ goto out;
+ vfio_virt_init(vdev);
+ }
+
+ while (count > 0) {
+ pos = *ppos;
+ if (pos == pdev->cfg_size)
+ break;
+ if (pos > pdev->cfg_size) {
+ ret = -EINVAL;
+ goto out;
+ }
+ ret = vfio_config_rwbyte(write, vdev, pos, buf);
+ if (ret < 0)
+ goto out;
+ buf++;
+ done++;
+ count--;
+ (*ppos)++;
+ }
+ ret = done;
+out:
+ pci_unblock_user_cfg_access(pdev);
+ return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_rdwr.c vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c
--- linux-2.6.34/drivers/vfio/vfio_rdwr.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c 2010-05-28 14:27:40.000000000 -0700
@@ -0,0 +1,147 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/mmu_notifier.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include <linux/vfio.h>
+
+ssize_t vfio_io_readwrite(
+ int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ size_t done = 0;
+ resource_size_t end;
+ void __iomem *io;
+ loff_t pos;
+ int pci_space;
+ int unit;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ pos = (*ppos & 0xFFFFFFFF);
+
+ if (vdev->bar[pci_space] == NULL)
+ vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+ io = vdev->bar[pci_space];
+ end = pci_resource_len(pdev, pci_space);
+ if (pos + count > end)
+ return -EINVAL;
+
+ while (count > 0) {
+ if ((pos % 4) == 0 && count >= 4) {
+ u32 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 4))
+ return -EFAULT;
+ iowrite32(val, io + pos);
+ } else {
+ val = ioread32(io + pos);
+ if (copy_to_user(buf, &val, 4))
+ return -EFAULT;
+ }
+ unit = 4;
+ } else if ((pos % 2) == 0 && count >= 2) {
+ u16 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 2))
+ return -EFAULT;
+ iowrite16(val, io + pos);
+ } else {
+ val = ioread16(io + pos);
+ if (copy_to_user(buf, &val, 2))
+ return -EFAULT;
+ }
+ unit = 2;
+ } else {
+ u8 val;
+
+ if (write) {
+ if (copy_from_user(&val, buf, 1))
+ return -EFAULT;
+ iowrite8(val, io + pos);
+ } else {
+ val = ioread8(io + pos);
+ if (copy_to_user(buf, &val, 1))
+ return -EFAULT;
+ }
+ unit = 1;
+ }
+ pos += unit;
+ buf += unit;
+ count -= unit;
+ done += unit;
+ }
+ *ppos += done;
+ return done;
+}
+
+ssize_t vfio_mem_readwrite(
+ int write,
+ struct vfio_dev *vdev,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ resource_size_t end;
+ void __iomem *io;
+ loff_t pos;
+ int pci_space;
+
+ pci_space = vfio_offset_to_pci_space(*ppos);
+ pos = (*ppos & 0xFFFFFFFF);
+
+ if (vdev->bar[pci_space] == NULL)
+ vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+ io = vdev->bar[pci_space];
+ end = pci_resource_len(pdev, pci_space);
+ if (pos > end)
+ return -EINVAL;
+ if (pos == end)
+ return 0;
+ if (pos + count > end)
+ count = end - pos;
+ if (write) {
+ if (copy_from_user(io + pos, buf, count))
+ return -EFAULT;
+ } else {
+ if (copy_to_user(buf, io + pos, count))
+ return -EFAULT;
+ }
+ *ppos += count;
+ return count;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_sysfs.c vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c
--- linux-2.6.34/drivers/vfio/vfio_sysfs.c 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c 2010-05-28 14:04:34.000000000 -0700
@@ -0,0 +1,153 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+struct vfio_class *vfio_class;
+
+int vfio_class_init(void)
+{
+ int ret = 0;
+
+ if (vfio_class != NULL) {
+ kref_get(&vfio_class->kref);
+ goto exit;
+ }
+
+ vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
+ if (!vfio_class) {
+ ret = -ENOMEM;
+ goto err_kzalloc;
+ }
+
+ kref_init(&vfio_class->kref);
+ vfio_class->class = class_create(THIS_MODULE, "vfio");
+ if (IS_ERR(vfio_class->class)) {
+ ret = IS_ERR(vfio_class->class);
+ printk(KERN_ERR "class_create failed for vfio\n");
+ goto err_class_create;
+ }
+ return 0;
+
+err_class_create:
+ kfree(vfio_class);
+ vfio_class = NULL;
+err_kzalloc:
+exit:
+ return ret;
+}
+
+static void vfio_class_release(struct kref *kref)
+{
+ /* Ok, we cheat as we know we only have one vfio_class */
+ class_destroy(vfio_class->class);
+ kfree(vfio_class);
+ vfio_class = NULL;
+}
+
+void vfio_class_destroy(void)
+{
+ if (vfio_class)
+ kref_put(&vfio_class->kref, vfio_class_release);
+}
+
+ssize_t config_map_read(struct kobject *kobj, struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t count)
+{
+ struct vfio_dev *vdev = bin_attr->private;
+ int ret;
+
+ if (off >= 256)
+ return 0;
+ if (off + count > 256)
+ count = 256 - off;
+ if (vdev->pci_config_map == NULL) {
+ ret = vfio_build_config_map(vdev);
+ if (ret < 0)
+ return ret;
+ }
+ memcpy(buf, vdev->pci_config_map + off, count);
+ return count;
+}
+
+static ssize_t show_locked_pages(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vfio_dev *vdev = dev_get_drvdata(dev);
+
+ if (vdev == NULL)
+ return -ENODEV;
+ return sprintf(buf, "%u\n", vdev->locked_pages);
+}
+
+static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
+
+static struct attribute *vfio_attrs[] = {
+ &dev_attr_locked_pages.attr,
+ NULL,
+};
+
+static struct attribute_group vfio_attr_grp = {
+ .attrs = vfio_attrs,
+};
+
+struct bin_attribute config_map_bin_attribute = {
+ .attr = {
+ .name = "config_map",
+ .mode = S_IRUGO,
+ },
+ .size = 256,
+ .read = config_map_read,
+};
+
+int vfio_dev_add_attributes(struct vfio_dev *vdev)
+{
+ struct bin_attribute *bi;
+ int ret;
+
+ ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
+ if (ret)
+ return ret;
+ bi = kmalloc(sizeof(*bi), GFP_KERNEL);
+ if (bi == NULL)
+ return -ENOMEM;
+ *bi = config_map_bin_attribute;
+ bi->private = vdev;
+ return sysfs_create_bin_file(&vdev->dev->kobj, bi);
+}
diff -uprN linux-2.6.34/include/linux/uiommu.h vfio-linux-2.6.34/include/linux/uiommu.h
--- linux-2.6.34/include/linux/uiommu.h 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/include/linux/uiommu.h 2010-06-03 20:35:35.000000000 -0700
@@ -0,0 +1,62 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+/*
+ * uiommu driver - manipulation of iommu domains from user progs
+ */
+struct uiommu_domain {
+ struct iommu_domain *domain;
+ atomic_t refcnt;
+};
+
+struct uiommu_domain *uiommu_fdget(int fd);
+void uiommu_put(struct uiommu_domain *);
+
+static inline int uiommu_attach_device(struct uiommu_domain *udomain,
+ struct device *dev)
+{
+ return iommu_attach_device(udomain->domain, dev);
+}
+
+static inline void uiommu_detach_device(struct uiommu_domain *udomain,
+ struct device *dev)
+{
+ iommu_detach_device(udomain->domain, dev);
+}
+
+static inline int uiommu_map_range(struct uiommu_domain *udomain,
+ unsigned long iova,
+ phys_addr_t paddr,
+ size_t size,
+ int prot)
+{
+ return iommu_map_range(udomain->domain, iova, paddr, size, prot);
+}
+
+static inline void uiommu_unmap_range(struct uiommu_domain *udomain,
+ unsigned long iova,
+ size_t size)
+{
+ iommu_unmap_range(udomain->domain, iova, size);
+}
+
+static inline phys_addr_t uiommu_iova_to_phys(struct uiommu_domain *udomain,
+ unsigned long iova)
+{
+ return iommu_iova_to_phys(udomain->domain, iova);
+}
diff -uprN linux-2.6.34/include/linux/vfio.h vfio-linux-2.6.34/include/linux/vfio.h
--- linux-2.6.34/include/linux/vfio.h 1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/include/linux/vfio.h 2010-06-07 12:20:06.000000000 -0700
@@ -0,0 +1,200 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, pu...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spr...@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tg...@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <h...@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <gr...@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <m...@redhat.com>
+ */
+
+/*
+ * VFIO driver - allow mapping and use of certain PCI devices
+ * in unprivileged user processes. (If IOMMU is present)
+ * Especially useful for Virtual Function parts of SR-IOV devices
+ */
+
+#ifdef __KERNEL__
+
+struct vfio_dev {
+ struct device *dev;
+ struct pci_dev *pdev;
+ u8 *pci_config_map;
+ int pci_config_size;
+ char name[8];
+ int devnum;
+ int pmaster;
+ void __iomem *bar[PCI_ROM_RESOURCE+1];
+ spinlock_t irqlock; /* guards command register accesses */
+ int listeners;
+ u32 locked_pages;
+ struct mutex lgate; /* listener gate */
+ struct mutex dgate; /* dma op gate */
+ struct mutex igate; /* intr op gate */
+ struct msix_entry *msix;
+ int nvec;
+ struct uiommu_domain *udomain;
+ int cachec;
+ struct eventfd_ctx *ev_irq;
+ struct eventfd_ctx *ev_msi;
+ struct eventfd_ctx **ev_msix;
+ struct {
+ u8 intr;
+ u8 bardirty;
+ u8 rombar[4];
+ u8 bar[6*4];
+ u8 msi[24];
+ } vinfo;
+};
+
+struct vfio_listener {
+ struct vfio_dev *vdev;
+ struct list_head dm_list;
+ struct mm_struct *mm;
+ struct mmu_notifier mmu_notifier;
+};
+
+/*
+ * Structure for keeping track of memory nailed down by the
+ * user for DMA
+ */
+struct dma_map_page {
+ struct list_head list;
+ struct page **pages;
+ dma_addr_t daddr;
+ unsigned long vaddr;
+ int npage;
+ int rdwr;
+};
+
+/* VFIO class infrastructure */
+struct vfio_class {
+ struct kref kref;
+ struct class *class;
+};
+extern struct vfio_class *vfio_class;
+
+ssize_t vfio_io_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+ssize_t vfio_config_readwrite(int, struct vfio_dev *,
+ char __user *, size_t, loff_t *);
+
+void vfio_disable_msi(struct vfio_dev *);
+void vfio_disable_msix(struct vfio_dev *);
+int vfio_enable_msi(struct vfio_dev *, int);
+int vfio_enable_msix(struct vfio_dev *, int, void __user *);
+
+#ifndef PCI_MSIX_ENTRY_SIZE
+#define PCI_MSIX_ENTRY_SIZE 16
+#endif
+#ifndef PCI_STATUS_INTERRUPT
+#define PCI_STATUS_INTERRUPT 0x08
+#endif
+
+struct vfio_dma_map;
+void vfio_dma_unmapall(struct vfio_listener *);
+int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
+int vfio_dma_map_common(struct vfio_listener *, unsigned int,
+ struct vfio_dma_map *);
+int vfio_domain_set(struct vfio_dev *, int);
+void vfio_domain_unset(struct vfio_dev *);
+
+int vfio_class_init(void);
+void vfio_class_destroy(void);
+int vfio_dev_add_attributes(struct vfio_dev *);
+extern struct idr vfio_idr;
+extern struct mutex vfio_minor_lock;
+int vfio_build_config_map(struct vfio_dev *);
+
+irqreturn_t vfio_interrupt(int, void *);
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for ioctls */
+
+/*
+ * Structure for DMA mapping of user buffers
+ * vaddr, dmaaddr, and size must all be page aligned
+ * buffer may only be larger than 1 page if (a) there is
+ * an iommu in the system, or (b) buffer is part of a huge page
+ */
+struct vfio_dma_map {
+ __u64 vaddr; /* process virtual addr */
+ __u64 dmaaddr; /* desired and/or returned dma address */
+ __u64 size; /* size in bytes */
+ __u64 flags; /* bool: 0 for r/o; 1 for r/w */
+#define VFIO_FLAG_WRITE 0x1 /* req writeable DMA mem */
+};
+
+/* map user pages at specific dma address */
+/* requires previous VFIO_DOMAIN_SET */
+#define VFIO_DMA_MAP_IOVA _IOWR(';', 101, struct vfio_dma_map)
+
+/* unmap user pages */
+#define VFIO_DMA_UNMAP _IOW(';', 102, struct vfio_dma_map)
+
+/* set device DMA mask & master status */
+#define VFIO_DMA_MASK _IOW(';', 103, __u64)
+
+/* request IRQ interrupts; use given eventfd */
+#define VFIO_EVENTFD_IRQ _IOW(';', 104, int)
+
+/* request MSI interrupts; use given eventfd */
+#define VFIO_EVENTFD_MSI _IOW(';', 105, int)
+
+/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
+#define VFIO_EVENTFDS_MSIX _IOW(';', 106, int)
+
+/* Get length of a BAR */
+#define VFIO_BAR_LEN _IOWR(';', 107, __u32)
+
+/* Set the IOMMU domain - arg is fd from uiommu driver */
+#define VFIO_DOMAIN_SET _IOW(';', 108, int)
+
+/* Unset the IOMMU domain */
+#define VFIO_DOMAIN_UNSET _IO(';', 109)
+
+/*
+ * Reads, writes, and mmaps determine which PCI BAR (or config space)
+ * from the high level bits of the file offset
+ */
+#define VFIO_PCI_BAR0_RESOURCE 0x0
+#define VFIO_PCI_BAR1_RESOURCE 0x1
+#define VFIO_PCI_BAR2_RESOURCE 0x2
+#define VFIO_PCI_BAR3_RESOURCE 0x3
+#define VFIO_PCI_BAR4_RESOURCE 0x4
+#define VFIO_PCI_BAR5_RESOURCE 0x5
+#define VFIO_PCI_ROM_RESOURCE 0x6
+#define VFIO_PCI_CONFIG_RESOURCE 0xF
+#define VFIO_PCI_SPACE_SHIFT 32
+#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
+
+static inline int vfio_offset_to_pci_space(__u64 off)
+{
+ return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
+}
+
+static inline __u64 vfio_pci_space_to_offset(int sp)
+{
+ return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
+}
diff -uprN linux-2.6.34/MAINTAINERS vfio-linux-2.6.34/MAINTAINERS
--- linux-2.6.34/MAINTAINERS 2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/MAINTAINERS 2010-05-28 12:30:21.000000000 -0700
@@ -5968,6 +5968,13 @@ S: Maintained
F: Documentation/fb/uvesafb.txt
F: drivers/video/uvesafb.*
+VFIO DRIVER
+M: Tom Lyon <pu...@cisco.com>
+S: Supported
+F: Documentation/vfio.txt
+F: drivers/vfio/
+F: include/linux/vfio.h
+
VFAT/FAT/MSDOS FILESYSTEM
M: OGAWA Hirofumi <hiro...@mail.parknet.co.jp>
S: Maintained
> diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
> --- linux-2.6.34/Documentation/vfio.txt 1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/Documentation/vfio.txt 2010-06-07 15:05:42.000000000 -0700
> @@ -0,0 +1,177 @@
...
> +Interrupts:
> +
> +Device interrupts are translated by the vfio driver into input events on event
> +notification file descriptors created by the eventfd system call. The user
> +program must create one or more event descriptors and pass them to the vfio
> +driver via ioctls to arrange for the interrupt mapping:
> +1.
> + efd = eventfd(0, 0);
> + ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
> + This provides an eventfd for traditional IRQ interrupts.
> + IRQs will be disable after each interrupt until the driver
disabled
> + re-enables them via the PCI COMMAND register.
> +2.
> + efd = eventfd(0, 0);
> + ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
> + This connects MSI interrupts to an eventfd.
> +3.
> + int arg[N+1];
> + arg[0] = N;
> + arg[1..N] = eventfd(0, 0);
> + ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
> + This connects N MSI-X interrupts with N eventfds.
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
one missing piece (again):
Documentation/ioctl/ioctl-number.txt | 1
Documentation/vfio.txt | 177 +++++++
MAINTAINERS | 7
drivers/Kconfig | 2
drivers/Makefile | 1
drivers/vfio/Kconfig | 18
drivers/vfio/Makefile | 6
drivers/vfio/uiommu.c | 126 +++++
drivers/vfio/vfio_dma.c | 324 ++++++++++++
drivers/vfio/vfio_intrs.c | 191 +++++++
drivers/vfio/vfio_main.c | 624 +++++++++++++++++++++++++
drivers/vfio/vfio_pci_config.c | 554 ++++++++++++++++++++++
drivers/vfio/vfio_rdwr.c | 147 +++++
drivers/vfio/vfio_sysfs.c | 153 ++++++
include/linux/uiommu.h | 62 ++
include/linux/vfio.h | 200 ++++++++
16 files changed, 2593 insertions(+)
> diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig
> --- linux-2.6.34/drivers/vfio/Kconfig 1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/Kconfig 2010-06-07 15:28:14.000000000 -0700
> @@ -0,0 +1,18 @@
> +menuconfig VFIO
> + tristate "Non-Priv User Space PCI drivers"
Non-privileged
(again)
> + depends on UIOMMU && PCI && IOMMU_API
> + help
> + Driver to allow advanced user space drivers for PCI, PCI-X,
> + and PCIe devices. Requires IOMMU to allow non-privilged
non-privileged
(again) :(
> + processes to directly control the PCI devices.
> +
> + If you don't know what to do here, say N.
> +
> +menuconfig UIOMMU
> + tristate "User level manipulation of IOMMU"
> + help
> + Device driver to allow user level programs to
> + manipulate IOMMU domains
domains.
> +
> + If you don't know what to do here, say N.
> +
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
Some general comments:
- Please pass this through ./scripts/checkpatch.pl to fix some formatting.
- Lots of hard-coded constants. Please try using pci_regs.h much more,
where not possible please add named enums.
- There are places where you get parameters from userspace and pass them
on to kmalloc etc. Everything you get from userspace needs to be
validated.
- You play non-standard tricks with minor numbers.
Won't it be easier to just make udev create a node
for the device in the way everyone does it? The name could be
descriptive including e.g. bus/dev/fn so userspace can find
your device.
- I note that if we exclude the iommu mapping, the rest conceptually could belong
in pci_generic driver in uio. So if we move these ioctls to the iommu driver,
as Avi suggested, then vfio can be part of the uio framework.
> ---
> This version now requires an IOMMU domain to be set before any access to
> device registers is granted (except that config space may be read). In
> addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> which does not have sufficient controls around IOMMU usage. The IOMMU domain
> is obtained from the 'uiommu' driver which is included in this patch.
>
> Various locking, security, and documentation issues have also been fixed.
>
> Please commit - it or me!
> But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?
>
> Blurb from previous patch version:
>
> This patch is the evolution of code which was first proposed as a patch to
> uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
> out of the uio framework, and things seem much cleaner. Of course, there is
> a lot of functional overlap with uio, but the previous version just seemed
> like a giant mode switch in the uio code that did not lead to clarity for
> either the new or old code.
>
> [a pony for avi...]
> The major new functionality in this version is the ability to deal with
> PCI config space accesses (through read & write calls) - but includes table
> driven code to determine whats safe to write and what is not. Also, some
> virtualization of the config space to allow drivers to think they're writing
> some registers when they're not. Also, IO space accesses are also allowed.
> Drivers for devices which use MSI-X are now prevented from directly writing
> the MSI-X vector area.
This table adds a lot of complexity to the code,
and I don't really understand why we need this code in
kernel: isn't the point of iommu that it protects us
from buggy devices? If yes, we should be able to
just ask userspace to be careful and avoid doing silly things
like overwriting MSI-X vectors, and if it's not careful,
no one gets hurt :)
If some registers absolutely must be protected,
I think we should document this carefully in code.
kzalloc with size coming from userspace? And it's signed. Ugh.
I think you should just enable all vectors and map them,
at startup.
Instead of all this complexity, can't we just stick a pointer to your device
in 'struct cdev *i_cdev' on the inode?
> + if (!vdev) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> + listener = kzalloc(sizeof(*listener), GFP_KERNEL);
> + if (!listener) {
> + ret = -ENOMEM;
> + goto err_alloc_listener;
> + }
> +
> + listener->vdev = vdev;
> + INIT_LIST_HEAD(&listener->dm_list);
> + filep->private_data = listener;
> +
> + mutex_lock(&vdev->lgate);
> + if (vdev->listeners == 0) { /* first open */
> + /* reset to known state if we can */
> + (void) pci_reset_function(vdev->pdev);
We are opening the device - how can it not be in a known state?
This is too late - device could be doing DMA here and we moved it from under the domain!
And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
PAGE_ALIGN here and elsewhere?
> + if (bir == vfio_offset_to_pci_space(start) &&
> + overlap(lo, hi, startp, endp)) {
> + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> + __func__);
> + return -EINVAL;
> + }
Tricky, slow, and - is it really necessary?
And it won't work if PAGE_SIZE is > 4K, because MSIX page is only 4K in size.
What happens if we don't do all these checks?
I don't think we really care about non-memory mmaps. They can all go
through read.
> + requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> + if (requested > actual || actual == 0)
> + return -EINVAL;
> +
> + /*
> + * Can't allow non-priv users to mmap MSI-X vectors
> + * else they can write anywhere in phys memory
> + */
not if there's an iommu.
> + start = vma->vm_pgoff << PAGE_SHIFT;
> + len = vma->vm_end - vma->vm_start;
> + if (vma->vm_flags & VM_WRITE) {
> + ret = vfio_msix_check(vdev, start, len);
> + if (ret)
> + return ret;
> + }
> +
> + vma->vm_private_data = vdev;
> + vma->vm_flags |= VM_IO | VM_RESERVED;
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
> +
> + return remap_pfn_range(vma, vma->vm_start, phys,
> + vma->vm_end - vma->vm_start,
> + vma->vm_page_prot);
I think there's a security problem here:
userspace can do mmap, then close the file and unbind
device from iommu, now that device is not bound
(or bound to anothe rprocess)
access device through mmap and crash the system.
We must make sure device stays open until no one
maps the memory range.
This better be done by userspace when it sees fit.
Otherwise device might corrupt userspace memory.
> + ret = pci_set_dma_mask(pdev, mask);
> + break;
Is the above needed?
So what happens if someone has a device file open and device
is being hot-unplugged? At a minimum, we want userspace to
have a way to get and handle these notifications.
But also remember we can not trust userspace to be well-behaved.
> + struct vfio_dev *vdev = pci_get_drvdata(pdev);
> +
> + vfio_free_minor(vdev);
> +
> + if (pdev->irq > 0)
> + free_irq(pdev->irq, vdev);
> +
> +#ifdef notdef
> + vfio_dev_del_attributes(vdev);
> +#endif
> +
> + pci_set_drvdata(pdev, NULL);
> + device_destroy(vfio_class->class, vdev->devnum);
> + kfree(vdev);
> + vfio_class_destroy();
> + pci_disable_device(pdev);
> +}
> +
> +static struct pci_driver driver = {
> + .name = "vfio",
> + .id_table = NULL, /* only dynamic id's */
> + .probe = vfio_probe,
> + .remove = vfio_remove,
Also - I think we need to handle e.g. suspend in some way.
Again, this likely involves notifying userspace so it can
save state to memory.
So the above looks like it is very unlikely to be exhaustive and
correct. Maybe there aren't bugs in this table to be found just by
looking hard at the spec, but likely will surface when someone tries
to actually run driver with e.g. a working pm on top.
Let's ask another question:
since we have the iommu protecting us, can't all or most of this be
done in userspace? What can userspace do that will harm the host?
I think each place where we block access to a register, there should
be a very specific documentation for why we do this.
> +int vfio_build_config_map(struct vfio_dev *vdev)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + u8 *map;
> + int i, len;
> + u8 pos, cap, tmp;
> + u16 flags;
> + int ret;
> + int loops = 100;
100?
.....
Why is that?
> + if (listener->mm != NULL) {
> + ret = -EINVAL;
> + goto out_lock;
> + }
> + listener->mm = current->mm;
> +#ifdef CONFIG_MMU_NOTIFIER
> + listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> + ret = mmu_notifier_register(&listener->mmu_notifier,
> + listener->mm);
> + if (ret)
> + printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> + __func__, ret);
> + ret = 0;
> +#endif
> + }
> +
> + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
If you map a 4G region, this will try to allocate 8Mbytes?
> - Lots of hard-coded constants. Please try using pci_regs.h much more,
> where not possible please add named enums.
This is mostly for lengths not specified in pci_regs, but given in standards docs.
> - There are places where you get parameters from userspace and pass them
> on to kmalloc etc. Everything you get from userspace needs to be
> validated.
I thought I had. Thats what more eyeballs are for.
> - You play non-standard tricks with minor numbers.
> Won't it be easier to just make udev create a node
> for the device in the way everyone does it? The name could be
> descriptive including e.g. bus/dev/fn so userspace can find
> your device.
I just copied what uio was doing. What is "the way everyone does it?"
>
> - I note that if we exclude the iommu mapping, the rest conceptually could belong
> in pci_generic driver in uio. So if we move these ioctls to the iommu driver,
> as Avi suggested, then vfio can be part of the uio framework.
But the interrupt handling is different in uio; uio doesn't support read or write calls
to read and write registers or memory, and it doesn't support ioctls at all for other
misc stuff. If we could blow off backwards compatibility with uio, then, sure we
could have a nice unified solution.
Thanks for the catch.
Again, just copied this from uio.
>
> > + if (!vdev) {
> > + ret = -ENODEV;
> > + goto out;
> > + }
> > +
> > + listener = kzalloc(sizeof(*listener), GFP_KERNEL);
> > + if (!listener) {
> > + ret = -ENOMEM;
> > + goto err_alloc_listener;
> > + }
> > +
> > + listener->vdev = vdev;
> > + INIT_LIST_HEAD(&listener->dm_list);
> > + filep->private_data = listener;
> > +
> > + mutex_lock(&vdev->lgate);
> > + if (vdev->listeners == 0) { /* first open */
> > + /* reset to known state if we can */
> > + (void) pci_reset_function(vdev->pdev);
>
> We are opening the device - how can it not be in a known state?
If an alternate driver left it in a weird state.
OK - how about a pci_clear_master before the domain_unset?
>
> And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
How do you propose?
It can be sped up with some caching.
BTW, MSI-X can be up to 2048 entries of 16 bytes..
These are just sorting out config, io, and memory read/writes.
Its conceivable that a virtual machine may want to jump to ROM code.
>
> > + requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > + if (requested > actual || actual == 0)
> > + return -EINVAL;
> > +
> > + /*
> > + * Can't allow non-priv users to mmap MSI-X vectors
> > + * else they can write anywhere in phys memory
> > + */
>
> not if there's an iommu.
I'm not totally convinced that the IOMMU code, as implemented, forces the
devices to use only their own vectors. But the iommu code is deep.
>
> > + start = vma->vm_pgoff << PAGE_SHIFT;
> > + len = vma->vm_end - vma->vm_start;
> > + if (vma->vm_flags & VM_WRITE) {
> > + ret = vfio_msix_check(vdev, start, len);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + vma->vm_private_data = vdev;
> > + vma->vm_flags |= VM_IO | VM_RESERVED;
> > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > + phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
> > +
> > + return remap_pfn_range(vma, vma->vm_start, phys,
> > + vma->vm_end - vma->vm_start,
> > + vma->vm_page_prot);
>
> I think there's a security problem here:
> userspace can do mmap, then close the file and unbind
> device from iommu, now that device is not bound
> (or bound to anothe rprocess)
> access device through mmap and crash the system.
>
> We must make sure device stays open until no one
> maps the memory range.
The memory system holds a file reference when pages are mapped;
the driver release routine won't be called until the region is unmapped or
killed.
For now, hotplug and suspend/resume not supported - sys admin must
not enable vfio for these devices. I think they are doable, but lots of testing
work - and not important for my use case.
I think, in an ideal world, you would be correct. I don't trust either
the hardware or the iommu software to feel good about this though.
>
>
> > +int vfio_build_config_map(struct vfio_dev *vdev)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u8 *map;
> > + int i, len;
> > + u8 pos, cap, tmp;
> > + u16 flags;
> > + int ret;
> > + int loops = 100;
>
> 100?
Why not?
Its OK to have multiple fds per device, but multiple address spaces per fd
means somebody forked - and I don't know how to keep track of mmu
notifications once that happens.
>
> > + if (listener->mm != NULL) {
> > + ret = -EINVAL;
> > + goto out_lock;
> > + }
> > + listener->mm = current->mm;
> > +#ifdef CONFIG_MMU_NOTIFIER
> > + listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> > + ret = mmu_notifier_register(&listener->mmu_notifier,
> > + listener->mm);
> > + if (ret)
> > + printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> > + __func__, ret);
> > + ret = 0;
> > +#endif
> > + }
> > +
> > + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
>
> If you map a 4G region, this will try to allocate 8Mbytes?
Yes. Ce la vie.
Heh, it passed, sorry.
> > - Lots of hard-coded constants. Please try using pci_regs.h much more,
> > where not possible please add named enums.
> This is mostly for lengths not specified in pci_regs, but given in standards docs.
Add them to pci_regs.h, or to named define in your code.
The largest problem is the tables though.
> > - There are places where you get parameters from userspace and pass them
> > on to kmalloc etc. Everything you get from userspace needs to be
> > validated.
> I thought I had. Thats what more eyeballs are for.
That's what code comments are for :)
Go over malloc calls - each time you malloc in response
to ioctl, and keep memory around, ask yourself why isn't this a DOS
attack. If it's not, document why.
> > - You play non-standard tricks with minor numbers.
> > Won't it be easier to just make udev create a node
> > for the device in the way everyone does it? The name could be
> > descriptive including e.g. bus/dev/fn so userspace can find
> > your device.
> I just copied what uio was doing. What is "the way everyone does it?"
Everyone is an over-statement :), but look at virtio-serial for example.
Your open routine could be as simple as:
struct cdev *cdev = inode->i_cdev;
struct vfio_dev *dev = container_of(cdev, struct vfio_dev, vfio_cdev);
filp->private_data = dev;
> >
> > - I note that if we exclude the iommu mapping, the rest conceptually could belong
> > in pci_generic driver in uio. So if we move these ioctls to the iommu driver,
> > as Avi suggested, then vfio can be part of the uio framework.
> But the interrupt handling is different in uio;
eventfd? This can be supported IMO.
> uio doesn't support read or write calls
> to read and write registers or memory,
We don't use write in pci generic uio so this can be fixed:
all we need is pass offset and size to control call, right?
> and it doesn't support ioctls at all for other
> misc stuff.
> If we could blow off backwards compatibility with uio, then, sure we
> could have a nice unified solution.
What I said above, *if* you move the mapping ioctl to iommu
device - no important ioctls are left after this, really.
....
>
> > > + /* reset to known state if we can */
> > > + (void) pci_reset_function(vdev->pdev);
> >
> > We are opening the device - how can it not be in a known state?
> If an alternate driver left it in a weird state.
Don't we care if it fails then? I think we do ...
Not all devices let you do it at any random time.
How about pci_reset_function before domain_unset?
> >
> > And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
> How do you propose?
Fail open if reset fails on open?
> > > + if (bir == vfio_offset_to_pci_space(start) &&
> > > + overlap(lo, hi, startp, endp)) {
> > > + printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> > > + __func__);
> > > + return -EINVAL;
> > > + }
> >
> > Tricky, slow, and - is it really necessary?
> > And it won't work if PAGE_SIZE is > 4K, because MSIX page is only 4K in size.
> It can be sped up with some caching.
> BTW, MSI-X can be up to 2048 entries of 16 bytes..
Right. Point is, it can be < PAGE_SIZE, so page will have other stuff in it,
so prohibiting it will create problems.
Sorry, I mean the one below.
It can always shadow ROM if it wants to do that. And I didn't think
you can jump to code mmaped in this way - can you?
> >
> > > + requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > > + if (requested > actual || actual == 0)
> > > + return -EINVAL;
> > > +
> > > + /*
> > > + * Can't allow non-priv users to mmap MSI-X vectors
> > > + * else they can write anywhere in phys memory
> > > + */
> >
> > not if there's an iommu.
> I'm not totally convinced that the IOMMU code, as implemented, forces the
> devices to use only their own vectors. But the iommu code is deep.
MSI is just a memory write. So if it didn't, then device could
trigger interrupt for another device just by doing a memory write.
IOW, if you let userspace control the device, it's useless from security POW
to prevent control to the MSI/MSIX table.
Right. sorry about the noise.
Hope the above is clear: userspace driver might need
to do some initializations before it lets
device write into its memory.
> > > + ret = pci_set_dma_mask(pdev, mask);
> > > + break;
> >
> > Is the above needed?
> >
....
> > So what happens if someone has a device file open and device
> > is being hot-unplugged? At a minimum, we want userspace to
> > have a way to get and handle these notifications.
> > But also remember we can not trust userspace to be well-behaved.
> For now,
And when it's fixed, how will the sysadmin know it's now safe?
It's the driver's responsibility to prevent crash and burn.
> hotplug and suspend/resume not supported -
> sys admin must not enable vfio for these devices.
For *which* devices? suspend is a system wide feature.
If device doesn't support it, make it depend on !SUSPEND && !HOTPLUG
or something?
> I think they are doable,
Issue is, if userspace interface is designed not to support these
events, applications will be written ignoring it, and we will
be stuck. Had this problem with infiniband userspace, no
hotplug/suspend support to this day.
> but lots of testing
> work - and not important for my use case.
That's what users are for :)
Using '' implies it is not really virtualized?
what exactly do you mean?
> or just pass thru to the hardware.
> > > + * Any virtualized field is also virtualized for writes.
virtualized for writes? what does it mean?
> > > + * Writes are only permitted if they have a 1 bit here.
you mean they are ignored if register is all 0, right?
> > > + */
> > > +struct perm_bits {
> > > + u32 rvirt; /* read bits which must be virtualized */
> > > + u32 write; /* writeable bits - virt if read virt */
> > > +};
By the time I got half page down, I forgot the order.
So please use .rvirt/.write initializers.
> > > +
When this table changes - as it invariably will - how
will userspace know?
We could add an ioctl to query this table - but more
ioctls, more bugs in kernel and userspace.
It would be easier if we had just one or two registers,
and just did protection, failing illegal reads/writes.
Then userspace would have a simple standard way to
find out, right where it happens, and it could deal with virtualization.
> > > +static struct perm_bits pci_cap_basic_perm[] = {
what endian-ness is all this in, btw?
> > > + { 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
you virtualize vendor and device id?
> > > + { 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
> > > + { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
Virtualizing bars is not that simple. PCI is full of
this stuff, and qemu has to do a ton of tricks in userspace.
If you are really sure we want this in kernel - and I still don't see
why - still better pass through as much as possible.
> > > + { 0, 0, }, /* 0x24 cardbus - not yet */
> > > + { 0, 0, }, /* 0x28 subsys vendor & dev */
> > > + { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
> > > + { 0, 0, }, /* 0x30 capability ptr & resv */
> > > + { 0, 0, }, /* 0x34 resv */
> > > + { 0, 0, }, /* 0x38 resv */
> > > + { 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
Use [] initializers here as well.
More importantly, please document *why* you are virtualizing
or blocking access, not just what you do.
irq is safe to write I think, it's for the OS, device does not use it.
> > > +};
Above it broken for non-type 0 devices. Is there a check in code
that this is what we get? Should be ...
> > > +
> > > +static struct perm_bits pci_cap_pm_perm[] = {
> > > + { 0, 0, }, /* 0x00 PM capabilities */
> > > + { 0, 0xFFFFFFFF, }, /* 0x04 PM control/status */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_vpd_perm[] = {
> > > + { 0, 0xFFFF0000, }, /* 0x00 address */
You don't let userspace write address?
> > > + { 0, 0xFFFFFFFF, }, /* 0x04 data */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_slotid_perm[] = {
> > > + { 0, 0, }, /* 0x00 all read only */
Why do we bother protecting readonly registers?
better check you don't get a root port then.
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_msix_perm[] = {
> > > + { 0, 0, }, /* 0x00 MSI-X Enable */
> > > + { 0, 0, }, /* 0x04 table offset & bir */
> > > + { 0, 0, }, /* 0x08 pba offset & bir */
> > > +};
> > > +
> > > +static struct perm_bits pci_cap_af_perm[] = {
> > > + { 0, 0, }, /* 0x00 af capability */
> > > + { 0, 0x0001, }, /* 0x04 af flr bit */
> > > +};
> > > +
> > > +static struct perm_bits *pci_cap_perms[] = {
> > > + [PCI_CAP_ID_BASIC] = pci_cap_basic_perm,
> > > + [PCI_CAP_ID_PM] = pci_cap_pm_perm,
> > > + [PCI_CAP_ID_VPD] = pci_cap_vpd_perm,
> > > + [PCI_CAP_ID_SLOTID] = pci_cap_slotid_perm,
> > > + [PCI_CAP_ID_MSI] = pci_cap_msi_perm,
> > > + [PCI_CAP_ID_PCIX] = pci_cap_pcix_perm,
> > > + [PCI_CAP_ID_EXP] = pci_cap_exp_perm,
> > > + [PCI_CAP_ID_MSIX] = pci_cap_msix_perm,
> > > + [PCI_CAP_ID_AF] = pci_cap_af_perm,
> > > +};
Disallowing access or 'virtualizing' - by which I think
you mean replacing write with read modify write? Should
really be very special case.
Can't simple open-coded C work?
if (addr == PCI_COMMAND)
return -EPERM;
....
> > > +
> > > +/*
> > > + * We build a map of the config space that tells us where
> > > + * and what capabilities exist, so that we can map reads and
> > > + * writes back to capabilities, and thus figure out what to
> > > + * allow, deny, or virtualize
> > > + */
> >
> > So the above looks like it is very unlikely to be exhaustive and
> > correct. Maybe there aren't bugs in this table to be found just by
> > looking hard at the spec, but likely will surface when someone tries
> > to actually run driver with e.g. a working pm on top.
> > Let's ask another question:
> >
> > since we have the iommu protecting us, can't all or most of this be
> > done in userspace? What can userspace do that will harm the host?
> > I think each place where we block access to a register, there should
> > be a very specific documentation for why we do this.
> I think, in an ideal world, you would be correct. I don't trust either
> the hardware or the iommu software
Not sure adding emulation bugs on top will help though: and
we will break userspace. Note how there's no way for userspace
to query the tables to figure out what works and what doesn't.
> to feel good about this though.
About documenting the tables? Me too, but for different reasons:
if there are bugs, we should be finding them and documenting
what they are, not plastering over.
> >
> >
> > > +int vfio_build_config_map(struct vfio_dev *vdev)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + u8 *map;
> > > + int i, len;
> > > + u8 pos, cap, tmp;
> > > + u16 flags;
> > > + int ret;
> > > + int loops = 100;
> >
> > 100?
> Why not?
The answer's 42. Document the constants please.
Maybe we need a notifier for that :) I thought we lock the memory
though - using the mlock rlimit - so why are we using notifiers
at all?
Just noticed this btw:
+static void vfio_dma_handle_mmu_notify(struct mmu_notifier *mn,
+ unsigned long start, unsigned long end)
+{
+ struct vfio_listener *listener;
+ unsigned long myend;
+ struct list_head *pos, *pos2;
+ struct dma_map_page *mlp;
+
+ listener = container_of(mn, struct vfio_listener, mmu_notifier);
+ mutex_lock(&listener->vdev->dgate);
+ list_for_each_safe(pos, pos2, &listener->dm_list) {
+ mlp = list_entry(pos, struct dma_map_page, list);
+ if (mlp->vaddr >= end)
I think you can't do a mutex: mmu notifiers are called
under rcu critical section. Also, userspace can make the dm_list
very long, making the rcu critical section very long.
Set some limit on the number of entries, and replace list
with an array? Or use a tree?
> >
> > > + if (listener->mm != NULL) {
> > > + ret = -EINVAL;
> > > + goto out_lock;
> > > + }
> > > + listener->mm = current->mm;
> > > +#ifdef CONFIG_MMU_NOTIFIER
> > > + listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> > > + ret = mmu_notifier_register(&listener->mmu_notifier,
> > > + listener->mm);
> > > + if (ret)
> > > + printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> > > + __func__, ret);
> > > + ret = 0;
> > > +#endif
> > > + }
> > > +
> > > + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> >
> > If you map a 4G region, this will try to allocate 8Mbytes?
> Yes. Ce la vie.
First of all, this will fail - and the request is quite real with
decent sized guests. Second, with appropriately sized allocs before failing
it will stress the system pretty hard. Split it in chunks of 4K or something.
Definitely not me.
> or GregKH for drivers?
>
I guess.
--
error compiling committee.c: too many arguments to function
--
This seems to be missing a change to include/linux/Kbuild that
adds vfio.h to the exported files. Without the export, you cannot
use the definitions from user space programs unless they come with
their own copy of the header.
Arnd
If this ever gets that far, I'll be glad to take it.
thanks,
greg k-h
What if I do:
SET
mmap
UNSET
Now I have access to device which is not behind an iommu.
Simplest solution is to remove the UNSET ioctl:
it's not terribly useful anyway.
--
MST
EXPORT_SYMBOL_GPL
.. snip
> +EXPORT_SYMBOL(uiommu_put);
ditto.
Is there a definitive explanation somewhere of when to use each?
For a driver like this, that is very tied to the way that the kernel
works, I would recommend the _GPL marking, like the UIO interface has.
But that's just me. :)
thanks,
greg k-h
Always use _GPL unless you have a defensible reason why you shouldn't.
The kernel's license if GPL, exporting a symbol without GPL can be seen
as adding an exception to the license.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
On Tuesday 08 June 2010 10:45:57 pm Michael S. Tsirkin wrote:
> On Tue, Jun 08, 2010 at 04:54:43PM -0700, Tom Lyon wrote:
> > On Tuesday 08 June 2010 03:38:44 pm Michael S. Tsirkin wrote:
> > > On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
...
> > > > + /* reset to known state if we can */
> > > > + (void) pci_reset_function(vdev->pdev);
> > >
> > > We are opening the device - how can it not be in a known state?
> > If an alternate driver left it in a weird state.
>
> Don't we care if it fails then? I think we do ...
> > > And we should make sure (at open time) we *can* reset on close, fail binding if we can't.
> > How do you propose?
>
> Fail open if reset fails on open?
OK, it'll now fail open if reset fails.
[ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
OK, here's the thing. The IOMMU API today does not do squat about dealing with
interrupts. Interrupts are special because the APIC addresses are not each in
their own page. Yes, the IOMMU hardware supports it (at least Intel), and there's
some Intel intr remapping code (not AMD), but it doesn't look like it is enough.
Therefore, we must not allow the user level driver to diddle the MSI or MSI-X
areas - either in config space or in the device memory space. If the device
doesn't have its MSI-X registers in nice page aligned areas, then it is not
"well-behaved" and it is S.O.L. The SR-IOV spec recommends that devices be
designed the well-behaved way.
When the code in vfio_pci_config speaks of "virtualization" it means that there
are fake registers which the user driver can read or write, but do not affect the
real registers. BARs are one case, MSI regs another. The PCI vendor and device
ID are virtual because SR-IOV doesn't supply them but I wanted the user driver
to find them in the same old place.
> > > > + case VFIO_DMA_MASK: /* set master mode and DMA mask */
> > > > + if (copy_from_user(&mask, uarg, sizeof mask))
> > > > + return -EFAULT;
> > > > + pci_set_master(pdev);
> > >
> > > This better be done by userspace when it sees fit.
> > > Otherwise device might corrupt userspace memory.
This ioctl is gone now - it was vestigial from the dma_sg_map interface.
[ Re: Hotplug and Suspend/Resume]
There are *plenty* of real drivers - brand new ones - which don't bother
with these today. Yeah, I can see adding them to the framework someday -
but if there's no urgent need then it is way down the priority list. Meanwhile,
the other uses beckon. And I never heard the Infiniband users complaining
about not having these things.
> > > > + pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> > >
> > > If you map a 4G region, this will try to allocate 8Mbytes?
> > Yes. Ce la vie.
>
> First of all, this will fail - and the request is quite real with
> decent sized guests. Second, with appropriately sized allocs before failing
> it will stress the system pretty hard. Split it in chunks of 4K or something.
Changed to use vmalloc/vfree - don't need physical contiguity.
The iommu book from AMD seems to say that interrupt remapping table
address is taken from the device table entry. So hardware support seems
to be there, and to me it looks like it should be enough.
Need to look at the iommu/msi code some more to figure out
whether what linux does is handling this correctly -
if it doesn't we need to fix that.
> Therefore, we must not allow the user level driver to diddle the MSI
> or MSI-X areas - either in config space or in the device memory space.
It won't help.
Consider that you want to let a userspace driver control
the device with DMA capabilities.
So if there is a range of addresses that device
can write into that can break host, these writes
can be triggered by userspace. Limiting
userspace access to MSI registers won't help:
you need a way to protect host from the device.
> If the device doesn't have its MSI-X registers in nice page aligned
> areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> recommends that devices be designed the well-behaved way.
>
> When the code in vfio_pci_config speaks of "virtualization" it means
> that there are fake registers which the user driver can read or write,
> but do not affect the real registers. BARs are one case, MSI regs
> another. The PCI vendor and device ID are virtual because SR-IOV
> doesn't supply them but I wanted the user driver to find them in the
> same old place.
Sorry, I still don't understand why do we bother. All this is already
implemented in userspace. Why can't we just use this existing userspace
implementation? It seems that all kernel needs to do is prevent
userspace from writing BARs.
Why can't we replace all this complexity with basically:
if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
return -ENOPERM;
And maybe another register or two. Most registers should be fine.
> [ Re: Hotplug and Suspend/Resume]
> There are *plenty* of real drivers - brand new ones - which don't
> bother with these today. Yeah, I can see adding them to the framework
> someday - but if there's no urgent need then it is way down the
> priority list.
Well, for kernel drivers everything mostly works out of the box, it is
handled by the PCI subsystem. So some kind of framework will need to be
added for userspace drivers as well. And I suspect this issue won't be
fixable later without breaking applications.
> Meanwhile, the other uses beckon.
Which other uses? I thought the whole point was fixing
what's broken with current kvm implementation.
So it seems to be we should not rush it ignoring existing issues such as
hotplug.
> And I never heard
> the Infiniband users complaining about not having these things.
I did.
--
MST
OK, after more investigation, I realize you are right.
We definitely need the IOMMU protection for interrupts, and
if we have it, a lot of the code for config space protection is pointless.
It does seem that the Intel intr_remapping code does what we want
(accidentally) but that the AMD iommu code does not yet do any
interrupt remapping. Joerg - can you comment? On the roadmap?
I should have an AMD system w IOMMU in a couple of days, so I
can check this out.
>
> > If the device doesn't have its MSI-X registers in nice page aligned
> > areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> > recommends that devices be designed the well-behaved way.
> >
> > When the code in vfio_pci_config speaks of "virtualization" it means
> > that there are fake registers which the user driver can read or write,
> > but do not affect the real registers. BARs are one case, MSI regs
> > another. The PCI vendor and device ID are virtual because SR-IOV
> > doesn't supply them but I wanted the user driver to find them in the
> > same old place.
>
> Sorry, I still don't understand why do we bother. All this is already
> implemented in userspace. Why can't we just use this existing userspace
> implementation? It seems that all kernel needs to do is prevent
> userspace from writing BARs.
I assume the userspace of which you speak is qemu? This is not what I'm
doing with vfio - I'm interested in the HPC networking model of direct
user space access to the network.
> Why can't we replace all this complexity with basically:
>
> if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
> return -ENOPERM;
>
> And maybe another register or two. Most registers should be fine.
>
> > [ Re: Hotplug and Suspend/Resume]
> > There are *plenty* of real drivers - brand new ones - which don't
> > bother with these today. Yeah, I can see adding them to the framework
> > someday - but if there's no urgent need then it is way down the
> > priority list.
>
> Well, for kernel drivers everything mostly works out of the box, it is
> handled by the PCI subsystem. So some kind of framework will need to be
> added for userspace drivers as well. And I suspect this issue won't be
> fixable later without breaking applications.
Whatever works out of the box for the kernel drivers which don't implement
suspend/resume will work for the user level drivers which don't.
>
> > Meanwhile, the other uses beckon.
>
> Which other uses? I thought the whole point was fixing
> what's broken with current kvm implementation.
> So it seems to be we should not rush it ignoring existing issues such as
> hotplug.
Non-kvm cases. That don't care about suspend/resume.
In that case, I don't see the point of fake config space registers
at all. For virtualization we need them to run unmodified guest drivers,
but we have an implementation in qemu. If you write your own custom drivers
for HPC, just get -EPERM from kernel and handle this.
> > Why can't we replace all this complexity with basically:
> >
> > if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
> > return -ENOPERM;
> >
> > And maybe another register or two. Most registers should be fine.
> >
> > > [ Re: Hotplug and Suspend/Resume]
> > > There are *plenty* of real drivers - brand new ones - which don't
> > > bother with these today. Yeah, I can see adding them to the framework
> > > someday - but if there's no urgent need then it is way down the
> > > priority list.
> >
> > Well, for kernel drivers everything mostly works out of the box, it is
> > handled by the PCI subsystem. So some kind of framework will need to be
> > added for userspace drivers as well. And I suspect this issue won't be
> > fixable later without breaking applications.
>
> Whatever works out of the box for the kernel drivers which don't implement
> suspend/resume will work for the user level drivers which don't.
How will hotplug/driver unload work then?
> >
> > > Meanwhile, the other uses beckon.
> >
> > Which other uses? I thought the whole point was fixing
> > what's broken with current kvm implementation.
> > So it seems to be we should not rush it ignoring existing issues such as
> > hotplug.
> Non-kvm cases.
Try describing these use-cases in more detail then.
And given that you want to use this driver for networking,
please copy netdev, not just linux-kernel.
> That don't care about suspend/resume.
>
>
>
Let's start with hotplug please. This is a really clear-cut issue:
all pci drivers get an event on hotplug and hotunplug, you need
such events for userspace and you need them on day 0, otherwise
userspace which ignores hotplug will get written, and you will
have to support it forever.
--
MST
> OK, after more investigation, I realize you are right.
> We definitely need the IOMMU protection for interrupts, and
> if we have it, a lot of the code for config space protection is pointless.
> It does seem that the Intel intr_remapping code does what we want
> (accidentally) but that the AMD iommu code does not yet do any
> interrupt remapping. Joerg - can you comment? On the roadmap?
Work on this is planned, but not at a high priority by now. I can
re-prioritize this item if needed.
Joerg
As a stop-gap measure, we could get by with a
portable API that let us figure out whether a given iommu supports
interrupt remapping.
--
MST
Hi Tom,
I found a few bugs. Patch below. The first chunk clears the
pci_config_map on close, otherwise we end up passing virtualized state
from one user to the next. The second is an off by one in the basic
perms. Finally, vfio_bar_fixup() needs an overhaul. It wasn't setting
the lower bits right and is allowing virtual writes of bits that aren't
aligned to the size. This section probably needs another pass or two of
refinement. Thanks,
Alex
Signed-off-by: Alex Williamson <alex.wi...@redhat.com>
---
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 96639e5..a0e8227 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -129,6 +129,10 @@ static int vfio_release(struct inode *inode, struct file *filep)
eventfd_ctx_put(vdev->ev_msi);
vdev->ev_irq = NULL;
}
+ if (vdev->pci_config_map) {
+ kfree(vdev->pci_config_map);
+ vdev->pci_config_map = NULL;
+ }
vfio_domain_unset(vdev);
/* reset to known state if we can */
(void) pci_reset_function(vdev->pdev);
diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
index c821b5d..f6e26b1 100644
--- a/drivers/vfio/vfio_pci_config.c
+++ b/drivers/vfio/vfio_pci_config.c
@@ -79,18 +79,18 @@ struct perm_bits {
static struct perm_bits pci_cap_basic_perm[] = {
{ 0xFFFFFFFF, 0, }, /* 0x00 vendor & device id - RO */
{ 0, 0xFFFFFFFC, }, /* 0x04 cmd & status except mem/io */
- { 0, 0xFF00FFFF, }, /* 0x08 bist, htype, lat, cache */
- { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x0c bar */
+ { 0, 0, }, /* 0x08 class code & revision id */
+ { 0, 0xFF00FFFF, }, /* 0x0c bist, htype, lat, cache */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x10 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x14 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x18 bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x1c bar */
{ 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x20 bar */
- { 0, 0, }, /* 0x24 cardbus - not yet */
- { 0, 0, }, /* 0x28 subsys vendor & dev */
- { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x2c rom bar */
- { 0, 0, }, /* 0x30 capability ptr & resv */
- { 0, 0, }, /* 0x34 resv */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x24 bar */
+ { 0, 0, }, /* 0x28 cardbus - not yet */
+ { 0, 0, }, /* 0x2c subsys vendor & dev */
+ { 0xFFFFFFFF, 0xFFFFFFFF, }, /* 0x30 rom bar */
+ { 0, 0, }, /* 0x34 capability ptr & resv */
{ 0, 0, }, /* 0x38 resv */
{ 0x000000FF, 0x000000FF, }, /* 0x3c max_lat ... irq */
};
@@ -318,30 +318,55 @@ static void vfio_virt_init(struct vfio_dev *vdev)
static void vfio_bar_fixup(struct vfio_dev *vdev)
{
struct pci_dev *pdev = vdev->pdev;
- int bar;
- u32 *lp;
- u32 len;
+ int bar, mem64 = 0;
+ u32 *lp = NULL;
+ u64 len = 0;
for (bar = 0; bar <= 5; bar++) {
- len = pci_resource_len(pdev, bar);
- lp = (u32 *)&vdev->vinfo.bar[bar * 4];
- if (len == 0) {
- *lp = 0;
- } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
- *lp &= ~0x1;
- *lp = (*lp & ~(len-1)) |
- (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
- if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
- bar++;
- } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
+ if (!mem64) {
+ len = pci_resource_len(pdev, bar);
+ lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+ if (len == 0) {
+ *lp = 0;
+ continue;
+ }
+
+ len = ~(len - 1);
+ } else
+ len >>= 32;
+
+ if (*lp == ~0U)
+ *lp = (u32)len;
+ else
+ *lp &= (u32)len;
+
+ if (mem64) {
+ mem64 = 0;
+ continue;
+ }
+
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
*lp |= PCI_BASE_ADDRESS_SPACE_IO;
- *lp = (*lp & ~(len-1)) |
- (*lp & ~PCI_BASE_ADDRESS_IO_MASK);
+ else {
+ *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
+ *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+ if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
+ *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+ mem64 = 1;
+ }
}
}
+
lp = (u32 *)vdev->vinfo.rombar;
len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
- *lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
+ len = ~(len - 1);
+
+ if (*lp == ~PCI_ROM_ADDRESS_ENABLE)
+ *lp = (u32)len;
+ else
+ *lp = *lp & ((u32)len | PCI_ROM_ADDRESS_ENABLE);
+
vdev->vinfo.bardirty = 0;
}
I still don't see why are we sticking all this emulation in kernel. It
is far from performance hotpath and can easily be emulated in userspace.
qemu does this, you can lift code from there if you like.
Maybe we need to protect the BARs from being manipulated by userspace,
but that should be all. No need for tables.
The benefit I see so far is that it removes duplicate code. Should
every user of this interface try to extract qemu's PCI config space
emulation and jury rig it into their code base? Tom is already
providing access to more capability bits than the kvm device assignment
code. If the kernel community will accept it, I think it saves vfio
usperspace writers some hassle and provides a better environment by
having emulation in a single, well tested, hopefully well used place.
Thanks,
Alex
I get it there's no chance you'll drop the "virtualization"
from the driver then?
--
MST
I am now stuck trying to get access to a system with interrupt remapping.
Turns out my Intel IOMMU doesn't have it.
Hi Tom,
Shouldn't we be matching the mlp based on daddr instead of vaddr? We
can have multiple dma address pointing at the same virtual address, so
dma address is the unique element. I'm also nervous about this dm_list.
For qemu device assignment, we're potentially statically mapping many GB
of iova space. It seems like this could get incredibly bloated and
slow. Thanks,
Alex
In weird circumstances, differing user vaddrs could reolve to the same physical address,
so the uniqueness of any mapping is the <vaddr,len>.
Yes, a linear list is slow, but does qemu need a lot of mappings, or just big ones?
That sounds like another argument for using daddr to me, no? There's
only one address space on the PCI bus, daddr. There's a 1:1 mapping of
daddr to physical page, but an N:1 mapping of vaddr to a physical page.
> Yes, a linear list is slow, but does qemu need a lot of mappings, or just big ones?
I'm not sure yet, the first interface I tried seems to be giving me
handfuls of pages. We sometimes reprogram the guest physical address
(daddr) to a new virtual address (vaddr), but I don't know what the old
virtual address was, so I can't unmap it. That's when I ran into the
issue above. Thanks,
Alex
Hi Tom,
This interface doesn't make sense for the MAP_IOVA user. Especially in
qemu, we have no idea what the DMA mask is for the device we're
assigning. It doesn't really matter though because the guest will use
bounce buffers internally once it loads the device specific drivers and
discovers the DMA mask. This only seems relevant if we're using a
DMA_MAP call that gets to pick the dmaaddr, so I'd propose we only make
this a required call for that interface, and create a separate ioctl for
actually enabling bus master. Thanks,
I expect there's no need for a separate ioctl to do this:
you can do this by write to the control register.
--
MST
Nope, vfio only allows direct writes to the memory and io space bits of
the command register, all other bits are virtualized. I wonder if
that's necessary though since we require the device to be attached to an
iommu domain before we allow config space access.
Alex
I don't see why's there need to protect the control register.
As far as I can see, nothing userspace does with it
can damage the host.
> all other bits are virtualized. I wonder if
> that's necessary though since we require the device to be attached to an
> iommu domain before we allow config space access.
>
> Alex
>
I don't think it's necessary. IMHO all the virtualization
tables can just be replaced with
if (pci header type == PCI_HEADER_TYPE_NORMAL)
if (addr < PCI_BASE_ADDRESS_0 + 24 &&
addr + len >= PCI_BASE_ADDRESS_0)
return -EPERM;
else /* similar thing for the bridge and cardbus
types */
Much simpler and more readable than tables full of 0xffff.
Reason this is enough is because virt drivers like qemu already
have the code to treat interrupt disable, MSI/MSIX capabilities
and BARs registers specially. custom userspace drivers simply
have no reason to touch anything besides the interrupt disable bit.
--
MST