vfio-pci

Javier Guerra Giraldez

unread,

Nov 15, 2013, 2:29:54 AM11/15/13

to snabb...@googlegroups.com

still churning with com

i've added (still only in my fork) lib/hardware/vfio.lua, which should
be a drop-replacement of lib/hardware/pci.lua

but am i getting blind or there are no interrupt handling? i guess
the CPU is more than fast enough to busy-wait for data and space on
the buffers, but still i assumed there was something

--
Javier

Luke Gorrie

unread,

Nov 15, 2013, 5:41:21 AM11/15/13

to snabb...@googlegroups.com

On 15 November 2013 08:29, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> i've added (still only in my fork) lib/hardware/vfio.lua, which should
> be a drop-replacement of lib/hardware/pci.lua

Cool :-)

Can we keep both PCI interfaces, going forward? so we can use vfio
when supported by kernel+bios+cpu and otherwise old-style.

How does DMA address space work with vfio? (do we use snabbswitch's
own virtual process addresses and the IOMMU makes that work? or do we
get to decide somehow?)

without vfio we need to use physical memory addresses for DMA. If
that's different with vfio then drivers will need to know and choose
between e.g. buffer->pointer (virtual address) and buffer->physical
(physical address) for DMA addresses depending on which PCI access
method they are using.

> but am i getting blind or there are no interrupt handling? i guess
> the CPU is more than fast enough to busy-wait for data and space on
> the buffers, but still i assumed there was something

There are no interrupts at all. Life is simple :-)

Julian Stecklina

unread,

Nov 15, 2013, 7:15:06 AM11/15/13

to snabb...@googlegroups.com

On 11/15/2013 11:41 AM, Luke Gorrie wrote:
> How does DMA address space work with vfio? (do we use snabbswitch's
> own virtual process addresses and the IOMMU makes that work? or do we
> get to decide somehow?)
>
> without vfio we need to use physical memory addresses for DMA. If
> that's different with vfio then drivers will need to know and choose
> between e.g. buffer->pointer (virtual address) and buffer->physical
> (physical address) for DMA addresses depending on which PCI access
> method they are using.

You can establish mappings in the IOMMU using ioctl. I use it to create
1:1 mappings, which makes life very easy. The problem with that is that
the IOMMU has an address limit that usually is smaller than the size of
virtual memory. So you run into situation were you cannot create 1:1
mappings.

Julian

signature.asc

Javier Guerra Giraldez

unread,

Nov 15, 2013, 9:00:40 AM11/15/13

to snabb...@googlegroups.com

On Fri, Nov 15, 2013 at 5:41 AM, Luke Gorrie <lu...@snabb.co> wrote:
> Can we keep both PCI interfaces, going forward? so we can use vfio
> when supported by kernel+bios+cpu and otherwise old-style.

sure. i haven't implemented (yet) any altenative-picking code, but in
theory it's even possible to use VFIO for one device and 'old-style'
for another. i'm not sure if that would be useful, but it's
conceivable.

what i definitely don't want is to make any change to the driver code!

> How does DMA address space work with vfio? (do we use snabbswitch's
> own virtual process addresses and the IOMMU makes that work? or do we
> get to decide somehow?)

there's an ioctl() to mmap some memory to the whole "container" (that
handles an iommu_group, probably more than one device there) you get
to choose where in IO space that memory goes.

in sv3, this ioctl() is called with the same pointer as the IO virtual
address, that establishes the 1:1 maping Julian mentions.

what i still don't get totally clear is at which moment is that
mapping created. in the VFIO sample code, they simply add a 1MB block
(mapped at IO address 0x0) before asking for the device's fd. in sv3,
it's called once for each region and on Switch::announce_dma_memory(),
but i don't get what that means.

probably i should to a similar thing: at device opening get all the
regions and add a mapping for each one. that also solves the question
about how much RAM to map (since each region has an offset and size)

> without vfio we need to use physical memory addresses for DMA. If
> that's different with vfio then drivers will need to know and choose
> between e.g. buffer->pointer (virtual address) and buffer->physical
> (physical address) for DMA addresses depending on which PCI access
> method they are using.

as said before, i want total equivalence on both methods from the
point of view of the driver. in SnS the mapping is managed by the
/sys/bus/pci/...../resourceX file; i assumed that put some device
memory in the CPU map and not the other way around. I guess the exact
IO starting address is important when the value you write to an IO
register happens to be a memory address, right? I haven't seen any
translating code, so I guess it assumes 1:1 mapping.

> There are no interrupts at all. Life is simple :-)

good!

--
Javier

Luke Gorrie

unread,

Nov 15, 2013, 9:49:46 AM11/15/13

to snabb...@googlegroups.com

On 15 November 2013 15:00, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> there's an ioctl() to mmap some memory to the whole "container" (that
> handles an iommu_group, probably more than one device there) you get
> to choose where in IO space that memory goes.

That's neat.

There are a few plausible ways we might want to use this:

1:1 mapping between IOMMU addresses and Snabb Switch process address
space. Then we pass normal pointer values for DMA addresses.

1:1 mapping between IOMMU addresses and physical memory address space.
This way we need to translate our own virtual addresses into physical
addresses for DMA, which sounds like work, but is what the drivers
already do today. (Currently we allocate all of our DMA buffers from
~2MB HugeTLB pages and when we allocate a new HugeTLB we use
/proc/self/pagemap to learn its physical address and then keep track
of this for each buffer.)

1:1 mapping between IOMMU and the address space that QEMU guests will
dictate to us. This is tricky for us. Just like the IOMMU has to make
our life easy by using a memory map that we supply, so will SnS have
to make life easy for QEMU guests by using the memory map that the
tell us to use. (The guests will give us addresses that they think are
physical - in the VM - but are not really physical - on the hardware -
and we have to adapt.)

.. there is one other mapping case in our Virtio client towards the
Linux kernel. There we use an ioctl to setup a memory map that the
kernel needs to interpret (like with the IOMMU). Today we use a 1:1
map of our own address space.

So! Lots of exciting address spaces. I don't know if there is a nice
trick to reconcile them all together. It is at least nice that all
memory - DMA, guests, etc - will be mapped in and accessible
_somewhere_ in the SnS process address space.

> what i still don't get totally clear is at which moment is that
> mapping created. in the VFIO sample code, they simply add a 1MB block
> (mapped at IO address 0x0) before asking for the device's fd. in sv3,
> it's called once for each region and on Switch::announce_dma_memory(),
> but i don't get what that means.
>
> probably i should to a similar thing: at device opening get all the
> regions and add a mapping for each one. that also solves the question
> about how much RAM to map (since each region has an offset and size)

Today we have something close to this: update_vhost_memory_map() in
vhost.lua. That is: easy time we allocate new memory that we want to
use for DMA we make an ioctl() to tell the kernel about it and create
a mapping.

This should possibly run on a 'hook' when new DMA memory is allocated.
For now in vhost it's done by this function in the driver regularly
checking if new DMA memory has been allocated.

> as said before, i want total equivalence on both methods from the
> point of view of the driver. in SnS the mapping is managed by the
> /sys/bus/pci/...../resourceX file; i assumed that put some device
> memory in the CPU map and not the other way around. I guess the exact
> IO starting address is important when the value you write to an IO
> register happens to be a memory address, right? I haven't seen any
> translating code, so I guess it assumes 1:1 mapping.

Maybe already explained above .. for total equivalence, then we want
to use physical addresses for DMA, because the drivers are already
doing the conversion from virtual to physical addresses.

Javier Guerra Giraldez

unread,

Nov 15, 2013, 10:32:47 AM11/15/13

to snabb...@googlegroups.com

On Fri, Nov 15, 2013 at 9:49 AM, Luke Gorrie <lu...@snabb.co> wrote:
> So! Lots of exciting address spaces. I don't know if there is a nice
> trick to reconcile them all together. It is at least nice that all
> memory - DMA, guests, etc - will be mapped in and accessible
> _somewhere_ in the SnS process address space.

ok, so i'll refactor this a little more, to make it easy to either
provide the starting address or by default use some hints. i'll check
the HugeTBL and /proc/self/pagemap handling to see how to wrap this in
a flexible enough way.

the way i see it, it should by default do the same as
mmap(/sys/pci..../resourceX), but if you already have some addresses
dictated, just pass to the VFIO mmap()ping call and it will try to use
it.

also, the current mmap() code returns the address of the RAM block, it
could just as easily also return the used IO start address. that way
it could use the provided address as a hint, but if it lies out of the
IOMMU-addressable space, then try other hints. in any case, the
calling code gets the used IO address (and maybe avoid checking
/proc/self/pagemap)

[PD: about the PCI/driver/VFIO shuffiling of devices, after a few test
i'm somewhat more comfortable on being able to return to sane states
without rebooting. in short, the current code uses
/sys/pci/.../unbind to remove the device from its current driver; for
vfio it's exactly the same: remove from whatever driver is currently
holding it and add to /dev/vfio. the other way is the basic case more
difficult, since most kernel drivers don't like to take back some
hardware after initialization; but in our case we're not interested in
any kernel driver. just taking it out from /dev/vfio and not giving
back to anybody is the same as the current SnS code leaves it.]

--
Javier

Javier Guerra Giraldez

unread,

Nov 16, 2013, 10:36:05 AM11/16/13

to snabb...@googlegroups.com

On Fri, Nov 15, 2013 at 10:32 AM, Javier Guerra Giraldez
<jav...@guerrag.com> wrote:
> On Fri, Nov 15, 2013 at 9:49 AM, Luke Gorrie <lu...@snabb.co> wrote:
>> So! Lots of exciting address spaces. I don't know if there is a nice
>> trick to reconcile them all together. It is at least nice that all
>> memory - DMA, guests, etc - will be mapped in and accessible
>> _somewhere_ in the SnS process address space.
>
>
> ok, so i'll refactor this a little more, to make it easy to either
> provide the starting address or by default use some hints. i'll check
> the HugeTBL and /proc/self/pagemap handling to see how to wrap this in
> a flexible enough way.

ok, i had some misconceptions and was somewhat confused about the
meaning of memory mapping vs io mapping.

so, this is what i've seen how it's done now:

SnS:

- on pci.map_pci_memory(), IO registers are mapped with the
/sys/pci/bus/device/.../resourceX

- memory.dma_alloc() allocates buffers with HugeTLB, for each one it
calls virtual_to_physical() to consult /proc/self/pagemap and returns
both the virtual and physical addresses. when the driver has to tell
the NIC about a buffer, it uses that stored physical address.

sv3:

- IO registers are mapped with the VfioDevice::map_bar() method,
which mmap()s a segment of the VFIO device, using the offset/size of a
device region, as reported by the VFIO_DEVICE_GET_REGION_INFO ioctl()

- buffers are allocated on different regions, and each is 'notified'
to the IOMMU on VfioGroup::map_memory_to_device(), with uses the
VFIO_IOMMU_MAP_DMA ioctl to map that memory region, using the same
address as the iova parameter, resulting in a 1:1 mapping. later, it
directly gives the NIC registers a buffer pointer.

currently, in my fork the first part (mmap IO registers) is already
done very similar to sv3 with vfio.map_pci_memory(), which directly
replaces pci.map_pci_memory()

for buffers, we could replace memory.allocate_next_chunk() to notify
the IOMMU about the just allocated chunk. we get to choose the IOVA
parameter (either the physical address, or the virtual address (no
need check /proc/self/pagemap), or do our own allocation of IOVA
space).

in any case, we store that IOVA as the 'physical' address and the
driver shouldn't care which we choose.

in other, different cases (qemu, kernel's virtio interface, etc), we
could use a given buffer and notify some ( 1:1? pagemap? other?) IOVA
to the IOMMU, and still keep it as the 'physical' in the chunk
structure.

--
Javier

Luke Gorrie

unread,

Nov 17, 2013, 5:28:50 AM11/17/13

to snabb...@googlegroups.com

I think you understand the issues as well as I do, at least :-)

I reckon we have to live with these complexities:

1. Translate buffer addresses in a device-dependent way on RX/TX. e.g.
translate to SnS virtual address on RX and then on TX either: use
virtual address again (for vfio), or translate to physical (for PCI
direct access), or translate to VM-physical (for virtio to a guest).

2. Not all memory is usable for all I/O interfaces. Particularly, KVM
guests can only do Virtio-DMA with their own memory. They can't see
our HugeTLB memory. So if we want to send from HugeTLB to a guest then
it's not enough to translate the address, we need to make a copy into
memory owned by that guest (see sv3's VirtioDevice::receive).
Alternatively, if we want to support zero-copy with guests, we need to
be smart and make sure that hardware DMA transfers packets directly
into memory owned by the right VM, which should be either easy or hard
depending on the application.

> --
> You received this message because you are subscribed to the Google Groups "Snabb Switch development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to snabb-devel...@googlegroups.com.
> To post to this group, send an email to snabb...@googlegroups.com.
> Visit this group at http://groups.google.com/group/snabb-devel.

Javier Guerra Giraldez

unread,

Nov 17, 2013, 11:45:52 PM11/17/13

to snabb...@googlegroups.com

On Sun, Nov 17, 2013 at 5:28 AM, Luke Gorrie <lu...@snabb.co> wrote:
> 1. Translate buffer addresses in a device-dependent way on RX/TX. e.g.
> translate to SnS virtual address on RX and then on TX either: use
> virtual address again (for vfio), or translate to physical (for PCI
> direct access), or translate to VM-physical (for virtio to a guest).

yes, what i've written (and am polishing right now, it's still not so
readable as it should) retains the virtual/physical address pair, just
as before (that way the drivers should just work); but what is
reported as physical is in fact what is given as the IOVA in the IOMMU
map, so could be almost anything. right now it uses the values taken
from the pagemap, but ideally we might use a simple allocator just for
that space.

of course, this is only when it founds that a given NIC is on VFIO.
if it's not, the 'normal' pci.lua code is used exactly as before.

i'm also hoping that when all the NICs are used via VFIO, we can skip
the C.lock_memory() at the end of memory.lua, since it seems that the
VFIO_IOMMU_MAP_DMA ioctl() pins the given memory.

specifically, the way i'm doing it now is that most of the code
should't require("pci") anymore. i'm moving the hardware scanning
code to a new "bus.lua" file, that checks each device to see if it's
accesible either via VFIO, or via the classic way and wires in the
corresponding set of functions to handle it.

so, the driver shouldn't just call pci.map_pci_memory(),
pci.set_bus_master() or memory.dma_alloc(), but instead
dev:map_memory(), dev:set_bus_master(), and dev:dma_alloc()

> 2. Not all memory is usable for all I/O interfaces. Particularly, KVM
> guests can only do Virtio-DMA with their own memory. They can't see
> our HugeTLB memory. So if we want to send from HugeTLB to a guest then
> it's not enough to translate the address, we need to make a copy into
> memory owned by that guest (see sv3's VirtioDevice::receive).
> Alternatively, if we want to support zero-copy with guests, we need to
> be smart and make sure that hardware DMA transfers packets directly
> into memory owned by the right VM, which should be either easy or hard
> depending on the application.

right, that's a complex set of complications :-)

in theory, if the driver "inside" the KVM specifies a memory buffer,
we could iommu map it, assigning a unique IOVA, and use that in the IO
registers (the current physical address); but only when the switch
code determines that this packet must go from the NIC to that specific
VM. if the switch has to read the packet to direct it, then we've
already lost the advantage, no?

still, in the (very common) cases where the VM gets either a whole
port, or something easily handled in hardware (i.e, discriminating
just by MAC address or VLAN id), then direct iommu mapping sounds
highly desirable.

on a related note, i've noticed that when i 'move' a single PCI
address (like 0000:88:00.0) to the VFIO driver, it grabs a whole lot
of them

(specifically: 0000:01:00.0 0000:01:00.1 0000:03:00.0 0000:03:00.1
0000:05:00.0 0000:05:00.1 0000:07:00.0 0000:07:00.1 0000:09:00.0
0000:09:00.1 0000:82:00.0 0000:82:00.1 0000:84:00.0 0000:84:00.1
0000:86:00.0 0000:86:00.1 0000:88:00.0 0000:88:00.1 0000:8a:00.0
0000:8a:00.1; occupying iommu groups 15, 16, 17, 18, 19, 39, 40,
41, 42 and 43, with two devices each).

it seems to be that the IOMMU granularity isn't too fine. of course,
that's not too bad, since each iommu group can be 'given' to a
different user or group (just chown the /dev/vfio/xx file). Also, I
still hope that 'virtual' devices (SR-IOV?) could fall in a different
IOMMU group, so each endpoint could be given to a different user.

--
Javier

Julian Stecklina

unread,

Nov 18, 2013, 8:57:28 AM11/18/13

to snabb...@googlegroups.com

On 11/18/2013 05:45 AM, Javier Guerra Giraldez wrote:
> on a related note, i've noticed that when i 'move' a single PCI
> address (like 0000:88:00.0) to the VFIO driver, it grabs a whole lot
> of them
>
> (specifically: 0000:01:00.0 0000:01:00.1 0000:03:00.0 0000:03:00.1
> 0000:05:00.0 0000:05:00.1 0000:07:00.0 0000:07:00.1 0000:09:00.0
> 0000:09:00.1 0000:82:00.0 0000:82:00.1 0000:84:00.0 0000:84:00.1
> 0000:86:00.0 0000:86:00.1 0000:88:00.0 0000:88:00.1 0000:8a:00.0
> 0000:8a:00.1; occupying iommu groups 15, 16, 17, 18, 19, 39, 40,
> 41, 42 and 43, with two devices each).
>
> it seems to be that the IOMMU granularity isn't too fine. of course,
> that's not too bad, since each iommu group can be 'given' to a
> different user or group (just chown the /dev/vfio/xx file). Also, I
> still hope that 'virtual' devices (SR-IOV?) could fall in a different
> IOMMU group, so each endpoint could be given to a different user.

I have the same problem with my development box (DQ77MK motherboard),
where I have to assign both ports of the dual-port 10G NIC plus one of
the onboard NICs to one IOMMU group.

The workstation boards we tried were usually fine.

What do you use?

Julian

signature.asc

Luke Gorrie

unread,

Nov 18, 2013, 9:14:54 AM11/18/13

to snabb...@googlegroups.com

On 18 November 2013 05:45, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> yes, what i've written (and am polishing right now, it's still not so
> readable as it should)

I just read through the full diff on your github fork, that code is
quite clear :)

> retains the virtual/physical address pair, just
> as before (that way the drivers should just work); but what is
> reported as physical is in fact what is given as the IOVA in the IOMMU
> map, so could be almost anything. right now it uses the values taken
> from the pagemap, but ideally we might use a simple allocator just for
> that space.

In the short term I think it's best to continue using these real
physical addresses from pagemap. That way we should be able to use
vfio and rawpci devices in the same SnS instance -- since they both
expect a real physical address in buffer->physical.

we can rethink this once we throw the KVM guest virtio into the mix.
It should be easier to figure out the neat solution when we have all
the messy details in front of us.

> i'm also hoping that when all the NICs are used via VFIO, we can skip
> the C.lock_memory() at the end of memory.lua, since it seems that the
> VFIO_IOMMU_MAP_DMA ioctl() pins the given memory.

lock_memory() is a feature! :-)

I'm a firmware hacker at heart. I want to program the hardware, not
the kernel. So I'm always looking for ways to politely ask the kernel
to step out of the way. lock_memory() stops the kernel from moving us
around in memory/swap, Linux 'isolcpus' + 'taskset -c' stops the
kernel multiplexing stuff on our CPUs, explicitly allocating HugeTLBs
stops the kernel being creative about what memory we get, and so on.

This in my mind is the key to getting really predictable high
performance. I see predictability as the really valuable thing that
x86 has traditionally failed to deliver and the reason people have
been buying so many weird hardware network devices. If we can have
transparent and predictable performance then it will really make
users' lives easier and allow them to deploy snabb switches instead of
hardware devices.

There even exists an "IOTLB bottleneck problem" where the IOMMU can
kill DMA performance due to cache misses in its address translation
hardware. So it may be that IOMMU is only desirable with really modern
hardware e.g. Linux 3.8 and Xeon Ivy Bridge.

So: I am a bit of a technological cave man when it comes to all these
fancy kernel features. I once spent a few months programming in Forth
and I felt completely at home :-).

> specifically, the way i'm doing it now is that most of the code
> should't require("pci") anymore. i'm moving the hardware scanning
> code to a new "bus.lua" file, that checks each device to see if it's
> accesible either via VFIO, or via the classic way and wires in the
> corresponding set of functions to handle it.
>
> so, the driver shouldn't just call pci.map_pci_memory(),
> pci.set_bus_master() or memory.dma_alloc(), but instead
> dev:map_memory(), dev:set_bus_master(), and dev:dma_alloc()

Sounds good!

> in theory, if the driver "inside" the KVM specifies a memory buffer,
> we could iommu map it, assigning a unique IOVA, and use that in the IO
> registers (the current physical address); but only when the switch
> code determines that this packet must go from the NIC to that specific
> VM. if the switch has to read the packet to direct it, then we've
> already lost the advantage, no?

Yes. This can't be done for the general case but it can be done for
interesting special cases.

Example of what I actually want to implement:

The Intel 82599 NIC has 128 separate DMA receive queues.
Each queue could be filled with buffers belonging to one specific VM.
The NIC can select which queue to use based on packet characteristics:
MAC address, IP address, VLAN, etc.
SnS can then send the packet to the VM (or not) simply by passing a
descriptor, without copying packet data.

So full zero-copy I/O is not solvable for the theoretical general case
but it should work for interesting special cases e.g. VMs with unique
MAC addresses connected to the network. (How could we have such
awesome network hardware that can filter and direct packets before
they are even stored in memory and not bother to use it? :-))

> (specifically: 0000:01:00.0 0000:01:00.1 0000:03:00.0 0000:03:00.1
> 0000:05:00.0 0000:05:00.1 0000:07:00.0 0000:07:00.1 0000:09:00.0
> 0000:09:00.1 0000:82:00.0 0000:82:00.1 0000:84:00.0 0000:84:00.1
> 0000:86:00.0 0000:86:00.1 0000:88:00.0 0000:88:00.1 0000:8a:00.0
> 0000:8a:00.1; occupying iommu groups 15, 16, 17, 18, 19, 39, 40,
> 41, 42 and 43, with two devices each).

Interesting. Each PCI card is physically two ports which each appear
as a PCI device. The IOMMU perhaps treats them both as a unit?

Luke Gorrie

unread,

Nov 18, 2013, 9:30:54 AM11/18/13

to snabb...@googlegroups.com

On 18 November 2013 14:57, Julian Stecklina

<jste...@os.inf.tu-dresden.de> wrote:
> I have the same problem with my development box (DQ77MK motherboard),
> where I have to assign both ports of the dual-port 10G NIC plus one of
> the onboard NICs to one IOMMU group.
>
> The workstation boards we tried were usually fine.
>
> What do you use?

Javier is on chur.snabb.co which is this Supermicro:
https://www.ahead-it.eu/en/shop/servers/3u-servers/DIS832/info/supermicro-dis832-max-8-x-satasas-hot-swap--16-x-ddr-3--redundant-psu--11-x-expansion-card

Julian Stecklina

unread,

Nov 18, 2013, 10:05:27 AM11/18/13

to snabb...@googlegroups.com

I have a very similar setup (dual socket, same chipset) at home in my
workstation that I haven't used for sv3 development yet. This is quite
unfortunate... I'll have to check the Linux code to see why it wants to
add so many devices to a single group. My rough plan for the future was
one process per device, but given the state of VT-d that doesn't seem
viable.

Julian

signature.asc

Julian Stecklina

unread,

Nov 18, 2013, 10:08:22 AM11/18/13

to snabb...@googlegroups.com

On 11/18/2013 03:14 PM, Luke Gorrie wrote:
> There even exists an "IOTLB bottleneck problem" where the IOMMU can
> kill DMA performance due to cache misses in its address translation
> hardware. So it may be that IOMMU is only desirable with really modern
> hardware e.g. Linux 3.8 and Xeon Ivy Bridge.

Do you have any references to people experiencing this in the wild? I
can imagine it being a problem for small packets.

Julian

signature.asc

Luke Gorrie

unread,

Nov 18, 2013, 10:34:33 AM11/18/13

to snabb...@googlegroups.com

On 18 November 2013 16:08, Julian Stecklina

Just rumors. But I am itching to have a nice benchmarking setup on
e.g. chur that can run network-heavy OpenStack workloads and start
peeking in CPU performance counters :-)

Javier Guerra Giraldez

unread,

Nov 18, 2013, 10:46:05 AM11/18/13

to snabb...@googlegroups.com

On Mon, Nov 18, 2013 at 9:14 AM, Luke Gorrie <lu...@snabb.co> wrote:
> On 18 November 2013 05:45, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
>> yes, what i've written (and am polishing right now, it's still not so
>> readable as it should)
>
> I just read through the full diff on your github fork, that code is
> quite clear :)

thks, i kept ironing it a few hours after that email

>> retains the virtual/physical address pair, just
>> as before (that way the drivers should just work); but what is
>> reported as physical is in fact what is given as the IOVA in the IOMMU
>> map, so could be almost anything. right now it uses the values taken
>> from the pagemap, but ideally we might use a simple allocator just for
>> that space.
>
> In the short term I think it's best to continue using these real
> physical addresses from pagemap. That way we should be able to use
> vfio and rawpci devices in the same SnS instance -- since they both
> expect a real physical address in buffer->physical.

right, it's the easiest way at the moment.

i've just changed the dma_alloc() semantics from returning a (virtual,
physical) pair of addresses, to (virtual, "something the NIC
understand, don't worry too much about it"), where that "something"
just coincides with the physical address.

> we can rethink this once we throw the KVM guest virtio into the mix.
> It should be easier to figure out the neat solution when we have all
> the messy details in front of us.

yep, no matter what we plan, i'm sure we'll get some unexpected curves.

>> i'm also hoping that when all the NICs are used via VFIO, we can skip
>> the C.lock_memory() at the end of memory.lua, since it seems that the
>> VFIO_IOMMU_MAP_DMA ioctl() pins the given memory.
>
> lock_memory() is a feature! :-)

hehe, right; but a privileged feature!. if you want SnS to be able
to run on non-root, then this call must be optional.

> There even exists an "IOTLB bottleneck problem" where the IOMMU can
> kill DMA performance due to cache misses in its address translation
> hardware. So it may be that IOMMU is only desirable with really modern
> hardware e.g. Linux 3.8 and Xeon Ivy Bridge.

yeah, i think it must be clear that on some cases running non-root
(VFIO, no mlock()) might not be as fast as possible.

> So: I am a bit of a technological cave man when it comes to all these
> fancy kernel features. I once spent a few months programming in Forth
> and I felt completely at home :-).

yeah, it's so nice when you're able to hold the whole machine state in
your head!

>> specifically, the way i'm doing it now is that most of the code
>> should't require("pci") anymore. i'm moving the hardware scanning
>> code to a new "bus.lua" file, that checks each device to see if it's
>> accesible either via VFIO, or via the classic way and wires in the
>> corresponding set of functions to handle it.
>>
>> so, the driver shouldn't just call pci.map_pci_memory(),
>> pci.set_bus_master() or memory.dma_alloc(), but instead
>> dev:map_memory(), dev:set_bus_master(), and dev:dma_alloc()
>
> Sounds good!

great, it's still a little "too OOP", and most of these calls have no
use for the 'self', but i think it's better to include it in the usage
contract.

i guess 0000:01:00.0 and 0000:01:00.1 are the two ports of the same
(first?) card, both belong to VFIO group 15. the VFIO docs make it
clear that you have to assign all devices of a group to the same
process; but the strange part is that it immediately assigns so many
devices, from several groups

--
Javier

Luke Gorrie

unread,

Nov 18, 2013, 10:48:29 AM11/18/13

to snabb...@googlegroups.com

On 18 November 2013 16:46, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> hehe, right; but a privileged feature!. if you want SnS to be able
> to run on non-root, then this call must be optional.

Oh good point :-)

Javier Guerra Giraldez

unread,

Nov 18, 2013, 11:19:36 AM11/18/13

to snabb...@googlegroups.com

AFAICT, it doesn't assign to a single group. these are 10 groups, 2
devices each (i guess each group is a car and each device is a NIC
port). but what happens is that it takes all of them, creating 10
/dev/vfio/XX files.

maybe it's because all of them are already unbound?

i've been trying to check where in Linux sources is the handling of
/sys/bus/pci/drivers/vfio-pci/new_id, but no avail....

--
Javier

Javier Guerra Giraldez

unread,

Nov 19, 2013, 1:46:28 AM11/19/13

to snabb...@googlegroups.com

On Mon, Nov 18, 2013 at 11:19 AM, Javier Guerra Giraldez
<jav...@guerrag.com> wrote:
> AFAICT, it doesn't assign to a single group. these are 10 groups, 2
> devices each (i guess each group is a car and each device is a NIC
> port). but what happens is that it takes all of them, creating 10
> /dev/vfio/XX files.
>
> maybe it's because all of them are already unbound?

mystery solved:

writing to the /sys/bus/pci/drivers/<driver-name>/new_id file is
handled by store_new_id() in linux_src/drivers/pci/pci-driver.c; it
takes 7 hex numbers: vendor, device, subvendor, subdevice,
class, class_mask, driver_data (only the first two are required). as
far as i can tell, none of that identifies a single card, all
non-bound devices that fit that description are added to the driver.

looking in the sources for an alternative way, i found that
/sys/bus/pci/drivers/<driver-name>/bind takes the "name" of the
device, so:

echo '0000:88:00.0' | sudo tee /sys/bus/pci/drivers/vfio-pci/bind

does add just the single 0000:88:00.0 device to the vfio-pci, creating
the /dev/vfio/42 file. of course, it doesn't work until i add
"0000:88:00.1" too to complete the group #42

i wonder why this isn't listed as the main way to add devices to the
vfio driver.

--
Javier

Julian Stecklina

unread,

Nov 19, 2013, 7:49:15 AM11/19/13

to snabb...@googlegroups.com

On 11/19/2013 07:46 AM, Javier Guerra Giraldez wrote:
> echo '0000:88:00.0' | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
>
> does add just the single 0000:88:00.0 device to the vfio-pci, creating
> the /dev/vfio/42 file. of course, it doesn't work until i add
> "0000:88:00.1" too to complete the group #42

On development system I end up with 3 NICs in a single group. This sucks. :(

> i wonder why this isn't listed as the main way to add devices to the
> vfio driver.

Good point.

Julian

signature.asc

Javier Guerra Giraldez

unread,

Nov 20, 2013, 3:29:19 AM11/20/13

to snabb...@googlegroups.com

it's working!

the app/ and design/ scripts no longer require("pci"), and do work
with both vfio and classic pci access at the same time, no distinction
is made

the designs.spammer.spammer script crashes with an "out of packets"
message; is that normal? it seems it does that in the trunk version
too.

to set a device to vfio, this can be used:

sudo src/snabbswitch -l lib.hardware.vfio -e
"lib.hardware.vfio.setup_vfio('0000:88:00.1', true)"

if the last parameter is true, it moves all devices of the same group.

to release a device from vfio:

sudo src/snabbswitch -l lib.hardware.pci -e
"lib.hardware.pci.unbind_device_from_linux('0000:88:00.1')"

i guess shorter names are in order...

--
Javier

Luke Gorrie

unread,

Nov 20, 2013, 9:30:26 AM11/20/13

to snabb...@googlegroups.com

On Wednesday, November 20, 2013, Javier Guerra Giraldez wrote:

it's working!

Awesome !!!

Will take it for a spin and check on the troubles you mention :)

Luke Gorrie

unread,

Nov 21, 2013, 5:44:15 AM11/21/13

to snabb...@googlegroups.com

On 20 November 2013 09:29, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> the designs.spammer.spammer script crashes with an "out of packets"
> message; is that normal? it seems it does that in the trunk version
> too.

I think there's a general issue with doing DMA on chur that started
roughly at the time we rebooted with "intel_iommu=on".

I'm seeing a bunch of messages like this:

luke@chur:~/hacking/javier/snabbswitch/src$ dmesg | tail
[699465.911029] DMAR:[fault reason 01] Present bit in root entry is clear
[792724.120210] dmar: DRHD: handling fault status reg 102
[792724.131122] dmar: DMAR:[DMA Read] Request device [01:00.0] fault
addr 73f620000
[792724.131122] DMAR:[fault reason 06] PTE Read access is not set
[792760.900862] dmar: DRHD: handling fault status reg 202
[792760.912039] dmar: DMAR:[DMA Read] Request device [01:00.0] fault
addr 73fa20000
[792760.912039] DMAR:[fault reason 06] PTE Read access is not set
[792809.321244] dmar: DRHD: handling fault status reg 302
[792809.331871] dmar: DMAR:[DMA Read] Request device [01:00.0] fault
addr 73f620000
[792809.331871] DMAR:[fault reason 06] PTE Read access is not set

Do we need to ask the IOMMU for more permissions? (Is this perhaps
even the case when not using vfio but simply having intel_iommu=on?)

Javier Guerra Giraldez

unread,

Nov 21, 2013, 10:14:29 AM11/21/13

to snabb...@googlegroups.com

On Thu, Nov 21, 2013 at 5:44 AM, Luke Gorrie <lu...@snabb.co> wrote:
> [792809.331871] dmar: DMAR:[DMA Read] Request device [01:00.0] fault
> addr 73f620000
> [792809.331871] DMAR:[fault reason 06] PTE Read access is not set
>
> Do we need to ask the IOMMU for more permissions? (Is this perhaps
> even the case when not using vfio but simply having intel_iommu=on?)

i see some forum comments [1] telling that it's a startup issue for
some hardware. they advice adding "iommu=pt" (passthrough mode)to the
boot line, but reading the sources it seems it's an amd thing only. i
don't see anywhere any description of it, the kernel docs only list it
[2], without any comment.

this messages are mentioned in [3], but only as an example of how the
Intel iommu driver reports faults. it seems that the DMA remapping
initialization tries to put this device (eg 01:00.0) at this address
(eg 73f620000) but no read access was set up before that. i think
that document assumes kernelspace drivers, so most of that
initialization should occur at startup, but for userspace drivers, it
could stay in undefined states for long times until the user does run
the driver.

do this messages appear at startup or when executing snabbswitch? if
at startup, it might not be a problem; but if it's when executing our
code, then i guess we should choose differently the ram address that
gets mapped to IO.

[1]:https://groups.google.com/forum/#!topic/fa.linux.kernel/QfkxD-1ejlU
[2]:https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/kernel-parameters.txt?id=refs/tags/v3.12#n1276
[3]:https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/Intel-IOMMU.txt?id=refs/tags/v3.12#n70

--
Javier

Luke Gorrie

unread,

Nov 21, 2013, 10:31:36 AM11/21/13

to snabb...@googlegroups.com

On 21 November 2013 16:14, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> do this messages appear at startup or when executing snabbswitch? if
> at startup, it might not be a problem; but if it's when executing our
> code, then i guess we should choose differently the ram address that
> gets mapped to IO.

It happens when executing snabb switch. I test with the loadgen design
updated to run on your branch
(~luke/hacking/javier/snabbswitch/src/designs/loadgen/loadgen.lua).

Commandline:
make && sudo flock -x /tmp/snabb.lock ./snabbswitch
designs.loadgen.loadgen small-icmp.cap 01:00.0

prints reports showing lack of DMA activity:
0000:01:00.0 TXDGPC (TX packets) 0 GOTCL (TX octets) 0

and does create those messages freshly in dmesg.

Maybe DMA memory is not accessible?

Javier Guerra Giraldez

unread,

Nov 21, 2013, 11:39:15 AM11/21/13

to snabb...@googlegroups.com

On Thu, Nov 21, 2013 at 10:31 AM, Luke Gorrie <lu...@snabb.co> wrote:
> It happens when executing snabb switch. I test with the loadgen design
> updated to run on your branch
> (~luke/hacking/javier/snabbswitch/src/designs/loadgen/loadgen.lua).

ah, ok. then it is an issue.

> Commandline:
> make && sudo flock -x /tmp/snabb.lock ./snabbswitch
> designs.loadgen.loadgen small-icmp.cap 01:00.0
>
> prints reports showing lack of DMA activity:
> 0000:01:00.0 TXDGPC (TX packets) 0 GOTCL (TX octets) 0
>
> and does create those messages freshly in dmesg.
>
> Maybe DMA memory is not accessible?

yes, it seems that the phys_addr might not be a good IOVA address.
i'll check tonight what happens if i do some 1:1 mappings, or maybe
whip out a simplistic allocator.

in any case there shouldn't be any change on the driver, it won't
complain if the phys_addr just happens to coincide with the memory
pointer value...

btw, is there any way to check if some real packets are finally sent
to the wire? maybe a tcpdump on the port at the other end of the
loopback wire

--
Javier

Luke Gorrie

unread,

Nov 21, 2013, 11:52:27 AM11/21/13

to snabb...@googlegroups.com

Easy way to check network activity is hardware atatistics counters in the nic. See dump_stats() in intel10g. A bunch of counters will become nonzero when there is real tx/rx.

Now with cabling your tcpdump idea should work too.

Javier Guerra Giraldez

unread,

Nov 22, 2013, 3:47:55 AM11/22/13

to snabb...@googlegroups.com

On Thu, Nov 21, 2013 at 11:39 AM, Javier Guerra Giraldez
<jav...@guerrag.com> wrote:
> On Thu, Nov 21, 2013 at 10:31 AM, Luke Gorrie <lu...@snabb.co> wrote:
>>
>> Maybe DMA memory is not accessible?
>
> yes, it seems that the phys_addr might not be a good IOVA address.
> i'll check tonight what happens if i do some 1:1 mappings, or maybe
> whip out a simplistic allocator.

things i've found so far:

- the dmesg lines appear at the "dev.r.TDT(dev.tdt)" line at the end
intel10g:sync_transmit()

- the error mentions failure reading, even if this is a register write
operation.

- it happens no matter if the card was accessed via VFIO or classic code.

- the register file is allocated by map_pci_memory(), which is the
"most different" function between normal pci and VFIO. the first one
mmaps the /sys/bus/pci/devices/<dev>/register0, the second one reads
the region #0 and mmap()s the resultant offset/size (which is 0 /
128KB in this case)

- originally I thought the error would be the memory mapped to the
card and not the io registers. the first one we get to choose the
IOVA, in the other we only get the size/offset into a descriptor file.
therefore it's not about the choice of mapping, and wouldn't be fixed
by creating an allocator for the IOVA argument

- the address mentioned in the fail lines seem to grow slightly with
each invocation. there might be some resource release missing.

- i coudln't find that address used anywhere.

--
Javier

Julian Stecklina

unread,

Nov 22, 2013, 10:43:06 AM11/22/13

to snabb...@googlegroups.com

On 11/22/2013 09:47 AM, Javier Guerra Giraldez wrote:
> On Thu, Nov 21, 2013 at 11:39 AM, Javier Guerra Giraldez
> <jav...@guerrag.com> wrote:
>> On Thu, Nov 21, 2013 at 10:31 AM, Luke Gorrie <lu...@snabb.co> wrote:
>>>
>>> Maybe DMA memory is not accessible?
>>
>> yes, it seems that the phys_addr might not be a good IOVA address.
>> i'll check tonight what happens if i do some 1:1 mappings, or maybe
>> whip out a simplistic allocator.
>
> things i've found so far:
>
> - the dmesg lines appear at the "dev.r.TDT(dev.tdt)" line at the end
> intel10g:sync_transmit()
>
> - the error mentions failure reading, even if this is a register write
> operation.

The IOMMU is only involved when the device writes to memory. Writes to
the MMIO space by the CPU do not involve the IOMMU.

> - it happens no matter if the card was accessed via VFIO or classic code.

Then only two things can happen:
- the device can fault reading the TX descriptor ring
- the device can fault reading packet data

which means that something failed to setup the I/O memory mappings for
those to regions.

> - the address mentioned in the fail lines seem to grow slightly with
> each invocation. there might be some resource release missing.

If it grows by 16 or 32 or 64 bytes, it would be the descriptor ring.
Check the values in the corresponding TDBAL/H registers.

Julian

signature.asc

Javier Guerra Giraldez

unread,

Nov 22, 2013, 11:19:18 AM11/22/13

to snabb...@googlegroups.com

On Fri, Nov 22, 2013 at 10:43 AM, Julian Stecklina
<jste...@os.inf.tu-dresden.de> wrote:
>> - the dmesg lines appear at the "dev.r.TDT(dev.tdt)" line at the end
>> intel10g:sync_transmit()
>>
>> - the error mentions failure reading, even if this is a register write
>> operation.
>
> The IOMMU is only involved when the device writes to memory. Writes to
> the MMIO space by the CPU do not involve the IOMMU.

of course! the dev.r.TDT(dev.tdt) triggers the device into reading
the rest of the data, right? _then_ it does a read throught the
IOMMU, which fails (as reported in dmesg), _then_ the problem is in
the choice of IOVA, as suspected all the time!

the two clues that i misinterpreted, wasting lots of time:

- once i determined that the message was triggered on a register
access I focused on register access itself, instead of thinking about
the side effects of that operation

- i interpreted the "read failure" as read from the CPU, but it could
just as well be "read from the device"

i spent lots of time reading the kernel and trying to see how the
mmap()ing of IO space could go wrong... lets hope a more fruitful
debugging session tonight!

tks a lot

--
Javier

Luke Gorrie

unread,

Nov 22, 2013, 12:53:31 PM11/22/13

to snabb...@googlegroups.com

On 22 November 2013 17:19, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> i spent lots of time reading the kernel and trying to see how the
> mmap()ing of IO space could go wrong... lets hope a more fruitful
> debugging session tonight!

Good luck! :-)

Javier Guerra Giraldez

unread,

Nov 23, 2013, 2:50:54 AM11/23/13

to snabb...@googlegroups.com

On Fri, Nov 22, 2013 at 10:43 AM, Julian Stecklina
<jste...@os.inf.tu-dresden.de> wrote:
> Check the values in the corresponding TDBAL/H registers.

right, that was it. it took a lot of work to get all the relevant
information, but in the end the numbers matched the dmesg error.

short story, it was the packet data, allocated by core/buffer.lua
using memory.dma_alloc(). the issue is that that memory is allocated
from HugeTLB and then checked the physical address; but it's never
notified to the IOMMU, so it's not mapped to access from the device.

changing buffer.lua to use vfio.dma_alloc() makes it work, at least
for vfio-managed devices. i think it's ugly to call a
lib/hardware/xxx file from a core/xxx, but maybe that means that the
memory management in vfio.lua really belongs somewhere else.

I also changed vfio.dma_alloc() to not bother checking the physical
memory address, it simply keeps a 'cursor' to assign new addresses for
each mapping.

now, about the 'normal' pci access: in Documentation/Intel-IOMMU[1]
it says: "Well behaved drivers call pci_map_*() calls before sending
command to device that needs to perform DMA." those pci_map_*() are
aliases to dma_map_*() calls available to kernel drivers; but not to
userspace drivers.

for "classic userspace drivers", there's some mentions about mmap()ing
/dev/mem to get access to physical pages, but not to handle the IOMMU.

there's also the UIO driver family, specifically the Generic PCI UIO
driver[2], but this seems almost like a root-user-only VFIO. maybe
the biggest difference is that it doesn't have a concept of device
containers and groups.

in short: it works, registers packet transmission... but only for
devices handled via VFIO, and with a layering violation in buffer.lua

[1]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/Intel-IOMMU.txt?id=refs/tags/v3.12#n70
[2]: http://www.free-electrons.com/kerneldoc/latest/DocBook/uio-howto/uio_pci_generic.html
--
Javier

Luke Gorrie

unread,

Nov 24, 2013, 6:25:27 AM11/24/13

to snabb...@googlegroups.com

On 23 November 2013 08:50, Javier Guerra Giraldez <jav...@guerrag.com> wrote:
> right, that was it. it took a lot of work to get all the relevant
> information, but in the end the numbers matched the dmesg error.

Awesome detective work ! Great fun reading along on the progress :)

> changing buffer.lua to use vfio.dma_alloc() makes it work, at least
> for vfio-managed devices. i think it's ugly to call a
> lib/hardware/xxx file from a core/xxx, but maybe that means that the
> memory management in vfio.lua really belongs somewhere else.

I have been pondering Emacs-style hooks for this kind of situation. Like:

vfio.lua:

function vfio_register_memory_hook (virtual, physical, length)

-- register with IOMMU...

end

table.insert(memory.new_dma_memory_hooks, vfio_register_hook)

memory.lua:

for i, hook in ipairs(new_dma_memory_hooks) do

hook(virtual, physical, length)

end

then with a modest set of hooks in core.* we can perhaps make everything play nicely together?

> I also changed vfio.dma_alloc() to not bother checking the physical

> memory address, it simply keeps a 'cursor' to assign new addresses for
> each mapping.

Gonna be interesting to see what address spaces we need to track and how we do this, e.g. with more fields in the buffer struct, or separate lookup tables, etc.

> in short: it works, registers packet transmission... but only for
> devices handled via VFIO, and with a layering violation in buffer.lua

Looking forward to playing with this :-)

So do I understand correctly that non-VFIO drivers are only working when booting the kernel without "intel_iommu=on" and we see no obvious way to fix this?

In that case maybe we should start using vfio as the default. In benchmarking to find peak performance we would want to A/B test vfio vs. raw but that is perhaps the only really interesting case on Xeon systems that all have IOMMUs.

Great hacking !!

-Luke

Luke Gorrie

unread,

Nov 24, 2013, 6:27:27 AM11/24/13

to snabb...@googlegroups.com

On 24 November 2013 12:25, Luke Gorrie <lu...@snabb.co> wrote:

So do I understand correctly that non-VFIO drivers are only working when booting the kernel without "intel_iommu=on" and we see no obvious way to fix this?

Hah!

You know, I actually clicked Send on an old draft of that mail. Here is what I really meant to say on reflection :-)

So the situation is that we need to decide at boot time whether or not to use the IOMMU (kernel intel_iommu=on param). If the IOMMU is enabled then we must use vfio for drivers. If the IOMMU is disabled then we must use raw PCI for the drivers.

VFIO/IOMMU is a nicer environment: we can run non-root, we don't risk crashing the machine with bad DMA, and we can choose our own memory map for DMA. (Maybe other benefits, e.g. less risk of two snabbswitch processes both opening the same device at the same time?)

Raw/non-IOMMU is more basic: we have to run as root, we can crash the machine with bad DMA, and we have to use physical addresses memory for DMA. The advantage is not depending on the IOMMU to be available and work as advertised.

In this case IOMMU/VFIO seems the most appealing as the default choice?

If we are doing peak-performance testing and are concerned about the IOMMU (e.g. "IOTLB thrashing") then we can A/B test by booting with IOMMU on/off.

I had thought it would be neat to run a mix of VFIO and raw drivers at the same time and perhaps in the same switch. This is perhaps not so interesting if the "raw" drivers' process must be managed by the IOMMU in this case - then it's only an illusion that we aren't using IOMMU and not a valid way to compare performance for example.

So! This is fantastic hacking Javier :-). Let's use VFIO for a while and see how it works. Maybe we can throw away the raw code later. With only 10,000 lines of code budget to play with each feature will soon have to really earn its keep :-)

Great hacking :-)

-Luke

Luke Gorrie

unread,

Nov 24, 2013, 6:44:54 AM11/24/13

to snabb...@googlegroups.com

On 24 November 2013 12:27, Luke Gorrie <lu...@snabb.co> wrote:

With only 10,000 lines of code budget to play with each feature will soon have to really earn its keep :-)

Speaking of which: VFIO also frees us from requiring HugeTLB to make the system function. It's needed with raw drivers because hardware demands ~1MB of contiguous memory for Intel TX/RX descriptor rings and the easiest way to get this in userspace is with a HugeTLB. However, with VFIO we should be able to simply malloc() the memory (memory.use_hugetlb = false) and depend on the IOMMU to make it look contiguous to the hardware.

This would downgrade HugeTLB to be an optimization and one that could potentially be done in a different way e.g. transparent HugeTLB instead of explicit allocation.

So we seem to be headed down the path of code simplification and that's a great thing! (If we want to show off how smart we are we can always point to our crazy old hacks in the Git history.. ;-))

Javier Guerra Giraldez

unread,

Nov 25, 2013, 8:40:58 AM11/25/13

to snabb...@googlegroups.com

On Sun, Nov 24, 2013 at 6:25 AM, Luke Gorrie <lu...@snabb.co> wrote:
> On 23 November 2013 08:50, Javier Guerra Giraldez <jav...@guerrag.com>
> wrote:
>> right, that was it. it took a lot of work to get all the relevant
>> information, but in the end the numbers matched the dmesg error.
>
> Awesome detective work ! Great fun reading along on the progress :)

hehe, thanks! maybe i should write some mystery novels.. otoh, who
would read about "the case of the missing allocator"... i'll stick to
programming.

>> changing buffer.lua to use vfio.dma_alloc() makes it work, at least
>> for vfio-managed devices. i think it's ugly to call a
>> lib/hardware/xxx file from a core/xxx, but maybe that means that the
>> memory management in vfio.lua really belongs somewhere else.
>
> I have been pondering Emacs-style hooks for this kind of situation. Like:
>
> vfio.lua:
>
> function vfio_register_memory_hook (virtual, physical, length)
> -- register with IOMMU...
> end
> table.insert(memory.new_dma_memory_hooks, vfio_register_hook)
>
> memory.lua:
>
> for i, hook in ipairs(new_dma_memory_hooks) do
> hook(virtual, physical, length)
> end
>
> then with a modest set of hooks in core.* we can perhaps make everything
> play nicely together?

i guess the most obvious way would be to add a memory.set_allocator() function.

for the other points where vfio.lua replaces pci.lua it's done by
adding methods to the device info table, but core.buffer isn't
concerned about any deice. fortunately, the vfio mapping is done for
the container, not for the device. (currently, vfio.c supports a
single container per process, but that's ok because a container can
handle any number of iommu groups)

still, i guess we'll use other allocators to get some buffers more
friendly with virtual machines, so it might be better to split
allocators (defaulting to malloc, or mmap(NULL)) and iomappings on
different settable hooks. it could even try different allocators
until one succeeds, so we would be able to do zero-copy on some cases
and bounce-buffers if nothing else works.

> So do I understand correctly that non-VFIO drivers are only working when
> booting the kernel without "intel_iommu=on" and we see no obvious way to fix
> this?

a long short would be to add iommu=pt to the boot line. i couldn't
find where in the kernel this is handled, but it seems to mean
"passthrough", which might be "1:1 unless defined differently". it
also seems to be AMD-only, but again, no proof of it.

> In that case maybe we should start using vfio as the default. In
> benchmarking to find peak performance we would want to A/B test vfio vs. raw
> but that is perhaps the only really interesting case on Xeon systems that
> all have IOMMUs.
>

> So the situation is that we need to decide at boot time whether or not to
> use the IOMMU (kernel intel_iommu=on param). If the IOMMU is enabled then we
> must use vfio for drivers. If the IOMMU is disabled then we must use raw PCI
> for the drivers.

well, it's not hard to check if there's vfio capabilities, in bus.lua
there's a function just for that.

>
> VFIO/IOMMU is a nicer environment: we can run non-root, we don't risk
> crashing the machine with bad DMA, and we can choose our own memory map for
> DMA. (Maybe other benefits, e.g. less risk of two snabbswitch processes both
> opening the same device at the same time?)
>
> Raw/non-IOMMU is more basic: we have to run as root, we can crash the
> machine with bad DMA, and we have to use physical addresses memory for DMA.
> The advantage is not depending on the IOMMU to be available and work as
> advertised.
>
> In this case IOMMU/VFIO seems the most appealing as the default choice?
>
> If we are doing peak-performance testing and are concerned about the IOMMU
> (e.g. "IOTLB thrashing") then we can A/B test by booting with IOMMU on/off.
>
> I had thought it would be neat to run a mix of VFIO and raw drivers at the
> same time and perhaps in the same switch. This is perhaps not so interesting
> if the "raw" drivers' process must be managed by the IOMMU in this case -
> then it's only an illusion that we aren't using IOMMU and not a valid way to
> compare performance for example.

well, i guess that the IOMMU is always there. having a fixed 1:1
translation might not be faster, but it likely takes a lot less time
to switch contexts.

> So! This is fantastic hacking Javier :-). Let's use VFIO for a while and see

> how it works. Maybe we can throw away the raw code later. With only 10,000

> lines of code budget to play with each feature will soon have to really earn
> its keep :-)

not sure what that means... is there a limit on how much code there must be?

> Speaking of which: VFIO also frees us from requiring HugeTLB to make the
> system function. It's needed with raw drivers because hardware demands ~1MB
> of contiguous memory for Intel TX/RX descriptor rings and the easiest way to
> get this in userspace is with a HugeTLB. However, with VFIO we should be
> able to simply malloc() the memory (memory.use_hugetlb = false) and depend
> on the IOMMU to make it look contiguous to the hardware.

right, there might be some limitations, but it seems the IOMMU was
added as a requirement to 64-bit architecture precisely because the
memory map was getting too complex and to make it easy for 32bit pci
devices access anywhere.

all VFIO examples i've seen allocate memory by mmap(NULL..), but it's
what malloc() ends up doing, so it might be the easiest way. also,
currently we don't deallocate almost anything, and the packet and
buffer structures are reused as much as possible, so it might be
better to allocate bigger chunks (currently it's 2MB at a time)

> This would downgrade HugeTLB to be an optimization and one that could
> potentially be done in a different way e.g. transparent HugeTLB instead of
> explicit allocation.
>
> So we seem to be headed down the path of code simplification and that's a
> great thing! (If we want to show off how smart we are we can always point to
> our crazy old hacks in the Git history.. ;-))

hehe, right :-) i'll move the allocators around a little and add some
hooks to remove the layering violation. then it should be easy to
start pruning old code.

and what would be the next step? SR-IOV?

--
Javier

Luke Gorrie

unread,

Nov 25, 2013, 9:54:53 AM11/25/13

to snabb...@googlegroups.com

On 25 November 2013 14:40, Javier Guerra Giraldez <jav...@guerrag.com> wrote:

still, i guess we'll use other allocators to get some buffers more

friendly with virtual machines, so it might be better to split
allocators (defaulting to malloc, or mmap(NULL)) and iomappings on
different settable hooks. it could even try different allocators
until one succeeds, so we would be able to do zero-copy on some cases
and bounce-buffers if nothing else works.

I have been pondering allocation schemes. I think the first thing we need is more use cases i.e. apps/designs. So I'm inclined to start with dumb libraries (buffer/memory/etc) and make the apps/designs explicitly implement the memory allocation they want and then see what use cases we end up with.

For example in the NFV [1] application sketch it looks like it will be simplest to manage memory manually in a simple application-specific scheme. I think I can exploit a 1:1:1 relationship between buffers, virtio RX descriptors, and hardware TX descriptors.

You know how they say there are three good numbers: zero, one, and infinity? well, we started with one central memory pool, now I wonder if we should try zero central pools (i.e. apps manage their own memory), and perhaps we find a nice solution for infinity central pools along the way (managing numa, VMs, etc).

[1] https://github.com/SnabbCo/snabbswitch/blob/nfv/src/apps/nfv/nfv.lua

a long short would be to add iommu=pt to the boot line. i couldn't
find where in the kernel this is handled, but it seems to mean
"passthrough", which might be "1:1 unless defined differently". it
also seems to be AMD-only, but again, no proof of it.

This we can try some time!

> So! This is fantastic hacking Javier :-). Let's use VFIO for a while and see

> how it works. Maybe we can throw away the raw code later. With only 10,000

> lines of code budget to play with each feature will soon have to really earn
> its keep :-)

not sure what that means... is there a limit on how much code there must be?

Yes :-)

From the start there is a whimsical project budget:

- 10,000 lines of source.

- 1MB executable.

- 1 second compile ('make' in snabbswitch/src/)

- 1 minute compilte w/ dependencies ('make' in snabbswitch/)

and I wonder if perhaps 1% of lines should be allowed to exceed 80 columns ;-)

and what would be the next step? SR-IOV?

Yes!

Well first to get vfio merged onto master in a small number of commits please :-)

SR-IOV is intriguing. I would like to know how many VFs can we really have in practice? On paper it looks like chur could have 1280 VFs: 20 PFs and 64 VFs per PF. But does it really work to set that up? (are there some system-wide limits in theory or in practice?)

And how many IOMMU groups can we have? (Can we also have a separate memory map for 1280 VFs? or what's the limit?)

If we really can have a high number of VFs then I think it will be very interesting to have a VF driver for SnS. If not then we should consider looking at generic NIC multiqueue support instead.

The use case I have in mind is DMA directly to VMs by assigning either a VF to the VM (with SnS in between) or otherwise by assigning a TX/RX queue of the PF with a match on e.g. MAC address. I am looking into this in the context of nfv.lua - more ideas as the week unfolds...

Javier Guerra Giraldez

unread,

Nov 26, 2013, 2:49:16 AM11/26/13

to snabb...@googlegroups.com

On Mon, Nov 25, 2013 at 9:54 AM, Luke Gorrie <lu...@snabb.co> wrote:
> I have been pondering allocation schemes. I think the first thing we need is
> more use cases i.e. apps/designs. So I'm inclined to start with dumb
> libraries (buffer/memory/etc) and make the apps/designs explicitly implement
> the memory allocation they want and then see what use cases we end up with.

i've refactored a little, splitting the allocator into three hooks:

memory.dma_alloc => takes a size in bytes, returns allocated memory /
IO address. defaults to the existing function, which uses the next
two hooks.

memory.allocate_RAM => called by default memory.dma_alloc(), takes a
size in bytes, returns allocated memory. if memory.use_hugtlb is
true, defaults to C.allocate_huge_page(), if not, it points to
C.malloc

memory.ram_to_io_addr => called by default memory.dma_alloc(), takes a
ram pointer and size, returns corresponding IO address. there's no
default, but calling memory.set_use_physical_memory() sets it to the
existing memory.virtual_to_physical and locks all memory.

now, lib.hardware.vfio doesn't require(memory) anymore, but at startup
lib.hardware.bus checks if the host has VFIO functionality and sets
memory.ram_to_io_addr to the vfio mapping code. if not, it calls
memory.set_use_physical_memory().

therefore, now SnS can be run without root privileges, if the devices
have all been assigned to the vfio driver, and the /dev/vfio/XX files
have been assigned rw access to the user or group running SnS.

Also, since the io mappings do pin memory, it quickly runs into the
64K max pinned ram for non-privileged users. "ulimit -l XX" changes
it for new processes from the current shell (ulimit is a bash builtin,
not an executable)

> For example in the NFV [1] application sketch it looks like it will be
> simplest to manage memory manually in a simple application-specific scheme.
> I think I can exploit a 1:1:1 relationship between buffers, virtio RX
> descriptors, and hardware TX descriptors.

i'm not sure how this works: reads a buffer from a device, changes the
'physical' memory of that buffers and notifes the other device? as i
understand it, b.physical means the "hardware" view of the allocated
buffer. just changing it won't make it visible to other device.

but if both devices can access the same ram address, then there
shouldn't be any problem on using the same buffer. if both devices
are on the same iommu container, they get the same io address mapping.
or if one is a virtio, then i guess we should find some kind of
equivalent addressing.

> You know how they say there are three good numbers: zero, one, and infinity?
> well, we started with one central memory pool, now I wonder if we should try
> zero central pools (i.e. apps manage their own memory), and perhaps we find
> a nice solution for infinity central pools along the way (managing numa,
> VMs, etc).

yes, i think each kind of 'problem domain' might get it's own
addressing, and maybe memory pool. it seems [1] a single vfio
container can't define overlapping maps; but if (for example) NUMA
wants to take from specific physical pages, then we should allocate
from there, and then create the iommap; if not, we keep with plain
malloc()+iommap

[1]: linux kernel/drivers/vfio/vfio_iommu_type1.c vfio_dma_do_map()

> From the start there is a whimsical project budget:
> - 10,000 lines of source.
> - 1MB executable.
> - 1 second compile ('make' in snabbswitch/src/)
> - 1 minute compilte w/ dependencies ('make' in snabbswitch/)
>
> and I wonder if perhaps 1% of lines should be allowed to exceed 80 columns
> ;-)

well, i've just reduced a few lines... but still it's bigger than
before VFIO. a quick wc tells it's 4,846 lines, so we're almost
halfway there

>> and what would be the next step? SR-IOV?
>
>
> Yes!
>
> Well first to get vfio merged onto master in a small number of commits
> please :-)

i'm not too handy with git (i use fossil for most of my own projects),
but the diff is in a mostly clean state now. should i do a pull
request? or first a rebase?

> SR-IOV is intriguing. I would like to know how many VFs can we really have
> in practice? On paper it looks like chur could have 1280 VFs: 20 PFs and 64
> VFs per PF. But does it really work to set that up? (are there some
> system-wide limits in theory or in practice?)

judging from the kernel docs, it seems IOV is a feature from the
definition of PCIe, so i guess it should be possible to activate all
of them. if there's a point of diminishing returns, only some
benchmarking would tell.

I tried to activate a few VFs by "echo '3' | sudo tee
/sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs", but got "invalid
argument" no matter what non-zero number i tried to put.

maybe i have to define some queues or other hardware configuration first?

> And how many IOMMU groups can we have? (Can we also have a separate memory
> map for 1280 VFs? or what's the limit?)

for physical hardware the IOMMU groups are fixed, each device belongs
to a single group according to the hardware topology. the only
flexibility is to choose which groups are used by which processes by
adding them to a 'container'. the io mappings are set for the
container, so each process could be isolated with some devices and
without any mean to affect others.

i don't know how that mixes with SR-IOV. i guess the simplest would
be that all VFs belong to the same group as the PF, after all, they're
on the same point of the PCI tree. but it would be much more
interesting if there's some flexibility on that. maybe the group
splits into smaller groups with some VFs each....

> If we really can have a high number of VFs then I think it will be very
> interesting to have a VF driver for SnS. If not then we should consider
> looking at generic NIC multiqueue support instead.

i guess that multiqueue is a much better low-hanging fruit; but it
totally depends on the on-chip capabilities, of which i know next to
nothing.

> The use case I have in mind is DMA directly to VMs by assigning either a VF
> to the VM (with SnS in between) or otherwise by assigning a TX/RX queue of
> the PF with a match on e.g. MAC address. I am looking into this in the
> context of nfv.lua - more ideas as the week unfolds...

yep, that would be great to offload the most common case for VMs to
the chip. of course, that depends on the chip being able to also
internally route between queues, and not only between wire and
queues....

--
Javier

Luke Gorrie

unread,

Nov 26, 2013, 6:02:12 AM11/26/13

to snabb...@googlegroups.com

On 26 November 2013 08:49, Javier Guerra Giraldez <jav...@guerrag.com> wrote:

On Mon, Nov 25, 2013 at 9:54 AM, Luke Gorrie <lu...@snabb.co> wrote:
> I have been pondering allocation schemes. I think the first thing we need is
> more use cases i.e. apps/designs. So I'm inclined to start with dumb
> libraries (buffer/memory/etc) and make the apps/designs explicitly implement
> the memory allocation they want and then see what use cases we end up with.

i've refactored a little, splitting the allocator into three hooks:

Cool!

Let's use this as a chance to get some experience with Github workflow. So far we haven't used it very much but I hope it's worth getting into. How about sending a pull request and we can discuss the code on Github?

therefore, now SnS can be run without root privileges

That is awesome :-)

> For example in the NFV [1] application sketch it looks like it will be
> simplest to manage memory manually in a simple application-specific scheme.
> I think I can exploit a 1:1:1 relationship between buffers, virtio RX
> descriptors, and hardware TX descriptors.

i'm not sure how this works: reads a buffer from a device, changes the
'physical' memory of that buffers and notifes the other device? as i
understand it, b.physical means the "hardware" view of the allocated
buffer. just changing it won't make it visible to other device.

This code is really using the buffer->physical as a general I/O address and translating the address when used in descriptors for Virtio (guest-physical address) vs. NIC (host-physical address). The buffers are all "sourced" from the Virtio device (i.e. guest memory) so no copies are needed (guest can address its own memory, NIC can address all memory) but address translation is needed.

So it's doing the job of the IOMMU. It would be neat if vfio/IOMMU could do this natively going forward. I don't yet have a feeling for how well that will work with multiple VMs and NICs in the same process (need to draw some pictures..)

It would be important to have some extra safeguards in this programming style e.g. an assert() that the IO address being used really belongs to the right namespace. DMA with bad addresses bugs would not be so much fun.

> From the start there is a whimsical project budget:
> - 10,000 lines of source.
> - 1MB executable.
> - 1 second compile ('make' in snabbswitch/src/)
> - 1 minute compilte w/ dependencies ('make' in snabbswitch/)
>
> and I wonder if perhaps 1% of lines should be allowed to exceed 80 columns
> ;-)

well, i've just reduced a few lines... but still it's bigger than
before VFIO. a quick wc tells it's 4,846 lines, so we're almost
halfway there

Yep. The source is growing and it's increasingly valuable to look for places where we can use less code to accomplish the same thing. This can be an iterative process. I enjoy sitting down and rewriting things to be shorter and I like the effect that it has on programs. It's probably also a good way for people to come in to the project: read some code, understand it, send in a shorter version. We can also tag-and-remove features if they should prove to be relatively unimportant e.g. 1G network drivers, non-VFIO PCI access, HugeTLB, etc.

Once the 10,000 line limit starts to _really_ bite then it will be time to think about how to scale up features/applications/etc in a good way.

I tried to activate a few VFs by "echo '3' | sudo tee
/sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs", but got "invalid
argument" no matter what non-zero number i tried to put.

maybe i have to define some queues or other hardware configuration first?

I think you need to specify this at module-load time, like 'modprobe ixgbe max_vfs=63'.

i don't know how that mixes with SR-IOV. i guess the simplest would
be that all VFs belong to the same group as the PF, after all, they're
on the same point of the PCI tree. but it would be much more
interesting if there's some flexibility on that. maybe the group
splits into smaller groups with some VFs each....

I think the key question for me is: can each VF have its own IO mapping? If so then each VF can be associated with a specific Virtio-net device and all the DMA can be done using the same address space i.e. guest-local addresses.

i guess that multiqueue is a much better low-hanging fruit; but it
totally depends on the on-chip capabilities, of which i know next to
nothing.

The reason that I find SR-IOV appealing is that it has a definite use case: multiplexing VMs onto a physical device. Hardware vendors will all need to implement SR-IOV in such a way that this works usefully. So (a) different NICs are likely to behave in a similar way and (b) their features should be well matched for the Network Functions Virtualization niche that SnS is currently moving into.

Multiple TX/RX queues also have other use cases e.g. to spread load across multiple CPUs using a hash function instead of exactly matching traffic to individual queues. So vendors may well-meaningly implement multiqueue support in different ways and not always supporting our use case.

Question is also whether it's easier to write a VF driver for vendor X than either our own PF driver (which is hard if the hardware or documentation is bad) or linking in a vendor "kernel-bypass" library (which eats our 1MB object code budget, is 3rd party code we need to be able to debug, and is not always freely redistributable).

It would be nice to be able to have this discussion with NIC hardware makers. I've mostly had a cold shoulder so far -- they don't want us writing fancy drivers, they want to do it themselves, and sell licenses to the results. To me this hardware/software bundling/lock-in agenda mostly devalues the hardware.

yep, that would be great to offload the most common case for VMs to
the chip. of course, that depends on the chip being able to also
internally route between queues, and not only between wire and
queues....

Yeah. They are already working on solving this e.g. our Intel NICs have an embedded switch that can do "hairpin turns" on VM<->VM traffic. It's not perfect but at least it's something the hardware people are already working on (be it with NICs or with switches).

Luke Gorrie

unread,

Nov 26, 2013, 7:39:01 AM11/26/13

to snabb...@googlegroups.com

On 26 November 2013 12:02, Luke Gorrie <lu...@snabb.co> wrote:

Let's use this as a chance to get some experience with Github workflow. So far we haven't used it very much but I hope it's worth getting into. How about sending a pull request and we can discuss the code on Github?

Checking around briefly: the OpenShift workflow looks quite simple and good to me: https://www.openshift.com/wiki/github-workflow-for-submitting-pull-requests

That is: pull request from feature branch is rebased to the current master and "squashed" into one commit.

Likely we don't want to merge feature branch history into master -- though in some cases a change might be broken into several commits.

Myself I have recently gotten into the habit of doing "rebase -i" to squash features into a small number of commits before merging.

Javier Guerra Giraldez

unread,

Nov 26, 2013, 8:48:16 AM11/26/13

to snabb...@googlegroups.com

On Tue, Nov 26, 2013 at 6:02 AM, Luke Gorrie <lu...@snabb.co> wrote:
>
> I think you need to specify this at module-load time, like 'modprobe ixgbe
> max_vfs=63'.

hum.... aren't we pulling the device from the driver? if it's not
there anymore, the error wouldn't be surprising at all. OTOH, there's
still a ../sriov_totalvfs = 63 file there.

>> i don't know how that mixes with SR-IOV. i guess the simplest would
>> be that all VFs belong to the same group as the PF, after all, they're
>> on the same point of the PCI tree. but it would be much more
>> interesting if there's some flexibility on that. maybe the group
>> splits into smaller groups with some VFs each....
>
>
> I think the key question for me is: can each VF have its own IO mapping? If
> so then each VF can be associated with a specific Virtio-net device and all
> the DMA can be done using the same address space i.e. guest-local addresses.

IO mappings are applied to the container, which can have any subset of
iommu groups but not individual devices; so the group is the smallest
mapping granularity.

currently (no iov), each card puts both ports on a single group, so
it's likely that all VFs would fall there too, unless the iov
functionality creates a new level in the PCI tree with some
PCI-bridge-like features, splitting the group.

> It would be nice to be able to have this discussion with NIC hardware
> makers. I've mostly had a cold shoulder so far -- they don't want us writing
> fancy drivers, they want to do it themselves, and sell licenses to the
> results. To me this hardware/software bundling/lock-in agenda mostly
> devalues the hardware.

yes, just like the ATi/nVidia gpu hostage situation. AMD has slightly
better attitude than ATi, but the drivers are still the worst of all.
ironically, Intel is the good guy in that space.

but the main difference is that even if the makers aren't directly
collaborating, the chips are well documented, right? I haven't read
hard docs of that chips; do you have a good reference?

>> yep, that would be great to offload the most common case for VMs to
>> the chip. of course, that depends on the chip being able to also
>> internally route between queues, and not only between wire and
>> queues....
>
>
> Yeah. They are already working on solving this e.g. our Intel NICs have an
> embedded switch that can do "hairpin turns" on VM<->VM traffic. It's not
> perfect but at least it's something the hardware people are already working
> on (be it with NICs or with switches).

yes, that kind of architecture (an internal switch for several
real/virtual NICs/queues, as opposed as just a big processor with
firmware to offload work from a single driver) is what i think leads
to much better flexibility for userspace drivers.

> https://www.openshift.com/wiki/github-workflow-for-submitting-pull-requests
>
> That is: pull request from feature branch is rebased to the current master
> and "squashed" into one commit.

good, i'll read it and apply to our situation

> Likely we don't want to merge feature branch history into master -- though
> in some cases a change might be broken into several commits.
>
> Myself I have recently gotten into the habit of doing "rebase -i" to squash

> features into a small number of commits before merging.

sounds like a good advise to keep history nice.

--
Javier

Julian Stecklina

unread,

Nov 26, 2013, 9:10:51 AM11/26/13

to snabb...@googlegroups.com

On 11/25/2013 03:54 PM, Luke Gorrie wrote:
> SR-IOV is intriguing. I would like to know how many VFs can we really
> have in practice? On paper it looks like chur could have 1280 VFs: 20
> PFs and 64 VFs per PF. But does it really work to set that up? (are
> there some system-wide limits in theory or in practice?)

The immediate problem would be IRQs. These VFs consume I think 3 MSI
vectors per VF. I don't know how Linux handles more than 256 - 32 (8 bit
vector space minus exception vectors). In theory, it can treat MSI
vectors as being CPU-local (multiplying the vector space by the number
of cores), but I am not sure if anyone has implemented that.

Julian

signature.asc

Luke Gorrie

unread,

Nov 26, 2013, 9:16:49 AM11/26/13

to snabb...@googlegroups.com

On 26 November 2013 15:10, Julian Stecklina <jste...@os.inf.tu-dresden.de> wrote:

The immediate problem would be IRQs. These VFs consume I think 3 MSI

vectors per VF. I don't know how Linux handles more than 256 - 32 (8 bit
vector space minus exception vectors). In theory, it can treat MSI
vectors as being CPU-local (multiplying the vector space by the number
of cores), but I am not sure if anyone has implemented that.

If we have our own VF driver that does not use interrupts, are we off the hook on that problem?

Luke Gorrie

unread,

Nov 26, 2013, 9:20:04 AM11/26/13

to snabb...@googlegroups.com

On 26 November 2013 14:48, Javier Guerra Giraldez <jav...@guerrag.com> wrote:

On Tue, Nov 26, 2013 at 6:02 AM, Luke Gorrie <lu...@snabb.co> wrote:
>
> I think you need to specify this at module-load time, like 'modprobe ixgbe
> max_vfs=63'.

hum.... aren't we pulling the device from the driver? if it's not
there anymore, the error wouldn't be surprising at all. OTOH, there's
still a ../sriov_totalvfs = 63 file there.

In the case I have in mind we are using the kernel driver for the PF (ixgbe) but we are using our own driver for the VFs (unbinding them from ixgbevf).

Julian Stecklina

unread,

Nov 26, 2013, 9:21:14 AM11/26/13

to snabb...@googlegroups.com

On 11/26/2013 08:49 AM, Javier Guerra Giraldez wrote:
> I tried to activate a few VFs by "echo '3' | sudo tee
> /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs", but got "invalid
> argument" no matter what non-zero number i tried to put.
>
> maybe i have to define some queues or other hardware configuration first?

I think you can only change sriov_numvfs on module load/boot time. Other
than that you might need to pass pci=realloc on the linux command line
to allow Linux to make room for all the VFs on the PCI bus. Linux'
kernel log will tell you, if this is necessary, but it is error-prone,
as it might confuse the BIOS.

A better solution is to have a mainboard that supports ARI on its PCIe
slots. Check your PCI Express Root Ports with lspci -vv (as root). You
should see ARIFwd+ in DevCap2 and DevCtl2. For some mainboards, ARI is
only available on some PCIe slots.... The main feature in ARI is that it
allows the PCI bus to treat the 'device' part in the BDF numbering as 0
and use the bits for device and function together to name up to 256
functions.

Julian

signature.asc

Julian Stecklina

unread,

Nov 26, 2013, 9:22:35 AM11/26/13

to snabb...@googlegroups.com

On 11/26/2013 12:02 PM, Luke Gorrie wrote:
> It would be nice to be able to have this discussion with NIC hardware
> makers. I've mostly had a cold shoulder so far -- they don't want us
> writing fancy drivers, they want to do it themselves, and sell licenses
> to the results. To me this hardware/software bundling/lock-in agenda
> mostly devalues the hardware.

ACK!

Julian

signature.asc

Julian Stecklina

unread,

Nov 26, 2013, 9:25:17 AM11/26/13

to snabb...@googlegroups.com

Sure. :)

Julian

signature.asc

Reply all

Reply to author

Forward