On Mon, Nov 25, 2013 at 9:54 AM, Luke Gorrie <
lu...@snabb.co> wrote:
> I have been pondering allocation schemes. I think the first thing we need is
> more use cases i.e. apps/designs. So I'm inclined to start with dumb
> libraries (buffer/memory/etc) and make the apps/designs explicitly implement
> the memory allocation they want and then see what use cases we end up with.
i've refactored a little, splitting the allocator into three hooks:
memory.dma_alloc => takes a size in bytes, returns allocated memory /
IO address. defaults to the existing function, which uses the next
two hooks.
memory.allocate_RAM => called by default memory.dma_alloc(), takes a
size in bytes, returns allocated memory. if memory.use_hugtlb is
true, defaults to C.allocate_huge_page(), if not, it points to
C.malloc
memory.ram_to_io_addr => called by default memory.dma_alloc(), takes a
ram pointer and size, returns corresponding IO address. there's no
default, but calling memory.set_use_physical_memory() sets it to the
existing memory.virtual_to_physical and locks all memory.
now, lib.hardware.vfio doesn't require(memory) anymore, but at startup
lib.hardware.bus checks if the host has VFIO functionality and sets
memory.ram_to_io_addr to the vfio mapping code. if not, it calls
memory.set_use_physical_memory().
therefore, now SnS can be run without root privileges, if the devices
have all been assigned to the vfio driver, and the /dev/vfio/XX files
have been assigned rw access to the user or group running SnS.
Also, since the io mappings do pin memory, it quickly runs into the
64K max pinned ram for non-privileged users. "ulimit -l XX" changes
it for new processes from the current shell (ulimit is a bash builtin,
not an executable)
> For example in the NFV [1] application sketch it looks like it will be
> simplest to manage memory manually in a simple application-specific scheme.
> I think I can exploit a 1:1:1 relationship between buffers, virtio RX
> descriptors, and hardware TX descriptors.
i'm not sure how this works: reads a buffer from a device, changes the
'physical' memory of that buffers and notifes the other device? as i
understand it, b.physical means the "hardware" view of the allocated
buffer. just changing it won't make it visible to other device.
but if both devices can access the same ram address, then there
shouldn't be any problem on using the same buffer. if both devices
are on the same iommu container, they get the same io address mapping.
or if one is a virtio, then i guess we should find some kind of
equivalent addressing.
> You know how they say there are three good numbers: zero, one, and infinity?
> well, we started with one central memory pool, now I wonder if we should try
> zero central pools (i.e. apps manage their own memory), and perhaps we find
> a nice solution for infinity central pools along the way (managing numa,
> VMs, etc).
yes, i think each kind of 'problem domain' might get it's own
addressing, and maybe memory pool. it seems [1] a single vfio
container can't define overlapping maps; but if (for example) NUMA
wants to take from specific physical pages, then we should allocate
from there, and then create the iommap; if not, we keep with plain
malloc()+iommap
[1]: linux kernel/drivers/vfio/vfio_iommu_type1.c vfio_dma_do_map()
> From the start there is a whimsical project budget:
> - 10,000 lines of source.
> - 1MB executable.
> - 1 second compile ('make' in snabbswitch/src/)
> - 1 minute compilte w/ dependencies ('make' in snabbswitch/)
>
> and I wonder if perhaps 1% of lines should be allowed to exceed 80 columns
> ;-)
well, i've just reduced a few lines... but still it's bigger than
before VFIO. a quick wc tells it's 4,846 lines, so we're almost
halfway there
>> and what would be the next step? SR-IOV?
>
>
> Yes!
>
> Well first to get vfio merged onto master in a small number of commits
> please :-)
i'm not too handy with git (i use fossil for most of my own projects),
but the diff is in a mostly clean state now. should i do a pull
request? or first a rebase?
> SR-IOV is intriguing. I would like to know how many VFs can we really have
> in practice? On paper it looks like chur could have 1280 VFs: 20 PFs and 64
> VFs per PF. But does it really work to set that up? (are there some
> system-wide limits in theory or in practice?)
judging from the kernel docs, it seems IOV is a feature from the
definition of PCIe, so i guess it should be possible to activate all
of them. if there's a point of diminishing returns, only some
benchmarking would tell.
I tried to activate a few VFs by "echo '3' | sudo tee
/sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs", but got "invalid
argument" no matter what non-zero number i tried to put.
maybe i have to define some queues or other hardware configuration first?
> And how many IOMMU groups can we have? (Can we also have a separate memory
> map for 1280 VFs? or what's the limit?)
for physical hardware the IOMMU groups are fixed, each device belongs
to a single group according to the hardware topology. the only
flexibility is to choose which groups are used by which processes by
adding them to a 'container'. the io mappings are set for the
container, so each process could be isolated with some devices and
without any mean to affect others.
i don't know how that mixes with SR-IOV. i guess the simplest would
be that all VFs belong to the same group as the PF, after all, they're
on the same point of the PCI tree. but it would be much more
interesting if there's some flexibility on that. maybe the group
splits into smaller groups with some VFs each....
> If we really can have a high number of VFs then I think it will be very
> interesting to have a VF driver for SnS. If not then we should consider
> looking at generic NIC multiqueue support instead.
i guess that multiqueue is a much better low-hanging fruit; but it
totally depends on the on-chip capabilities, of which i know next to
nothing.
> The use case I have in mind is DMA directly to VMs by assigning either a VF
> to the VM (with SnS in between) or otherwise by assigning a TX/RX queue of
> the PF with a match on e.g. MAC address. I am looking into this in the
> context of nfv.lua - more ideas as the week unfolds...
yep, that would be great to offload the most common case for VMs to
the chip. of course, that depends on the chip being able to also
internally route between queues, and not only between wire and
queues....
--
Javier