Indeed, and the number of VF's is limited by the PCIe specification to
65535 with one PF.
The device is dividing its resources amongst the VF's, so the maximum number
of VF's is controlled by the amount of resources available on the device
and the implementation of the logic on the device.
The number oF VF's exposed to the host is controlled by the host driver
(up to the max supported by the device) via stores to the device
configuration space SR-IOV capability.
>A coprocessor is intended to be implicitly immediately available,
>under OS control, to the current processor context, be it threads, OS
>or drivers. That implies huge quota of VF's for all threads plus
>sundry other uses on all guest OS just in case they want one.
That assumes that the coprocess will be used by all processors,
which aside from legacy coprocessors like FPUs (even then, most
applications didn't actually use floating point and there are
hooks in most major operating systems to detect whether an application
uses floating point so they don't need to save the FPR over context switches).
>And, just guessing at the device internals, implies huge management tables,
>CAMs instead of SRAMs, caches, blah, blah, etc.
Certainly in many cases, CAMS are quite useful. Particularly on
networking hardware that performs hardware packet classification
based on header fields.
>The overhead of the async completion signal would likely be much greater
>that the cost of the original coprocessor hash/encrypt.
That again, depends on the coprocessor. If the amount of work
that is offloaded isn't large enough to subsume the slight extra
cost for the virtio interrupt (particularly on cpus where the
interrupt overhead is low - e.g. ARMv8), you probably should
couple the coprocessor closer to the CPU, much like ARM Neoverse
cores where the RND instruction interacts with an off-cpu random
number generator (via MMIO).
Here's what our chips look like to the kernel/software:
https://doc.dpdk.org/guides-20.05/platform/octeontx2.html
Packet comes in, hardware allocates packet storage from the
NPA (network pool allocator) hardware block. Passes to
NCPC for classification (big CAMS), queues to scheduler,
scheduler may or may not interact with a processor or
one of the many blocks that can be added to the processing
flow for a packet (crypto for IPsec, compression, etc) before
queuing the packet for egress (where shaping occurs) on
a network port.