Hello all,
We are proposing to move SR-IOV functional testing in CI from dedicated hardware-backed nodes to QEMU-emulated SR-IOV PFs on every node.
The goal is to make SR-IOV testing easier to run, scale, and debug, while still using real hardware for the rare hardware-specific issues.
Relevant PRs:
- https://github.com/kubevirt/kubevirtci/pull/1600
- https://github.com/kubevirt/kubevirt/pull/16975
With these changes, each cluster node will expose an emulated SR-IOV PF, allowing SR-IOV tests to run without special hardware.
Our plan is to stop maintaining the `kind-sriov` provider after 3 releases.
A few points:
- If we hit bugs that reproduce only on hardware, we will continue using hardware for debugging. These cases are rare.
- The current upstream hardware is old and very specific, and does not reflect what customers typically use.
- Our SR-IOV e2e tests are focused on functional correctness, not performance or hardware-specific capabilities.
- The only functionality we currently test that emulation does not support is `link_state` changes.
To handle this, we updated the e2e tests to detect emulated vs hardware environments and skip only the `link_state` mutation path in emulated mode.
- All other test coverage remains dual-mode (emulated + hardware).
Why this change is beneficial:
- Setup and maintenance cost is high today (special Prow setup, undercloud CNI, etc.).
- It will make both CI and local debugging easier, with no special hardware requirement.
- We currently have only two CI machines for this lane, which is a bottleneck.
- It reduces setup time by ~15-30 minutes.
- It reduces provider sprawl by using a more generic provider path.
- It helps us consume newer Kubernetes versions earlier (e.g. 1.36 beta for DRA work), even before kind support catches up.
Comments and concerns are very welcome.
Thanks