Howdy!
Let us take a moment to recap the "straight line" design concept that is gaining currency in Snabb Switch.
Snabb Switch is replacing special-case NIC offloads with general-purpose SIMD offloads. So far we have replaced NIC memory transfers with SIMD memory transfers and NIC checksums with SIMD checksums. (This applies mostly to the Virtio-net/NFV code.)
There are several reasons we are doing this:
- SIMD offloads work for more diverse workloads than NIC offloads.
- SIMD offloads work the same for all NICs.
This takes us towards a sweet spot with simple code, high performance, and consistency across diverse workloads.
Here is a timeline of relevant hardware and software events. First, here is what has already happened:
- Intel announce goal of 8x SIMD speedup.
- Intel reach 2x with Sandy Bridge (AVX)
- Intel reach 4x with Haswell (AVX2)
- Snabb Switch moves memcpy from NIC to SIMD (SSE2 in glibc)
- Snabb Switch moves checksum from NIC to SIMD (SSE2+AVX2)
Here is what is still to come:
- glibc memcpy upgrades to AVX2 (distros adopt >= 2.20)
- Intel reach 8x with Skylake/Canonlake (AVX512; ~2016/2017).
- Older CPUs with slow SIMD die out, both in Snabb Lab and in the wild.
This is a really nice roadmap! The work we have already done will keep paying us performance dividends as the hardware and software ecosystem moves forward. We sit back and enjoy riding the wave. Our creative energy can be spent on fun things like adding features or moving more work onto SIMD.
The alternative roadmap for NIC offloads would not be so nice. Then we would have some bad-performing special cases (e.g. tunnelled traffic) and we would be working hard to reduce these by implementing increasingly complex NIC-specific offloads, like the nested VXLAN/NVGRE/Geneve (but not L2TPv3...) offloads in Intel's latest NIC. This would mean inconsistent performance between different protocols/workloads/NICs, complex hardware interfaces propagating through our whole codebase (and colliding with each other as we support more hardware), and more maintenance work every time a supported vendor releases a new NIC.
No thanks :-).
So that is the theory. How about practice?
On Haswell we already seem to be ahead. On our mid-range Haswell (2.4 GHz E5-2620v3) we can run a 10 Gbps iperf between two VMs on the same Snabb Switch. The Snabb Switch is processing 20 Gbps of traffic (each packet looped through the NIC), is checksumming all data in both directions, is performing all data transfers, and is dealing with real 1500 byte packets (not easy 64KB pseudo-packets). That's all done in software on one CPU core.
Our raw packet-forwarding performance is also comfortably above our target of 10 Gbps full-duplex per core w/ 256-byte average packet size. (NB: This does not depend on offloads when the guest is a packet-forwarding application like a router rather than an endpoint like a server.)
The performance on our older Sandy Bridge lab servers seems to take a hit from the new SIMD code. (I don't have definite numbers yet.) This is actually a _good_ thing. The whole intention of this design is to link our performance to SIMD performance because that is taking off like a rocket. Performance gains from moving to new microarchitectures and performance drops moving back to older ones are two sides of the same coin. We do care about the performance on older CPUs, and we need to understand it, but the world moves fast and we develop for tomorrow's CPUs and not yesterday's.
Onward! :-)
-Luke