SIMD FTW, or, Straightline redux

1,624 views
Skip to first unread message

Luke Gorrie

unread,
Mar 19, 2015, 5:17:40 AM3/19/15
to snabb...@googlegroups.com
Howdy!

Let us take a moment to recap the "straight line" design concept that is gaining currency in Snabb Switch.

Snabb Switch is replacing special-case NIC offloads with general-purpose SIMD offloads. So far we have replaced NIC memory transfers with SIMD memory transfers and NIC checksums with SIMD checksums. (This applies mostly to the Virtio-net/NFV code.)

There are several reasons we are doing this:

- x86 SIMD performance is increasing 8x per core from SSE2 to AVX512.
- SIMD offloads work for more diverse workloads than NIC offloads.
- SIMD offloads work the same for all NICs.

This takes us towards a sweet spot with simple code, high performance, and consistency across diverse workloads.

Here is a timeline of relevant hardware and software events. First, here is what has already happened:

- Intel announce goal of 8x SIMD speedup.
- Intel reach 2x with Sandy Bridge (AVX)
- Intel reach 4x with Haswell (AVX2)
- Snabb Switch moves memcpy from NIC to SIMD (SSE2 in glibc)
- Snabb Switch moves checksum from NIC to SIMD (SSE2+AVX2)

Here is what is still to come:

- glibc memcpy upgrades to AVX2 (distros adopt >= 2.20)
- Intel reach 8x with Skylake/Canonlake (AVX512; ~2016/2017).
- Older CPUs with slow SIMD die out, both in Snabb Lab and in the wild.

This is a really nice roadmap! The work we have already done will keep paying us performance dividends as the hardware and software ecosystem moves forward. We sit back and enjoy riding the wave. Our creative energy can be spent on fun things like adding features or moving more work onto SIMD.

The alternative roadmap for NIC offloads would not be so nice. Then we would have some bad-performing special cases (e.g. tunnelled traffic) and we would be working hard to reduce these by implementing increasingly complex NIC-specific offloads, like the nested VXLAN/NVGRE/Geneve (but not L2TPv3...) offloads in Intel's latest NIC. This would mean inconsistent performance between different protocols/workloads/NICs, complex hardware interfaces propagating through our whole codebase (and colliding with each other as we support more hardware), and more maintenance work every time a supported vendor releases a new NIC.

No thanks :-).

So that is the theory. How about practice?

On Haswell we already seem to be ahead. On our mid-range Haswell (2.4 GHz E5-2620v3) we can run a 10 Gbps iperf between two VMs on the same Snabb Switch. The Snabb Switch is processing 20 Gbps of traffic (each packet looped through the NIC), is checksumming all data in both directions, is performing all data transfers, and is dealing with real 1500 byte packets (not easy 64KB pseudo-packets). That's all done in software on one CPU core.

Our raw packet-forwarding performance is also comfortably above our target of 10 Gbps full-duplex per core w/ 256-byte average packet size. (NB: This does not depend on offloads when the guest is a packet-forwarding application like a router rather than an endpoint like a server.)

The performance on our older Sandy Bridge lab servers seems to take a hit from the new SIMD code. (I don't have definite numbers yet.) This is actually a _good_ thing. The whole intention of this design is to link our performance to SIMD performance because that is taking off like a rocket. Performance gains from moving to new microarchitectures and performance drops moving back to older ones are two sides of the same coin. We do care about the performance on older CPUs, and we need to understand it, but the world moves fast and we develop for tomorrow's CPUs and not yesterday's.

Onward! :-)
-Luke


Luke Gorrie

unread,
Apr 29, 2015, 3:21:18 PM4/29/15
to snabb...@googlegroups.com
Following up on the trade offs we have made with SIMD:

On 19 March 2015 at 10:17, Luke Gorrie <lu...@snabb.co> wrote:
The performance on our older Sandy Bridge lab servers seems to take a hit from the new SIMD code.

I had a really pleasant discussion with a major network operator today. They tested snabbnfv on an Ivy Bridge processor and saw less iperf performance than expected: 8.7 Gbps instead of 9.3 Gbps of goodput. I took the opportunity to explain the way we have moved offloads from the NIC onto SIMD and how this benefits newer CPU models at the expense of older ones.

I am pleased to say that they absolutely agreed with our design. They see it as very valuable to have the freedom to work with any encapsulations that they want, and no problem that we are optimizing for early/late 2015 CPU models.
 
So that is one very positive data point.

Here is how I described our new offload design:



Here is a little bit of background...

"The Overhead of Software Tunneling"

Here they are saying that x86 can get excellent performance when the NIC does TCP checksums (~10 Gbps with one core) but that performance drops when the NIC cannot do the TCP checksum (~2.3 Gbps in 2012).

The NIC traditionally cannot do offloads when encapsulation is used and so encapsulated traffic is slow. The first hack around this is STT which uses a TCP-like tunnel header that can trick the NIC into doing encapsulation. The next hacks coming out now are network cards that can offload with limited tunnel protocols (VXLAN, Geneve, NVGRE). However, nobody has a solution for hardware offload with other encapsulations like L2TPv3, GTP, MPLS, etc.

This is where SIMD comes in. SIMD is the vector part of the x86 that has been called MMX, SSE, and AVX. SIMD instructions operate on more data than normal ones: 16 bytes per instruction on Sandy/Ivy Bridge, 32 bytes on Haswell, and 64 bytes on Skylake (expected later this year). Intel are massively speeding up the SIMD instructions for scientific/HPC applications -- question is, is that useful for networking too?

We started experimenting with this on Snabb Switch earlier this year and quickly realized that the SIMD can take over the offloads that have been done by the NIC. That is wonderful because they can then apply to all protocols and all NICs, you can simply stop worrying about that stuff. However, there is a transition period now where Intel is moving from 16-byte SIMD to 64-byte SIMD and it means that results are better on newer CPUs / worse on older ones. To me this seems like an acceptable trade off -- but of course it takes some explanation when, like today, you run a simple test on a 1-2 year old machine and don't see line rate.

For what it is worth, here is how your test setup looks on a fairly slow Haswell server (2.4 GHz no turbo):

root@vm1:~# iperf -i2 -t10 -c10.2
------------------------------------------------------------
Client connecting to 10.2, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.1 port 42269 connected with 10.0.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 2.0 sec  2.15 GBytes  9.23 Gbits/sec
[  3]  2.0- 4.0 sec  2.17 GBytes  9.32 Gbits/sec
[  3]  4.0- 6.0 sec  2.17 GBytes  9.31 Gbits/sec
[  3]  6.0- 8.0 sec  2.19 GBytes  9.39 Gbits/sec
[  3]  8.0-10.0 sec  2.19 GBytes  9.39 Gbits/sec
[  3]  0.0-10.0 sec  10.9 GBytes  9.33 Gbits/sec

and with L2TPv3 tunnels enabled in Snabb Switch on both sides:

root@vm1:~# iperf -i2 -t10 -c10.2
------------------------------------------------------------
Client connecting to 10.2, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.1 port 42272 connected with 10.0.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 2.0 sec  2.06 GBytes  8.84 Gbits/sec
[  3]  2.0- 4.0 sec  2.10 GBytes  9.01 Gbits/sec
[  3]  4.0- 6.0 sec  2.10 GBytes  9.01 Gbits/sec
[  3]  6.0- 8.0 sec  2.10 GBytes  9.00 Gbits/sec
[  3]  0.0-10.0 sec  10.4 GBytes  8.97 Gbits/sec

So we are getting basically line rate in both cases when accounting for protocol overhead.

That is how I think the future of networking looks :-) and with the Skylake processor and its SIMD upgrade there will be extra potential for adding more features without impacting performance.
Reply all
Reply to author
Forward
0 new messages