Single ipsec tunnel performance

62 views
Skip to first unread message

Charles Vaske

unread,
Nov 9, 2015, 1:22:21 AM11/9/15
to highspeedencryption
Reproducing the pcrypt setup from a few years ago, I'm able to get
~5Gbit/s on a single tunnel. The setup is two machines, each

cpu: dual socket Xeon E5 v3 
kernel: 3.13.0-32-generic #57-Ubuntu SMP
net: 2x 10Gb ports from an Intel 82599 NIC

Port 1 on each machine is connected to switch 1, and port 2 is
connected to switch 2, for switch redundancy. Config is a layer 3
fabric, so that ECMP manages individual streams already. This limits
single stream performance to 10gb between the hosts, but multiple
streams hit 20gbps:

cvaske@surely:~$ iperf -c maeby --print_mss
------------------------------------------------------------
Client connecting to maeby, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local surely port 33664 connected with maeby port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.5 GBytes  9.89 Gbits/sec
[  3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
cvaske@surely:~$ iperf -c maeby --print_mss -P 2
------------------------------------------------------------
Client connecting to maeby, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local surely port 33665 connected with maeby port 5001
[  4] local surely port 33666 connected with maeby port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  11.5 GBytes  9.89 Gbits/sec
[  3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
[  4]  0.0-10.0 sec  11.5 GBytes  9.87 Gbits/sec
[  4] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
[SUM]  0.0-10.0 sec  23.0 GBytes  19.8 Gbits/sec

Setting up a standard IPsec host-host tunnel according to this setup:


I played with various IRQ affinities using the Intel driver script
from the tarball. There's a version on github here, but I don't know
if its the same:


I get 1.02Gbps to 1.04Gbps, regardless of any irq affinities I set.

Next, I setup pcrypt with the following two commands:

modprobe pcrypt
modprobe tcrypt alg="pcrypt(authenc(hmac(sha1),cbc(aes)))" type=3

Stopping and starting ipsec, I get ~4.5Gbps:

cvaske@surely$ iperf -c maeby --print_mss
------------------------------------------------------------
Client connecting to maeby, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local surely port 33661 connected with maeby port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.69 GBytes  4.89 Gbits/sec
[  3] MSS size 8890 bytes (MTU 8930 bytes, unknown interface)

The iperf server process is taking about 90% CPU, and kworker threads
~30% CPU. The iperf client process is taking about 60% CPU, and
kworker threads are taking about 16% CPU each.

As I experimented with different IRQ affinities, I was only able to
decrease performance. Here's the results from various settings:

local -- spread affinities to processors on the same NUMA node as the NIC
proc0 -- all affinities to processor 0
all -- spread interrupts along all processors
remote -- spread affinities along processors on the remote NUMA node (should be slower)
-x -- also set the XPS transmit affinities

After testing these settings, even clearing out the XPS settings that the -x option had set wouldn't improve performance past an average of 3.5Gbits/s, so I must have messed something up there. The data for the plot in R format:

iperf.10sec <- list("local"=c(4.61, 4.67, 4.57, 4.62, 4.59, 4.94, 4.84, 4.40, 4.60),
                    "proc0"=c(4.53, 4.99, 4.57, 4.48, 4.48, 4.66, 4.53, 4.39),
                    "all"=c(4.36, 4.62, 4.73, 4.66, 4.56, 4.69, 4.60, 4.45, 4.67, 4.55),
                    "all (5 threads)"=c(4.55, 4.73, 4.13, 4.64, 4.66, 4.14, 4.35, 4.46, 4.66, 4.59),
                    "remote"=c(3.54, 3.56, 3.51, 3.59, 3.51, 3.49, 3.57, 3.55, 3.54, 3.49),
                    "local (5 threads)"=c(4.39, 4.47, 4.31, 4.39, 4.38, 4.50, 4.39, 4.40, 4.57, 4.27),
                    "local (try 2)"=c(4.28, 4.43, 4.27, 4.32, 4.29, 4.30, 4.03, 4.17, 4.31, 4.28),
                    "local -x"=c(3.61, 3.60, 3.54, 3.55, 3.50, 3.56, 3.57, 3.55, 3.58, 3.52),
                    "proc0 -x"=c(3.55, 3.53, 3.45, 3.50, 3.55, 3.50, 3.48, 3.63, 3.56, 3.50),
                    "all -x"=c(3.72, 3.68, 3.57, 3.64, 3.53, 3.68, 3.73, 3.61, 3.66, 3.60))
par(mar=c(5.5,7.5,1,1))
boxplot(rev(iperf.10sec), xlab="Gbit/s", horizontal=TRUE, las=1)
dev.copy(png, file="pcrypt.png")
dev.off()

4.5Gbits/s seems like a good starting point to scale towards 10Gbit/s, and it seems that there's enough CPU left over for that, if we can somehow set up separated tunnels. That seems like a reasonable next step.

Alan Hannan

unread,
Nov 9, 2015, 8:34:55 AM11/9/15
to highspeedencryption
Chris - you are awesome and thank you for setting up this test and doing this!

Alan Hannan

unread,
Nov 9, 2015, 8:38:42 AM11/9/15
to highspeedencryption
one issue the literature talks about having problems with is out-of-order packet delivery.  IF we aren't careful, we might speed up the packets but create huge memory requirements and overhead for the end-to-end clients in reordering packets for the application.

With ECMP is the traffic hashed on a deterministic basis, do you know?

What is making the ECMP decisions?  The linux box itself, or the router/switch on either side?  If so - do you know how it's doing multi-path selection?


On Monday, November 9, 2015 at 12:22:21 AM UTC-6, Charles Vaske wrote:

Charles Vaske

unread,
Nov 9, 2015, 10:52:58 AM11/9/15
to highspeedencryption
Since there's only a single switch hop in between the two nodes, I think only the linux kernel has a chance to make the ECMP decisions, which I believe are based on a hash of IPs and ports. It appears to be deterministic, at least from small amounts of tcpdump monitoring. 
Reply all
Reply to author
Forward
0 new messages