Single ipsec tunnel performance

Charles Vaske

unread,

Nov 9, 2015, 1:22:21 AM11/9/15

to highspeedencryption

Reproducing the pcrypt setup from a few years ago, I'm able to get

~5Gbit/s on a single tunnel. The setup is two machines, each

cpu: dual socket Xeon E5 v3

kernel: 3.13.0-32-generic #57-Ubuntu SMP

net: 2x 10Gb ports from an Intel 82599 NIC

Port 1 on each machine is connected to switch 1, and port 2 is

connected to switch 2, for switch redundancy. Config is a layer 3

fabric, so that ECMP manages individual streams already. This limits

single stream performance to 10gb between the hosts, but multiple

streams hit 20gbps:

cvaske@surely:~$ iperf -c maeby --print_mss

------------------------------------------------------------

Client connecting to maeby, TCP port 5001

TCP window size: 325 KByte (default)

------------------------------------------------------------

[ 3] local surely port 33664 connected with maeby port 5001

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-10.0 sec 11.5 GBytes 9.89 Gbits/sec

[ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)

cvaske@surely:~$ iperf -c maeby --print_mss -P 2

------------------------------------------------------------

Client connecting to maeby, TCP port 5001

TCP window size: 325 KByte (default)

------------------------------------------------------------

[ 3] local surely port 33665 connected with maeby port 5001

[ 4] local surely port 33666 connected with maeby port 5001

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-10.0 sec 11.5 GBytes 9.89 Gbits/sec

[ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)

[ 4] 0.0-10.0 sec 11.5 GBytes 9.87 Gbits/sec

[ 4] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)

[SUM] 0.0-10.0 sec 23.0 GBytes 19.8 Gbits/sec

Setting up a standard IPsec host-host tunnel according to this setup:

https://www.strongswan.org/uml/testresults/ikev2/host2host-cert/

I played with various IRQ affinities using the Intel driver script

from the tarball. There's a version on github here, but I don't know

if its the same:

https://github.com/majek/ixgbe/blob/master/scripts/set_irq_affinity

I get 1.02Gbps to 1.04Gbps, regardless of any irq affinities I set.

Next, I setup pcrypt with the following two commands:

modprobe pcrypt

modprobe tcrypt alg="pcrypt(authenc(hmac(sha1),cbc(aes)))" type=3

Stopping and starting ipsec, I get ~4.5Gbps:

cvaske@surely$ iperf -c maeby --print_mss

------------------------------------------------------------

Client connecting to maeby, TCP port 5001

TCP window size: 325 KByte (default)

------------------------------------------------------------

[ 3] local surely port 33661 connected with maeby port 5001

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-10.0 sec 5.69 GBytes 4.89 Gbits/sec

[ 3] MSS size 8890 bytes (MTU 8930 bytes, unknown interface)

The iperf server process is taking about 90% CPU, and kworker threads

~30% CPU. The iperf client process is taking about 60% CPU, and

kworker threads are taking about 16% CPU each.

As I experimented with different IRQ affinities, I was only able to

decrease performance. Here's the results from various settings:

local -- spread affinities to processors on the same NUMA node as the NIC

proc0 -- all affinities to processor 0

all -- spread interrupts along all processors

remote -- spread affinities along processors on the remote NUMA node (should be slower)

-x -- also set the XPS transmit affinities

After testing these settings, even clearing out the XPS settings that the -x option had set wouldn't improve performance past an average of 3.5Gbits/s, so I must have messed something up there. The data for the plot in R format:

iperf.10sec <- list("local"=c(4.61, 4.67, 4.57, 4.62, 4.59, 4.94, 4.84, 4.40, 4.60),

"proc0"=c(4.53, 4.99, 4.57, 4.48, 4.48, 4.66, 4.53, 4.39),

"all"=c(4.36, 4.62, 4.73, 4.66, 4.56, 4.69, 4.60, 4.45, 4.67, 4.55),

"all (5 threads)"=c(4.55, 4.73, 4.13, 4.64, 4.66, 4.14, 4.35, 4.46, 4.66, 4.59),

"remote"=c(3.54, 3.56, 3.51, 3.59, 3.51, 3.49, 3.57, 3.55, 3.54, 3.49),

"local (5 threads)"=c(4.39, 4.47, 4.31, 4.39, 4.38, 4.50, 4.39, 4.40, 4.57, 4.27),

"local (try 2)"=c(4.28, 4.43, 4.27, 4.32, 4.29, 4.30, 4.03, 4.17, 4.31, 4.28),

"local -x"=c(3.61, 3.60, 3.54, 3.55, 3.50, 3.56, 3.57, 3.55, 3.58, 3.52),

"proc0 -x"=c(3.55, 3.53, 3.45, 3.50, 3.55, 3.50, 3.48, 3.63, 3.56, 3.50),

"all -x"=c(3.72, 3.68, 3.57, 3.64, 3.53, 3.68, 3.73, 3.61, 3.66, 3.60))

par(mar=c(5.5,7.5,1,1))

boxplot(rev(iperf.10sec), xlab="Gbit/s", horizontal=TRUE, las=1)

dev.copy(png, file="pcrypt.png")

dev.off()

4.5Gbits/s seems like a good starting point to scale towards 10Gbit/s, and it seems that there's enough CPU left over for that, if we can somehow set up separated tunnels. That seems like a reasonable next step.

Alan Hannan

unread,

Nov 9, 2015, 8:34:55 AM11/9/15

to highspeedencryption

Chris - you are awesome and thank you for setting up this test and doing this!

Alan Hannan

unread,

Nov 9, 2015, 8:38:42 AM11/9/15

to highspeedencryption

one issue the literature talks about having problems with is out-of-order packet delivery. IF we aren't careful, we might speed up the packets but create huge memory requirements and overhead for the end-to-end clients in reordering packets for the application.

With ECMP is the traffic hashed on a deterministic basis, do you know?

What is making the ECMP decisions? The linux box itself, or the router/switch on either side? If so - do you know how it's doing multi-path selection?

On Monday, November 9, 2015 at 12:22:21 AM UTC-6, Charles Vaske wrote:

Charles Vaske

unread,

Nov 9, 2015, 10:52:58 AM11/9/15

to highspeedencryption

Since there's only a single switch hop in between the two nodes, I think only the linux kernel has a chance to make the ECMP decisions, which I believe are based on a hash of IPs and ports. It appears to be deterministic, at least from small amounts of tcpdump monitoring.

Reply all

Reply to author

Forward