Reproducing the pcrypt setup from a few years ago, I'm able to get
~5Gbit/s on a single tunnel. The setup is two machines, each
cpu: dual socket Xeon E5 v3
kernel: 3.13.0-32-generic #57-Ubuntu SMP
net: 2x 10Gb ports from an Intel 82599 NIC
Port 1 on each machine is connected to switch 1, and port 2 is
connected to switch 2, for switch redundancy. Config is a layer 3
fabric, so that ECMP manages individual streams already. This limits
single stream performance to 10gb between the hosts, but multiple
streams hit 20gbps:
cvaske@surely:~$ iperf -c maeby --print_mss
------------------------------------------------------------
Client connecting to maeby, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 3] local surely port 33664 connected with maeby port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 11.5 GBytes 9.89 Gbits/sec
[ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
cvaske@surely:~$ iperf -c maeby --print_mss -P 2
------------------------------------------------------------
Client connecting to maeby, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 3] local surely port 33665 connected with maeby port 5001
[ 4] local surely port 33666 connected with maeby port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 11.5 GBytes 9.89 Gbits/sec
[ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
[ 4] 0.0-10.0 sec 11.5 GBytes 9.87 Gbits/sec
[ 4] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
[SUM] 0.0-10.0 sec 23.0 GBytes 19.8 Gbits/sec
Setting up a standard IPsec host-host tunnel according to this setup:
I played with various IRQ affinities using the Intel driver script
from the tarball. There's a version on github here, but I don't know
if its the same:
I get 1.02Gbps to 1.04Gbps, regardless of any irq affinities I set.
Next, I setup pcrypt with the following two commands:
modprobe pcrypt
modprobe tcrypt alg="pcrypt(authenc(hmac(sha1),cbc(aes)))" type=3
Stopping and starting ipsec, I get ~4.5Gbps:
cvaske@surely$ iperf -c maeby --print_mss
------------------------------------------------------------
Client connecting to maeby, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 3] local surely port 33661 connected with maeby port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 5.69 GBytes 4.89 Gbits/sec
[ 3] MSS size 8890 bytes (MTU 8930 bytes, unknown interface)
The iperf server process is taking about 90% CPU, and kworker threads
~30% CPU. The iperf client process is taking about 60% CPU, and
kworker threads are taking about 16% CPU each.
As I experimented with different IRQ affinities, I was only able to
decrease performance. Here's the results from various settings:

local -- spread affinities to processors on the same NUMA node as the NIC
proc0 -- all affinities to processor 0
all -- spread interrupts along all processors
remote -- spread affinities along processors on the remote NUMA node (should be slower)
-x -- also set the XPS transmit affinities
After testing these settings, even clearing out the XPS settings that the -x option had set wouldn't improve performance past an average of 3.5Gbits/s, so I must have messed something up there. The data for the plot in R format:
iperf.10sec <- list("local"=c(4.61, 4.67, 4.57, 4.62, 4.59, 4.94, 4.84, 4.40, 4.60),
"proc0"=c(4.53, 4.99, 4.57, 4.48, 4.48, 4.66, 4.53, 4.39),
"all"=c(4.36, 4.62, 4.73, 4.66, 4.56, 4.69, 4.60, 4.45, 4.67, 4.55),
"all (5 threads)"=c(4.55, 4.73, 4.13, 4.64, 4.66, 4.14, 4.35, 4.46, 4.66, 4.59),
"remote"=c(3.54, 3.56, 3.51, 3.59, 3.51, 3.49, 3.57, 3.55, 3.54, 3.49),
"local (5 threads)"=c(4.39, 4.47, 4.31, 4.39, 4.38, 4.50, 4.39, 4.40, 4.57, 4.27),
"local (try 2)"=c(4.28, 4.43, 4.27, 4.32, 4.29, 4.30, 4.03, 4.17, 4.31, 4.28),
"local -x"=c(3.61, 3.60, 3.54, 3.55, 3.50, 3.56, 3.57, 3.55, 3.58, 3.52),
"proc0 -x"=c(3.55, 3.53, 3.45, 3.50, 3.55, 3.50, 3.48, 3.63, 3.56, 3.50),
"all -x"=c(3.72, 3.68, 3.57, 3.64, 3.53, 3.68, 3.73, 3.61, 3.66, 3.60))
par(mar=c(5.5,7.5,1,1))
boxplot(rev(iperf.10sec), xlab="Gbit/s", horizontal=TRUE, las=1)
dev.copy(png, file="pcrypt.png")
dev.off()
4.5Gbits/s seems like a good starting point to scale towards 10Gbit/s, and it seems that there's enough CPU left over for that, if we can somehow set up separated tunnels. That seems like a reasonable next step.