Even though this driver implements stateless offloads - TXCSUM, RXCSUM, TSO, LRO - (just like the original FreeBSD one - https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#stateless-offloads), the underlying ENA device does NOT implement RXCSUM nor TSO (see amzn/amzn-drivers#29). It also looks like the LRO logic never gets activated based on the observed values of relevant tracepoints. We use netchannels so maybe it does not matter as much.
OSv when running on t3nano reports this:
D/22 ena]: Elastic Network Adapter (ENA)ena v2.6.3
[D/22 ena]: LLQ is not supported. Using the host mode policy.
[D/22 ena]: ena_attach: set max_num_io_queues to 2
[D/22 ena]: Enable only 3 MSI-x (out of 9), reduce the number of queues
[D/22 ena]: device offloads (caps): TXCSUM=2, TXCSUM_IPV6=0, TSO4=0, TSO6=0, RXCSUM=0, RXCSUM_IPV6=0, LRO=1, JUMBO_MTU=1
...
[D/22 ena]: ena_update_hwassist: CSUM_IP=1, CSUM_UDP=4, CSUM_TCP=2, CSUM_UDP_IPV6=0, CSUM_TCP_IPV6=0, CSUM_TSO=0
Can anyone confirm if this indeed is the case? The issue I am citing above was opened in 2017 and never updated. If so how much of the performance impact does it have?
Relatedly, I have since improved the driver a bit. Mainly I have changed the "cleanup" logic (mostly handling RX) to make the worker threads and corresponding MSIX vectors pin to a single vCPU. That seems to reduce # of IPIs and in some workflows I see performance improve by 5-10%.
I have also run more tests with iperf3 and netperf:
netperf -H 172.31.89.238 -t TCP_STREAM -l 5 -- -m 65536
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.31.89.238 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
65536 16384 65536 5.01 3776.40
iperf3 -t 5 -c 172.31.93.118
Connecting to host 172.31.93.118, port 5201
[ 5] local 172.31.90.167 port 55674 connected to 172.31.93.118 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 444 MBytes 3.72 Gbits/sec 5901 199 KBytes
[ 5] 1.00-2.00 sec 421 MBytes 3.54 Gbits/sec 5529 147 KBytes
[ 5] 2.00-3.00 sec 464 MBytes 3.89 Gbits/sec 5923 157 KBytes
[ 5] 3.00-4.00 sec 440 MBytes 3.69 Gbits/sec 6117 158 KBytes
[ 5] 4.00-5.00 sec 450 MBytes 3.78 Gbits/sec 6686 260 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-5.00 sec 2.17 GBytes 3.72 Gbits/sec 30156 sender
[ 5] 0.00-5.03 sec 2.16 GBytes 3.69 Gbits/sec receiver
With iperf3 I typically see a relatively high number of retries. Do you think it indicates some sort of bottleneck on OSv side?
Relatedly, with both iperf and netperf I never see OSv exceed the 4Gbits/s barrier and it never approaches the NIC bandwidth limit (the maximum of t3nano is 5Gbits).
Finally, all the tests I have been running we conducted with clients (wrk, iperf, netperf, etc) on the t3micro Ubuntu instance deployed in the same availability zone (us-east-1f) and same VPC. I have also had no chance to compare it to Linux guest so no idea if these results are half decent or not.
Any input is highly appreciated.