Hi,
I've written a UDP server and client to evaluate the maximum performance I can get between 2 servers, each with 2 x dual port ConnectX 3 NICs. The servers are directly connected (no switch) and running as 40Gb Ethernet with a 9000 byte MTU (i.e. total of 4 x 40Gb links). My eventual application needs to receive around 23 Gb/s per link (92 Gb/s total), but I'm having some inconsistent results in my UDP transmit performance.
I'm using VMA offloading for the Tx and Rx and that works nicely for a single UDP stream on a single NIC. My packet size is 8200 bytes.
When I try to run 2, 3 or 4 UDP steams on the other links between the 2 servers I see substantial variation in my Tx performance. I'm binding the UDP sender to separate CPU cores and only allocating memory from the adjacent NUMA node to each CPU, so I believe everything is setup quite optimally in that way.
Does libvma have internal resources that would be shared across threads, processes, ports or event devices? i.e. is there only one internal thread per server?
Thanks in advance,
Andrew