Hi Yacqub,
I ran some iperf3 tests of my own on your nodes. First I had amd226, amd227, and amd238 all talking to each other, each acting as server and client in turn, then I had amd218, amd226, and amd241 all talking to each other in the same manner. Unfortunately, while I saw this same behavior with various permutations of pairs of nodes, I didn't see anything that stood out as a specific smoking gun (such as one single node being a problem). I will mention that overall, the tests involving amd227 and amd238 had the lowest throughput of all of them. I re-ran iperf3 tests between those two nodes specifically while looking at traffic statistics on the switch, and I didn't see any appreciable signs of congestion or throttling on the switch even while the tests showed numerous retries and low throughput.
Have you played around with kernel parameter tuning at all? Specifically kernel buffer sizes (net.core.rmem_max and net.core.wmem_max) and TCP socket buffer sizes (net.ipv4.tcp_rmem and net.ipv4.tcp_wmem). You should also look into various tuning options within iperf3 that might help with higher bandwidth testing. While all of these retries could be a sign of NIC or cable problems, it would be worthwhile to eliminate or mitigate OS-level interference first and then work from there. You should also look into core pinning on both the client and server, either using the -A flag or with something like numactl. Even though these nodes are all the same hwtype, loaded with the same OS, and connected to the same switch, the OSes on each node could be doing considerably different things.
Sorry I don't have a more concrete answer for you, but hopefully my suggestions above can help you track this down.
Best,
- Aleks