[ 4] local 192.100.0.1 port 44437 connected to 192.200.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 27.00-28.00 sec 864 MBytes 7.25 Gbits/sec 394 460 KBytes
[ 4] 28.00-29.00 sec 874 MBytes 7.33 Gbits/sec 186 546 KBytes
[ 4] 29.00-30.00 sec 916 MBytes 7.69 Gbits/sec 502 444 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-30.00 sec 24.9 GBytes 7.12 Gbits/sec 9087 sender
[ 4] 0.00-30.00 sec 24.8 GBytes 7.11 Gbits/sec receiver
[ 4] local 192.100.0.1 port 44422 connected to 192.200.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 27.00-28.00 sec 1.08 GBytes 9.27 Gbits/sec 96 969 KBytes
[ 4] 28.00-29.00 sec 1.09 GBytes 9.35 Gbits/sec 0 1.31 MBytes
[ 4] 29.00-30.00 sec 1.09 GBytes 9.35 Gbits/sec 58 1.13 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-30.00 sec 32.0 GBytes 9.17 Gbits/sec 1373 sender
[ 4] 0.00-30.00 sec 32.0 GBytes 9.17 Gbits/sec receiver
Ah, that picture is very cool and helpful! I checked that all of the nodes reserved for my experiment are under the same leaf, so I guess in that case iperf traffic is really forwarding double the reported traffic through a single leaf. Could it be that that level of traffic (combined with traffic exists from other experiments) could be overloading the leaf?
I just ran a similar setup on the Clemson cluster and observed similar results, so I wonder if the source of the inconsistency lies somewhere in the linux networking stack. I wonder if anyone has tried similar throughput tests and observed similar results?
There's nothing inherently problematic with the stack at these data rates (the linux stack is really inefficient in packet parsing, but 40G TCP is pretty low PPS).
Are you using both interfaces on the middlebox, or just one?
Have you disabled segmentation offload on the NICs? (You almost certainly should - otherwise the middlebox may end up reassembling and then breaking up the packets again, which at the very least can have strange effects on jitter).
If you run the flow for much longer (say 3-5 minutes instead of 30 seconds), are the results more consistent?
There's nothing inherently problematic with the stack at these data rates (the linux stack is really inefficient in packet parsing, but 40G TCP is pretty low PPS).I see. My background is not in networking, so didn't have a good sense of what a standard configuration can handle.
If you run the flow for much longer (say 3-5 minutes instead of 30 seconds), are the results more consistent?I tried this both before and after disabling segmentation offload on the NICs. The results are still consistently lower than expected, around 7 Gbps throughput with many retries.
On Sat, Jun 24, 2017 at 1:06 PM, Jeff Helt <jmhe...@gmail.com> wrote:There's nothing inherently problematic with the stack at these data rates (the linux stack is really inefficient in packet parsing, but 40G TCP is pretty low PPS).I see. My background is not in networking, so didn't have a good sense of what a standard configuration can handle.I'm trying to understand your topology - we don't usually use kernel *routing*, so maybe it's particularly poor (we stick generally to bridging). (The question of forwarding throughput is basically one of memory bandwidth divided by the number of memory copies the kernel decides to do for certain operations). You just have 3 nodes, and the middlebox has ipv4 forwarding enabled, and just has routes in the host route table for each network? Or are you using some other feature of iproute2 to do the forwarding?
I'm going to run a few tests to try to get a baseline on basic kernel routing performance, hopefully in the next hour or so, which may shed some light on the performance differences in the types of forwarding the kernel can do.
It's definitely a packet handling performance problem on the middlebox host - if you use a bridge you can do slightly better (7.73Gbps versus 6.64Gbps for routing on my sliver), but it's still not optimal. If you set the MTU to 9000 you can max out the connection without any problems at all, as that obviously reduces the packet load by a significant factor on the middlebox host (in both the bridging and routing cases).
Ultimately, it seems like the cores on these systems are simply not fast enough to handle 10G routing with the default kernel implementation (since it's generally a single-core operation, the core Mhz matters more than you might like).
[ 3] 0.0-30.0 sec 68.2 GBytes 19.5 Gbits/sec
[ 3] 0.0-30.0 sec 62.7 GBytes 17.9 Gbits/sec
Strange because now I am seeing around 9.3-9.4 Gbps throughput on the Wisconsin and Emulab "d430" nodes but with occasional significant drops, such as in the output below.[ 4] 109.00-110.00 sec 1.09 GBytes 9.38 Gbits/sec 0 1.26 MBytes[ 4] 110.00-111.00 sec 1.09 GBytes 9.37 Gbits/sec 0 1.57 MBytes[ 4] 111.00-112.00 sec 894 MBytes 7.50 Gbits/sec 284 775 KBytes[ 4] 112.00-113.00 sec 1.09 GBytes 9.40 Gbits/sec 0 1.02 MBytes[ 4] 113.00-114.00 sec 1.09 GBytes 9.38 Gbits/sec 0 1.26 MBytes[ 4] 114.00-115.00 sec 1.09 GBytes 9.35 Gbits/sec 11 1.21 MBytes[ 4] 115.00-116.00 sec 1.09 GBytes 9.38 Gbits/sec 0 1.50 MBytes[ 4] 116.00-117.00 sec 660 MBytes 5.54 Gbits/sec 271 962 KBytes[ 4] 117.00-118.00 sec 1.09 GBytes 9.38 Gbits/sec 0 1.34 MBytesI tried correlating the output of dropwatch running on the middlebox with the drops, but I haven't been able to draw any conclusions yet. Perhaps there are some background processes running that I am not properly controlling for.
Ultimately, it seems like the cores on these systems are simply not fast enough to handle 10G routing with the default kernel implementation (since it's generally a single-core operation, the core Mhz matters more than you might like).I also find it strange that top reports a CPU usage of less than 10% (and often less than 5%) while running a test with iperf. If the performance was CPU bound, wouldn't we expect to see at least one core maxed out?
4516 nbastin 20 0 10076 1492 856 S 25.5 0.0 0:02.54 iperf3
%Cpu(s): 0.0 us, 0.7 sy, 0.0 ni, 99.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
What image are you using on the d430? (the kernel in this Centos 7.1 is pretty old - newer ones have some threading optimizations for IP forwarding). Also the d430 has a much newer 10G card that might have a more efficient driver.
That's while forwarding 60+Gbps of traffic... If I look at my cloudlab reservation while routing your iperf traffic it's similarly useless:%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
What image are you using on the d430? (the kernel in this Centos 7.1 is pretty old - newer ones have some threading optimizations for IP forwarding). Also the d430 has a much newer 10G card that might have a more efficient driver.I was testing with the same CentOS 7.1 image, but perhaps the newer NIC explains the difference. The results I showed from the Wisconsin nodes in that post were also after doing a system update, which included a kernel upgrade from 3.10.0-327.10.1 to 3.10.0-514.21.2. I haven’t checked the diffs between those two releases, but things may be starting to make sense now.
That's while forwarding 60+Gbps of traffic... If I look at my cloudlab reservation while routing your iperf traffic it's similarly useless:%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 stAh, yes, this is what I was seeing on the middlebox node. Is there a linux utility to view core usage by the kernel?
That's while forwarding 60+Gbps of traffic... If I look at my cloudlab reservation while routing your iperf traffic it's similarly useless:%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 stAh, yes, this is what I was seeing on the middlebox node. Is there a linux utility to view core usage by the kernel?Not really in a raw hardware environment (one of the essential uses of a VM environment is the ability to troubleshoot kernel resource usage, since you have an outer environment to inspect it from).
I was trying to understand last night what exactly is consuming CPU but not being captured by programs like top. I noticed that there are a number of ksoftirqd threads consuming cpu on the middle box while testing throughput, which I believe are handling the sent and recieved interrupts from the NIC.
If I turn off irqbalance and set the interrupt smp affinity for all of the NIC-related interrupts to be a single processor, then only one ksoftirqd process handles all of the work. However, even with these settings, the single running ksoftirqd process is not showing 100% usage for a single core (the maximum I’ve seen is around 50%).
So what is part that isn’t being captured by top? If it’s not the interrupt handling, is it the context switches? Or is there something else I’m not thinking of?
I am trying to run a few experiments using the topology below. No vlan tagging is used.Sender (192.100.0.1) -- (192.100.0.2) Middlebox (192.200.0.2) -- (192.200.0.1) ReceiverI am trying to measure the baseline throughput using iperf3 sending TCP traffic from the sender to the receiver, but I am noticing some highly variable results. For instance, the following two blocks of output were taken from runs within a minute of each other.Low throughput:[ 4] local 192.100.0.1 port 44437 connected to 192.200.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 27.00-28.00 sec 864 MBytes 7.25 Gbits/sec 394 460 KBytes
[ 4] 28.00-29.00 sec 874 MBytes 7.33 Gbits/sec 186 546 KBytes
[ 4] 29.00-30.00 sec 916 MBytes 7.69 Gbits/sec 502 444 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-30.00 sec 24.9 GBytes 7.12 Gbits/sec 9087 sender
[ 4] 0.00-30.00 sec 24.8 GBytes 7.11 Gbits/sec receiver
Expected throughput:
[ 4] local 192.100.0.1 port 44422 connected to 192.200.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 27.00-28.00 sec 1.08 GBytes 9.27 Gbits/sec 96 969 KBytes
[ 4] 28.00-29.00 sec 1.09 GBytes 9.35 Gbits/sec 0 1.31 MBytes
[ 4] 29.00-30.00 sec 1.09 GBytes 9.35 Gbits/sec 58 1.13 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-30.00 sec 32.0 GBytes 9.17 Gbits/sec 1373 sender
[ 4] 0.00-30.00 sec 32.0 GBytes 9.17 Gbits/sec receiver
The middlebox isn't under any load during any experiments and is configured to forward traffic using iproute2. No specialized network-related software is installed.
Any idea what could be going on here or steps to further debug? Appreciate any help.
Jeff
--
You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-users+unsubscribe@googlegroups.com.
To post to this group, send email to cloudlab-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/964e8517-b511-461e-bc39-92b0662b3a6e%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
To post to this group, send email to cloudla...@googlegroups.com.