Inconsistent throughput between Wisconsin c220g1 nodes

352 views
Skip to first unread message

Jeff Helt

unread,
Jun 23, 2017, 3:58:42 PM6/23/17
to cloudlab-users
I am trying to run a few experiments using the topology below. No vlan tagging is used.

Sender (192.100.0.1) -- (192.100.0.2) Middlebox (192.200.0.2) -- (192.200.0.1) Receiver

I am trying to measure the baseline throughput using iperf3 sending TCP traffic from the sender to the receiver, but I am noticing some highly variable results. For instance, the following two blocks of output were taken from runs within a minute of each other.

Low throughput:

[  4] local 192.100.0.1 port 44437 connected to 192.200.0.1 port 5201

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]  27.00-28.00  sec   864 MBytes  7.25 Gbits/sec  394    460 KBytes       

[  4]  28.00-29.00  sec   874 MBytes  7.33 Gbits/sec  186    546 KBytes       

[  4]  29.00-30.00  sec   916 MBytes  7.69 Gbits/sec  502    444 KBytes       

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bandwidth       Retr

[  4]   0.00-30.00  sec  24.9 GBytes  7.12 Gbits/sec  9087             sender

[  4]   0.00-30.00  sec  24.8 GBytes  7.11 Gbits/sec                  receiver


Expected throughput:


[  4] local 192.100.0.1 port 44422 connected to 192.200.0.1 port 5201

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]  27.00-28.00  sec  1.08 GBytes  9.27 Gbits/sec   96    969 KBytes       

[  4]  28.00-29.00  sec  1.09 GBytes  9.35 Gbits/sec    0   1.31 MBytes       

[  4]  29.00-30.00  sec  1.09 GBytes  9.35 Gbits/sec   58   1.13 MBytes       

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bandwidth       Retr

[  4]   0.00-30.00  sec  32.0 GBytes  9.17 Gbits/sec  1373             sender

[  4]   0.00-30.00  sec  32.0 GBytes  9.17 Gbits/sec                  receiver


The middlebox isn't under any load during any experiments and is configured to forward traffic using iproute2. No specialized network-related software is installed.

Any idea what could be going on here or steps to further debug? Appreciate any help.

Jeff

Mike Hibler

unread,
Jun 23, 2017, 4:10:41 PM6/23/17
to Jeff Helt, cloudlab-users
What is the experiment name?
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To post to this group, send email to cloudla...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/964e8517-b511-461e-bc39-92b0662b3a6e%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jeff Helt

unread,
Jun 23, 2017, 4:18:02 PM6/23/17
to cloudlab-users, jmhe...@gmail.com
Apologies for not including it. It is jhelt-QV26723

Nicholas Bastin

unread,
Jun 23, 2017, 7:47:02 PM6/23/17
to Jeff Helt, cloudlab-users
Since I'm just that kind of crazy, I took the wisconsin advertisement and fed it through geni-lib and created a dotfile rendering of the infrastructure topology.  I've saved the rendered version as a PDF in two layouts - the first page is a radial layout that has a nice view of the core but is largely useless for seeing where specific experiment nodes are connected, and the second "page" is a fully-exploded hierarchical view where all node names are clear.  You can find this PDF at:


Nodes with dotted ovals were reserved at the time I created the diagram (so you can get some idea of possible contention).  Leaf-spine links at Wisconsin are 40G, while node-leaf edges are 10G.  I pruned the "internet" node and all connections to it out of the graph in order to reduce the clutter.  There are still some orphaned nodes and a few other oddities that perhaps someone from Cloudlab can explain.

You can find the full dotfile here for your own rendering pleasure:


--
Nick

Leigh Stoller

unread,
Jun 23, 2017, 7:51:08 PM6/23/17
to Nicholas Bastin, Jeff Helt, cloudlab-users
> Since I'm just that kind of crazy, I took the wisconsin advertisement and
> fed it through geni-lib and created a dotfile rendering of the
> infrastructure topology. I've saved the rendered version as a PDF in two
> layouts - the first page is a radial layout that has a nice view of the
> core but is largely useless for seeing where specific experiment nodes
> are connected, and the second "page" is a fully-exploded hierarchical
> view where all node names are clear. You can find this PDF at:
>
> https://www.dropbox.com/s/g9mu8qrcneb8oik/cloudlab-wisc-ad.pdf?dl=0

Cool picture! Don't suppose you would be willing to provide more details
about how did this with geni-lib? :-)

Leigh




Nicholas Bastin

unread,
Jun 23, 2017, 7:54:50 PM6/23/17
to Leigh Stoller, Jeff Helt, cloudlab-users
This will end up getting merged into geni-lib, but the very-hacky-basic code in my branch is here:


I had to add interface_ref inspection to AdLink objects, since it wasn't there, but otherwise it's pretty stock.  There's some ghastly URN parsing hack in there that I need to replace with the real URN module.. :-)

--
Nick

Jeff Helt

unread,
Jun 23, 2017, 8:23:48 PM6/23/17
to cloudlab-users, lbst...@gmail.com, jmhe...@gmail.com
Ah, that picture is very cool and helpful! I checked that all of the nodes reserved for my experiment are under the same leaf, so I guess in that case iperf traffic is really forwarding double the reported traffic through a single leaf. Could it be that that level of traffic (combined with traffic exists from other experiments) could be overloading the leaf?

Jeff

Nicholas Bastin

unread,
Jun 23, 2017, 8:47:08 PM6/23/17
to Jeff Helt, cloudlab-users, Leigh Stoller
On Fri, Jun 23, 2017 at 8:23 PM, Jeff Helt <jmhe...@gmail.com> wrote:
Ah, that picture is very cool and helpful! I checked that all of the nodes reserved for my experiment are under the same leaf, so I guess in that case iperf traffic is really forwarding double the reported traffic through a single leaf. Could it be that that level of traffic (combined with traffic exists from other experiments) could be overloading the leaf?

I would really not think so, but someone from Wisconsin might have to answer that question.  In general if your nodes are on the same leaf I would expect you to have full bandwidth available regardless of other experiments, as there shouldn't be any internal contention in the switch until your packets try to go to the spine.

If you just do iperf to/from the hosts with the middlebox as the endpoint (instead of forwarding through the middlebox), do you see similar inconsistencies in performance?

--
Nick 

Jeff Helt

unread,
Jun 23, 2017, 8:53:44 PM6/23/17
to Nicholas Bastin, Leigh Stoller, cloudlab-users
Right, that makes sense to me.

Nope, I see much more consistent throughput (around 9.4Gbps) between the middlebox and either of the two hosts.

Jeff Helt

unread,
Jun 24, 2017, 11:50:50 AM6/24/17
to cloudlab-users, nick....@gmail.com, lbst...@gmail.com
I just ran a similar setup on the Clemson cluster and observed similar results, so I wonder if the source of the inconsistency lies somewhere in the linux networking stack. I wonder if anyone has tried similar throughput tests and observed similar results?

Jeff

Nicholas Bastin

unread,
Jun 24, 2017, 12:17:28 PM6/24/17
to Jeff Helt, cloudlab-users, Leigh Stoller
On Sat, Jun 24, 2017 at 11:50 AM, Jeff Helt <jmhe...@gmail.com> wrote:
I just ran a similar setup on the Clemson cluster and observed similar results, so I wonder if the source of the inconsistency lies somewhere in the linux networking stack. I wonder if anyone has tried similar throughput tests and observed similar results?

There's nothing inherently problematic with the stack at these data rates (the linux stack is really inefficient in packet parsing, but 40G TCP is pretty low PPS).  iperf3 is reporting packet loss (well, retransmissions, which we can hope is due to packet loss and not some wacky software bug), which to me says something else is wrong somewhere - either in the NIC parameters, queue configurations, or interrupt tuning (or something else I'm not thinking of off the top of my head).

Are you using both interfaces on the middlebox, or just one?  Have you disabled segmentation offload on the NICs?  (You almost certainly should - otherwise the middlebox may end up reassembling and then breaking up the packets again, which at the very least can have strange effects on jitter).  The non-deterministic nature of the problem makes me inclined to think it's more likely to be related to interrupts and processor affinity for varying tasks, or at least some intermittent background task, but that's still just a stab in the dark.

If you run the flow for much longer (say 3-5 minutes instead of 30 seconds), are the results more consistent?

--
Nick

Jeff Helt

unread,
Jun 24, 2017, 1:06:28 PM6/24/17
to cloudlab-users, jmhe...@gmail.com, lbst...@gmail.com
There's nothing inherently problematic with the stack at these data rates (the linux stack is really inefficient in packet parsing, but 40G TCP is pretty low PPS).
 
I see. My background is not in networking, so didn't have a good sense of what a standard configuration can handle.

Are you using both interfaces on the middlebox, or just one?
 
When testing on the Wisconsin cluster, the middlebox is using both interfaces. However, when I tested on the Clemson cluster, I configured the middlebox to only use 1 (since that's all that's available). They seemed to yield similar results.

Have you disabled segmentation offload on the NICs?  (You almost certainly should - otherwise the middlebox may end up reassembling and then breaking up the packets again, which at the very least can have strange effects on jitter).

I hadn't but I just did and the results were more or less the same. In fact, if anything I am seeing fewer (nearly zero) time segments when I see the expected/desired throughput of 9+ Gbps after disabling it.
 
If you run the flow for much longer (say 3-5 minutes instead of 30 seconds), are the results more consistent?

I tried this both before and after disabling segmentation offload on the NICs. The results are still consistently lower than expected, around 7 Gbps throughput with many retries.

Jeff

Nicholas Bastin

unread,
Jun 24, 2017, 1:57:48 PM6/24/17
to Jeff Helt, cloudlab-users, Leigh Stoller
On Sat, Jun 24, 2017 at 1:06 PM, Jeff Helt <jmhe...@gmail.com> wrote:
There's nothing inherently problematic with the stack at these data rates (the linux stack is really inefficient in packet parsing, but 40G TCP is pretty low PPS).
 
I see. My background is not in networking, so didn't have a good sense of what a standard configuration can handle.

I'm trying to understand your topology - we don't usually use kernel *routing*, so maybe it's particularly poor (we stick generally to bridging).  (The question of forwarding throughput is basically one of memory bandwidth divided by the number of memory copies the kernel decides to do for certain operations).  You just have 3 nodes, and the middlebox has ipv4 forwarding enabled, and just has routes in the host route table for each network?  Or are you using some other feature of iproute2 to do the forwarding?
 
If you run the flow for much longer (say 3-5 minutes instead of 30 seconds), are the results more consistent?

I tried this both before and after disabling segmentation offload on the NICs. The results are still consistently lower than expected, around 7 Gbps throughput with many retries.

I'm going to run a few tests to try to get a baseline on basic kernel routing performance, hopefully in the next hour or so, which may shed some light on the performance differences in the types of forwarding the kernel can do.

--
Nick 

Jeff Helt

unread,
Jun 24, 2017, 2:08:53 PM6/24/17
to cloudlab-users, jmhe...@gmail.com, lbst...@gmail.com


On Saturday, June 24, 2017 at 1:57:48 PM UTC-4, Nicholas Bastin wrote:
On Sat, Jun 24, 2017 at 1:06 PM, Jeff Helt <jmhe...@gmail.com> wrote:
There's nothing inherently problematic with the stack at these data rates (the linux stack is really inefficient in packet parsing, but 40G TCP is pretty low PPS).
 
I see. My background is not in networking, so didn't have a good sense of what a standard configuration can handle.

I'm trying to understand your topology - we don't usually use kernel *routing*, so maybe it's particularly poor (we stick generally to bridging).  (The question of forwarding throughput is basically one of memory bandwidth divided by the number of memory copies the kernel decides to do for certain operations).  You just have 3 nodes, and the middlebox has ipv4 forwarding enabled, and just has routes in the host route table for each network?  Or are you using some other feature of iproute2 to do the forwarding?

Yes, that's correct. I believe I've made the cloudlab profile public. It's called "dataplane-docker" (ignore the docker part, I haven't gotten there yet) and a link is here.

Cloudlab configures routes and adds entries in the hosts files automatically in the sender and receiver to each other and the base CentOS box has ip forwarding enabled by default, so you should be able to replicate this by simply running iperf3 -c receiver on the client after starting the iperf3 server on the receiver.

I'm going to run a few tests to try to get a baseline on basic kernel routing performance, hopefully in the next hour or so, which may shed some light on the performance differences in the types of forwarding the kernel can do.

Amazing, thank you. Let me know if you have any questions or uncover something interesting. I will continue to do the same.

Jeff

Mike Hibler

unread,
Jun 24, 2017, 2:16:50 PM6/24/17
to Jeff Helt, cloudlab-users, lbst...@gmail.com
Another data point would be the Emulab "d430" nodes. They have multiple 10Gb
interfaces and are all connected to a single core switch.

On Sat, Jun 24, 2017 at 11:08:53AM -0700, Jeff Helt wrote:
>
>
> On Saturday, June 24, 2017 at 1:57:48 PM UTC-4, Nicholas Bastin wrote:
> >
> > On Sat, Jun 24, 2017 at 1:06 PM, Jeff Helt <jmhe...@gmail.com
> > <javascript:>> wrote:
> >
> >> There's nothing inherently problematic with the stack at these data rates
> >>> (the linux stack is really inefficient in packet parsing, but 40G TCP is
> >>> pretty low PPS).
> >>>
> >>
> >> I see. My background is not in networking, so didn't have a good sense of
> >> what a standard configuration can handle.
> >>
> >
> > I'm trying to understand your topology - we don't usually use kernel
> > *routing*, so maybe it's particularly poor (we stick generally to
> > bridging). (The question of forwarding throughput is basically one of
> > memory bandwidth divided by the number of memory copies the kernel decides
> > to do for certain operations). You just have 3 nodes, and the middlebox
> > has ipv4 forwarding enabled, and just has routes in the host route table
> > for each network? Or are you using some other feature of iproute2 to do
> > the forwarding?
> >
>
> Yes, that's correct. I believe I've made the cloudlab profile public. It's
> called "dataplane-docker" (ignore the docker part, I haven't gotten there
> yet) and a link is here <https://www.cloudlab.us/p/PSI/dataplane-docker/9>.
>
> Cloudlab configures routes and adds entries in the hosts files
> automatically in the sender and receiver to each other and the base CentOS
> box has ip forwarding enabled by default, so you should be able to
> replicate this by simply running iperf3 -c receiver on the client after
> starting the iperf3 server on the receiver.
>
> I'm going to run a few tests to try to get a baseline on basic kernel
> > routing performance, hopefully in the next hour or so, which may shed some
> > light on the performance differences in the types of forwarding the kernel
> > can do.
> >
>
> Amazing, thank you. Let me know if you have any questions or uncover
> something interesting. I will continue to do the same.
>
> Jeff
>
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To post to this group, send email to cloudla...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/656c02b4-82ee-491d-87aa-eac35189d20b%40googlegroups.com.

Nicholas Bastin

unread,
Jun 24, 2017, 3:53:36 PM6/24/17
to Jeff Helt, cloudlab-users, Leigh Stoller
It's definitely a packet handling performance problem on the middlebox host - if you use a bridge you can do slightly better (7.73Gbps versus 6.64Gbps for routing on my sliver), but it's still not optimal.  If you set the MTU to 9000 you can max out the connection without any problems at all, as that obviously reduces the packet load by a significant factor on the middlebox host (in both the bridging and routing cases).

Ultimately, it seems like the cores on these systems are simply not fast enough to handle 10G routing with the default kernel implementation (since it's generally a single-core operation, the core Mhz matters more than you might like).  That being said, there's plenty of inefficiencies in the kernel implementation and the machine is definitely capable of forwarding at line rate if you don't use the kernel facility.

--
Nick

Jeff Helt

unread,
Jun 24, 2017, 4:38:59 PM6/24/17
to cloudlab-users, jmhe...@gmail.com, lbst...@gmail.com


On Saturday, June 24, 2017 at 3:53:36 PM UTC-4, Nicholas Bastin wrote:
It's definitely a packet handling performance problem on the middlebox host - if you use a bridge you can do slightly better (7.73Gbps versus 6.64Gbps for routing on my sliver), but it's still not optimal.  If you set the MTU to 9000 you can max out the connection without any problems at all, as that obviously reduces the packet load by a significant factor on the middlebox host (in both the bridging and routing cases).

Strange because now I am seeing around 9.3-9.4 Gbps throughput on the Wisconsin and Emulab "d430" nodes but with occasional significant drops, such as in the output below.

[  4] 109.00-110.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.26 MBytes       
[  4] 110.00-111.00 sec  1.09 GBytes  9.37 Gbits/sec    0   1.57 MBytes       
[  4] 111.00-112.00 sec   894 MBytes  7.50 Gbits/sec  284    775 KBytes       
[  4] 112.00-113.00 sec  1.09 GBytes  9.40 Gbits/sec    0   1.02 MBytes       
[  4] 113.00-114.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.26 MBytes       
[  4] 114.00-115.00 sec  1.09 GBytes  9.35 Gbits/sec   11   1.21 MBytes       
[  4] 115.00-116.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.50 MBytes       
[  4] 116.00-117.00 sec   660 MBytes  5.54 Gbits/sec  271    962 KBytes       
[  4] 117.00-118.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.34 MBytes

I tried correlating the output of dropwatch running on the middlebox with the drops, but I haven't been able to draw any conclusions yet. Perhaps there are some background processes running that I am not properly controlling for.

Ultimately, it seems like the cores on these systems are simply not fast enough to handle 10G routing with the default kernel implementation (since it's generally a single-core operation, the core Mhz matters more than you might like).
 
I also find it strange that top reports a CPU usage of less than 10% (and often less than 5%) while running a test with iperf. If the performance was CPU bound, wouldn't we expect to see at least one core maxed out?

Jeff  

Nicholas Bastin

unread,
Jun 24, 2017, 4:40:08 PM6/24/17
to Jeff Helt, cloudlab-users, Leigh Stoller
I set up some comparable experiments on a VTS site with unbounded interface speeds, just to get an idea of the raw CPU speed possible on something like this.  The device in question has an E5-...@2.10Ghz, so a little older and slower than the wisconsin CPUs, but largely the same architecture.

In the first topology I took two simple hosts connected via an OVS bridge:

[  3]  0.0-30.0 sec  68.2 GBytes  19.5 Gbits/sec


In the second topology I took those hosts and connected them to a network namespace acting as a linux kernel router instead of a bridge:

[  3]  0.0-30.0 sec  62.7 GBytes  17.9 Gbits/sec


I did these runs a number of times, they're all roughly around the same numbers - routing in the kernel is definitely more expensive than bridging (unsurprisingly), and this is similar to the performance difference we see on the Cloudlab nodes as well.

If you want a high-throughput single TCP flow using the kernel forwarding on this hardware, it seems that you'll have to stick to a larger MTU.  I suspect that you could trivially forward at the full 10G using pcap on the middlebox, but I haven't had a chance to test that.  Barring that the 82599ES supports both DPDK and netmap, and you can support close to line-rate forwarding (even with tiny frames) with those.

--
Nick

Nicholas Bastin

unread,
Jun 24, 2017, 4:53:28 PM6/24/17
to Jeff Helt, cloudlab-users, Leigh Stoller
On Sat, Jun 24, 2017 at 4:38 PM, Jeff Helt <jmhe...@gmail.com> wrote:
Strange because now I am seeing around 9.3-9.4 Gbps throughput on the Wisconsin and Emulab "d430" nodes but with occasional significant drops, such as in the output below.

[  4] 109.00-110.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.26 MBytes       
[  4] 110.00-111.00 sec  1.09 GBytes  9.37 Gbits/sec    0   1.57 MBytes       
[  4] 111.00-112.00 sec   894 MBytes  7.50 Gbits/sec  284    775 KBytes       
[  4] 112.00-113.00 sec  1.09 GBytes  9.40 Gbits/sec    0   1.02 MBytes       
[  4] 113.00-114.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.26 MBytes       
[  4] 114.00-115.00 sec  1.09 GBytes  9.35 Gbits/sec   11   1.21 MBytes       
[  4] 115.00-116.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.50 MBytes       
[  4] 116.00-117.00 sec   660 MBytes  5.54 Gbits/sec  271    962 KBytes       
[  4] 117.00-118.00 sec  1.09 GBytes  9.38 Gbits/sec    0   1.34 MBytes

I tried correlating the output of dropwatch running on the middlebox with the drops, but I haven't been able to draw any conclusions yet. Perhaps there are some background processes running that I am not properly controlling for.

What image are you using on the d430?  (the kernel in this Centos 7.1 is pretty old - newer ones have some threading optimizations for IP forwarding).  Also the d430 has a much newer 10G card that might have a more efficient driver.
 
Ultimately, it seems like the cores on these systems are simply not fast enough to handle 10G routing with the default kernel implementation (since it's generally a single-core operation, the core Mhz matters more than you might like).
 
I also find it strange that top reports a CPU usage of less than 10% (and often less than 5%) while running a test with iperf. If the performance was CPU bound, wouldn't we expect to see at least one core maxed out?

The client/server nodes are not CPU bound - their job is pretty trivial, really.  Of course also note that total % CPU is divided by the number of CPUs (so a process using 30% of one CPU will still register as less than 1% total CPU usage).  If I run a 4-thread iperf on Cloudlab I still don't get much (24-26% across the run duration):

 4516 nbastin   20   0   10076   1492    856 S  25.5  0.0   0:02.54 iperf3                                                                                                    


for the process, and of course the system is still just running 0.7%:

%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.2 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st


The middlebox node is CPU bound on a single core (for most of the operation - some few DMA copies are done in separate kthreads), but that is kernel time that isn't accounted for in CPU usage metrics.  Our VTS boxes very rarely see any CPU usage at all in things like top:

%Cpu(s):  0.1 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

That's while forwarding 60+Gbps of traffic...  If I look at my cloudlab reservation while routing your iperf traffic it's similarly useless:

%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st


--
Nick

Jeffrey Helt

unread,
Jun 24, 2017, 5:16:59 PM6/24/17
to Nicholas Bastin, cloudlab-users, Leigh Stoller

What image are you using on the d430?  (the kernel in this Centos 7.1 is pretty old - newer ones have some threading optimizations for IP forwarding).  Also the d430 has a much newer 10G card that might have a more efficient driver.

I was testing with the same CentOS 7.1 image, but perhaps the newer NIC explains the difference. The results I showed from the Wisconsin nodes in that post were also after doing a system update, which included a kernel upgrade from 3.10.0-327.10.1 to 3.10.0-514.21.2. I haven’t checked the diffs between those two releases, but things may be starting to make sense now.

That's while forwarding 60+Gbps of traffic...  If I look at my cloudlab reservation while routing your iperf traffic it's similarly useless:

%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st

Ah, yes, this is what I was seeing on the middlebox node. Is there a linux utility to view core usage by the kernel?

Thanks for all of your help in figuring this out!

Jeff

 

Leigh Stoller

unread,
Jun 24, 2017, 5:26:13 PM6/24/17
to Jeffrey Helt, Nicholas Bastin, cloudlab-users
Excellent thread. I will pin this so that we can refer others to it in the future.

Leigh 

--
Sent from my mobile device.

Eric Eide

unread,
Jun 24, 2017, 5:36:43 PM6/24/17
to Jeffrey Helt, CloudLab Users
Jeffrey Helt <jmhe...@gmail.com> writes:

> I was testing with the same CentOS 7.1 image, but perhaps the newer NIC
> explains the difference.

Let me mention that the CENTOS7-64-STD disk image is newer, CentOS 7.3.

I don't know if it will be better for you, but surely it is newer!

Eric.

--
-------------------------------------------------------------------------------
Eric Eide <ee...@cs.utah.edu> . University of Utah School of Computing
http://www.cs.utah.edu/~eeide/ . +1 (801) 585-5512 voice, +1 (801) 581-5843 FAX

Jeff Helt

unread,
Jun 24, 2017, 5:57:08 PM6/24/17
to cloudlab-users, jmhe...@gmail.com
Thanks for the tip! That image isn't listed in the Cloudlab menu. Slightly off topic, but where can I see the available images. This link on the Utah site seems to be broken but was the only one I could find.

Thanks again everyone!

Jeff

Eric Eide

unread,
Jun 24, 2017, 6:26:34 PM6/24/17
to Jeff Helt, cloudlab-users
Jeff Helt <jmhe...@gmail.com> writes:

> Thanks for the tip! That image isn't listed in the Cloudlab menu.

Yeah... sorry :-(. I reported an issue about this in our issue-tracking
system, so this should be fixed soon.

Nicholas Bastin

unread,
Jun 24, 2017, 9:07:01 PM6/24/17
to Jeffrey Helt, cloudlab-users, Leigh Stoller
On Sat, Jun 24, 2017 at 5:16 PM, Jeffrey Helt <jmhe...@gmail.com> wrote:

What image are you using on the d430?  (the kernel in this Centos 7.1 is pretty old - newer ones have some threading optimizations for IP forwarding).  Also the d430 has a much newer 10G card that might have a more efficient driver.

I was testing with the same CentOS 7.1 image, but perhaps the newer NIC explains the difference. The results I showed from the Wisconsin nodes in that post were also after doing a system update, which included a kernel upgrade from 3.10.0-327.10.1 to 3.10.0-514.21.2. I haven’t checked the diffs between those two releases, but things may be starting to make sense now.

If you want a tool that can capture a bunch of system information such that you can compare between the two (esp. after running), you can try our "uhgetconf" utility, that grabs a ton of useful system info and writes it into a json file:


(You just need python to run it, which your nodes have, but if you yum install pciutils first you'll get more information in the output about the PCI topology).  This can give you useful info between systems, but also between software updates on the same system (even if just a kernel, you can see changes to ioports and modules that might be relevant and easy to refer to later).  You need to run it with root privileges to get maximum data.

We catalogue these files for all of our experiments (at start, after each run, and right before we destroy it) so we can answer questions about a system that may not be running any longer when we're looking at the result data (probably not relevant here, but it can save you a lot of experiment re-setups at a later date just to answer simple questions that come up later).
That's while forwarding 60+Gbps of traffic...  If I look at my cloudlab reservation while routing your iperf traffic it's similarly useless:

%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st

Ah, yes, this is what I was seeing on the middlebox node. Is there a linux utility to view core usage by the kernel?

Not really in a raw hardware environment (one of the essential uses of a VM environment is the ability to troubleshoot kernel resource usage, since you have an outer environment to inspect it from).  One thing you can do is compare /proc/interrupts between runs (uhgetconf captures this information as well) and see whether anything you're doing is triggering an unusual number of interrupts, and whether they are balanced evenly across cores and sockets (whether you want an even or uneven balance depends on a lot of factors, but collecting the information is a place to start).

You can also rebuild the kernel with finer grained *_TICKCOUNT_* config options for a variety of modules, but that additional accounting comes at the expense of performance (of course.. :-)).

You could also first try to update to a far more recent kernel, where the internal network forwarding modules support using more cores.  You'll have to get into the 4.4+ tree to get most of the updates that would matter, and I'm not sure how that is managed in CentOS.

--
Nick

Jeffrey Helt

unread,
Jun 25, 2017, 9:16:02 AM6/25/17
to Nicholas Bastin, cloudlab-users, Leigh Stoller
That's while forwarding 60+Gbps of traffic...  If I look at my cloudlab reservation while routing your iperf traffic it's similarly useless:

%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st

Ah, yes, this is what I was seeing on the middlebox node. Is there a linux utility to view core usage by the kernel?

Not really in a raw hardware environment (one of the essential uses of a VM environment is the ability to troubleshoot kernel resource usage, since you have an outer environment to inspect it from).

I was trying to understand last night what exactly is consuming CPU but not being captured by programs like top. I noticed that there are a number of ksoftirqd threads consuming cpu on the middle box while testing throughput, which I believe are handling the sent and recieved interrupts from the NIC. If I turn off irqbalance and set the interrupt smp affinity for all of the NIC-related interrupts to be a single processor, then only one ksoftirqd process handles all of the work. However, even with these settings, the single running ksoftirqd process is not showing 100% usage for a single core (the maximum I’ve seen is around 50%).

So what is part that isn’t being captured by top? If it’s not the interrupt handling, is it the context switches? Or is there something else I’m not thinking of?

Jeff

Nicholas Bastin

unread,
Jun 25, 2017, 9:33:03 AM6/25/17
to Jeffrey Helt, cloudlab-users, Leigh Stoller
On Sun, Jun 25, 2017 at 9:15 AM, Jeffrey Helt <jmhe...@gmail.com> wrote:
I was trying to understand last night what exactly is consuming CPU but not being captured by programs like top. I noticed that there are a number of ksoftirqd threads consuming cpu on the middle box while testing throughput, which I believe are handling the sent and recieved interrupts from the NIC.

That is the case.  (And other interrupts, but in your case 99%+ of them will be from the NIC).
 
If I turn off irqbalance and set the interrupt smp affinity for all of the NIC-related interrupts to be a single processor, then only one ksoftirqd process handles all of the work. However, even with these settings, the single running ksoftirqd process is not showing 100% usage for a single core (the maximum I’ve seen is around 50%).

So what is part that isn’t being captured by top? If it’s not the interrupt handling, is it the context switches? Or is there something else I’m not thinking of?

All the memory copies within the kernel (e.g. packet parsing into deep sk_buff structures), and the actual act of constructing new packets for routing purposes (the extra efficiency from bridging is that there's no need to recreate a new layer 2 header - TTL rewrite is probably inconsequential since the kernel parses the field anyhow).  I'm a little surprised you see as much as 50% - that does actually lend some credence to the idea that the newer chipset on the d430 is driving some efficiency, as it has better interrupt coalescing than the 82599 does.

One of the primary benefits of using something like DPDK or netmap (on top of not doing unnecessary packet parsing) is that they batch process packets from the NIC, reducing interrupt load by as much as an order of magnitude.

--
Nick

Michael Blodgett

unread,
Jun 27, 2017, 2:15:37 PM6/27/17
to Jeff Helt, cloudlab-users
Catching up on this weekends mail,  in the past I've done a lot more work with CentOS6/ Intel X520 combination so I wanted to verify some number there.  For your three node topology, with the CentOS6/Intel X520 combination, with the default settings, your bottleneck is going to be a CPU on the iperf3 node that is receiving the test traffic.   The default rx ring is 512 and occasionally incoming packets will get dropped for lack of ring buffer, you'll note the rx_missed_errors and the attempts at flow control, the latter being on by default on the OS, but disabled at the switch. 

[  4]  63.00-64.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.97 MBytes
[  4]  64.00-65.00  sec  1.10 GBytes  9.41 Gbits/sec   94   2.34 MBytes
[  4]  65.00-66.00  sec  1.10 GBytes  9.41 Gbits/sec    0   2.50 MBytes
[  4]  66.00-67.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.61 MBytes
[  4]  67.00-68.00  sec  1.09 GBytes  9.41 Gbits/sec    0   2.68 MBytes
[  4]  68.00-69.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.85 MBytes
[  4]  69.00-70.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.85 MBytes
[  4]  70.00-71.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.85 MBytes
[  4]  71.00-72.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.85 MBytes
[  4]  72.00-73.00  sec  1.09 GBytes  9.39 Gbits/sec  170   2.13 MBytes
[  4]  73.00-74.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.26 MBytes
[  4]  74.00-75.00  sec  1.09 GBytes  9.41 Gbits/sec    0   2.36 MBytes
[  4]  75.00-76.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.43 MBytes
[  4]  76.00-77.00  sec  1.10 GBytes  9.42 Gbits/sec    0   2.49 MBytes

mblodget@node0:~ % sudo /sbin/ethtool -S eth2 | egrep "rx_missed|flow"
     fdir_overflow: 0
     rx_missed_errors: 6490
     tx_flow_control_xon: 79
     rx_flow_control_xon: 0
     tx_flow_control_xoff: 160
     rx_flow_control_xoff: 0

Increase the rx_ring with something like 'sudo /sbin/ethtool -G eth2 rx 2048'  and you should run without dropping packets.  I'll take a look with CENTOS7-64-STD but wanted to get this datapoint out there. 

Mike


On Fri, Jun 23, 2017 at 2:58 PM, Jeff Helt <jmhe...@gmail.com> wrote:
I am trying to run a few experiments using the topology below. No vlan tagging is used.

Sender (192.100.0.1) -- (192.100.0.2) Middlebox (192.200.0.2) -- (192.200.0.1) Receiver

I am trying to measure the baseline throughput using iperf3 sending TCP traffic from the sender to the receiver, but I am noticing some highly variable results. For instance, the following two blocks of output were taken from runs within a minute of each other.

Low throughput:

[  4] local 192.100.0.1 port 44437 connected to 192.200.0.1 port 5201

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]  27.00-28.00  sec   864 MBytes  7.25 Gbits/sec  394    460 KBytes       

[  4]  28.00-29.00  sec   874 MBytes  7.33 Gbits/sec  186    546 KBytes       

[  4]  29.00-30.00  sec   916 MBytes  7.69 Gbits/sec  502    444 KBytes       

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bandwidth       Retr

[  4]   0.00-30.00  sec  24.9 GBytes  7.12 Gbits/sec  9087             sender

[  4]   0.00-30.00  sec  24.8 GBytes  7.11 Gbits/sec                  receiver


Expected throughput:


[  4] local 192.100.0.1 port 44422 connected to 192.200.0.1 port 5201

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]  27.00-28.00  sec  1.08 GBytes  9.27 Gbits/sec   96    969 KBytes       

[  4]  28.00-29.00  sec  1.09 GBytes  9.35 Gbits/sec    0   1.31 MBytes       

[  4]  29.00-30.00  sec  1.09 GBytes  9.35 Gbits/sec   58   1.13 MBytes       

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bandwidth       Retr

[  4]   0.00-30.00  sec  32.0 GBytes  9.17 Gbits/sec  1373             sender

[  4]   0.00-30.00  sec  32.0 GBytes  9.17 Gbits/sec                  receiver


The middlebox isn't under any load during any experiments and is configured to forward traffic using iproute2. No specialized network-related software is installed.

Any idea what could be going on here or steps to further debug? Appreciate any help.

Jeff

--
You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-users+unsubscribe@googlegroups.com.
To post to this group, send email to cloudlab-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/964e8517-b511-461e-bc39-92b0662b3a6e%40googlegroups.com.

Jeff Helt

unread,
Jun 29, 2017, 11:59:33 AM6/29/17
to cloudlab-users, jmhe...@gmail.com
I ended up switching to the Ubuntu 16 image for my tests because of its much newer kernel (4.4), so I haven't tried this running CentOS 7. However, with the newer software, the default rx ring size of 512 is not the source of the throughput variability. For instance, even when observing a large number of retries and reduced throughput on the path from sender to receiver, I see:

$ ethtool -S enp129s0f0 | egrep "rx_missed|flow"
     fdir_overflow: 0
     rx_missed_errors: 0
     tx_flow_control_xon: 0
     rx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_flow_control_xoff: 0

Did you ever end up trying this on CentOS 7?

To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
To post to this group, send email to cloudla...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages