Achieved 10Gbit/s bidirectional routing

Jesper Dangaard Brouer

unread,

Jul 15, 2009, 1:20:20 PM7/15/09

to

I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard
hardware running Linux.

http://linuxcon.linuxfoundation.org/meetings/1585
https://events.linuxfoundation.org/lc09o17

I'm getting some really good 10Gbit/s bidirectional routing results
with Intels latest 82599 chip. (I got two pre-release engineering
samples directly from Intel, thanks Peter)

Using a Core i7-920, and tuning the memory according to the RAMs
X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to
6.4GT/s. (Motherboard P6T6 WS revolution)

With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed
bidirectional routing.

Notice bidirectional routing means that we actually has to move approx
40Gbit/s through memory and in-and-out of the interfaces.

Formatted quick view using 'ifstat -b'

eth31-in eth31-out eth32-in eth32-out
9.57 + 9.52 + 9.51 + 9.60 = 38.20 Gbit/s
9.60 + 9.55 + 9.52 + 9.62 = 38.29 Gbit/s
9.61 + 9.53 + 9.52 + 9.62 = 38.28 Gbit/s
9.61 + 9.53 + 9.54 + 9.62 = 38.30 Gbit/s

[Adding an extra NIC]

Another observation is that I'm hitting some kind of bottleneck on the
PCI-express switch. Adding an extra NIC in a PCIe slot connected to
the same PCIe switch, does not scale beyond 40Gbit/s collective
throughput.

But, I happened to have a special motherboard ASUS P6T6 WS revolution,
which has an additional PCIe switch chip NVIDIA's NF200.

Connecting two dual port 10GbE NICs via two different PCI-express
switch chips, makes things scale again! I have achieved a collective
throughput of 66.25 Gbit/s. This results is also influenced by my
pktgen machines cannot keep up, and I'm getting closer to the memory
bandwidth limits.

FYI: I found a really good reference explaining the PCI-express
architecture, written by Intel:

http://download.intel.com/design/intarch/papers/321071.pdf

I'm not sure how to explain the PCI-express chip bottleneck I'm
seeing, but my guess is that I'm limited by the number of outstanding
packets/DMA-transfers and the latency for the DMA operations.

Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express
chips, that can tell me the number of outstanding transfers they
support?

--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network developer
Cand. Scient Datalog / MSc.
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bill Fink

unread,

Jul 15, 2009, 11:30:11 PM7/15/09

to

We've achieved 70 Gbps aggregate unidirectional TCP performance from
one P6T6 based system to another. We figured out in our case that
we were being limited by the interconnect between the Intel X58 and
Nvidia N200 chips. The first 2 PCIe 2.0 slots are directly off the
Intel X58 and get the full 40 Gbps throughput from the dual-port
Myricom 10-GigE NICs we have installed in them. But the other
3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered
through googling that the link between the X58 and N200 chips
only operates at PCIe x16 _1.0_ speed, which limits the possible
aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps.

This was clearly seen in our nuttcp testing:

[root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11
n2: 11505.2648 MB / 10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT
n3: 11727.4489 MB / 10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT
n4: 11770.1250 MB / 10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT
n5: 11837.9320 MB / 10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT
n6: 9096.8125 MB / 10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT
n7: 9100.1211 MB / 10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT
n8: 9095.6179 MB / 10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT
n9: 9075.5472 MB / 10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT

This used 4 dual-port Myricom 10-GigE NICs. We also tested with
a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed
at about 70 Gbps, due to the performance bottleneck between the
X58 and N200 chips.

-Bill

Jesper Dangaard Brouer

unread,

Jul 16, 2009, 5:40:13 AM7/16/09

to

Correcting my self, according to Bill's info below.

It does not scale when adding an extra NIC to the same NVIDIA NF200 PCIe
switch chip (reason explained below by Bill)

This definitly explains the bottlenecks I have seen! Thanks!

Yes, it seems to scale when installing the two NICs in the first two
slots, both connected to the X58. If overclocking the RAM and CPU a
bit, I can match my pktgen machines speed which gives a collective
throughput of 67.95 Gbit/s.

eth33 eth34 eth31 eth32
in out in out in out in out
7.54 + 9.58 + 9.56 + 7.56 + 7.33 + 9.53 + 9.50 + 7.35 = 67.95 Gbit/s

Now I just need a faster generator machine, to find the next bottleneck ;-)

> This was clearly seen in our nuttcp testing:
>
> [root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11
> n2: 11505.2648 MB / 10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT
> n3: 11727.4489 MB / 10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT
> n4: 11770.1250 MB / 10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT
> n5: 11837.9320 MB / 10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT
> n6: 9096.8125 MB / 10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT
> n7: 9100.1211 MB / 10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT
> n8: 9095.6179 MB / 10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT
> n9: 9075.5472 MB / 10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT
>
> This used 4 dual-port Myricom 10-GigE NICs. We also tested with
> a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed
> at about 70 Gbps, due to the performance bottleneck between the
> X58 and N200 chips.

This is also very excellent results!

Thanks a lot Bill !!!

--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network developer
Cand. Scient Datalog / MSc.
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer

--

Bill Fink

unread,

Jul 16, 2009, 11:40:13 AM7/16/09

to

We also achieved nearly 80 Gbps in bidirectional TCP tests (40 Gbps
simultaneously in each direction):

[root@i7raid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -r -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -r -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -r -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -r -xc3/3 -p5008 192.168.8.11
n2: 11542.6250 MB / 10.07 sec = 9619.9920 Mbps 44 %TX 51 %RX 0 retrans 0.12 msRTT
n3: 11543.7143 MB / 10.06 sec = 9622.2153 Mbps 41 %TX 49 %RX 0 retrans 0.15 msRTT
n4: 11622.8125 MB / 10.05 sec = 9701.0296 Mbps 43 %TX 51 %RX 0 retrans 0.10 msRTT
n5: 11523.6875 MB / 10.03 sec = 9638.8883 Mbps 43 %TX 50 %RX 0 retrans 0.15 msRTT
n6: 11608.0141 MB / 10.04 sec = 9695.7388 Mbps 43 %TX 50 %RX 0 retrans 0.10 msRTT
n7: 11580.1250 MB / 10.04 sec = 9679.3910 Mbps 43 %TX 50 %RX 0 retrans 0.13 msRTT
n8: 11608.0000 MB / 10.06 sec = 9678.7596 Mbps 42 %TX 50 %RX 0 retrans 0.10 msRTT
n9: 11553.3750 MB / 10.05 sec = 9643.7296 Mbps 45 %TX 50 %RX 0 retrans 0.11 msRTT

This was using 2 dual-port 10-GigE NICs in the first two PCIe 2.0 slots.
We are using an Intel i7 965 quad-core 3.2 GHz Nehalem processor
(overclocked to 3.4 GHz) and 2000 MHz DDR3 memory. Adding an additional
dual-port 10-GigE NIC on the Nvidia N200 chip does only marginally
better, as it appears we are basically CPU limited at this point for
this test (the sum of the TX and RX CPU utilization for each pair of
10-GigE interfaces is about 93%).

-Bill

Willy Tarreau

unread,

Jul 17, 2009, 4:40:12 PM7/17/09

to

Hey guys, those are really nice numbers. Since TCP splicing appeared in the
kernel (once we got it fixed), I achieved 10 Gbps of HTTP proxying using
haproxy with very low CPU usage (about 20% of a Core2Duo 2.66 GHz).

Before buying the machines, I had been wandering around with the NICs
donated by Myricom in order to try to find a machine capable of supporting
this. My conclusion was that a lot of machines had difficulties getting
above 3.5, 4.7 and 6.5 Gbps of output traffic (those 3 numbers were always
the same, depending on the chipsets). There clearly was a bandwidth
limitation imposed by the chipset.

So I waited for the X38 and AM780FX chipsets to become available and
bought 3 machines (1 C2D, 1 AMD X2, 1 AMD X4). Those ones have no problem
with 10 Gbps of forwarded traffic (20 Gbps of total bus bandwidth), even
with 1500 bytes frames, but I don't know how high they can go, maybe
they will saturate slightly above.

Unfortunately, I only have 5 NICs in 3 machines and no switch (and CX4
is hard to find these days), so I'm probably stuck at 10 Gbps max.

Interestingly, I had the impression that forwarding data with TCP
splicing costs less CPU than IP forwarding, because the NICs can do
LRO.

Also, I know a french service provider who uses haproxy on Core i7
machines and who has already reached 5 Gbps of sustained traffic
with recent intel dual-port NICs (though I'm not sure exactly which
ones). This is with very little CPU usage too, less than 2-3% user
and 15% system+softirq. On previous machines (quad core xeons), it
was impossible to go beyond 3 Gbps, it looked like the chipset was
the limitating factor too (though I don't precisely remember which
one it was).

I really blamed the NICs because this guys machine was about 4 times
more powerful than mine, but apparently it was just a chipset issue.

I also happen to have a customer who recently received a few Sun NXGE,
mounted in Sun x2100-m2 using an nvidia chipset which I tested OK at
10 Gbps with my myri10GE NICs. I'll try to see if I can run some tests
there, as Davem once said those NICs are really good too.

All in all, I find it really cool that our beloved OS scales that
well with the hardware :-)

Regards,
Willy

Bill Fink

unread,

Jul 17, 2009, 7:40:07 PM7/17/09

to

Yes, I am quite impressed that the Linux kernel and TCP/IP network
stack performs amazingly well at these multi-10-GigE speeds. I was
especially interested in Jesper's IP forwarding results, as we haven't
tested that yet ourselves, and one of the intended applications of
these systems is as a multi-10-GigE firewall, so that's looking very
encouraging at this point.

-Bill

Jesper Dangaard Brouer

unread,

Jul 18, 2009, 3:20:08 AM7/18/09

to

Nice, but I think we have a bug with the measured CPU usage. Eric
Dumazet did a fix, but also pointed out that in a later mail, at I seem
like it not fixed completely yet...

> Before buying the machines, I had been wandering around with the NICs
> donated by Myricom in order to try to find a machine capable of supporting
> this. My conclusion was that a lot of machines had difficulties getting
> above 3.5, 4.7 and 6.5 Gbps of output traffic (those 3 numbers were always
> the same, depending on the chipsets). There clearly was a bandwidth
> limitation imposed by the chipset.
>
> So I waited for the X38 and AM780FX chipsets to become available and
> bought 3 machines (1 C2D, 1 AMD X2, 1 AMD X4). Those ones have no problem
> with 10 Gbps of forwarded traffic (20 Gbps of total bus bandwidth), even
> with 1500 bytes frames, but I don't know how high they can go, maybe
> they will saturate slightly above.

My experience is also that the AMDs can easily do 10Gbit/s forwarding,
but doing bidirectional they suffer...

> Unfortunately, I only have 5 NICs in 3 machines and no switch (and CX4
> is hard to find these days), so I'm probably stuck at 10 Gbps max.

We are a fiber company, so I'm using our spare 10G optics, but I'm
limited by our supply of SFP+ currently.

I'll be getting two 6 port 10GbE NIC using PCIe2 x16 82599, in august,
so it will be interesting how high we can go! :-)

> Interestingly, I had the impression that forwarding data with TCP
> splicing costs less CPU than IP forwarding, because the NICs can do
> LRO.
>
> Also, I know a french service provider who uses haproxy on Core i7
> machines and who has already reached 5 Gbps of sustained traffic
> with recent intel dual-port NICs (though I'm not sure exactly which
> ones). This is with very little CPU usage too, less than 2-3% user
> and 15% system+softirq. On previous machines (quad core xeons), it
> was impossible to go beyond 3 Gbps, it looked like the chipset was
> the limitating factor too (though I don't precisely remember which
> one it was).
>
> I really blamed the NICs because this guys machine was about 4 times
> more powerful than mine, but apparently it was just a chipset issue.
>
> I also happen to have a customer who recently received a few Sun NXGE,
> mounted in Sun x2100-m2 using an nvidia chipset which I tested OK at
> 10 Gbps with my myri10GE NICs. I'll try to see if I can run some tests
> there, as Davem once said those NICs are really good too.

The Sun NIU NIC has to use several hardware queues to achieve 10GbE.
Currently using these as generators, and thats one of my limiting
factors.

> All in all, I find it really cool that our beloved OS scales that
> well with the hardware :-)

Yes, its really amazing how well the Linux net stack scales. I think
the primary thanks for this efford goes to DaveMs multiqueue changes and
Eric Dumazet's tuning.

ps. I'll offline untill tuesday.

--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network developer
Cand. Scient Datalog / MSc.
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer

--