[Rocks-Discuss] 10GB ethernet cards SolarFlare

Jim Kusznir

unread,

Mar 16, 2012, 1:02:17 PM3/16/12

to Discussion of Rocks Clusters

Hi all:

I'm still designing my next iteration of expansion for my cluster
Aeolus, and still revisiting what connection technology to use. IB
seems the least expensive, as long as you buy all you're going to buy
right up front, and you don't want to make a cluster of more than 36
nodes (ever). Expansion starts getting quite expensive after that
point, especially if one maintains 100% non-blocking.

For those reasons, I was leaning toward 10GbE if there was a
cost-effective solution available. I thought it was a lost cause
until I spoke to SolarFlare. They have a 10GB NIC that they say can
do <8microsec latency with "standard drivers" and priced at around
$300/card. This puts 10GbE in the cost-competitive ballpark. As
such, I'm seriously considering it, and was wondering if anyone had
any experience with this configuration, or any pros/cons to consider.

My use case:

1) PVFS data over IP
2) MPI jobs (currently run over 1GbE TCP)
3) High speed connectivity across campus to another cluster, primarily
for sharing data/home directories/licenses/etc. NOT to run MPI jobs
split across the clusters

For technical, fiscal, political reasons, #3 is beyond the possibility
of a native IB connection. I know I can get an Melonox IB switch with
2 10GbE ports, and use the 10GbE for cluster IP connectivity outside.

My current pro/con list:
Pro IB:
Faster
Somewhat lower latency
Cheaper up front

Con IB
More of a management pain
Would have to run IPoIB and thus 2 IP networks and the pains
associated with that
Would need a media converter to run to 10GbE for outside-cluster connectivity
Would be a much larger pain to expand beyond the single-switch
capacity and rapidly become considerably more costly

Pro 10GbE:
Single IP network (can kickstart across it, etc)
"normal" MPI interconnect
Supports high-speed PVFS2 natively
Easily expands inexpensively
Easily interconnects to other cluster / resources
No painful/propitiatory management requirements

Cons 10GbE:
Slightly slower (bandwidth and latency)
Slightly more expensive up front

Any suggestions?

--Jim

Vlad Manea

unread,

Mar 16, 2012, 2:39:32 PM3/16/12

to Discussion of Rocks Clusters

Dear all,

a bit in line with the topic of 10 GbE NICs, I was wandering whether
you know servers with 10GbE on board (replacing the 1 GbE).
Are there any future plans from hardware manufacturing companies
to get 10 GbE on motherboards soon?

Thanks,
Vlad

Philip Papadopoulos

unread,

Mar 16, 2012, 2:54:07 PM3/16/12

to Discussion of Rocks Clusters

On Fri, Mar 16, 2012 at 10:02 AM, Jim Kusznir <jkus...@gmail.com> wrote:

> Hi all:
>
> I'm still designing my next iteration of expansion for my cluster
> Aeolus, and still revisiting what connection technology to use. IB
> seems the least expensive, as long as you buy all you're going to buy
> right up front, and you don't want to make a cluster of more than 36
> nodes (ever). Expansion starts getting quite expensive after that
> point, especially if one maintains 100% non-blocking.
>

I'm a big fan of Ethernet because of administrative simplicity,
demonstrated interoperability,
and extension into the network beyond your cluster. Having said that, much
depends on your particular
applications, if they are latency sensitive, a true HPC network is
essential.

We have three different clusters at SDSC that go about this using
variations of technology.
Triton is 10Gb Myrinet with transparent bridging to 10GbE. Cluster apps use
MX for low-latency messaging
and TCP for off cluster communication. Every node has both an MX "address"
and 10GbE TCP/IP address.
iperf between nodes over the TCP layer achieves 98% of wire speed. The
myricom switch has 32 10GbE ports,
16 of which are in a channel bond to the "outside world". Triton was
designed to be "on the network." but Myricom is out of the large switch
business, unfortunately.

Trestles is a QDR IB connected cluster. Attached to the central IB switch
are mellanox ethernet bridges with
12 10GbE SFP+ ports each. the bridge provides virtual 10GbE adapters.
Nodes have special drivers that connect over IB to their virtual adapter.
This works pretty well, but frame size is limited to 4K (not usual 10GbE
Jumbos).
That seems to be a software limitation. In Mellanox high-end switches they
use a similar approach to make their
ports dual purpose (either 10G or QDR IB).

Gordon uses I/O nodes each with 2xQDR and 2x10GbE. they function as Lustre
routers but not IP routers, so the
WAN connectivity from nodes is more limited.

All of these cluster access Lustre file systems that are connected to a
pair of Arista chassis. In particular we have
64 OSSes connected@2 x 10GbE each. The OSSes have 36, 2 TB Enterprise SAS
drives, and deliver roughly
2GB/sec to/from disk. Triton is connected into this switching complex @
16x10GbE, Trestles @ 24x10GbE and Gordon at 128X10GbE. In other words, we
a lot of 10GbE (and in 3-4 years, this would likely be replaced for
40GbE). Ethernet gives us the isolation that we want, more than
acceptable I/O performance, and known interoperability.

>
> For those reasons, I was leaning toward 10GbE if there was a
> cost-effective solution available. I thought it was a lost cause
> until I spoke to SolarFlare. They have a 10GB NIC that they say can
> do <8microsec latency with "standard drivers" and priced at around
> $300/card. This puts 10GbE in the cost-competitive ballpark. As
> such, I'm seriously considering it, and was wondering if anyone had
> any experience with this configuration, or any pros/cons to consider.
>
> My use case:
>
> 1) PVFS data over IP
>

That should work well over 10GbE.

> 2) MPI jobs (currently run over 1GbE TCP)
>

10GbE will be better that 1GbE -- but as above. The need for low-latency is
dictated by Apps.

> 3) High speed connectivity across campus to another cluster, primarily
> for sharing data/home directories/licenses/etc. NOT to run MPI jobs
> split across the clusters
>

We do that too. And is one of the key reasons for doing 10G. Years ago, I
wrote an NSF MRI grant to build
a multi-rail 10GbE network on our campus (Quartzite). We have 60
cross-campus fibers that meet in E1200 switch at Calit2. From SDSC, we
have 5x10GbE into that network. The -most- difficult part is finding the
fibers.
We run passive DWDM so our 5x10GbE is actually a single fiber pair. This
was expensive 5 years ago, but is not expensive today. That network
(Quartzite) also bridges to our campus network.

>
> For technical, fiscal, political reasons, #3 is beyond the possibility
> of a native IB connection. I know I can get an Melonox IB switch with
> 2 10GbE ports, and use the 10GbE for cluster IP connectivity outside.
>
> My current pro/con list:
> Pro IB:
> Faster
> Somewhat lower latency
> Cheaper up front
>
> Con IB
> More of a management pain
> Would have to run IPoIB and thus 2 IP networks and the pains
> associated with that
>

I don't view 2 IP networks as very problematic, we run multiple IP networks
on Triton.

> Would need a media converter to run to 10GbE for outside-cluster
> connectivity
>

With trestles, our MTU size was limited.

> Would be a much larger pain to expand beyond the single-switch
> capacity and rapidly become considerably more costly
>

Only if you really require full bisection. Much depends on how big you want
to get.
Small port counts - IB is cheaper
Large port counts - IB and 10GbE are roughly the same price on the fabric
side. Approximately $1K/port in
full bisection configs. Our large Arista switches and our Large Mellanox
IB switches are nearly identical in price.
Our large Myricom switch (layer 2 only) was very modest in price ($300/port
for MX).

>
> Pro 10GbE:
> Single IP network (can kickstart across it, etc)
> "normal" MPI interconnect
> Supports high-speed PVFS2 natively
> Easily expands inexpensively
>

Again depends on ultimate port count.

> Easily interconnects to other cluster / resources
> No painful/propitiatory management requirements
>
> Cons 10GbE:
> Slightly slower (bandwidth and latency)
> Slightly more expensive up front
>
> Any suggestions?
>

How large do you want to go on your cluster?

-P

>
> --Jim
>
>

--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120316/42c8a1e2/attachment.html

Hung-sheng Tsao

unread,

Mar 17, 2012, 2:45:13 PM3/17/12

to Discussion of Rocks Clusters, Discussion of Rocks Clusters

Hi
Solarflare seems a nice product
Since it is mainly sw based and you may want to get support

Since it is new product you will need to work out all bugs that come with new sw
Hope that you have the time to deal with new features

How big is your cluster
All switch 10ge or Ib you will need to consider the size of the switch , when you grow you may need to change the chassis

Since you seems to happy with 10ge then consider qdr with 75% blocking it will save some switch cost and you can always reduce the blocking on the futures

There exist ib to ip gateway so it seems that bridging Ib with ip should not be problem today
My 2c

Sent from my iPhone

Hung-sheng Tsao

unread,

Mar 17, 2012, 2:50:28 PM3/17/12

to Discussion of Rocks Clusters, Discussion of Rocks Clusters

Hi
The technology change very fast
Build or not is always treat off

Sun has 10ge with sparc server

Intel seems to have 10ge on bd with next generation chip

-LT

Sent from my iPhone

Jim Kusznir

unread,

Mar 28, 2012, 1:27:48 PM3/28/12

to Discussion of Rocks Clusters

Lots of good info, thanks Phil.

I think I'll answer the last question first: I'm not sure how large
the cluster will grow to. Today its 24 compute nodes + 3 PVFS2 I/O
nodes+1 head node. We ARE buying more hardware, but probably not
going to exceed 12 nodes at present. So its very likely that we'll
outgrow a single 36-port IB switch (which seems to be the largest they
come before going to leaf+spine models). However, I can get 48 port
10GbE switches for around $7k (whereas my 36-port IB switch is about
$8.2k). I also figured that with expansion of 10GbE, we can monitor
the uplinks and bond more channels as necessary. I don't foresee
needing to go to over 100 ports for quite a while (at which point this
technology would probably be replaced anyway). If we do upgrade 100%
of our existing cluster plus our new additions, the 10GbE solution can
still do that with one switch, with a little room for expansion.

One of my understandings is that IB does not "trunk" like ethernet.
In that, if I have 20 clients generating packets to a single
destination, all at 1/20th the bandwidth of their rate, their packets
will all be interwoven on the output link (eg, one can easily just use
a single 10GbE link for the "uplink" as long as the total input raw
bandwidth is less than 10GbE). However, I'm told with IB, this is not
OK; its not about bandwidth its about hosts. I've been told that IB
runs as circuit-switched whereas ethernet is packet-switched. I don't
know if I've been correctly informed. In any case, that's the reason
I've been given on how bad oversubscription is on IB, but acceptable
on ethernet.

As to apps, I don't have any data ponts as to how latency-sensitive
they are. We've only really ever had ethernet (well, we had 6 nodes
in a blade enclosure at a remote site with internal DDR IB, and most
of the jobs didn't seem to be notably different on that vs gig-E
connected main cluster). In general, the jobs we presently
predominantly run are not communication intensive; they tend to be
more compute timestep, sync, compute next timestep, where there's
little communication during the compute steps. I can tell you that
we're saturating our IP links for PVFS presently.

Recent conversations with SolarFlare and LG-Ericsson seem to suggest
that even on a small deployment, 10GbE will be notiably less expensive
(~$150-$350/port(node), including NIC) than IB without the IB<->IP
bridge (which if I implement using Mellonox purpose-designed equipment
would add another $5-10k).

So, with the current pricing info, 10GbE looks more attractive.
Anything I'm missing?

--Jim

Rayson Ho

unread,

Mar 28, 2012, 3:19:20 PM3/28/12

to Discussion of Rocks Clusters, Brian Wey

On Wed, Mar 28, 2012 at 1:27 PM, Jim Kusznir <jkus...@gmail.com> wrote:
> In general, the jobs we presently
> predominantly run are not communication intensive; they tend to be
> more compute timestep, sync, compute next timestep, where there's
> little communication during the compute steps. I can tell you that
> we're saturating our IP links for PVFS presently.

10GbE should work nicely with PVFS2. I checked with the PVFS developers
a few months ago, and they are interested in optimizing PVFS2 for 10GbE's
OS-bypass (ie. VFIO and/or SolarFlare's userspace network stack). 10GbE
should give you higher bandwidth & much lower latency than your existing
Gigabit-Ethernet network.

Note that VFIO is going to be upstreamed, but it is currently not in the
official Linux tree yet... See also the discussion we had a few months ago:

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2012-January/056149.html

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

Philip Papadopoulos

unread,

Mar 29, 2012, 12:48:12 AM3/29/12

to Discussion of Rocks Clusters

On Wed, Mar 28, 2012 at 10:27 AM, Jim Kusznir <jkus...@gmail.com> wrote:

> Lots of good info, thanks Phil.
>
> I think I'll answer the last question first: I'm not sure how large
> the cluster will grow to. Today its 24 compute nodes + 3 PVFS2 I/O
> nodes+1 head node. We ARE buying more hardware, but probably not
> going to exceed 12 nodes at present. So its very likely that we'll
> outgrow a single 36-port IB switch (which seems to be the largest they
> come before going to leaf+spine models). However, I can get 48 port
> 10GbE switches for around $7k (whereas my 36-port IB switch is about
> $8.2k). I also figured that with expansion of 10GbE, we can monitor
> the uplinks and bond more channels as necessary. I don't foresee
> needing to go to over 100 ports for quite a while (at which point this
> technology would probably be replaced anyway). If we do upgrade 100%
> of our existing cluster plus our new additions, the 10GbE solution can
> still do that with one switch, with a little room for expansion.
>
> One of my understandings is that IB does not "trunk" like ethernet.
> In that, if I have 20 clients generating packets to a single
> destination, all at 1/20th the bandwidth of their rate, their packets
> will all be interwoven on the output link (eg, one can easily just use
> a single 10GbE link for the "uplink" as long as the total input raw
> bandwidth is less than 10GbE).

This is incorrect -- networks are like plumbing pipes.
If you really have 20 clients to a single destination, the congestion is at
the endpoint, not the network.

1. the big difference between IB and Ethernet is that IB supports multiple
paths between two destinations.
Ethernet (with spanning tree turned on to remove loops) supports only a
single path.

> However, I'm told with IB, this is not
> OK; its not about bandwidth its about hosts. I've been told that IB
> runs as circuit-switched whereas ethernet is packet-switched.

2. sort of. IB switch fabrics are a wormhole routed. Suppose there are 3
switch chips between you and your
destination. A "tracer" packet between source and destination creates a
short-lived circuit between src and
destination. When the circuit is complete, the packet data is sent. and
then the circuit is torn down.
Ethernet fabrics are (usually) store and forward. In ethernet your packet
is sent to the first switch chip and stored there, when the switch chip is
able to send to the next hop, it sends and so on. In advanced Ethernet
fabrics
(Gnodal, Arista, others) there is in reality a fusion between
store-and-forward of Ethernet and so-called wormhole routed fabrics like IB
(Myrinet and others, too).

> I don't
> know if I've been correctly informed. In any case, that's the reason
> I've been given on how bad oversubscription is on IB, but acceptable
> on ethernet.
>

down deep, the problem is that while IB fabrics have multiple paths between
destinations, poor routing algorithms actually create hotspots in the
fabric. This was more acute several years ago, less acute now. On an
ethernet fabric, this is only one route, but that route can contain bonded
links. In general, LACP algorithms on Ethernet do a pretty good job of
balancing traffic across all the links in a channel bond (OK usually do a
good job).

>
> As to apps, I don't have any data ponts as to how latency-sensitive
> they are. We've only really ever had ethernet (well, we had 6 nodes
> in a blade enclosure at a remote site with internal DDR IB, and most
> of the jobs didn't seem to be notably different on that vs gig-E
> connected main cluster). In general, the jobs we presently
> predominantly run are not communication intensive; they tend to be
> more compute timestep, sync, compute next timestep, where there's
> little communication during the compute steps. I can tell you that
> we're saturating our IP links for PVFS presently.
>
> Recent conversations with SolarFlare and LG-Ericsson seem to suggest
> that even on a small deployment, 10GbE will be notiably less expensive
> (~$150-$350/port(node), including NIC) than IB without the IB<->IP
> bridge (which if I implement using Mellonox purpose-designed equipment
> would add another $5-10k).
>
> So, with the current pricing info, 10GbE looks more attractive.
> Anything I'm missing?
>

I'm a big fan of Ethernet, but use low-latency HPC fabrics when it makes
sense. Ethernet "self-configures" has
decades of demonstrated interoperability across vendors and the like.
Arista shattered the price barrier on 10GbE a few years ago and their 52
port (7050 series) runs for academics at under $300/port. I'm purchasing
one of these for one of my projects. Other companies have also joined the
fray for low-cost 10GbE (and that's why I'm a fan --
commodity pressures push down the cost as ports/year sold rise). while we
use IB, there is in reality only a single vendor.

-P

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120328/44e68821/attachment.html

Reply all

Reply to author

Forward