Paper Title |
A Scalable, Commodity Data Center Network Architecture |
Author(s) |
Mohammad Al-Fares Alexander Loukissas Amin Vahdat |
Date | 2008 |
Novel Idea | As clusters built from commodity hardware become more popular, it has become apparent that inter-node bandwidth is often a major bottleneck. The authors identify inefficiencies in the traditional networking topology used in these clusters as a primary cause, and propose an alternative 'Fat Tree' topology which will allow any node to communicate with any other at full capacity. |
Main Result(s) |
They identify oversubscription as the core inefficiency present in traditional topologies, and define it as the ratio of the aggregate bandwidth of the hosts to the bisection bandwidth of the network - that is, a larger ratio indicates more wasted bandwidth. They propose a topology, the 'Fat Tree', which can in theory provide a 1:1 oversubscription ratio. It does this using by providing a more densely connected network, such that traffic out of a given pod of nearby nodes can be routed in a multitude of ways through the network. Another primary issue with traditional network topologies is that the top-level switches must be high-end 10GigE machines that cost a lot of money. In the Fat Tree layout, a large number of commodity switches, each responsible for many fewer nodes, defrays the cost considerably -- the price increase per port per bit/sec is not linear, so using more cheaper machines can create a substantial savings. Additionally, the power usage and heat generation per port per bit/sec is not linear either, so Fat Tree deployments waste far less energy for the same performance than those utilizing fewer, higher-end switches. The Fat Tree topology comes with some challenges. For one, standard IP routing protocols tend to choose one path between two points and stick to it, which would immediately negate the advantage conferred by the Fat Tree layout. To offset this, they introduce their own routing protocol, which uses the ID number of each node as a source of deterministic entropy for seeding an even distribution of connections over the available channels. Another challenge is the extremely large number of physical connections required between computers, especially as clusters become large. They present a fairly reasonable-looking model of how the cables can be bundled and arranged to minimize the awkwardness of this, but it is ultimately an unavoidable downside to having each machine at each level be connected to k/2 others (where k is the number of pods and number of nodes per pod). |
Impact | I never have anything interesting to write in this section. |
Evidence |
They evaluate their work on a small virtual cluster running on an even smaller set of actual machines. While their data seems to solidly back up their claims about the theoretical superiority of Fat Tree topology, it leaves many unanswered questions about the real-world feasibility of implementing them. For instance, they do not address how hard it will be for cluster operators to obtain/create the required router-level changes that enable Fat Trees (I presume using Click routers on a large scale is not feasible in the real world). |
Prior Work | They use the Click modular router, described in Kohler, Morris, Chen, Jannotti (hi JJ!), and Kaashoek (2000), to test the viability of the routing changes they need to make. |
Reproducibility | They provide detailed algorithms for just about everything; reproducing their virtual experiment should be very possible. |
Question | It seems like a big part of what makes this awkward is the excessive cabling everywhere - maybe I'm wrong and this is really not a big issue, but I kind of expect it would be. If so, I wonder if it's possible to do some reasoning about the kinds of flows that happen, and strategically remove top layer connections between pods and Core routers that service only pods this pod will talk to infrequently, or second-layer connections between edge routers and aggregation routers they will rarely use. This does require a lot of prior knowledge about the nature of the computation the cluster will be used for though... |
Criticism |
A mild criticism of the organization of the paper - they frequently reference later results in the beginning section without enough context for the reader to understand what's going on without having read later sections. |
Ideas for further work | Like, actually building one, and stuff. |
Authors: Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat
Date: SIGCOMM 2008
Novel idea: The authors rethink conventional network architecture and develop a novel system which improves inter-node communication bandwidth between datacenters.
Main result(s): The authors modify existing router software in order to take advantage of a Clos topology knows as a fat-tree. This modified system results in significant bandwidth cost savings in comparison to previous systems.
Impact: This system does not require developers to make any major changes to their programs; therefore, it has the potential to add efficiency to a wide array of existing applications at low upfront cost.
Evidence: The authors evaluate their system on a cluster of 20 switches and 16 end hosts and run a variety of benchmarks.
Prior work: The authors base their work on ideas brought to light in 1953 by Charles Clos in the field of telephone switches.
Reproducibility: Although no source code is available, the authors describe their modifications to core algorithms extensively and provide several examples of pseudocode; therefore, their work should be fairly reproducible.
Criticism: If I were a system administrator, I would be hesitant to deploy such experimental changes to well tested systems such as routing.
A Scalable, Commodity Data Center Network Architecture
Authors
Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat
Date
SIGCOMM'08 - August, 2008
Novel Idea
The paper proposes an architecture based on commodity Ethernet
switches that allows full bisection bandwidth in scalable manner.
Main Results
The fat-tree architecture allows that up to ~27000 nodes can
communicate on full bandwidth on their network interfaces. They
require no changes in the hosts, but require some changes on switches.
Impact
The impact is on the cost of building data centers where nodes can
potentially communicate at full speed of their network interfaces.
Part of this is because the solution is based on the use of commodity
equipment (with no or few modifications).
Evidence
The authors present cost estimates, as well as describing the
architecture and the mechanisms that should be incorporated in the
switches to improve performance in some cases (fairly distributing
routes among ports).
For that matter, they describe a two-level routing table (which they
end implementing in hardware), flow classification and reassignment,
and finally the scheduling of large flows (to minimize route
conflicts).
The paper also proposes a way to pack the equipment in order to tackle
the high demand for wiring.
Prior Work
They mention that techniques for building scalable interconnecting
networks for MPPs is something they base their work on. They mention
some vendors that used fat-trees in such cases (Thinking Machines,
SGI), and they also mention that Myrinet/Infiniband switches use
fat-trees (note that they try to avoid using Myrinet/Infiniband,
giving preference to commodity Ethernet equipment).
Competitive Work
They authors mention toroidal interconnect networks for MPPs, which
could be potentially useful for cluster environments, but the authors
argue that the associated wiring cost is prohibitive, just as
scattering routing decisions.
Finally they cite OSPF2 and ECMP, which they rule out for either
packet reordering (because of the round-robin or random
port-choosing), increase in the table size (when we split prefixes (?
-- see "Questions/Criticism")), and decision oblivious to bandwidth
(the hash-based scheme based only on source/destination addresses).
Reproducibility
The difficulty lies in the technical details involved in the general
architecture and in the particular implementation the authors have
chosen. However, essential information, as the flow classification
algorithm, the flow scheduler methodology, the two-level routing, and
the algorithm for producing the switch routing tables are provided,
except for little parametric details (things as "[...] every _few
seconds [...]" appear in the text).
However, I would say that the experiments are reproducible.
Questions + Criticism
The paper presents the heat dissipation and power demands normalized
to the aggregate bandwidth of the switches. I think that showing those
number in absolute scale is important, because the fat-tree
architecture uses more switches at total.
In this line, how much the system would consume/dissipate when we
consider the TCO over time?
Still related to heating how much the packaging affects such metric?
It appears that it is all about reducing cable length, but cables look
pretty cheap compared to electrical power over the months to cool down
the system.
A little technical question: how exactly the technique for splitting
prefixes works on ECMP?
Ideas for Further Work
The first idea is measuring power consumption and heat dissipation
with more attention, but the authors mention that this is an ongoing
work already.
An interesting thing would be measuring the impact of the fat-tree
architecture on a Map-Reduce system (in the shuffle between the Map
and the Reduce phases).
On Mon, Oct 25, 2010 at 8:10 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote: