Reviews for Alfares10Hedera

Rodrigo Fonseca

unread,

Apr 1, 2013, 8:39:51 PM4/1/13

to csci2950u-...@googlegroups.com

Hi,

I hope you all had a great break!
Please post your review of Hedera as a group reply to this message.

Thanks,
Rodrigo

Shu Zhang

unread,

Apr 1, 2013, 8:55:14 PM4/1/13

to csci2950u-...@googlegroups.com

Hedera is a scalable, dynamic flow scheduling system that adaptively schedules a multi-stage switching fabric to efficiently utilize aggregate network resources. The core idea is “dynamic”, which stands on the opposite side to traditional ECMP multipath routing algorithm. But, the paper said Hedera will not be a replacement of ECMP, since ECMP performs very well in large amount of flows, for example, when MapReduce is doing shuffling. Hedera complements ECMP in the existence of randomized amount of large flows. Since ECMP performs not very well in the above situation, Hedera advocates using a dynamic and a centralized view to control the allocation of flow paths in order to maximize the utilization of network bandwidth resources.

In the Background section, the paper points out the weakness in dealing with small amount of flows by ECMP and concludes the data center traffic patterns. Figure 2 demonstrates two types of ECMP collisions when two flows hash to the same core switch or aggregate switch. But the scheduling could be better since there are idle paths to be allocated. Figure 3 shows that the when flows are sparsely distributed in data centers, ECMP could produce relatively high loss in Bisection bandwidth because hash collision. But when flows goes more, the loss become reasonable because the random hashing could be truly balancing the load.

Hedera has a main control loop, and within each loop, there are three steps: flow detection, natural demand estimation and dynamic path allocation. The loop is cycled every 5 seconds. The paper discussed the behavior of Hedera basing on Fat Tree topology. The two major part is the natural demand estimator and the scheduler (path allocator).

I think the paper does not do well in explaining the demand estimator, which is where my major confusion lies in. In part 4.2, the paper talks a lot about the demand matrix ( which is really simple to understand ), but talks little about the estimation algorithm. Although pseudo code is presented, comparing to scheduler (Simulated Annealing) part, the explanation is too little. :D

For scheduling (placement) module, the Global First Fit is straightfoward, and its time complexity depends on the number of possible paths. For fat tree, it will be (k/2)^2. The Simulated Annealing method is very interesting, it borrows this classic method into the data center networking field, which is the first time I see it appears in this area (previously I only saw it in Computer Vision related areas). But the idea is also easy-understanding. It first decides an initial state. Then it generate all possible neighbors and calculate the energy of these neighbors. There are three different types of neighbour generators so they are randomly selected to ensure the algorithm won’t stuck with certain circumstances. The algorithm tries to move the neighbors with lower energy or, move to higher-energy neighbors in certain possibilities. This cycle continues until the temperature becomes zero. An interesting problem occurs when the flows are changing , so there would be different initial states, so this topic should be considered in SA method. The time complexity of SA is related to the average number of large flows to a host because the algorithm is performed basing on the allocation of core switches, which is one-to-one mapping of hosts.

The Implementation part describes how the testbed is built. And in order to simulate the condition in data centers, they built a simulator. The simulator can simulates the TCP flow behavior in discrete time ticks. Interestingly, behaviors of AIMD, slow start and TCP Reno are all considered.

Resting graphs in Section 6 clearly demonstrates the comparisons of Hedera and ECMP. The major concern is the average bisection bandwidth. As we can see, in all communication patterns, the performs rank as SA>GFF>ECMP. Even in data shuffle condition where ECMP will not be blamed for its performance, Hedera also does well. And since the system is built on Openflow, the SDN control loop time consumption is calculated and the paper argued that the control loop will not consume a lot of bandwidth because the major delay occurs in the controller and is 100ms, which is far shorter than the 5 second time period.

Overall, it is a very good paper in network resource allocation whose ideas could be borrowed. The shining points are the SA algorithm, the dynamic view, and combination of SDN and the demand estimator. If one is doing research on network resource sharing area, this paper is definately a must-read material!

Questions
1. Why the demand estimator is splitted into two parts (source and destination), and why the paper said it increases the flow capacities from the sources and decreases the capacity at receivers? Why not increase at the receivers and decreases from the source?

2. The scheduling cycle is five seconds, and it uses the mechanism of polling. But the paper didn’t explain in detail why the interval of 5 seconds is selected. Is this time interval a result of experiments on different options ?

Christopher Picardo

unread,

Apr 1, 2013, 9:19:08 PM4/1/13

to csci2950u-...@googlegroups.com

Paper Title

Hedera Flow Scheduling for Data Center Networks

Authors

Mohammad AI-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat.

Date

In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI '10), April 2010.

Novel Idea

Replaces static load balancing by using commodity switches and unmodified hosts to implement a scalable dynamic flow scheduling system to effectively utilize aggregate network resources.

Main Results

- A central scheduler with global knowledge of active flows outperforms hash-based ECMP load balancing.

- Simulated Annealing almost always outperforms Global First Fit and delivers near optimal bisection bandwidth.

- Global First Fit and Simulated Annealing significantly outperform static hashing (ECMP) and achieve near optimal bisection bandwidth of the network (15.4Gb/s goodput). The performance improves as the level of local communication increases.

Impact

Hedera ameliorates the issue of having apriori unknown workloads, a central scheduler delivers high bandwidth without requiring protocol changes, and therefore reduces inter-rack network bottlenecks.

Question

What would be a worst-case scenario for Hedera? Maybe a situation where we are forced to change protocols at the host level, forcing the centralized scheduler to have a tainted, partial or limited view of active flows?

It seems that data centers could be susceptible to down time when upgrades are performed on them (i.e.: new protocols installed at the host level). Can we avoid this?

Jeff Rasley

unread,

Apr 1, 2013, 10:45:26 PM4/1/13

to csci2950u-...@googlegroups.com

Title: Hedera: dynamic flow scheduling for data center networks

Authors: Mohammad Al-Fares UCSD et al.

Date/Conf: NSDI 2010

The authors present the Hedera system, which helps alleviate problems with ECMP routing where large flows may be hashed to the same path, causing a congestion bottleneck that could be avoided if a global network view was used. They use OpenFlow to periodically check switches for utilization information and install routes to optimize the placement of long-living large flows in order to control network congestion.

Reproducibility seems high, since algorithms are laid out and they used a combination of simulation and real hardware tests.

Comments:

As the MicroTE paper points out, when the DC workload doesn't include many large flows Hedera doesn't do much good. When Hedera was published they say DC traces were not available, it seems like MicroTE is somewhat of a follow-up/answer to Hedera and I probably should have read them in that order instead, oops :)

Zhiyuan "Eric" Zhang

unread,

Apr 2, 2013, 12:01:24 AM4/2/13

to csci2950u-...@googlegroups.com

Paper Title

Hedera: Dynamic Flow Scheduling for Data Center Networks

Authors

Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang and Amin Vahdat

Date

NSDI'10

Novel Idea

This paper presents a dynamic flow scheduling system called Hedera. The main idea is to use a global view of routing and traffic matrix to estimate the bandwidth share for each flow, and then compute and allocate paths for flows. The authors propose two scheduling algorithms: Global First Fit and Simulated Annealing, and a demand estimation scheme that can estimate the ideal bandwidth share for each flow.

Main Result

Their results shows that Simulated Annealing outperforms Global First Fit in most cases, and both of them outperform hash-based scheduling scheme. They also find that the performance gain is dependent on the characteristics of the flows, such as rate and duration. In the simulation result, the author demonstrates that both the demand estimation and Simulated Annealing are efficient enough with large networks with many flows (although they are pretty slow in my first impression).

Evidence

Hedera is tested on a physical fat-tree OpenFlow network with 16 hosts. They also run simulation to test the algorithms' scalability on large networks and many flows.

Prior Work

Equal-Cost Mutli-Path forwarding is a similar scheme that also takes advantage of multiple paths in data center. The main difference is that it uses hashing to spread flows across paths, while Hedera first estimates the flow's demand and then allocates paths based on that.

Reproducibility

The result in this paper should be easy to reproduce. All algorithms are well described with formal definition, and they should be easy to implement on OpenFlow controllers.

On Monday, April 1, 2013 8:39:51 PM UTC-4, Rodrigo Fonseca wrote:

Zhou, Rui

unread,

Apr 1, 2013, 11:53:38 PM4/1/13

to csci2950u-...@googlegroups.com

Paper:

Hedera : Dynamic Flow Scheduling for Data Center Networks

Authors:

Mohammad Al-Fares

Sivasankar Radhakrishnan

Barath Raghavan

Nelson Huang

Amin Vahdat

Novel Idea:

Given a better structure such as multiple-rooted/fat tree, we still needs better dynamic protocols to allocate flows to obtain close-to-optimal bi-sectional bandwidth between hosts in data-center networks.

Summary:

To cope with the problem of over subscribed root of traditional tree structured data -center networks, multiple rooted tree and fat -tree structures were proposed. However the protocols used was static allocation based on hash, and we have the problem of "elephants flows get hashed to one path". The problems are largely caused by the absence of observation on the flows distribution and proper reaction. This paper then raised the idea of Hedera--using dynamic flows allocation based on the traffic in the data-center.

To be more specific, Hedera focuses particularly on large flows that exceed a threshold. This is because small multitude flows already fits well with existing ECMP solution. When a flow is detected to be an elephant, the flow is allocated carefully by the controller. The algorithms used to allocate big flows can be chosen agilely. This paper proposed two of them: Global first fit and Simulated Annealing. Global first fit is to pick the first available route and send the elephant flow, the Simulated Annealing, however, tries to find a close to optimal solution based on a heuristically search. In order to reduce the search space of Simulated Annealing and obtain feasible efficiency, core switches are mapped to edge hosts instead of flows.

Implementation based on Portland test-bed and Openflow were developed and evaluation shows that the bi-sectional bandwidth between hosts with both dynamic allocations exceed those in static allocation significantly. The Simulated Annealing is sometimes very close to optimal(Non-blocking).

Question:

1. In the Simulated Annealing algorithm, there was no explain of the formula for Probability and I also did not notice any indication of how to get the initial Temperature.

2. To reduce the search space in Simulated Annealing, the paper suggest core to host mapping, however it later mentions "However, there are communication patterns where a optimal solution necessarily requires a singe dest host receive incoming traffic through multiple switches". This feels like a contradictory? So the granularity of core switch to host mapping reduces the search complexity but also restricts the ability to find optimal after all?

3.How come in figure 9, rand0, the Simulated Annealing exceeded Non-blocking?

--
You received this message because you are subscribed to the Google Groups "CSCI2950-u Spring 13 - Brown" group.
To unsubscribe from this group and stop receiving emails from it, send an email to csci2950u-sp13-b...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Shao, Tuo

unread,

Apr 2, 2013, 3:06:31 AM4/2/13

to csci2950u-...@googlegroups.com

Paper Title

Hedera: Dynamic Flow Scheduling for Data Center Networks

Authors

Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat

Novel Idea

This paper presents a dynamic flow scheduling system called Hedera to efficiently utilize resources-mainly bandwidth-with minimal sheduling overhead in data center network.

Main Results

There are two main results of this paper. First, It presents an efficient algorithm to evalute idealized bandwith of each flow. Second, based this demand estimator, the paper proposes two algorithms for scheduling when current scheduled link compacity doesn't meet the demand of new large flows.

Impact

As mentioned in the paper, finding flow routes in general network while not exceeding the capacity of any link is a NP-complete problem. This paper provides a efficient scheduling scheme to achieve better utilization of data center network resouces within small runtime. It helps to elimate bottlenecks in the data center network.

Evidence

The paper first use a example to illustrate limitation of current data center multipathing using ECMP. To elimate the limitation, the paper decribes its own design as a complement to ECMP including flow demand estimation and flow scheduling. Plenty of evaluting experiments are conducted with different traffic patterns to evaluate its performance.

Prior and Competetive Work

Notably, Hedera with the global knowledge of the network outperforms the hash-based ECMP.

On Mon, Apr 1, 2013 at 8:39 PM, Rodrigo Fonseca <rodrigo...@gmail.com> wrote:

Place, Jordan

unread,

Apr 2, 2013, 1:15:44 AM4/2/13

to csci2950u-...@googlegroups.com

Hedera: Dynamic Flow Scheduling for Data Center Networks

Mohammad Al-Fares, Barath Raghavan, Nelson Huang, Sivasankar
Radhakrishnan, Amin Vahdat
NSDI '10

Today's data centers suffer from sub-optimal bisectional bandwidth
utilization due to poor scheduling of flows across parallel routes
between servers. ECMP attempts to remedy this issue using hashing to
diversify the paths flows take across the network. Unfortunately,
hashing collisions lead to large flows being placed on the same path
despite other paths being open.
Hedera works by identifying large flows and re-routing them
optimally to their destinations. The scheduler identifies large flows
by polling switches. It can calculate paths using either a first fit
or probabilistic search. Routes are installed via OpenFlow on
switches.
The performance of Hedera is impressive as bandwidth usage
approaches optimal. However, it takes ~100ms for Hedera to react to a
flow. As MicroTE points out, it seems that only moving large/longer
flows leaves room for improvement that takes into account smaller
flows. Additionally, I wonder if there might be a way for application
to push notifications of their traffic needs to the network to help
the scheduler.

On Mon, Apr 1, 2013 at 8:39 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote:

Papagiannopoulou, Dimitra

unread,

Apr 2, 2013, 1:35:18 AM4/2/13

to Rodrigo Fonseca, csci2950u-...@googlegroups.com

Title: Hedera: Dynamic Flow Scheduling for Data Center Networks

Authors: Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat

Novel Idea: This paper proposes Hedera, which is a dynamic flow scheduling system for adaptive scheduling in multi-stage switch topologies found in data centers. Flow information are collected from switches, non-conflicting paths are computed for flows and traffic is rerouted accordingly, with the goal to maximize aggregate network utilization at minimal scheduler overhead.

Main Results: Hedera was fully implemented on the PortLand testbed and was found to deliver 96% of optimal bisection bandwidth and up to 113% better than static load-balancing methods. The central scheduler with global knowledge of active flows outperforms the ECMP load-balancing, but the performance gains depend on the rates and duration of the flows in the network (larger gains were achieved in the case of many large data transfers).

Impact: The work of this paper has a significant impact on the performance scalability of data centers. Existing data centers use static single-path forwarding which often results in collisions in switch buffers and inefficient switch utilization. Hedera offers a solution to this problem and manages to outperform static load-balancing methods.

Prior Work: Hedera complements ECMP [21] by supplementing default ECMP behavior for communication patterns that cause problems to ECMP. Also, Hedera augments the PortLand routing and fault tolerance protocols [29]. Using the standard PortLand mechanisms, the Hedera scheduler is aware of failures and can reroute flows that are mapped to failed components. Furthermore, the implementation and testing of the scheduling algorithms was done using the fat-tree network that was described in a previous work by the same authors [3].

Competitive Work: VL2 [16] and Monsoon [17] proposed using Valiant Load Balancing, but this technique can cause bandwidth losses due to long-term collisions as was shown in this paper. SEATTLE [25], DCell [19] and BCube [18] were other related works on data center networks, but did not address multi-pathing. Although TeXCP [23] and MATE [10] explored multi-pathing environments, they differ from this work as they used distributed traffic engineering, while here the design is based on a central scheduler. Other works on centralized router control were namely Ethane [6], RCP [5], the 4D architecture [15] and Tesseract [35]. Finally, there have been works on virtual switching fabrics and Clos networks but they did not employ commodity switches to build a multi-level switch architecture ([33][32][30])

Evidence: The authors generated data center traffic patterns that stress and saturate the network to compare Hedera's performance to current hash-based multipath forwarding schemes. They reported Bisection Bandwidth losses for different numbers of TCP flows per host for a k=48 fat-tree using ECMP and they found that hash collisions can reduce the network's bisection bandwidth by 60.8% in average, if each host transfers an equal amount of data to all remote hosts one at a time, while the respective reduction in case of parallel communication is only 2.5%. They presented an example for TCP bandwidth demand estimation in a hypothetical equilibrium state of the network, for 4 hosts connected by a non-blocking topology. The Global First Fit and Simulated Annealing algorithms were then analyzed and their time and space complexities were reported .Hedera was later evaluated through simulations using the communication patterns created (Stride, Staggered Prob and Random) and results on the average bisection bandwidth were reported for each pattern and each algorithm (ECMP, Global First-Fit, Simulated Annealing and use of a non-blocking switch). Global First Fit and Simulated Annealing were found to outperform static hashing (ECMP) and achieve a near optimal bisection bandwidth (15.4Gb/s goodput) The authors also performed an all-to-all in-memory data shuffle using their testbed, which is an expensive but necessary operation for many MapReduce/Hadoop operations and found that centralized flow scheduling performs a lot better than static ECMP hash-based routing (39% better bisection bandwidth). They further reported the aggregate bisection bandwidth achieved when running the benchmark suite for a simulated fat-tree network with 8,192 hosts for all the simulated algorithms. Finally they compared their algorithms against a hypothetical non blocking switch for the entire data center and against static ECMP hashing and found that their dynamic placement algorithms outperform static ECMP hashing significantly (Global First Fit and Simulated Annealing performed close to optimal). They further performed a complexity analysis of the Demand Estimation and Simulated Annealing and they also talked about the total control overhead of the centralized scheduling design.

Criticism: Overall, the authors performed a considerable amount of experimentation and supported their claims with a wide variety of results. They didn't limit their results only on average bisection bandwidth, but they also investigated other aspects of the proposed solution, such as its complexity and overhead, as well as how traffic variations and different algorithms can yield significantly different results. They found that Hedera can outperform the state of the art hash-based ECMP load-balancing and demonstrated how its performance depends on the rates and duration of the flows in the network. They showed that Simulated Annealing almost always outperforms Global First Fit and can achieve near optimal bisection bandwidth for a wide range of communication patterns. Finally the findings of this paper showed that dynamic flow scheduling can deliver considerable bandwidth gains at moderate cost. Overall, the experimental process in this paper is thorough and the main findings are significant.

Question: In the paper, the authors assume a threshold rate over which a flow is assumed to be large (100Mbps, which corresponds to 10% of each host's 1GigE link). How was this threshold chosen? Should it be based on experiments on the traffic rates of their network or it is ok to be arbitrarily chosen?

On Mon, Apr 1, 2013 at 8:39 PM, Rodrigo Fonseca <rodrigo...@gmail.com> wrote:

Charles Zhang

unread,

Apr 2, 2013, 5:57:05 AM4/2/13

to Rodrigo Fonseca, csci2950u-...@googlegroups.com

Hedera: Dynamic Flow Scheduling for Data Center Networks

Authors: Mohammad Al-fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat.

Date: Proceedings of the 7th USENIX conference on networked systems design and implementation, 2010

Novel Idea:

This paper proposed an alternative method of flow scheduling called Hedera to traditional data center’s ECMP, which is more adaptable to the network condition and the flow size.

Main Results: First of all, they examined how currently used ECMP fell short in certain scenarios, then continued to propose a higher level of abstracted design improvement. Then went into details of design of algorithms including the flow demand estimator. Later on described the implementation details and the benchmarking. They made use of a central controller like in Open Flow which has global knowledge of active flows and later in experiments showed that this approach out-performed the state-of-the-art ECMP load balancing. And because of the way Hedera is designed, it’s performance gains are dependent on the rates and durations of the flows in the network.

In their implementation of the demand estimator they chose to go with the sparse matrix and parallelism to improve performance.

They compared two types of scheduling algorithms, global first fit and simulated annealing, and with good reasoning chose to go with simulated annealing.

Evidence:

They did extensive benchmarking both on a testbed of 16 hosts and a virtualized environment and also performed an all-to-all data shuffle in their test bed. The evidence shows that Hedera outperforms traditional state-of-the-art ECMP

Prior work: A lot of work has been done in load-balancing, including the baseline ECMP, and also VLB. This work is also employing the use of an Open Flow central controller to control the routing entries in several switches.

Reproducibility:
Fairly high. The description of the implementation is pretty thorough and the concepts are well explained.

kmdent

unread,

Apr 1, 2013, 11:59:15 PM4/1/13

to csci2950u-...@googlegroups.com

Hedra: dynamic flow scheduling for data center networks
By Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavant, Nelson Huang, and Amin Vahdat

Novel Idea: A dynamic, scalable load balancer that efficiently balances based on a multi stage switching process. Hedra first gets flow information from switches, then computes a non-conflicting path, then it will route traffic accordingly. Headra has a much more general view of the network, and can identify bottlenecks not known to the individual switches.

Main Result: The load balancer delivers 96% optimal bandwidth, and performs 113% better than static load balancers. They test both Global fit first, and Simulated Annealing and compared them to ECMP. Global fit first works well when there aren’t that many flows, and it does not guarantee that the flow will be assigned. Simulated Annealing probabilitically searches for the most efficient path. Simulated Annealing was around around before the paper, but their addition was limiting the search space by assigning a core switch to each destination host. This ensured a quick convergence and a near optimal assignment of flows

Evidence: They use traffic patterns based on prior work to test their load balancer. The traffic consists of many small RCP communications as well as larger flows like backups, and mapreduce traffic. Simulated Annealing almost always outperforms Global Fit First and achieves a near optimal bisection of bandwidth.

Prior work: VLB and ECMP.
Competitive work: VL2 and Monsoon, TeXCP and Mate
Reprod: With enough time, this could be reproduced.