Hedera is a scalable, dynamic flow scheduling system that adaptively schedules a multi-stage switching fabric to efficiently utilize aggregate network resources. The core idea is “dynamic”, which stands on the opposite side to traditional ECMP multipath routing algorithm. But, the paper said Hedera will not be a replacement of ECMP, since ECMP performs very well in large amount of flows, for example, when MapReduce is doing shuffling. Hedera complements ECMP in the existence of randomized amount of large flows. Since ECMP performs not very well in the above situation, Hedera advocates using a dynamic and a centralized view to control the allocation of flow paths in order to maximize the utilization of network bandwidth resources.
In the Background section, the paper points out the weakness in dealing with small amount of flows by ECMP and concludes the data center traffic patterns. Figure 2 demonstrates two types of ECMP collisions when two flows hash to the same core switch or aggregate switch. But the scheduling could be better since there are idle paths to be allocated. Figure 3 shows that the when flows are sparsely distributed in data centers, ECMP could produce relatively high loss in Bisection bandwidth because hash collision. But when flows goes more, the loss become reasonable because the random hashing could be truly balancing the load.
Hedera has a main control loop, and within each loop, there are three steps: flow detection, natural demand estimation and dynamic path allocation. The loop is cycled every 5 seconds. The paper discussed the behavior of Hedera basing on Fat Tree topology. The two major part is the natural demand estimator and the scheduler (path allocator).
I think the paper does not do well in explaining the demand estimator, which is where my major confusion lies in. In part 4.2, the paper talks a lot about the demand matrix ( which is really simple to understand ), but talks little about the estimation algorithm. Although pseudo code is presented, comparing to scheduler (Simulated Annealing) part, the explanation is too little. :D
For scheduling (placement) module, the Global First Fit is straightfoward, and its time complexity depends on the number of possible paths. For fat tree, it will be (k/2)^2. The Simulated Annealing method is very interesting, it borrows this classic method into the data center networking field, which is the first time I see it appears in this area (previously I only saw it in Computer Vision related areas). But the idea is also easy-understanding. It first decides an initial state. Then it generate all possible neighbors and calculate the energy of these neighbors. There are three different types of neighbour generators so they are randomly selected to ensure the algorithm won’t stuck with certain circumstances. The algorithm tries to move the neighbors with lower energy or, move to higher-energy neighbors in certain possibilities. This cycle continues until the temperature becomes zero. An interesting problem occurs when the flows are changing , so there would be different initial states, so this topic should be considered in SA method. The time complexity of SA is related to the average number of large flows to a host because the algorithm is performed basing on the allocation of core switches, which is one-to-one mapping of hosts.
The Implementation part describes how the testbed is built. And in order to simulate the condition in data centers, they built a simulator. The simulator can simulates the TCP flow behavior in discrete time ticks. Interestingly, behaviors of AIMD, slow start and TCP Reno are all considered.
Resting graphs in Section 6 clearly demonstrates the comparisons of Hedera and ECMP. The major concern is the average bisection bandwidth. As we can see, in all communication patterns, the performs rank as SA>GFF>ECMP. Even in data shuffle condition where ECMP will not be blamed for its performance, Hedera also does well. And since the system is built on Openflow, the SDN control loop time consumption is calculated and the paper argued that the control loop will not consume a lot of bandwidth because the major delay occurs in the controller and is 100ms, which is far shorter than the 5 second time period.
Overall, it is a very good paper in network resource allocation whose ideas could be borrowed. The shining points are the SA algorithm, the dynamic view, and combination of SDN and the demand estimator. If one is doing research on network resource sharing area, this paper is definately a must-read material!
Questions
1. Why the demand estimator is splitted into two parts (source and destination), and why the paper said it increases the flow capacities from the sources and decreases the capacity at receivers? Why not increase at the receivers and decreases from the source?
2. The scheduling cycle is five seconds, and it uses the mechanism of polling. But the paper didn’t explain in detail why the interval of 5 seconds is selected. Is this time interval a result of experiments on different options ?
Paper Title
Hedera Flow Scheduling for Data Center Networks
Authors
Mohammad AI-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat.
Date
In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI '10), April 2010.
Novel Idea
Replaces static load balancing by using commodity switches and unmodified hosts to implement a scalable dynamic flow scheduling system to effectively utilize aggregate network resources.
Main Results
- A central scheduler with global knowledge of active flows outperforms hash-based ECMP load balancing.
- Simulated Annealing almost always outperforms Global First Fit and delivers near optimal bisection bandwidth.
- Global First Fit and Simulated Annealing significantly outperform static hashing (ECMP) and achieve near optimal bisection bandwidth of the network (15.4Gb/s goodput). The performance improves as the level of local communication increases.
Impact
Hedera ameliorates the issue of having apriori unknown workloads, a central scheduler delivers high bandwidth without requiring protocol changes, and therefore reduces inter-rack network bottlenecks.
Question
What would be a worst-case scenario for Hedera? Maybe a situation where we are forced to change protocols at the host level, forcing the centralized scheduler to have a tainted, partial or limited view of active flows?
It seems that data centers could be susceptible to down time when upgrades are performed on them (i.e.: new protocols installed at the host level). Can we avoid this?
Paper Title
Hedera: Dynamic Flow Scheduling for Data Center Networks
Authors
Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang and Amin Vahdat
Date
NSDI'10
Novel Idea
This paper presents a dynamic flow scheduling system called Hedera. The main idea is to use a global view of routing and traffic matrix to estimate the bandwidth share for each flow, and then compute and allocate paths for flows. The authors propose two scheduling algorithms: Global First Fit and Simulated Annealing, and a demand estimation scheme that can estimate the ideal bandwidth share for each flow.
Main Result
Their results shows that Simulated Annealing outperforms Global First Fit in most cases, and both of them outperform hash-based scheduling scheme. They also find that the performance gain is dependent on the characteristics of the flows, such as rate and duration. In the simulation result, the author demonstrates that both the demand estimation and Simulated Annealing are efficient enough with large networks with many flows (although they are pretty slow in my first impression).
Evidence
Hedera is tested on a physical fat-tree OpenFlow network with 16 hosts. They also run simulation to test the algorithms' scalability on large networks and many flows.
Prior Work
Equal-Cost Mutli-Path forwarding is a similar scheme that also takes advantage of multiple paths in data center. The main difference is that it uses hashing to spread flows across paths, while Hedera first estimates the flow's demand and then allocates paths based on that.
Reproducibility
The result in this paper should be easy to reproduce. All algorithms are well described with formal definition, and they should be easy to implement on OpenFlow controllers.
--
You received this message because you are subscribed to the Google Groups "CSCI2950-u Spring 13 - Brown" group.
To unsubscribe from this group and stop receiving emails from it, send an email to csci2950u-sp13-b...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Title: Hedera: Dynamic Flow Scheduling for Data Center Networks
Authors: Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat
Novel Idea: This paper proposes Hedera, which is a dynamic flow scheduling system for adaptive scheduling in multi-stage switch topologies found in data centers. Flow information are collected from switches, non-conflicting paths are computed for flows and traffic is rerouted accordingly, with the goal to maximize aggregate network utilization at minimal scheduler overhead.
Main Results: Hedera was fully implemented on the PortLand testbed and was found to deliver 96% of optimal bisection bandwidth and up to 113% better than static load-balancing methods. The central scheduler with global knowledge of active flows outperforms the ECMP load-balancing, but the performance gains depend on the rates and duration of the flows in the network (larger gains were achieved in the case of many large data transfers).
Impact: The work of this paper has a significant impact on the performance scalability of data centers. Existing data centers use static single-path forwarding which often results in collisions in switch buffers and inefficient switch utilization. Hedera offers a solution to this problem and manages to outperform static load-balancing methods.
Prior Work: Hedera complements ECMP [21] by supplementing default ECMP behavior for communication patterns that cause problems to ECMP. Also, Hedera augments the PortLand routing and fault tolerance protocols [29]. Using the standard PortLand mechanisms, the Hedera scheduler is aware of failures and can reroute flows that are mapped to failed components. Furthermore, the implementation and testing of the scheduling algorithms was done using the fat-tree network that was described in a previous work by the same authors [3].
Competitive Work: VL2 [16] and Monsoon [17] proposed using Valiant Load Balancing, but this technique can cause bandwidth losses due to long-term collisions as was shown in this paper. SEATTLE [25], DCell [19] and BCube [18] were other related works on data center networks, but did not address multi-pathing. Although TeXCP [23] and MATE [10] explored multi-pathing environments, they differ from this work as they used distributed traffic engineering, while here the design is based on a central scheduler. Other works on centralized router control were namely Ethane [6], RCP [5], the 4D architecture [15] and Tesseract [35]. Finally, there have been works on virtual switching fabrics and Clos networks but they did not employ commodity switches to build a multi-level switch architecture ([33][32][30])
Evidence: The authors generated data center traffic patterns that stress and saturate the network to compare Hedera's performance to current hash-based multipath forwarding schemes. They reported Bisection Bandwidth losses for different numbers of TCP flows per host for a k=48 fat-tree using ECMP and they found that hash collisions can reduce the network's bisection bandwidth by 60.8% in average, if each host transfers an equal amount of data to all remote hosts one at a time, while the respective reduction in case of parallel communication is only 2.5%. They presented an example for TCP bandwidth demand estimation in a hypothetical equilibrium state of the network, for 4 hosts connected by a non-blocking topology. The Global First Fit and Simulated Annealing algorithms were then analyzed and their time and space complexities were reported .Hedera was later evaluated through simulations using the communication patterns created (Stride, Staggered Prob and Random) and results on the average bisection bandwidth were reported for each pattern and each algorithm (ECMP, Global First-Fit, Simulated Annealing and use of a non-blocking switch). Global First Fit and Simulated Annealing were found to outperform static hashing (ECMP) and achieve a near optimal bisection bandwidth (15.4Gb/s goodput) The authors also performed an all-to-all in-memory data shuffle using their testbed, which is an expensive but necessary operation for many MapReduce/Hadoop operations and found that centralized flow scheduling performs a lot better than static ECMP hash-based routing (39% better bisection bandwidth). They further reported the aggregate bisection bandwidth achieved when running the benchmark suite for a simulated fat-tree network with 8,192 hosts for all the simulated algorithms. Finally they compared their algorithms against a hypothetical non blocking switch for the entire data center and against static ECMP hashing and found that their dynamic placement algorithms outperform static ECMP hashing significantly (Global First Fit and Simulated Annealing performed close to optimal). They further performed a complexity analysis of the Demand Estimation and Simulated Annealing and they also talked about the total control overhead of the centralized scheduling design.
Criticism: Overall, the authors performed a considerable amount of experimentation and supported their claims with a wide variety of results. They didn't limit their results only on average bisection bandwidth, but they also investigated other aspects of the proposed solution, such as its complexity and overhead, as well as how traffic variations and different algorithms can yield significantly different results. They found that Hedera can outperform the state of the art hash-based ECMP load-balancing and demonstrated how its performance depends on the rates and duration of the flows in the network. They showed that Simulated Annealing almost always outperforms Global First Fit and can achieve near optimal bisection bandwidth for a wide range of communication patterns. Finally the findings of this paper showed that dynamic flow scheduling can deliver considerable bandwidth gains at moderate cost. Overall, the experimental process in this paper is thorough and the main findings are significant.
Question: In the paper, the authors assume a threshold rate over which a flow is assumed to be large (100Mbps, which corresponds to 10% of each host's 1GigE link). How was this threshold chosen? Should it be based on experiments on the traffic rates of their network or it is ok to be arbitrarily chosen?
Hedera: Dynamic Flow Scheduling for Data Center Networks
Authors: Mohammad Al-fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat.
Date: Proceedings of the 7th USENIX conference on networked systems design and implementation, 2010
Novel Idea:
This paper proposed an alternative method of flow scheduling called Hedera to traditional data center’s ECMP, which is more adaptable to the network condition and the flow size.
Main Results: First of all, they examined how currently used ECMP fell short in certain scenarios, then continued to propose a higher level of abstracted design improvement. Then went into details of design of algorithms including the flow demand estimator. Later on described the implementation details and the benchmarking. They made use of a central controller like in Open Flow which has global knowledge of active flows and later in experiments showed that this approach out-performed the state-of-the-art ECMP load balancing. And because of the way Hedera is designed, it’s performance gains are dependent on the rates and durations of the flows in the network.
In their implementation of the demand estimator they chose to go with the sparse matrix and parallelism to improve performance.
They compared two types of scheduling algorithms, global first fit and simulated annealing, and with good reasoning chose to go with simulated annealing.
Evidence:
They did extensive benchmarking both on a testbed of 16 hosts and a virtualized environment and also performed an all-to-all data shuffle in their test bed. The evidence shows that Hedera outperforms traditional state-of-the-art ECMP
Prior work: A lot of work has been done in load-balancing, including the baseline ECMP, and also VLB. This work is also employing the use of an Open Flow central controller to control the routing entries in several switches.
Reproducibility:
Fairly high. The description of the implementation is pretty thorough and the concepts are well explained.
Hedra: dynamic flow scheduling for data center networks
By Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavant, Nelson Huang, and Amin Vahdat
Novel Idea: A dynamic, scalable load balancer that efficiently balances based on a multi stage switching process. Hedra first gets flow information from switches, then computes a non-conflicting path, then it will route traffic accordingly. Headra has a much more general view of the network, and can identify bottlenecks not known to the individual switches.
Main Result: The load balancer delivers 96% optimal bandwidth, and performs 113% better than static load balancers. They test both Global fit first, and Simulated Annealing and compared them to ECMP. Global fit first works well when there aren’t that many flows, and it does not guarantee that the flow will be assigned. Simulated Annealing probabilitically searches for the most efficient path. Simulated Annealing was around around before the paper, but their addition was limiting the search space by assigning a core switch to each destination host. This ensured a quick convergence and a near optimal assignment of flows
Evidence: They use traffic patterns based on prior work to test their load balancer. The traffic consists of many small RCP communications as well as larger flows like backups, and mapreduce traffic. Simulated Annealing almost always outperforms Global Fit First and achieves a near optimal bisection of bandwidth.
Prior work: VLB and ECMP.
Competitive work: VL2 and Monsoon, TeXCP and Mate
Reprod: With enough time, this could be reproduced.