Paper Title |
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center |
Author(s) |
Benjamin Hindman, Andy Konwinski, Matei Zaharia, |
Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica | |
Date | 2010 |
Novel Idea | As cluster computing becomes more popular, individual organizations are starting to want to run many different kinds of distributed job, but may only have access to one cluster. Mesos attempts to resolve this issue by acting as a resource allocator for multiple different distributed processing frameworks running on the same cluster. |
Main Result(s) |
To run on Mesos, a framework must be customized to use the Mesos API. Once this is accomplished, Mesos allocates resources by sending out resource offers to the modified frameworks. The frameworks then either accept or reject the offer, and if they accept, specify a job. Mesos then tells the node it was offering resources from to perform the specified job for the accepting framework. Mesos keeps track of a distinction between long and short jobs, and can guarantee frameworks that need long jobs that it will not kill their jobs if they don't go over a certain resource quota. However, Mesos will kill short jobs if it thinks it can allocate the resources they are using more efficiently, or if it thinks they are misrepresented long jobs. Long jobs can also be killed if they have gone over their guaranteed quota. Mesos allocates resources using a scheme called Dominant Resource Fairness. This scheme attempts to give each framework the same fraction of its particular dominant resource - that is, whatever each framework needs most of (RAM, CPU, network utilization, etc). So a DRF-fair result might be to give one process 80% of the available CPU and another 80% of the available RAM. |
Impact | I never have anything interesting to write in this section. |
Evidence |
They show that Mesos is able to keep cluster utilization at 100% with a random assortment of jobs on a few frameworks, but they don't really test many different kinds of loads or different kinds of frameworks. Instead they have a lengthy section of mathematical proofs of scheduling guarantees under a variety of conditions, but it would be nice if these were borne out in the lab. They also show that the kind of specialized framework they hope to promote with Mesos is a good idea by showing how their framework, SPARC, can use its specialized knowledge of certain data-intensive problems to perform jobs several orders of magnitude faster than Hadoop. |
Prior Work | The specifically mention Quincy as an inspiration, and indirectly imply that it was the source of their task-killing mechanism. |
Reproducibility | They don't really talk at all about the actual method by which Mesos keeps nodes separated from the frameworks enough to have control over where they run their jobs, but connected enough that the master doesn't become a bottleneck. It seems like there are some nontrivial issues here that would make reproducing this quite a job. |
Question | How does DRF avoid limiting uncompetitive resources unnecessarily? If we add a third process to the example that needs a 1 gigabit ethernet connection per job but barely any RAM and only a tiny fraction of a CPU, why won't DRF only give it 80% of the available networking resources, even though it isn't competing for them with anyone else? |
Criticism | Is a scheduler that requires custom frameworks really practical? They have implemented Hadoop and a few others themselves, but as those frameworks evolve, these implementations won't evolve with them. Can Mesos really gain the traction it would need to be supported by every major distributed computation framework? Or can it make it easy enough to do the modifications in house that using Mesos doesn't mean forgoing most of the available frameworks? |
Ideas for further work | I wonder if it is possible to make something almost as effective that does all the resource management at the system level, without having to interface with the frameworks at all. You lose a lot of easy customizability, but it may be possible to extrapolate the necessary data by simply observing jobs created by various processes, rather than by asking them explicitly what they need, and you gain a lot in developer time and end-user usefulness if you support frameworks that have never heard of you. |
Authors: Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
Date: 2010, UC Berkeley Tech Report
Novel idea: The authors elegantly summarize the purpose of Mesos: "to enable fine-grained sharing between multiple cluster computing frameworks, while giving these frameworks enough control to achieve placement goals such as data locality."
Main result: The authors provide an implementation that delivers on the above promise by offering frameworks resources and allowing them to choose which computations to run. In practice, this model achieved near-optimal locality when sharing a cluster among diverse frameworks, in addition to offering scalability and fault-tolerance benefits.
Impact: Practical concerns often necessitate the utilization of multiple disparate systems. Being able to run these various systems on a single set of compute resources may provide significant cost savings.
Evidence: The authors ported MPI and Hadoop to Mesos, and they also built a new framework on top of Mesos called Spark. Mesos was able to keep cluster utilization at 100% without affecting the standalone running time of each framework. They also meausured the locality of these frameworks while running on top of Mesos. They additionally scaled Mesos up to 50,000 EC2 nodes to prove that it is scalable.
Prior work: Mesos obviously builds on the successes of distributed compute frameworks such as Hadoop and MPI. As far as their approach is concerned, the authors cite inspiration from prior work on microkernels, exokernels, and hypervisors.
Competitive work: Mesos is similar to HPC and grid schedulers; however, it allows frameworks flexibility when it comes to storing and retrieving data on the cluster. Mesos also shares common goals with virtual machine clouds such as EC2; however, the abstraction in virtual machine clouds (the VM) is not granular enough to provide for the needs of distributed computing frameworks in comparison to the resource offers employed in Mesos.
Reproducibility: Mesos is open source software, and is available via github.com/mesos. Given adequate compute resources it should be possibly to reproduce the authors' results.
Praise: (Yet again.) The authors exploit operating system container technologies, specifically Linux containers and Solaris projects, to achieve isolation. It's great to see the authors of distributed systems working with existing OS mechanisms rather than reinventing the wheel.
Ideas for further work: The authors mention that although Mesos currently isolates CPU cores and memory, they plan to extend their implementation to isolate network and IO using new features in Linux kernel 2.6.33. The same should be possible using Solaris' Project Crossbow.