Is that possible run the Sparrow on some other mapreduce simulators?

P Tursun

unread,

Sep 9, 2015, 10:13:27 AM9/9/15

to Sparrow Users

Hi:

I was trying to run the Sparrow. I was wondering if it is possible run it on Mapreduce simulators? If yes, how significant the change to code would be?

Thanks so much.

Kay Ousterhout

unread,

Sep 9, 2015, 6:26:44 PM9/9/15

to sparrow-sch...@googlegroups.com, P Tursun

Hi,

Sparrow was written to work with Apache Spark, so integrating it with MapReduce would likely be a significant amount of work. I don't know much about Mapreduce simulators, but it's possible making it work with a simulator (rather than with a running MapReduce cluster) would be less work. If you're just interested in simulating Sparrow's performance for a particular workload, we have simulators that model various scenarios available here: https://github.com/radlab/sparrow/tree/master/simulation.

-Kay

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lou

unread,

Sep 11, 2015, 5:55:04 PM9/11/15

to Sparrow Users, pazi...@gmail.com

Hi all,

Here comes my two cents on the topic, (some of) which I hope would make sense.

At first, assume that the referred MapReduce simulator in the context is SLS. To add Sparrow schedulers to such MapReduce simulator, a high-level guideline would be to replace resource manager of Yarn with a scheduling node of Sparrow schedulers, since the way of these two working is very different. A trick here might be: from the scheduling viewpoint, Yarn is essentially a monolithic scheduler, so it is easy to replace its own scheduler with Sparrow schedulers. Next, some low-level changes to make include but are not limited to: 1) a means to support of the communication protocol between Sparrow node monitor and Yarn containers (running on various nodes), or between a Yarn node monitor and Sparrow schedulers (and note that Apache Thrift is a fine choice for a PoC, while a better choice can be made as always) and, 2) loading realistic workloads for evaluation if interested, then the way of generating a DAG (with different constraints w.r.t. dependencies) should also be proposed and developed and, 3) some nontrivial coding skills applied in Sparrow really gotta be well noticed otherwise your heart will just fall apart along the way, without knowingly and, 4) Yarn is container-oriented while Sparrow/Spark as being PMs or VMs oriented and, 5) something has not been mentioned above.

In addition, as shown in the Sparrow paper, it has worked very best for short-lived jobs, e.g. query-based data processing via Shark (a prior version to Spark SQL). When it comes to batch jobs (e.g. MapReduce or multi-stage jobs) processing, some fundamentals may not be the best, e.g. batch sampling and/or the pushing mechanism adopted by her virtual reservation, hypothetically and/or practically.

Yet one more thing, based off of the diligent efforts made by Spark development team, one may consider to use Spark on Yarn either in the yarn-cluster or
yarn-client mode in one of the own clusters. For the lightweight virtualized environment, Google Kubernetes might be of interest. Willing to add a bit sugar on top? Then just go for Mesos, on which Sparrow schedulers may be more than happy to chip in.