How to run spark in Sparrow model?

169 views
Skip to first unread message

Lee hu

unread,
Nov 1, 2015, 8:32:29 AM11/1/15
to Sparrow Users
I have the interesting to try Sparrow, because it owns a novel idea.  After I read the source code(https://github.com/kayousterhout/spark/tree/sparrow), I found I do not how to run spark with Sparrow model.


I understand the principle of Sparrow, there is a scheduler in every node, but I can not find any code or script to launch the SparrowExecutorBackendso I just want to know how Spark+Sparrow launch the Executor. 


In standalone mode, the worker will receive the command, which contains the mainClass(StandaloneExecutorBackend), and will launch a new process as Executor, but there is no way to launch SparrowExecutorBackend, so how can I launch it?


Lou

unread,
Nov 1, 2015, 3:42:52 PM11/1/15
to Sparrow Users
Hi there,

See my short comments below.

I understand the principle of Sparrow, there is a scheduler in every node, but I can not find any code or script to
> launch the SparrowExecutorBackend
so I just want to know how Spark+Sparrow launch the Executor. 

The scripts under "/sparrow/deploy/ec2" might be of your focus.

For my understanding, Sparrow backends have played a role as a communication protocol between frontends and Spark executors. Further, the number of frontends using Shark (i.e. a prior version of Spark SQL) is the same as the number of backends on each worker. This is limited by the current implementation available. 

Yes, from the viewpoint of scheduling performance when it comes to handle short-lived jobs on sub-second level, Sparrow is genuinely innovative itself as a pull-based job scheduling.

Hopefully, it helps, and good luck.

Cheers,
Lou
Message has been deleted

Lee hu

unread,
Nov 1, 2015, 8:34:13 PM11/1/15
to Sparrow Users
Thank you very much! I find those scripts.

在 2015年11月2日星期一 UTC+8上午4:42:52,Lou写道:

Lee hu

unread,
Nov 1, 2015, 10:49:30 PM11/1/15
to Sparrow Users
I have some not clear. As my understanding, the Spark worker can be viewed as the nodeMonitor in Sparrow.  What we need to launch is:

spark master + spark worker // through /spark/bin/start-all.sh

spark SparrowExecutorBackend  // through the /spark/run spark.scheduler.sparrow.SparrowExecutorBackend

Sparrow Frontend //whatever, such as /shark/run shark.sparrow.SparrowTPCHRunner args1 args2


Is this right? 



在 2015年11月2日星期一 UTC+8上午4:42:52,Lou写道:
Hi there,

Kay Ousterhout

unread,
Nov 2, 2015, 12:06:43 AM11/2/15
to sparrow-sch...@googlegroups.com
You don't ned to start the Spark master / worker, but the rest is correct. When you run the SparrowTPCHRunner, it creates its own master that communicates directly with the SparrowExecutorBackends to launch jobs.

-Kay

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lee hu

unread,
Nov 2, 2015, 5:16:06 AM11/2/15
to Sparrow Users
Thanks!  But why I can not find  any code in Spark/shark implement SchedulerService.Iface or call the SchedulerThrift, which will launch a Scheduler server. 

 I suppose when I start the SparkExecutorBackend, it will start a Scheduler.

在 2015年11月2日星期一 UTC+8下午1:06:43,Kay Ousterhout写道:
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

Lou

unread,
Nov 2, 2015, 6:21:07 AM11/2/15
to Sparrow Users
I think when you start Sparrow master using its daemon, the per-job scheduler and node monitor will go get started. Moreover, the order of starting Sparrow to work with Spark (0.7) is "Sparrow Master -> backends -> frontends". The Sparrow backend on each machine listens for task launch requests from a Sparrow master, and it also passes some messages from a Spark executor back to the Sparrow master. Sparrow node monitor can be considered as a local scheduler.

To use Sparrow with Shark, there are a few additional programs to install, e.g. a HDFS, Hive, and dbgen. 

Best,
Lou

---

Kay Ousterhout

unread,
Nov 2, 2015, 1:07:21 PM11/2/15
to sparrow-sch...@googlegroups.com
Sorry you're right -- you need to first start Sparrow (this script can help with that: https://github.com/radlab/sparrow/blob/master/deploy/ec2/template/start_sparrow.sh) which serves as an intermediary between the Spark driver and the Spark executors.

-Kay

To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.

Lee hu

unread,
Nov 2, 2015, 7:16:16 PM11/2/15
to Sparrow Users
Yes, then I can understand how is the scheduler launched. But since the object SparrowExecutorBackend do not use any args, why do we provided args like this:

/root/spark/run -Dspark.scheduler=sparrow -Dspark.master.port=7077 -Dspark.hostname=$HOSTNAME -Dspark.serializer=spark.KryoSerializer -Dspark.driver.host=$name -Dspark.driver.port=60500 -Dsparrow.app.name=$id -Dsparrow.app.port=$port -Dspark.httpBroadcast.uri=http://$ip:33624 spark.scheduler.sparrow.SparrowExecutorBackend


especially for the master port. 

when I run the above command, it will always tell the warning: can not send message to the BlockManagerMaster.


Thanks!


在 2015年11月3日星期二 UTC+8上午2:07:21,Kay Ousterhout写道:
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

Lee hu

unread,
Nov 2, 2015, 7:17:43 PM11/2/15
to Sparrow Users
Thanks very much! I can understand the code better.

在 2015年11月2日星期一 UTC+8下午7:21:07,Lou写道:

Lee hu

unread,
Nov 3, 2015, 11:39:53 PM11/3/15
to Sparrow Users
I can run the default Spark example(grep 'a' and 'b' form a text file) now, but I have some questiones(hope helpful for others):

I work like the following way:

1. Start the SparrowDaemon: This will launch a Scheduler and a NodeMonitor on every node, the scheduler knows all the NodeMonitors from a configuration file, and will listen on default port 20503 waiting the frontend to submit Job. The NodeMonitor listen on the default 20501 waiting the NodeMonitorClient to register. The Scheduler communication with the NodeMonitor using a internal port 20502.


2. Start the SparkExecutorBackend: This will launch a SparrowBackend, it will try to register the Spark driver from spark.driver.host and spark.driver.port, and then register to the NodeMonitor.


3. Start the SparrowFrontend, this can be used by any spark/shark application, whose masterUrl like spa...@scheduler.hostname:20503.


My Question is

1. In the step 2, the SparkExecutorBackend need register to the driver before register to the NodeMonitor, and the driver can only start until step3, so there is a risk that when the SparrowFrontend launch the task, the SparrowBackend is still not register to the NodeMonitor, I use a Thread.sleep before launch the job, so when the frontend submit the job, the backend is register to the driver and NodeMonitor, but this is not practical in real.


2. If there are multi users launch job in parallel. How one SparrowBackend deal with multi spark drivers? or we just launch more SparkBackend in one node, but the paper said there is just one SparrowBackend in every node.
I can understand the default sparrow examples, because all the SparrowBackends will not need to communicate with one single component, jus like the driver in Spark.


I maybe mis-understanding something, any help is appreciated. Thanks!















在 2015年11月2日星期一 UTC+8下午7:21:07,Lou写道:
I think when you start Sparrow master using its daemon, the per-job scheduler and node monitor will go get started. Moreover, the order of starting Sparrow to work with Spark (0.7) is "Sparrow Master -> backends -> frontends". The Sparrow backend on each machine listens for task launch requests from a Sparrow master, and it also passes some messages from a Spark executor back to the Sparrow master. Sparrow node monitor can be considered as a local scheduler.

Lou

unread,
Nov 4, 2015, 6:08:32 PM11/4/15
to Sparrow Users
Hi there,

See my short comments below (after giving some interviews and reading an awesome scientific paper in a long day).

According to my understanding, in Step Two, when starting a backend, you need to "register" it to a certain frontend, by using the hostname and the same application name of the frontend. Although you may get some warning messages during the time between starting a backend and a fronted, it is just okay. 


>2. If there are multi users launch job in parallel. How one SparrowBackend deal with multi spark drivers? or we just launch more SparkBackend in one node, but the paper said there is just one SparrowBackend in every node.
I can understand the default sparrow examples, because all the SparrowBackends will not need to communicate with one single component, jus like the driver in Spark.

In general, when you are using Spark standalone cluster manager, there is only one executor to be launched in one machine/worker by default. In Sparrow, if there are multiple Spark Drivers (e.g. Shark) submitting jobs, then the number of executors of each worker has to be equal to the number of frontends. In addition, as mentioned in the Sparrow paper, one may consider to allocate each of such frontends to a single machine, on which there is a Sparrow scheduler running, in order to minimize scheduling latency caused by submitting query jobs.

Last but not least, it is not always so easy to play around and live in the world designed by others, but great to see that we have given it a try at least. It also seems to me that many other pieces of work, coming after Sparrow, in which Sparrow has been compared as a reference framework, have not suffered from any of such practical issues. What on earth...

Mvh
Lou

lihu

unread,
Nov 4, 2015, 7:30:56 PM11/4/15
to Sparrow Users
your reply is very helpful to me, thanks very much!

I read almost all the codes in Sparrow, and find it is not all the same with the paper, so I just get confused. I also noticed some other workers based on Sparrow, and get a good result, so I think maybe my understanding is wrong?  but the code is just there.
In fact, I read spark code a lot, and understanding the design well. It is not just compatible if we just a single Executor on one Node, but there is a lot of Drivers in parallel without something new changed. I think I may missed something, just as I missed to start the SparrowDaemon at first.

Your answer verify my understanding is not wrong. Thanks again!

Lou <ylu...@gmail.com>于2015年11月5日周四 上午7:08写道:
--
You received this message because you are subscribed to a topic in the Google Groups "Sparrow Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparrow-scheduler-users/EAQZCFd_usw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparrow-scheduler...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Best Wishes!

Li Hu(李浒) | Graduate Student

Institute for Interdisciplinary Information Sciences(IIIS)
Tsinghua University, China

Kay Ousterhout

unread,
Nov 4, 2015, 7:46:23 PM11/4/15
to sparrow-sch...@googlegroups.com
You need to start one SparkExecutorBackend for each Spark driver (this is because the Spark driver manages the metadata for that backend, e.g, which RDDs are stored on that ExecutorBackend).  But, all of the SparkExecutorBackends on one machine communicate with the same SparrowDaemon, which handles deciding how many tasks each can be running at a given time.

This is indeed a practical issue with Sparrow  -- Sparrow distributes the scheduling state for the Spark driver, but not the rest of the state (e.g., which blocks are stored where), so each SparkExecutorBackend still needs to communicate with a single driver.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.

Lee hu

unread,
Nov 4, 2015, 7:57:03 PM11/4/15
to Sparrow Users
Thanks for your replay.  

I understand this now. Reading the code of Sparrow give me a better understanding of Scheduling and Spark, thanks for your work !
I may try to solve the problem of share the block store information If I needed.


在 2015年11月5日星期四 UTC+8上午8:46:23,Kay Ousterhout写道:
To unsubscribe from this group and all its topics, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Best Wishes!

Li Hu(李浒) | Graduate Student

Institute for Interdisciplinary Information Sciences(IIIS)
Tsinghua University, China

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

Lou

unread,
Nov 5, 2015, 7:41:52 AM11/5/15
to Sparrow Users
Thanks for the discussion and explanations, which helped me understand both Sparrow and Spark better. 

More interestingly, @Lee, keep us posted with your new findings and potential contributions to Sparrow if interested (and desired). 

Cheers,
Lou
Thanks for your replay.  

Lee hu

unread,
Nov 5, 2015, 11:03:41 PM11/5/15
to Sparrow Users
hmm, I will post new findings here if I finded :-D

在 2015年11月5日星期四 UTC+8下午8:41:52,Lou写道:
Reply all
Reply to author
Forward
0 new messages