Spark-Sparrow Cluster Setup

169 views
Skip to first unread message

gkumar7

unread,
Apr 4, 2016, 12:16:43 AM4/4/16
to Sparrow Users
I would like to run various spark workloads utilizing the sparrow distributed scheduling model and compare with the traditional spark approach. In the sparrow model, it seems that each frontend will be a scheduler while the backends can be considered as the workers.

Therefore, for example, to compare a spark cluster of 1 master and 8 slaves, an "analogous" setup with sparrow would be consist of 8 frontends and 8 backends. Would this be correct?

Thank you for the assistance.

Lou

unread,
Apr 4, 2016, 4:06:40 PM4/4/16
to Sparrow Users

Hi there, 


Just two short notes here, i.e.


1. For the comparison between Sparrow and Spark, my (educated) guess would be: the cluster size should be more than a hundred or even much larger; otherwise the performance of the decentralized approach will just be prohibited. Moreover, I don’t think your referred setup of Sparrow would actually have introduced a fair comparison against Spark, e.g. due to too many schedulers involved.


2. Essentially, what (else) has made a difference between Spark cluster scheduler and Sparrow is the scheduling mechanism adopted by each of them, individually. The combo of virtual reservation and batch sampling has resulted in a (very) good scheduling performance when it comes to a single resource consideration. 


Btw, my understanding of Sparrow backend is its role as a communication protocol between Sparrow node monitor and Spark executors, and there also might be a correlation between the number of Spark executors and the number of Sparrow frontends, i.e. Shark, granted by the current Github version, as being a two-year+ straight.


Best,

Lou

Kay Ousterhout

unread,
Apr 4, 2016, 4:10:35 PM4/4/16
to sparrow-sch...@googlegroups.com
There's no single "analogous" setup with Sparrow. With Sparrow, performance will be the same if all jobs are scheduled from a single frontend as it would be if each job were scheduled using its own frontend.  Each Sparrow frontend corresponds to a Spark driver, so all jobs that need to share data should be scheduled from the same frontend.

-Kay

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Geet Kumar

unread,
Apr 4, 2016, 4:32:20 PM4/4/16
to sparrow-sch...@googlegroups.com

Ah okay, the reason I was not certain was because I was under the impression that the number of frontends correlates to the number of sparrow schedulers and the number of backends correlates to the number of workers (according to figure 6 in the paper).

But since a single job should use a single frontend, is there a way to set the number of schedulers being used?

Or are all tasks of a given job being distributed to workers by a single scheduler?

You received this message because you are subscribed to a topic in the Google Groups "Sparrow Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparrow-scheduler-users/kgfS8vEM6q4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparrow-scheduler...@googlegroups.com.

Kay Ousterhout

unread,
Apr 4, 2016, 4:35:12 PM4/4/16
to sparrow-sch...@googlegroups.com
You're correct that the number of frontends corresponds to the number of sparrow schedulers and the number of backends corresponds to the number of workers.  If you're only running one job, there won't be any benefits of Sparrow, because Sparrow allows *different* jobs to be scheduled by different frontends (but each job needs to be scheduled by a single frontend).

Geet Kumar

unread,
Apr 4, 2016, 4:39:54 PM4/4/16
to sparrow-sch...@googlegroups.com

Sounds good, thank you for the clarification!

Henrique Grando

unread,
Jul 5, 2016, 10:47:02 PM7/5/16
to Sparrow Users
Hi Kay,

Could you just clarify some ideas for me? When you said "If you're only running one job, there won't be any benefits of Sparrow, because Sparrow allows *different* jobs to be scheduled by different frontends", by job did you mean the set of tasks in a Spark stage? That is, the improvement with the distributed scheduling in Sparrow would only occur when you have jobs (stages) that can be executed in parallel within an application, is that correct? If that is correct, is it safe to assume that such scenario would only occur when you have input data (from HDFS or any other source) being fetch to more than one initial RDD? With all of that said, is there any way that we can make simple applications like Grep and WordCount take advantage of the distributed scheduling on Sparrow?

Thank you so much for the attention,

Henrique


On Monday, April 4, 2016 at 3:35:12 PM UTC-5, Kay Ousterhout wrote:
You're correct that the number of frontends corresponds to the number of sparrow schedulers and the number of backends corresponds to the number of workers.  If you're only running one job, there won't be any benefits of Sparrow, because Sparrow allows *different* jobs to be scheduled by different frontends (but each job needs to be scheduled by a single frontend).
On Mon, Apr 4, 2016 at 1:32 PM, Geet Kumar <gku...@hawk.iit.edu> wrote:

Ah okay, the reason I was not certain was because I was under the impression that the number of frontends correlates to the number of sparrow schedulers and the number of backends correlates to the number of workers (according to figure 6 in the paper).

But since a single job should use a single frontend, is there a way to set the number of schedulers being used?

Or are all tasks of a given job being distributed to workers by a single scheduler?

On Apr 4, 2016 3:10 PM, "Kay Ousterhout" <kayous...@gmail.com> wrote:
There's no single "analogous" setup with Sparrow. With Sparrow, performance will be the same if all jobs are scheduled from a single frontend as it would be if each job were scheduled using its own frontend.  Each Sparrow frontend corresponds to a Spark driver, so all jobs that need to share data should be scheduled from the same frontend.

-Kay
On Sun, Apr 3, 2016 at 9:16 PM, gkumar7 <gku...@hawk.iit.edu> wrote:
I would like to run various spark workloads utilizing the sparrow distributed scheduling model and compare with the traditional spark approach. In the sparrow model, it seems that each frontend will be a scheduler while the backends can be considered as the workers.

Therefore, for example, to compare a spark cluster of 1 master and 8 slaves, an "analogous" setup with sparrow would be consist of 8 frontends and 8 backends. Would this be correct?

Thank you for the assistance.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparrow Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparrow-scheduler-users/kgfS8vEM6q4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

Kay Ousterhout

unread,
Jul 11, 2016, 12:08:26 AM7/11/16
to sparrow-sch...@googlegroups.com
Hi Henrique,

Right now, simple applications like Grep and WordCount cannot take advantage of the distributed scheduling on Sparrow.  Right now, Spark is architected such that metadata about where each partition of each RDD is stored is all maintained on a single, centralized Spark driver.  The Spark driver also runs the scheduler (e.g., Sparrow).  Sparrow can only be used when you have multiple Spark drivers (in which case each driver can use its own Sparrow scheduler).

This is not a fundamental limitation of Sparrow; it's a limitation of the current Spark architecture.  If you wanted to distribute scheduling for the stages in a single job, or for tasks within a single stage, you'd need to decouple the part of the driver that handles scheduling (so it could be distributed) from the part of the driver that keeps track of where RDDs are stored.

-Kay

On Tue, Jul 5, 2016 at 7:47 PM, Henrique Grando <henriqu...@gmail.com> wrote:
Hi Kay,

Could you just clarify some ideas for me? When you said "If you're only running one job, there won't be any benefits of Sparrow, because Sparrow allows *different* jobs to be scheduled by different frontends", by job did you mean the set of tasks in a Spark stage? That is, the improvement with the distributed scheduling in Sparrow would only occur when you have jobs (stages) that can be executed in parallel within an application, is that correct? If that is correct, is it safe to assume that such scenario would only occur when you have input data (from HDFS or any other source) being fetch to more than one initial RDD? With all of that said, is there any way that we can make simple applications like Grep and WordCount take advantage of the distributed scheduling on Sparrow?

Thank you so much for the attention,

Henrique


On Monday, April 4, 2016 at 3:35:12 PM UTC-5, Kay Ousterhout wrote:
You're correct that the number of frontends corresponds to the number of sparrow schedulers and the number of backends corresponds to the number of workers.  If you're only running one job, there won't be any benefits of Sparrow, because Sparrow allows *different* jobs to be scheduled by different frontends (but each job needs to be scheduled by a single frontend).
On Mon, Apr 4, 2016 at 1:32 PM, Geet Kumar <gku...@hawk.iit.edu> wrote:

Ah okay, the reason I was not certain was because I was under the impression that the number of frontends correlates to the number of sparrow schedulers and the number of backends correlates to the number of workers (according to figure 6 in the paper).

But since a single job should use a single frontend, is there a way to set the number of schedulers being used?

Or are all tasks of a given job being distributed to workers by a single scheduler?

On Apr 4, 2016 3:10 PM, "Kay Ousterhout" <kayous...@gmail.com> wrote:
There's no single "analogous" setup with Sparrow. With Sparrow, performance will be the same if all jobs are scheduled from a single frontend as it would be if each job were scheduled using its own frontend.  Each Sparrow frontend corresponds to a Spark driver, so all jobs that need to share data should be scheduled from the same frontend.

-Kay
On Sun, Apr 3, 2016 at 9:16 PM, gkumar7 <gku...@hawk.iit.edu> wrote:
I would like to run various spark workloads utilizing the sparrow distributed scheduling model and compare with the traditional spark approach. In the sparrow model, it seems that each frontend will be a scheduler while the backends can be considered as the workers.

Therefore, for example, to compare a spark cluster of 1 master and 8 slaves, an "analogous" setup with sparrow would be consist of 8 frontends and 8 backends. Would this be correct?

Thank you for the assistance.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparrow Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparrow-scheduler-users/kgfS8vEM6q4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparrow-scheduler...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler...@googlegroups.com.

Henrique Grando

unread,
Jul 12, 2016, 6:33:42 PM7/12/16
to Sparrow Users
Thank you, Kay! Clarified a lot.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparrow Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparrow-scheduler-users/kgfS8vEM6q4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Sparrow Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sparrow-scheduler-users+unsub...@googlegroups.com.

Bishwajit Saha

unread,
May 10, 2017, 9:22:27 AM5/10/17
to Sparrow Users
While i am going to buiild sparrow integrated spark with the command 

sbt/sbt package assembly

Then i am facing errors like beelow.
[info] Compiling 1 Scala source to /home/bishwa/Documents/spark-sparrow/project/project/target/scala-2.9.2/sbt-0.12/classes...
[error] error while loading CharSequence, class file '/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken
[error] (bad constant pool tag 18 at byte 10)
[error] error while loading Comparator, class file '/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar(java/util/Comparator.class)' is broken
[error] (bad constant pool tag 18 at byte 20)
[error] two errors found
[error] (compile:compile) Compilation failed
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? 

I am not understanding what the problem is . Is there problem with the version of scala or java?
I am using java version like below.
 
 java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

 and scala version like below.
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL 

Can anyone please help me with this issue?

Any help will be greatly appreciated.
Thanks in advance
Reply all
Reply to author
Forward
0 new messages