Testing on EC2

Jonathan Mace

unread,

Dec 9, 2011, 1:31:40 PM12/9/11

to spark...@googlegroups.com

Greetings,

In order to test things like speedup for my algorithm, I need to ensure that Spark fully utilises the resources available to it - ie. if I'm running a 16 node cluster, I need to ensure that Spark distributes RDDs evenly across all 16. Is there a way of easily checking or configuring this?

Jon

Jonathan Mace

unread,

Dec 9, 2011, 1:38:23 PM12/9/11

to spark...@googlegroups.com

I should add that the reason I ask, is because even though I have an RDD of 256 objects, they are only being distributed across 8 of the 16 nodes in my cluster. I am using the Mesos web UI to verify this.

Jon

Matei Zaharia

unread,

Dec 9, 2011, 2:16:08 PM12/9/11

to spark...@googlegroups.com

Hi Jon,

You can change the level of parallelism with by passing a second argument to parallelize(), textFile(), etc. For example, sc.parallelize(List(1,2,3,4,5,6), 4) will create 4 partitions. The default # of partitions is 8.

Alternatively, you can set the Java system property spark.default.parallelism to the desired number before creating your Spark context. Then this will be used as the default in all operations.

Matei

Jonathan Mace

unread,

Dec 9, 2011, 3:46:44 PM12/9/11

to spark...@googlegroups.com

Thanks Matei,

Would you recommend I have the same number of partitions as CPUs, or should I use some multiple of the number of CPUs? Additionally, the size of my collection is completely under control, and for performance reasons I would prefer to have 1 element of the collection per partition - is this wise? Will they be evenly distributed among partitions?

Cheers,
Jon

Matei Zaharia

unread,

Dec 9, 2011, 5:01:07 PM12/9/11

to spark...@googlegroups.com

It's usually better to have more partitions than CPUs for load balancing. Something like 4x more is probably a good start. It should also be fine to have one element per partition.

Matei

Raja Cherukuri

unread,

Dec 15, 2011, 2:42:32 PM12/15/11

to spark...@googlegroups.com

Hi Matei,

I set up mesos master/slave on a test machine and ran the example:

/usr/local/spark/run spark.examples.SparkPi mas...@127.0.0.1:5050

runs and completes the tasks the first time.

second time I run the application it simply hangs.

The mesos ui shows:

Active Frameworks

ID	User	Name	Running Tasks	CPUs	MEM	Max Share	Connected
201112151121-0-0000	root	SparkPi	0	0	24.0 GB	0.79	2011-12-15 11:26:42
201112151121-0-0001	root	SparkPi	0	0	0.0 MB	0.00	2011-12-15 11:27:52

what does this mean ? Downloaded the latest spark/mesos.

- Raja

Matei Zaharia

unread,

Dec 15, 2011, 3:26:21 PM12/15/11

to spark...@googlegroups.com

This happens because the second framework is still considered "active" -- there's a timeout to decide when a framework has gone away in Mesos, and right now the default value for the timeout is to high. You can fix it by launching your mesos-master as follows:

mesos-master --failover_timeout=0

Matei

Raja Cherukuri

unread,

Dec 15, 2011, 4:52:27 PM12/15/11

to spark...@googlegroups.com

Strangely that option doesn't work..

only when I run deploy/stop-slaves and deploy/start-slaves.. the next task continues from its hung state to completion the second time...

- Raja

From: Raja Cherukuri <rche...@ymail.com>
To: "spark...@googlegroups.com" <spark...@googlegroups.com>
Sent: Thursday, December 15, 2011 11:42 AM
Subject: spark with mesos test...

Raja Cherukuri

unread,

Dec 15, 2011, 5:06:12 PM12/15/11

to spark...@googlegroups.com, Matei Zaharia

Hi Matei,

This option doesn't work.

Here is how it comes up:

Starting master on raja-server.net

ssh -i /root/.ssh/kroot_id_rsa -o StrictHostKeyChecking=no -o ConnectTimeout=2 raja-server.net /usr/local/mesos/deploy/mesos-daemon mesos-master --failover_timeout=0 </dev/null >/dev/null

Only after restart of slaves does the second task continue...and I see resources released in the UI

Active Frameworks

ID	User	Name	Running Tasks	CPUs	MEM	Max Share	Connected

201112151349-0-0001

raja

SparkLR

0

0.0 MB

0.00

2011-12-15 13:50:42
201112151349-0-0002	raja	SparkLR

0

0.0 MB

0.00

2011-12-15 13:55:11
201112151349-0-0000	raja	SparkLR

0

0.0 MB

0.00

2011-12-15 13:50:35
201112151349-0-0003	raja	SparkLR

0

24.0 GB

0.79

2011-12-15 13:55:16

Is this anyway related to the fact that I am running master and slave on the same machine to do my initial testing ?

Thank You

Raja

From: Matei Zaharia <ma...@eecs.berkeley.edu>
To: spark...@googlegroups.com
Sent: Thursday, December 15, 2011 12:26 PM
Subject: Re: spark with mesos test...

Matei Zaharia

unread,

Dec 16, 2011, 1:46:10 AM12/16/11

to Raja Cherukuri, spark...@googlegroups.com

Might be because mesos-daemon doesn't pass the option correctly. Try editing conf/mesos.conf and adding the option there (add a line with failover_timeout=0).

Raja Cherukuri

unread,

Dec 16, 2011, 8:47:45 AM12/16/11

to spark...@googlegroups.com

Matei,

That didn't work either, only a restart of slaves allows me to re-run the job ( stop-slaves and start-slaves ).

I added an additional machine as another slave.

Now both slaves run the tasks and return results.( even if one of them goes down, mesos-master resends the

TID to the other slave and completes it )

As a default the job seems to be abandoned when at least 4 TIDs fail.

Is this configurable ?

Thank You

Raja

From: Matei Zaharia <ma...@eecs.berkeley.edu>

To: Raja Cherukuri <rche...@ymail.com>
Cc: "spark...@googlegroups.com" <spark...@googlegroups.com>
Sent: Thursday, December 15, 2011 10:46 PM

Matei Zaharia

unread,

Dec 16, 2011, 8:51:29 AM12/16/11

to spark...@googlegroups.com

Can you tell me which version of Mesos you built, and maybe try running mesos-master by hand? I've tested this and it does seem to work. The only possible reason I can think of that might cause it not to work is if your old job stays around (i.e. you don't Ctrl-C it). However, if you really have only one SparkLR job running, then there should be only one framework shown in the web UI, and tasks from old jobs should be cleaned up.

For the 4 failures/task, the setting isn't yet configurable, but you can easily change it in SimpleJob.scala (it's a constant called MAX_TASK_FAILURES). Feel free to make it get loaded from a system property if you prefer.

Matei

Raja Cherukuri

unread,

Dec 16, 2011, 9:03:59 AM12/16/11

to spark...@googlegroups.com

Matei,

Downloaded the latest from github ( on 12/14/2011 ).

The only information I see in the log files about the build version is date when I built it:

Build: 2011-12-14 13:38:50 by root

Do you have a version / build info that comes out when I give a -version flag ( then you can know

the exact build ). If not, can you add this. I can re-build and see how it works.

Thank You

Raja

From: Matei Zaharia <ma...@eecs.berkeley.edu>
To: spark...@googlegroups.com

Sent: Friday, December 16, 2011 5:51 AM

Matei Zaharia

unread,

Dec 16, 2011, 9:06:10 AM12/16/11

to spark...@googlegroups.com

Oh, Mesos development is now happening in Apache SVN, so you should get it with:

svn checkout https://svn.apache.org/repos/asf/incubator/mesos/trunk mesos

Hopefully this version will contain the fix.

On Dec 16, 2011, at 3:03 PM, Raja Cherukuri wrote:

> Matei,

> Downloaded the latest from github ( on 12/14/2011 ).
> The only information I see in the log files about the build version is date when I built it:
> Build: 2011-12-14 13:38:50 by root
> Do you have a version / build info that comes out when I give a -version flag ( then you can know
> the exact build ). If not, can you add this. I can re-build and see how it works.
>
> Thank You
>
> Raja
>

> From: Matei Zaharia <ma...@eecs.berkeley.edu>
> To: spark...@googlegroups.com
> Sent: Friday, December 16, 2011 5:51 AM
> Subject: Re: spark with mesos test...
>
> Can you tell me which version of Mesos you built, and maybe try running mesos-master by hand? I've tested this and it does seem to work. The only possible reason I can think of that might cause it not to work is if your old job stays around (i.e. you don't Ctrl-C it). However, if you really have only one SparkLR job running, then there should be only one framework shown in the web UI, and tasks from old jobs should be cleaned up.
>
> For the 4 failures/task, the setting isn't yet configurable, but you can easily change it in SimpleJob.scala (it's a constant called MAX_TASK_FAILURES). Feel free to make it get loaded from a system property if you prefer.
>
> Matei
>
> On Dec 16, 2011, at 2:47 PM, Raja Cherukuri wrote:
>
>> Matei,
>> That didn't work either, only a restart of slaves allows me to re-run the job ( stop-slaves and start-slaves ).
>>
>> I added an additional machine as another slave.
>> Now both slaves run the tasks and return results.( even if one of them goes down, mesos-master resends the
>> TID to the other slave and completes it )
>> As a default the job seems to be abandoned when at least 4 TIDs fail.
>> Is this configurable ?
>>
>> Thank You
>>
>> Raja
>>

>> From: Matei Zaharia <ma...@eecs.berkeley.edu>
>> To: Raja Cherukuri <rche...@ymail.com>
>> Cc: "spark...@googlegroups.com" <spark...@googlegroups.com>
>> Sent: Thursday, December 15, 2011 10:46 PM
>> Subject: Re: spark with mesos test...
>>
>> Might be because mesos-daemon doesn't pass the option correctly. Try editing conf/mesos.conf and adding the option there (add a line with failover_timeout=0).
>>
>> On Dec 15, 2011, at 11:06 PM, Raja Cherukuri wrote:
>>
>>> Hi Matei,
>>> This option doesn't work.
>>> Here is how it comes up:
>>>
>>> Starting master on raja-server.net
>>> ssh -i /root/.ssh/kroot_id_rsa -o StrictHostKeyChecking=no -o ConnectTimeout=2 raja-server.net /usr/local/mesos/deploy/mesos-daemon mesos-master --failover_timeout=0 </dev/null >/dev/null
>>>
>>>
>>>
>>> Only after restart of slaves does the second task continue...and I see resources released in the UI
>>> Active Frameworks
>>>
>>> ID User Name Running Tasks CPUs MEM Max Share Connected
>>> 201112151349-0-0001 raja SparkLR 0 0 0.0 MB 0.00 2011-12-15 13:50:42
>>> 201112151349-0-0002 raja SparkLR 0 0 0.0 MB 0.00 2011-12-15 13:55:11
>>> 201112151349-0-0000 raja SparkLR 0 0 0.0 MB 0.00 2011-12-15 13:50:35
>>> 201112151349-0-0003 raja SparkLR 0 0 24.0 GB 0.79 2011-12-15 13:55:16
>>>
>>>
>>>
>>> Is this anyway related to the fact that I am running master and slave on the same machine to do my initial testing ?
>>>
>>> Thank You
>>>
>>> Raja
>>>
>>>
>>>
>>>

Reply all

Reply to author

Forward