Oh, Mesos development is now happening in Apache SVN, so you should get it with:
svn checkout https://svn.apache.org/repos/asf/incubator/mesos/trunk mesos
Hopefully this version will contain the fix.
On Dec 16, 2011, at 3:03 PM, Raja Cherukuri wrote:
> Matei,
> Downloaded the latest from github ( on 12/14/2011 ).
> The only information I see in the log files about the build version is date when I built it:
> Build: 2011-12-14 13:38:50 by root
> Do you have a version / build info that comes out when I give a -version flag ( then you can know
> the exact build ). If not, can you add this. I can re-build and see how it works.
>
> Thank You
>
> Raja
>
> From: Matei Zaharia <ma...@eecs.berkeley.edu>
> To: spark...@googlegroups.com
> Sent: Friday, December 16, 2011 5:51 AM
> Subject: Re: spark with mesos test...
>
> Can you tell me which version of Mesos you built, and maybe try running mesos-master by hand? I've tested this and it does seem to work. The only possible reason I can think of that might cause it not to work is if your old job stays around (i.e. you don't Ctrl-C it). However, if you really have only one SparkLR job running, then there should be only one framework shown in the web UI, and tasks from old jobs should be cleaned up.
>
> For the 4 failures/task, the setting isn't yet configurable, but you can easily change it in SimpleJob.scala (it's a constant called MAX_TASK_FAILURES). Feel free to make it get loaded from a system property if you prefer.
>
> Matei
>
> On Dec 16, 2011, at 2:47 PM, Raja Cherukuri wrote:
>
>> Matei,
>> That didn't work either, only a restart of slaves allows me to re-run the job ( stop-slaves and start-slaves ).
>>
>> I added an additional machine as another slave.
>> Now both slaves run the tasks and return results.( even if one of them goes down, mesos-master resends the
>> TID to the other slave and completes it )
>> As a default the job seems to be abandoned when at least 4 TIDs fail.
>> Is this configurable ?
>>
>> Thank You
>>
>> Raja
>>
>> From: Matei Zaharia <ma...@eecs.berkeley.edu>
>> To: Raja Cherukuri <rche...@ymail.com>
>> Cc: "spark...@googlegroups.com" <spark...@googlegroups.com>
>> Sent: Thursday, December 15, 2011 10:46 PM
>> Subject: Re: spark with mesos test...
>>
>> Might be because mesos-daemon doesn't pass the option correctly. Try editing conf/mesos.conf and adding the option there (add a line with failover_timeout=0).
>>
>> On Dec 15, 2011, at 11:06 PM, Raja Cherukuri wrote:
>>
>>> Hi Matei,
>>> This option doesn't work.
>>> Here is how it comes up:
>>>
>>> Starting master on raja-server.net
>>> ssh -i /root/.ssh/kroot_id_rsa -o StrictHostKeyChecking=no -o ConnectTimeout=2 raja-server.net /usr/local/mesos/deploy/mesos-daemon mesos-master --failover_timeout=0 </dev/null >/dev/null
>>>
>>>
>>>
>>> Only after restart of slaves does the second task continue...and I see resources released in the UI
>>> Active Frameworks
>>>
>>> ID User Name Running Tasks CPUs MEM Max Share Connected
>>> 201112151349-0-0001 raja SparkLR 0 0 0.0 MB 0.00 2011-12-15 13:50:42
>>> 201112151349-0-0002 raja SparkLR 0 0 0.0 MB 0.00 2011-12-15 13:55:11
>>> 201112151349-0-0000 raja SparkLR 0 0 0.0 MB 0.00 2011-12-15 13:50:35
>>> 201112151349-0-0003 raja SparkLR 0 0 24.0 GB 0.79 2011-12-15 13:55:16
>>>
>>>
>>>
>>> Is this anyway related to the fact that I am running master and slave on the same machine to do my initial testing ?
>>>
>>> Thank You
>>>
>>> Raja
>>>
>>>
>>>
>>>