retrying when a node fails?

4 views
Skip to first unread message

erich

unread,
Oct 23, 2009, 2:15:38 PM10/23/09
to Disco-development
I have a pretty simple wordcount thing. All nodes can see all files.
The files are just defined with full paths on the shared directory.
Disco 0.2.3.

If a node is down, the disco master will assign a workload to the
node, then figure out the node is down and blacklist said node:

>2009/10/23 12:38:31node008
>
>WARN: Node failure: "Couldn't connect to node008 (timeout). Node blacklisted temporarily."

I think it then says the work is getting rescheduled:

>2009/10/23 12:38:31master
>
>map:0 assigned to node009
>
>2009/10/23 12:38:31master
>
>map:0 added to waitlist


Then the work assigned to the working node finishes, and I get a
second "added to waitlist" message:

>2009/10/23 12:39:31master
>
>map:0 added to waitlist
>
>2009/10/23 12:39:10master
>
>Received results from map:1 @ node009.

But the map:0 never runs, and the server decides that it needs to kill
the job:

>2009/10/23 12:39:31master
>
>WARN: Job killed
>
>2009/10/23 12:39:31master
>
>ERROR: Job terminated due to the previous errors
>
>2009/10/23 12:39:31master
>
>ERROR: Master terminated the job: Job failed on all available nodes

So, is this just a corner case when there are very few work items,
some sort of other known issue, or am I just setting something up
wrong? I was hoping that the work would get rescheduled on the node
that's up. Indeed, if I rerun the task both work items are assigned
to the non-blacklisted node. Is there anything I need to do to
indicate that a chunk of work is runnable on any node, or should be
retried, or something like that?



Reply all
Reply to author
Forward
0 new messages