erich
unread,Oct 23, 2009, 2:15:38 PM10/23/09Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Disco-development
I have a pretty simple wordcount thing. All nodes can see all files.
The files are just defined with full paths on the shared directory.
Disco 0.2.3.
If a node is down, the disco master will assign a workload to the
node, then figure out the node is down and blacklist said node:
>2009/10/23 12:38:31node008
>
>WARN: Node failure: "Couldn't connect to node008 (timeout). Node blacklisted temporarily."
I think it then says the work is getting rescheduled:
>2009/10/23 12:38:31master
>
>map:0 assigned to node009
>
>2009/10/23 12:38:31master
>
>map:0 added to waitlist
Then the work assigned to the working node finishes, and I get a
second "added to waitlist" message:
>2009/10/23 12:39:31master
>
>map:0 added to waitlist
>
>2009/10/23 12:39:10master
>
>Received results from map:1 @ node009.
But the map:0 never runs, and the server decides that it needs to kill
the job:
>2009/10/23 12:39:31master
>
>WARN: Job killed
>
>2009/10/23 12:39:31master
>
>ERROR: Job terminated due to the previous errors
>
>2009/10/23 12:39:31master
>
>ERROR: Master terminated the job: Job failed on all available nodes
So, is this just a corner case when there are very few work items,
some sort of other known issue, or am I just setting something up
wrong? I was hoping that the work would get rescheduled on the node
that's up. Indeed, if I rerun the task both work items are assigned
to the non-blacklisted node. Is there anything I need to do to
indicate that a chunk of work is runnable on any node, or should be
retried, or something like that?