Have you tried fixing the nr_maps and nr_reduces parameters in new_job()?
By specifying nr_maps = K, you can limit the number of concurrent map
processes to K. You get a corresponding effect with nr_reduces = K but as
nr_reduces specifies also the number of partitions, modifying it might not
be that straightforward.
If nr_maps doesn't fix your issue, we can consider adding a new parameter
that specifies the limit.
> What do you think? Is this something that is useful to others as well?
Overall I think that there's a lot of room for improvement on the
scheduler side. The current FIFO policy is in many cases good enough but I
can easily imagine many scenarios where it is not.
It's good to hear about cases which are not handled adequately by the
current scheduler. I'm happy to improve it based on real use cases.
Ville
Have you tried fixing the nr_maps and nr_reduces parameters in new_job()?
By specifying nr_maps = K, you can limit the number of concurrent map
processes to K. You get a corresponding effect with nr_reduces = K but as
nr_reduces specifies also the number of partitions, modifying it might not
be that straightforward.
If nr_maps doesn't fix your issue, we can consider adding a new parameter
that specifies the limit.
On Thu, 30 Jul 2009, Eran Sandler wrote:
>> On Thu, Jul 30, 2009 at 5:38 AM, Ville Tuulos <tuu...@gmail.com> wrote:
>>
>> Have you tried fixing the nr_maps and nr_reduces parameters in
>> new_job()? By specifying nr_maps = K, you can limit the number of
>> concurrent map processes to K. You get a corresponding effect with
>> nr_reduces = K but as nr_reduces specifies also the number of
>> partitions, modifying it might not be that straightforward.
>>
>> If nr_maps doesn't fix your issue, we can consider adding a new
>> parameter that specifies the limit.
>
> nr_maps are managable but nr_reduces will eventually consume all of the worker processes and if I
> play around with that the result is not really optimal :-)
>
> I was thinking more of a hard limit to the maximum # of worker processes a given job can take
> including map and reduce.
Right. Luckily adding such a parameter is not too complicated.
The find_values() function in handle_work.erl finds the maximum number of
workers from the nr_maps and nr_reduces parameters now. It could use any
other (new) parameter as well.
Sounds like something that we could add to 0.2.3.
> Also, another issue which I usually end up spending most of the time is the last reduce... I saw
> a presentation from Google about their implementation of Map-Reduce and if you have a bit more
> machines, its sometimes better to run some of the last reduces on more than one machine
> (specifically if the hardware specs are different) and the first one to finish kills of the other
> runs. But that's usually wasteful in a resource constraint environment such as the one I have...
> :-)
Not a bad idea if you have a heterogeneous environment. I don't know how
much it would help if machines are identical. Let's keep the idea in mind (:
Ville