Number of Maps and Reduces

Brian Eoff

unread,

May 16, 2011, 2:49:00 PM5/16/11

to mr...@googlegroups.com

I apologize in advance for this question, but I have been searching through the documentation, and I cannot find a satisfactory answer.

How do I set the number of mappers and reduces using MrJob?

Thanks in advance.

- Brian

Dave Marin

unread,

May 16, 2011, 3:20:53 PM5/16/11

to mr...@googlegroups.com

To set the number of maps/reduces to run, you can use --jobconf to access the appropriate hadoop options. For example:

mr_your_job.py --jobconf mapred.map.tasks=23 --jobconf mapred.reduce.tasks=42

I believe to actually setting the number of mappers and reducers happens when Hadoop is started up. There's a way to do it with bootstrap actions on EMR, but haven't yet built support for those (see https://github.com/Yelp/mrjob/issues/69).

-Dave

--

Yelp is looking to hire great engineers! See http://www.yelp.com/careers.

Jesse Shieh

unread,

May 27, 2011, 1:58:27 PM5/27/11

to mrjob

As a manual hack to get bootstrap actions to work, I edited emr.py:745
to something like this:
args['bootstrap_actions'] = [botoemr.BootstrapAction(
'configure_hadoop', 's3://elasticmapreduce/bootstrap-
actions/configure-hadoop', ["-
s","mapred.tasktracker.map.tasks.maximum=1","-
s","mapred.map.tasks=5"])]

if self._master_bootstrap_script:
args['bootstrap_actions'].extend([botoemr.BootstrapAction(
'master', self._master_bootstrap_script['s3_uri'],
[])])

It appears, however, that setting mapred.map.tasks here has no
effect. Also, passing mapred.map.tasks through --jobconf seems to get
ignored too. Any idea why this parameter is ignored? Does mrjob do
something special to calculate the number of mappers?

Jesse

On May 16, 12:20 pm, Dave Marin <d...@yelp.com> wrote:
> To set the number of maps/reduces to run, you can use --jobconf to access
> the appropriate hadoop options. For example:
>
> mr_your_job.py --jobconf mapred.map.tasks=23 --jobconf
> mapred.reduce.tasks=42
>
> I believe to actually setting the number of mappers and reducers happens
> when Hadoop is started up. There's a way to do it with bootstrap actions on

> EMR, but haven't yet built support for those (seehttps://github.com/Yelp/mrjob/issues/69).
>
> -Dave

Dave Marin

unread,

May 27, 2011, 2:35:28 PM5/27/11

to mr...@googlegroups.com

Oh, hey, you work with Matt Tai, right? :)

mrjob definitely doesn't do anything special. Maybe it's something
that doesn't work right on Hadoop 0.18? If you're using mrjob 0.2.6,
you could try adding --hadoop-version 0.20 and see if that makes
things any better.

-Dave

Jesse Shieh

unread,

May 27, 2011, 2:46:33 PM5/27/11

to mr...@googlegroups.com

Oh, yeah I do! He's definitely got us totally hooked on MRJob. It makes our lives so much easier!

I'll try hadoop 0.20 and let you know how it goes. Thanks for your help and a great open source project!

Jesse

--
Jesse Shieh | Co-Founder | Adku | www.adku.com | c: 213-537-7379

Dave Marin

unread,

May 27, 2011, 2:50:04 PM5/27/11

to mr...@googlegroups.com

:)

-Dave

Shivkumar Shivaji

unread,

May 27, 2011, 3:01:22 PM5/27/11

to mr...@googlegroups.com

I looked at this briefly before. The place where these are set in by Amazon's elastic-mapreduce framework (not mrjob). Refer to http://s3.amazonaws.com/awsdocs/ElasticMapReduce/latest/emr-dg.pdf for details. They recommend using different options for the number of mappers. They also seem to imply that you don't need to set the number of mappers unless you are doing something really different. I found that they even have a sensible default for the number of reducers. The above document should be close to answering your questions.

Shiv

Excerpt from the EMR documentation:

When your job flow runs, Hadoop determines the number of mapper and reducer tasks needed to process

the data. Larger job flows should have more tasks for better resource use and shorter processing time.

Typically, an Elastic MapReduce job flow remains the same size during the entire job flow; you set the

number of tasks when you create the job flow. When you resize a running job flow, you can vary the

processing during the job flow execution. Therefore, instead of using a fixed number of tasks, you can

vary the number of tasks during the life of the job flow. There are two configuration options to help set

the ideal number of tasks. They are:

• mapred.map.tasksperslot

• mapred.reduce.tasksperslot

Jesse Shieh

unread,

May 27, 2011, 3:24:52 PM5/27/11

to mr...@googlegroups.com

Thanks Shiv! That helps a lot. I'll de-prioritize hadoop 0.20 then and find may way around EMR-land. You guys are awesome.

Reply all

Reply to author

Forward