Performance tuning in Map Reduce

Priya Chatterjee

unread,

Oct 1, 2012, 4:56:10 AM10/1/12

to chenn...@googlegroups.com

What are the steps needed to be done for performance tuning of map reduce in hadoop.

Senthil Kumar

unread,

Oct 1, 2012, 6:02:47 AM10/1/12

to chenn...@googlegroups.com

Hi

Some of the steps i usually follow:

1. Increase dfs block size to either 256 or 512M

2. mapred.min.split.size = dfs.block.size (don’t increase it to have lesser maps)

3. Mapred.map/reduce.tasks.speculative.execution - false

4. Mapred.job.reuse.jvm.tasks - -1

5. mapred.child.java.opts – have higher child task memory

io.sort.mb - check for spills more than 1 in logs. And increase it

6. mapred.compress.map.output – true

7.Use combiner – if possible

8. Total Slots (M + R) > num cores

9. Increase ulimit

10. Use a correct writable data format

Text is very expensive

Reuse writables

11. Implement RAW COMPARATORS with your custom format

12. Create objects in setup method of your Mapper/Reducer class & Reuse

13. mapred.reduce.slowstart.completed.maps – 0.8

14. tasktracker.http.threads – 2* no of cores

Senthil

Senthil Kumar

unread,

Oct 1, 2012, 7:33:56 AM10/1/12

to chenn...@googlegroups.com

Please Use monitoring tools like Ganglia to identify the bottleneck..

On Monday, October 1, 2012 2:26:10 PM UTC+5:30, Priya Chatterjee wrote:

Pavan Kulkarni

unread,

Oct 1, 2012, 12:19:43 PM10/1/12

to chenn...@googlegroups.com

Try also to tune mapred.tasktracker.map.tasks.maximum /reduce.tasks.maximum according to the memory you have available. each task takes the memory you have set for child JVM. So multiply that with these tasks and for best performance you should have your memory consumed to almost 80% of its capacity

Also tune mapred.map.tasks = data input size / split size

mapred.reduce.tasks = (0.95 to 1.75) * number of nodes * mapred.tasktracker.map.tasks.maximum

--

--

--With Regards
Pavan Kulkarni

Senthil Kumar

unread,

Oct 1, 2012, 10:04:45 PM10/1/12

to chenn...@googlegroups.com

Well Said Pavan..

But increasing the split size more than block size will result in reduced performance.

For Example.

Block Size - 256 mb

Split Size - 512 mb

Well with this you will decrease the number of mapper by half, but the split needs to conatin two blocks (one block may be from one datanode and other usually come from other node.)Due to different physical locations of blocks of a split, we cannot take the advantage of data locality. So its better to go with default block size as split size and reuse the JVM.

-Senthil

Bini

unread,

Oct 2, 2012, 2:44:24 PM10/2/12

to chenn...@googlegroups.com

Adding few points

1. Try to use physical machines if your VMs are not performing

2. Use multiple disks to store the hdfs data

3. Tune for max maps in a trial and error method for your application

Pavan Kulkarni

unread,

Oct 2, 2012, 2:49:59 PM10/2/12

to chenn...@googlegroups.com

Yes Senthil what you said is correct.

It is like Block size : HDFS :: Split size : MapReduce

Ex: Block size :256MB Split size : 512MB

Then the mapper has to get remaining 256MB from another node over network which

is not optimal usage of data locality ( hadoop's stongest point) . Hence always keep both same.

siva

unread,

Oct 4, 2012, 9:37:05 AM10/4/12

to chenn...@googlegroups.com

Hi Senthi & Pavan,

Is it possible to initialize reducer task after completion of all map tasks (parallel copying also).

If yes then how and what will be it's contribution towards performance?

Regards,
Sivakumar
91 9048360190

On Monday, October 1, 2012 2:26:10 PM UTC+5:30, Priya Chatterjee wrote:

Senthil Kumar

unread,

Oct 5, 2012, 2:18:44 AM10/5/12

to chenn...@googlegroups.com

Siva

Yes it is possible to intialize the reduce tasks after all map tasks.

mapred.reduce.slowstart.completed.maps

The above parameter helps to do the same.Default value is .05 (5%).

Increasing to around 0.8 (80%) will increase in throughput as well as performance.

How does it improve my MR performance?

By default value of 0.5, schedulers allocate slots for reducers after completion of 5% map tasks.

These slots can be made use effectively if the slots are allocated after 80% completion.

suggestion ->never make it above 0.9.

Senthil

ksr

unread,

Mar 30, 2013, 3:57:00 AM3/30/13

to chenn...@googlegroups.com

In which file ,do we need to set this property "mapred.reduce.slowstart.complete.maps "?

Senthil Kumar

unread,

Apr 4, 2013, 8:04:00 AM4/4/13

to chenn...@googlegroups.com

sorry for delayed reply reply.... You can set in mapred-site.xml or you can set per job as part of the configuration object

Reply all

Reply to author

Forward