Performance tuning in Map Reduce

520 views
Skip to first unread message

Priya Chatterjee

unread,
Oct 1, 2012, 4:56:10 AM10/1/12
to chenn...@googlegroups.com
What are the steps needed to be done for performance tuning of map reduce in hadoop.

Senthil Kumar

unread,
Oct 1, 2012, 6:02:47 AM10/1/12
to chenn...@googlegroups.com

Hi


Some of the steps i usually  follow:

     1.   Increase dfs block size to either 256 or 512M

     2.   mapred.min.split.size =  dfs.block.size (don’t increase it to have lesser maps)

     3.  Mapred.map/reduce.tasks.speculative.execution -  false

     4.  Mapred.job.reuse.jvm.tasks   -  -1

     5. mapred.child.java.opts – have higher child task memory

io.sort.mb  - check for spills more than 1 in logs. And increase it

6. mapred.compress.map.output – true

7.Use combiner – if possible

8. Total Slots (M + R) > num cores

9. Increase ulimit

10. Use a correct writable data format

Text is very expensive

Reuse writables

11. Implement RAW COMPARATORS with your custom format

12. Create objects in setup method of your Mapper/Reducer class & Reuse

13. mapred.reduce.slowstart.completed.maps  – 0.8

14. tasktracker.http.threads – 2* no of cores


Senthil

Senthil Kumar

unread,
Oct 1, 2012, 7:33:56 AM10/1/12
to chenn...@googlegroups.com

Please Use monitoring tools like Ganglia to identify the bottleneck.. 


On Monday, October 1, 2012 2:26:10 PM UTC+5:30, Priya Chatterjee wrote:

Pavan Kulkarni

unread,
Oct 1, 2012, 12:19:43 PM10/1/12
to chenn...@googlegroups.com
Try also to tune mapred.tasktracker.map.tasks.maximum /reduce.tasks.maximum according to the memory you have available. each task takes the memory you have set for child JVM. So multiply that with these tasks and for best performance you should have your memory consumed to almost 80% of its capacity

Also tune mapred.map.tasks = data input size / split size

mapred.reduce.tasks = (0.95 to 1.75) * number of nodes * mapred.tasktracker.map.tasks.maximum


--
 
 



--

--With Regards
Pavan Kulkarni

Senthil Kumar

unread,
Oct 1, 2012, 10:04:45 PM10/1/12
to chenn...@googlegroups.com
Well Said Pavan..

But increasing the split size more than block size will result in reduced performance.
For Example.
         Block Size - 256 mb
         Split Size - 512 mb

Well with this you will decrease the number of mapper by half, but the split needs to conatin two blocks (one block may be from one datanode and other usually come from other node.)Due to different physical locations of blocks of a split, we cannot take the advantage of data locality. So its better to go with default block size as split size and reuse the JVM.


-Senthil

Bini

unread,
Oct 2, 2012, 2:44:24 PM10/2/12
to chenn...@googlegroups.com
Adding few points

1. Try to use physical machines if your VMs are not performing
2. Use multiple disks to store the hdfs data
3. Tune for max maps in a trial and error method for your application

Pavan Kulkarni

unread,
Oct 2, 2012, 2:49:59 PM10/2/12
to chenn...@googlegroups.com
Yes Senthil what you said is correct.

It is like Block size : HDFS :: Split size : MapReduce
Ex: Block size :256MB Split size : 512MB
Then the mapper has to get remaining 256MB from another node over network which
is not optimal usage of data locality ( hadoop's stongest point) . Hence always keep both same.

siva

unread,
Oct 4, 2012, 9:37:05 AM10/4/12
to chenn...@googlegroups.com
 

 Hi Senthi & Pavan,

  Is it possible to initialize reducer task after completion of all map tasks (parallel copying also).

  If yes then how and what will be it's contribution towards performance?

Regards,
Sivakumar
91 9048360190

 


On Monday, October 1, 2012 2:26:10 PM UTC+5:30, Priya Chatterjee wrote:

Senthil Kumar

unread,
Oct 5, 2012, 2:18:44 AM10/5/12
to chenn...@googlegroups.com
Siva

Yes it is possible to intialize the reduce tasks after all map tasks.
mapred.reduce.slowstart.completed.maps 
The above parameter helps to do the same.Default value is .05 (5%).
Increasing to around 0.8 (80%) will increase in throughput as well as performance.

How does it improve my MR performance?
     By default value of 0.5, schedulers allocate slots for reducers after completion of 5% map tasks.
     These slots can be made use effectively if the slots are allocated after 80% completion.


suggestion ->never make it above 0.9.

Senthil

ksr

unread,
Mar 30, 2013, 3:57:00 AM3/30/13
to chenn...@googlegroups.com
In which file ,do we need to set this property "mapred.reduce.slowstart.complete.maps "?

Senthil Kumar

unread,
Apr 4, 2013, 8:04:00 AM4/4/13
to chenn...@googlegroups.com
sorry for delayed reply reply.... You can set in mapred-site.xml or you can set per job as part of the configuration object
Reply all
Reply to author
Forward
0 new messages