Want to increase no of Maps and Reducers in PIG

241 views
Skip to first unread message

Sivaji

unread,
Oct 2, 2014, 2:26:09 AM10/2/14
to chenn...@googlegroups.com
Hi Everyone,

I am trying to processing 10 million records in Pig, I am applying some grouping and aggregate function.Its takes more then 7 Hours to process this this data, in UI its showing 4 maps and one Reducers,

To optimizing for this process how to increase number of maps and Reduces can any one explain How can i do this Thank you.  

Alex M

unread,
Oct 2, 2014, 4:03:49 AM10/2/14
to chenn...@googlegroups.com

Number of mappers were determined by your InputFormat. If you are using PigStorage, FileInputFormat will allocate at least 1 mapper for each file. If the file is large, FileInputFormat will split the file into smaller trunks. You can control this process by two hadoop setting: "mapred.min.split.size", "mapred.max.split.size".

To know more please follow the links. 

Sivaji

unread,
Oct 2, 2014, 6:22:53 AM10/2/14
to chenn...@googlegroups.com
Thank Alex,

I am using PigStorage() only, I am using This code is it work fine to speed my process


SET default_parallel 200;
SET output.compression.enabled true;
SET output.compression.codec com.hadoop.compression.lzo.LzopCodec
SET mapred.min.split.size 256000;
SET mapred.max.split.size 256000;
SET pig.noSplitCombination true; 
SET mapred.max.jobs.per.node 1;

Can you please explain above code if i am using,

Thanks.

Sivaji

unread,
Oct 6, 2014, 10:19:26 AM10/6/14
to chenn...@googlegroups.com
Hi Every one,

In my organisation we are using 6 node cluster, i am running some pig scripts on more then 10 m of records, i want set no.of mappers and reducers, 

HOW can i know identified how many reducers i can set for max in my pig script ? what is the best performance of mappers size ? i am using below code to set mapps and reducers,

SET default_parallel 200;
SET mapred.min.split.size 256000;
SET mapred.max.split.size 256000;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1; 

Thanks

Subrata Biswas

unread,
Oct 7, 2014, 1:24:23 PM10/7/14
to chenn...@googlegroups.com
Hi Sivaji,
You can not set Number of mappers, it depends on no# of input Split/Block size.
if your block size (I am just taking block guessing input Split and block size come equal or it take block size on mathematical calculation) is 32r file size MB and your file size is 1GB, no# of mapper will be 1*1024/32 = 32.
Now if you have 5 data node as you have 6 node cluster assuming one is Master(NN and JT), So each node will launch 3/5 = 6 Map, and 2 node will launch 7 map as 2ha is still remaining. Now here is one more question, if you have set Max 5 mapper can be launched in a data node, your Mapreduce job will have huge performance impact. So there you need to have look.

Also some sample file comes along with Hadoop, reside in hadoop lib folder, you use for the bench marking. please have a look into hadoop document for those sample utility files. One of the sample jar is for your hadoop bench marking.

Please do not expect those command and file name, better start looking into those files or follow a good hadoop book you will have direct and perfect answer on it.

Regards
Subrata.

--
You received this message because you are subscribed to the Google Groups "Hadoop Users Group (HUG) Chennai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chennaihug+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages