Is there a way to specify the number of mappers to be used when running a cascading job?

766 views
Skip to first unread message

Gnanesh Gujulva

unread,
Apr 20, 2015, 5:26:04 PM4/20/15
to cascadi...@googlegroups.com
Hi,
  I am trying to run a join of two data sets using Cascading. The join seems to take a long time when run on a hadoop cluster machine. On investigating, the cascading job spawns only one mapper even when the input data is large. I believe that is the reason for the slowness in processing the data. Is there a way to specify programmaticaly the number of mappers/reducers in Cascading. I tried the following but it still is creating only one mapper.

    Properties properties = new Properties();
    properties.setProperty("mapreduce.job.maps", "21");
    AppProps.setApplicationJarClass( properties, Main.class );
    Hadoop2MR1FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

Anyone have success in configuring the number of mappers/reducers from Cascading API?
Thanks
Gnanesh

Cascading API version: 2.6.0
Hadoop platform: Hortonworks 2.6

Krishnan K

unread,
Apr 20, 2015, 5:35:33 PM4/20/15
to cascadi...@googlegroups.com

Yes Gnanesh, you can use the following property to set number of reducers.

                Properties properties = new Properties();
properties.setProperty("mapreduce.job.reduces", "1000");
               
                Flow  flow  = new Hadoop2MR1FlowConnector( properties ).connect.....


For number of mappers, you can adjust your split size.

Regards

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/74899b8e-e81e-4606-bcb3-f15b01ab1e76%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gnanesh Gujulva

unread,
Apr 21, 2015, 9:19:46 AM4/21/15
to cascadi...@googlegroups.com
Thanks Cyberlord. I am tried the following setting for reduces and mappers. It does not seem to change the number of mappers or reducers being spawned. Anything else that I can try? What should I look in the logs to see if this property is being passed on to the Hadoop job controller.
-Gnanesh

Ken Krugler

unread,
Apr 21, 2015, 9:59:00 AM4/21/15
to cascadi...@googlegroups.com
Just to confirm, you're sure your job is being run on the cluster, and not locally?

E.g. you're not using a local file as one of the inputs to the job.

-- Ken


From: Gnanesh Gujulva

Sent: April 21, 2015 6:19:46am PDT

To: cascadi...@googlegroups.com

Subject: Re: Is there a way to specify the number of mappers to be used when running a cascading job?


Thanks Cyberlord. I am tried the following setting for reduces and mappers. It does not seem to change the number of mappers or reducers being spawned. Anything else that I can try? What should I look in the logs to see if this property is being passed on to the Hadoop job controller.
-Gnanesh

On Monday, April 20, 2015 at 5:35:33 PM UTC-4, Cyberlord wrote:

Yes Gnanesh, you can use the following property to set number of reducers.

                Properties properties = new Properties();
properties.setProperty("mapreduce.job.reduces", "1000");
               
                Flow  flow  = new Hadoop2MR1FlowConnector( properties ).connect.....


For number of mappers, you can adjust your split size.

Regards

On Mon, Apr 20, 2015 at 2:26 PM, Gnanesh Gujulva <sendg...@gmail.com> wrote:
Hi,
  I am trying to run a join of two data sets using Cascading. The join seems to take a long time when run on a hadoop cluster machine. On investigating, the cascading job spawns only one mapper even when the input data is large. I believe that is the reason for the slowness in processing the data. Is there a way to specify programmaticaly the number of mappers/reducers in Cascading. I tried the following but it still is creating only one mapper.

    Properties properties = new Properties();
    properties.setProperty("mapreduce.job.maps", "21");
    AppProps.setApplicationJarClass( properties, Main.class );
    Hadoop2MR1FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

Anyone have success in configuring the number of mappers/reducers from Cascading API?
Thanks
Gnanesh

Cascading API version: 2.6.0
Hadoop platform: Hortonworks 2.6
--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Gnanesh Gujulva

unread,
Apr 21, 2015, 10:14:06 AM4/21/15
to cascadi...@googlegroups.com
Yes. I am running the job on a cluster. I am submitting the job through oozie.  

Antonios Chalkiopoulos

unread,
Apr 22, 2015, 4:42:24 AM4/22/15
to cascadi...@googlegroups.com
Gnanesh,

You can not force the number of mappers ( but you can provide a hint and it's up to the map-reduce framework to decide whether to use that hint ).

Usually when reading two datasets in order to join them , you automatically get ONE mapper per block. Usually that means that for every 128 MByte (if that's your block size) you get one mapper.
The mappers are just reading the data in -  so they should be pretty fast. To give you an indication when i'm joining 3 GB with another 3 GB it usually takes 46 mappers in total 26 seconds to read the whole data in.

Where the trouble starts with joining is in the reduce phase. You may even end up with 100nds of reducers, taking many hours to join the datasets.

You can control the number of reducers though (as others indicated). However this does not guarantee that the job will complete in a timely manner. 

It matters a lot how your data looks - and what type of join you use  ! 

- Antonios

Pushpender Garg

unread,
Apr 22, 2015, 10:15:29 AM4/22/15
to cascadi...@googlegroups.com
Try adjusting split size for mappers as others have mentioned.
to set number of reducers on hadoop1 property is "mapred.reduce.tasks", for hadoop2 it is "mapreduce.job.reduces". This should work at flow level as well as flowstep level (getStepConfigDef on pipe)


Supreet Oberoi

unread,
Apr 23, 2015, 3:04:42 PM4/23/15
to cascadi...@googlegroups.com
We verified that the Cascading job is being executed in local mode when submitted by oozie (syslogs show localjobrunner references). 

For oozie experts -- is there something specific that needs to be done when launching a job with oozie to ensure that it does not run in the local mode?

-supreet

Gnanesh Gujulva

unread,
Apr 29, 2015, 8:53:15 AM4/29/15
to cascadi...@googlegroups.com
Update: 
We found a workaround for the issue noted below i.e. Cascading jobs submitted through oozie. It looks like for some reason the hadoop configuration is not imported automatically when submitting the cascading jobs through oozie. Here is a code snippet that works around the issue
   Properties properties = new Properties();
  
   Configuration conf=new Configuration();
    conf.addResource(new Path ("/etc/hadoop/conf/mapred-site.xml"));
    conf.addResource(new Path ("/etc/hadoop/conf/yarn-site.xml"));
    conf.addResource(new Path ("/etc/hadoop/conf/core-site.xml"));
    conf.addResource(new Path ("/etc/hadoop/conf/hdfs-site.xml"));

    Hadoop2MR1Planner.copyConfiguration(properties, conf);


    AppProps.setApplicationJarClass( properties, Main.class );
    FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

After importing the configuration, the oozie job used the resource manager to submit the job and automatically I could see more than one mapper being spawned. 

Andre Kelpe

unread,
Apr 29, 2015, 9:34:20 AM4/29/15
to cascadi...@googlegroups.com
Thanks for sharing this. It looks like Oozie tries to be smart and is actually causing more harm than good. Could you open a bug upstream and check if that is desired behavior?

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

Chris K Wensel

unread,
Apr 29, 2015, 12:00:15 PM4/29/15
to cascadi...@googlegroups.com
If those files are not in the classpath, you will then need to explicitly provide them as you do below. this is how Hadoop Configuration works.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Gnanesh Gujulva

unread,
Apr 30, 2015, 3:05:35 PM4/30/15
to cascadi...@googlegroups.com
We successfully were able to run Java apps on hadoop without explicitly importing hadoop configuration. Our platform engineers had to copy hadoop configuration files to oozie share lib folder on hdfs i.e. /user/oozie/share/lib/...
Reply all
Reply to author
Forward
0 new messages