Scalability of Cascading jobs

75 views
Skip to first unread message

Lekhnath Bhusal

unread,
Nov 10, 2012, 2:50:41 AM11/10/12
to cascadi...@googlegroups.com
Hi folks,

I have started using cascading few weeks back. I have created data analytic engine on top of it. Mos of the code I wrote till date were running in pseudo distributed environment.
Now that I needed to push the application to real data. I have to run the jobs in cluster. With the same data, jobs almost run in the same time both in pseudo distributed mode and 6-node cluster.
For a very simple example,  I have a simple validation job to validate regular expression pattern matching on individual fields of tuple. When I run the job in distributed environment its not scaling well. Even if I have large number of mappers, individual mappers are running too slow there than in pseudo distributed mode.

 Is there anything missing in configuration.When I run the similar jobs in pure MapReduce they are scaling well.

Thanks,
Lekhnath

Chris K Wensel

unread,
Nov 10, 2012, 4:22:36 PM11/10/12
to cascadi...@googlegroups.com
Cascading will scale fine if you set the proper Hadoop properties on your Flows. Cascading makes no attempt to set any defaults. The first place to look would likely be the num reducers setting.


ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/Xf6jO6fmxcwJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.


Lekhnath Bhusal

unread,
Nov 13, 2012, 1:59:34 AM11/13/12
to cascadi...@googlegroups.com
Thanks for the response. 
I found that one of my functions was emitting lots of tuples from a single tuple in the stream making the output from it to explode. The copy overhead was killing all scalability. After fixing that function things work fine.

Thanks
Lekhnath

Pat Ferrel

unread,
Nov 13, 2012, 12:03:36 PM11/13/12
to cascadi...@googlegroups.com, bixo...@yahoogroups.com
How can I take many input files (actual HDFS part-xxxx files) and at the end of the flow in a new sink end up with one file. My flow seems to default to as many output files as original input files.

Seems like I should be telling cascading to use a single reducer somewhere (in the last step?)

Pat Ferrel

unread,
Nov 13, 2012, 1:59:17 PM11/13/12
to bixo...@yahoogroups.com, cascadi...@googlegroups.com
Trying to use the cascading Scheme to specify a single output file for a sink. I've tried the setNumSinkParts(1) and the constructor param. Using cascading1.2.5 in Bixo. I tried using a SequenceFile Scheme with setNumSinkParts(1) and that didn't work either. Am I doing something wrong?

Relevant code:

        Scheme indexScheme = new TextLine( IndexDatum.FIELDS, 1 );
        indexScheme.setNumSinkParts(1);//one index file
        Scheme outputScheme = new TextLine( new Fields("hashed person id", "hashed preference id"), 1 );//one preference output file --> doesn't work!
        //outputScheme.setNumSinkParts(1);//one preference output file --> doesn't work so trying the constructor param
        Tap outputSink = new Hfs( outputScheme, prefSinkPath.toString());

// there are two output sinks, both with Scheme's that setNumSinkParts(1)
        sinkMap.put(writeHashedIds.getName(), outputSink);

Flow flow = flowConnector.connect(source, sinkMap, tailPipes.toArray(new Pipe[tailPipes.size()]));
 
I guess back to specifying a single reducer in hadoop?


BTW sorry to cross-post cascading questions to bixo-dev but I get no response on the cascading list.


On Nov 13, 2012, at 9:09 AM, Vivek Magotra <vm....@gmail.com> wrote:

 

Hi Pat,

In Cascading, you can specify :

Alternatively, if you set up just one reduce task, you'll end up with one part file.

Vivek


On Nov 13, 2012, at 9:03 AM, Pat Ferrel wrote:

 

How can I take many input files (actual HDFS part-xxxx files) and at the end of the flow in a new sink end up with one file. My flow seems to default to as many output files as original input files.

Seems like I should be telling cascading to use a single reducer somewhere (in the last step?)



Reply all
Reply to author
Forward
0 new messages