Scalability of Cascading jobs

Lekhnath Bhusal

unread,

Nov 10, 2012, 2:50:41 AM11/10/12

to cascadi...@googlegroups.com

Hi folks,

I have started using cascading few weeks back. I have created data analytic engine on top of it. Mos of the code I wrote till date were running in pseudo distributed environment.

Now that I needed to push the application to real data. I have to run the jobs in cluster. With the same data, jobs almost run in the same time both in pseudo distributed mode and 6-node cluster.

For a very simple example, I have a simple validation job to validate regular expression pattern matching on individual fields of tuple. When I run the job in distributed environment its not scaling well. Even if I have large number of mappers, individual mappers are running too slow there than in pseudo distributed mode.

Is there anything missing in configuration.When I run the similar jobs in pure MapReduce they are scaling well.

Thanks,

Lekhnath

Chris K Wensel

unread,

Nov 10, 2012, 4:22:36 PM11/10/12

to cascadi...@googlegroups.com

Cascading will scale fine if you set the proper Hadoop properties on your Flows. Cascading makes no attempt to set any defaults. The first place to look would likely be the num reducers setting.

http://docs.cascading.org/cascading/2.1/userguide/html/ch03s08.html

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/Xf6jO6fmxcwJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--

Chris K Wensel

ch...@concurrentinc.com

http://concurrentinc.com

Lekhnath Bhusal

unread,

Nov 13, 2012, 1:59:34 AM11/13/12

to cascadi...@googlegroups.com

Thanks for the response.

I found that one of my functions was emitting lots of tuples from a single tuple in the stream making the output from it to explode. The copy overhead was killing all scalability. After fixing that function things work fine.

Thanks

Lekhnath

Pat Ferrel

unread,

Nov 13, 2012, 12:03:36 PM11/13/12

to cascadi...@googlegroups.com, bixo...@yahoogroups.com

How can I take many input files (actual HDFS part-xxxx files) and at the end of the flow in a new sink end up with one file. My flow seems to default to as many output files as original input files.

Seems like I should be telling cascading to use a single reducer somewhere (in the last step?)

Pat Ferrel

unread,

Nov 13, 2012, 1:59:17 PM11/13/12

to bixo...@yahoogroups.com, cascadi...@googlegroups.com

Trying to use the cascading Scheme to specify a single output file for a sink. I've tried the setNumSinkParts(1) and the constructor param. Using cascading1.2.5 in Bixo. I tried using a SequenceFile Scheme with setNumSinkParts(1) and that didn't work either. Am I doing something wrong?

Relevant code:

Scheme indexScheme = new TextLine( IndexDatum.FIELDS, 1 );

indexScheme.setNumSinkParts(1);//one index file

Scheme outputScheme = new TextLine( new Fields("hashed person id", "hashed preference id"), 1 );//one preference output file --> doesn't work!

//outputScheme.setNumSinkParts(1);//one preference output file --> doesn't work so trying the constructor param

Tap outputSink = new Hfs( outputScheme, prefSinkPath.toString());

// there are two output sinks, both with Scheme's that setNumSinkParts(1)

sinkMap.put(writeHashedIds.getName(), outputSink);

Flow flow = flowConnector.connect(source, sinkMap, tailPipes.toArray(new Pipe[tailPipes.size()]));

I guess back to specifying a single reducer in hadoop?

BTW sorry to cross-post cascading questions to bixo-dev but I get no response on the cascading list.

On Nov 13, 2012, at 9:09 AM, Vivek Magotra <vm....@gmail.com> wrote:

Hi Pat,

In Cascading, you can specify :

http://docs.cascading.org/cascading/1.2/javadoc/cascading/scheme/Scheme.html#setNumSinkParts(int)

Alternatively, if you set up just one reduce task, you'll end up with one part file.

Vivek

On Nov 13, 2012, at 9:03 AM, Pat Ferrel wrote:

How can I take many input files (actual HDFS part-xxxx files) and at the end of the flow in a new sink end up with one file. My flow seems to default to as many output files as original input files.

Seems like I should be telling cascading to use a single reducer somewhere (in the last step?)

__._,_.___

Reply via web post

Reply to sender

Reply to group

Start a New Topic

Messages in this topic (2)

Recent Activity:

New Members 1

Visit Your Group

Switch to: Text-Only, Daily Digest • Unsubscribe • Terms of Use • Send us Feedback