Cascading etiquette: Constructor vs. JobConf

60 views
Skip to first unread message

JPatrick Davenport

unread,
Aug 19, 2012, 9:56:39 PM8/19/12
to cascadi...@googlegroups.com
Hello,
I'm in the process of trying to run 15 distinct, somewhat complex filters against the same data set. Right now I read in the data set and split the pipe down the multiple filters (these are more than normal filters in the cascading sense, they are subassemblies, but they ultimately filter the data). Each of the filters has a dynamic threshold that I have to pass to it. I'm using the constructor since the filter is a String or a double. Is this okay?

I assume that since the sub assemblies and filters and aggregators all implement serializable, my thresholds would get happily serialized and sent to Hadoop when the job is run. I can also see putting the thresholds into the job config and pulling them out later. Which is the preferred way?
To give an example:

class MyExample extends BaseOperation<Tuple> implements Filter<Tuple> {
private final String threshold;
public MyExample(final String threshold) {
this.threshold = threshold;
}
// filter code goes below.
}

Thanks,
JPD

Bertrand Dechoux

unread,
Aug 20, 2012, 2:08:55 AM8/20/12
to cascadi...@googlegroups.com
Hi,

My own opinion is that in your context using the constructor is indeed a good idea.

It is one of the advantage of cascading to be able to construct your flow and then send it to be processed by the nodes. You don't need to create classes (static configuration) and then pull everything from the job conf with custom keys that should be distinct.

Of course, you have to be careful and limit the amount of data that is transferred within the job conf. You would not want to insert a data file within it. But the constraint is more about the size of the job conf and not really about the flow itself (which is only a part of the job conf).

Regards

Bertrand

Koert Kuipers

unread,
Aug 20, 2012, 11:11:51 AM8/20/12
to cascadi...@googlegroups.com
we put it in the constructors as well. indeed the limitation is that it is java serialized, which isn't the most efficient, so don't try to ship enormous data structures this way.


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/VOpfaw5YlB0J.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

JPatrick Davenport

unread,
Aug 20, 2012, 9:50:52 PM8/20/12
to cascadi...@googlegroups.com
Thank you all for your response. It's nice to see that constructor idea is a good idea.

As to a lot of data issues, I'm only passing a string that's about 10 chars and a double. So I think it should be safe.

Thanks,
JPD
Reply all
Reply to author
Forward
0 new messages