Hello,
I'm in the process of trying to run 15 distinct, somewhat complex filters against the same data set. Right now I read in the data set and split the pipe down the multiple filters (these are more than normal filters in the cascading sense, they are subassemblies, but they ultimately filter the data). Each of the filters has a dynamic threshold that I have to pass to it. I'm using the constructor since the filter is a String or a double. Is this okay?
I assume that since the sub assemblies and filters and aggregators all implement serializable, my thresholds would get happily serialized and sent to Hadoop when the job is run. I can also see putting the thresholds into the job config and pulling them out later. Which is the preferred way?
To give an example:
class MyExample extends BaseOperation<Tuple> implements Filter<Tuple> {
private final String threshold;
public MyExample(final String threshold) {
this.threshold = threshold;
}
// filter code goes below.
}
Thanks,
JPD