Hi Everyone,
I am currently working on a problem of outlier filtering. I have implemented a way but it is not performing well. Here is my approach:
Step 1:
I have input tuples in inputPipe -> (x, y, z, value)
I create a copy of this inputPipe and call it quartilePipe
output tuples quartilePipe -> (x, y, z, quartile25, quartile75)
Step2:
I use a CoGroup to to join quartilePipe and original inputPipe to get tuples in form of outputPipe-> (x, y, z, value, quartile25, quartile75)
Now, I apply a custom filter and do the following for each:
calculate,
IQR = quartile75 - quartile25
LOW = quartile25 - 1.5*IQR
HIGH = quartile75 + 1.5 * IQR
filter out tuples if value>HIGH or value <LOW
Unfortunately,
this method is not performing very well. Amount of data which we have is humongous. I believe that the problem is with the CoGroup but not sure how to optimize it. I didnt use HashJoin instead of CoGroup as the data in quartilePipe itself can be a lot and may not fit into memory.
Thanks a lot.
Regards,
Himanshu