Optimizing Outlier filtering using Cascading

14 views
Skip to first unread message

HIMANSHU VERMA

unread,
Oct 10, 2017, 10:01:12 AM10/10/17
to cascading-user
Hi Everyone,
I am currently working on a problem of outlier filtering. I have implemented a way but it is not performing well. Here is my approach:

Step 1:
I have input tuples in  inputPipe  -> (x, y, z, value)
I create a copy of this inputPipe and call it quartilePipe
I use groupby(groupby key fields -> x,y,z) and a custom aggregator on quartilePipe to compute the quartiles (25% and 75%) (Quartile calculation is quite optimized and uses the similar method as mentioned https://mahout.apache.org/docs/0.13.1-SNAPSHOT/javadocs/org/apache/mahout/math/stats/OnlineSummarizer.html)

output tuples quartilePipe -> (x, y, z, quartile25, quartile75)

Step2: 
I use a CoGroup to to join quartilePipe and original inputPipe to get tuples in form of  outputPipe-> (x, y, z, value, quartile25, quartile75)
Now, I apply a custom filter and do the following for each:
 calculate,
 IQR = quartile75 - quartile25
LOW = quartile25 - 1.5*IQR
HIGH = quartile75 + 1.5 * IQR

filter out tuples if value>HIGH or value <LOW


Unfortunately,
this method is not performing very well. Amount of data which we have is humongous. I believe that the problem is with the CoGroup but not sure how to optimize it. I didnt use HashJoin instead of CoGroup as the data in quartilePipe itself can be a lot and may not fit into memory.

Thanks a lot.

Regards,
Himanshu

Ken Krugler

unread,
Oct 10, 2017, 6:18:16 PM10/10/17
to cascadi...@googlegroups.com
Hi Himanshu,

The interesting question is how many unique keys (combinations of x, y, z) you’ve got, versus total number of input tuples.

If this is reasonably small (e.g. < 10M, depends a bit on the size of the key) then a HashJoin should work.

Note that (a) you want to put the quartilePipe on the right side of the HashJoin, and (b) you can watch the logs and if you see spills to disk happening during the HashJoin, you can increase the number of values that you keep in memory before spilling (max for this will be limited by map task memory)

— Ken

--------------------------------------------

HIMANSHU VERMA

unread,
Oct 11, 2017, 10:21:51 AM10/11/17
to cascading-user
Hi Ken,
Thanks for your response. In actual problem, the key is seven dimensional (a,b,c,d,e,f,g). Number of such combinations may be less than 10M today but it will eventually grow. So, even if HashJoin works today, it may fail in future. 

Ken Krugler

unread,
Oct 11, 2017, 10:51:59 AM10/11/17
to cascadi...@googlegroups.com
Hi Himanshu,

If you can’t/don’t want to use HashJoin, then you either are left with optimizing your use of CoGroup, or (potentially) trying some helicopter stunts.

E.g. don’t split the stream, do the following…

1. Group on the key fields
2. Create a custom buffer operator that calculates the two quartiles you need.
3. This buffer operator emits all incoming tuples with special values for the quartiles, and (at the end) the two actual values you need.
4. Do a second group on the same key fields, and specify sorting of the quartile values such that your actual calculated values are first.
5. Use another custom buffer that grabs the first incoming tuple (the one with the real values) and applies it to all subsequent tuples in the group.

This last step can do the calculation and the filtering at the same time.

— Ken


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/be5e8bb1-cefb-4617-8290-931854de9348%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages