Filtering Groups

kdel...@liveramp.com

unread,

Jun 30, 2016, 6:01:13 PM6/30/16

to cascading-user

A common challenge I run into is when I have the result of a GroupBy, and for each group I want to either pass all the values to the output or none of the values, but I don't know which it is until I've seen all the values. What's the best way to do this? I know I cannot reset the arguments iterator from bufferCall (to iterate twice) nor can I chain Buffers together, and caching them will usually result in running out of memory. What do you suggest?

-Kevin

Chris K Wensel

unread,

Jun 30, 2016, 6:41:41 PM6/30/16

to cascadi...@googlegroups.com

if i understand correctly,

just put an upstream Function in play to flag the tuple as undesirable, secondary sort on the flag, if it shows up at the top of the iterator in the buffer, discard the iterator.

also, look at how AggregateBy works. you could build a Function that when it sees the undesirable grouping, sends a flag across the wire, but also begins aggressively dropping any new tuples found in the grouping.

do the first option first, optimize with the second.

ckw

On Jun 30, 2016, at 3:01 PM, kdel...@liveramp.com wrote:

A common challenge I run into is when I have the result of a GroupBy, and for each group I want to either pass all the values to the output or none of the values, but I don't know which it is until I've seen all the values. What's the best way to do this? I know I cannot reset the arguments iterator from bufferCall (to iterate twice) nor can I chain Buffers together, and caching them will usually result in running out of memory. What do you suggest?

-Kevin

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/34b61aad-e2e4-40d0-ae24-fb2e7e9f06a1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

—

Chris K Wensel

ch...@wensel.net

Ken Krugler

unread,

Jun 30, 2016, 6:56:21 PM6/30/16

to cascadi...@googlegroups.com

Hi Kevin,

On Jun 30, 2016, at 3:01pm, kdel...@liveramp.com wrote:

A common challenge I run into is when I have the result of a GroupBy, and for each group I want to either pass all the values to the output or none of the values, but I don't know which it is until I've seen all the values. What's the best way to do this? I know I cannot reset the arguments iterator from bufferCall (to iterate twice) nor can I chain Buffers together, and caching them will usually result in running out of memory. What do you suggest?

One approach is to split your stream, and on the RHS create a custom Buffer that emits only the grouping field(s) for groups that pass your test.

Then HashJoin that against the original data, with a RightJoin, and you’ll retain only the records that you want.

— Ken

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Message has been deleted

Kevin Delgado

unread,

Jul 6, 2016, 1:32:26 PM7/6/16

to cascading-user

But this only works if the flag can be applied independent of the other tuples in the group. I am asking for the situation where I am not sure whether the group can be flagged for deletion before seeing all the values in a group. For example. if I have a stream of records with two types of ids, id1 and id2, I need to group all the records by id1, and then delete EVERY record with the same id1 if ANY two records with the same id1 have duplicate id2. A simple upstream function will not work because i cannot determine whether an id2 is a duplicate for a group of id1 without seeing all the tuples of a specific id1. I know I can use two GroupBy and Buffer calls (first for flagging, second for filtering) but is there a way to do it with just one?

Ken Krugler

unread,

Jul 6, 2016, 1:44:06 PM7/6/16

to cascadi...@googlegroups.com

On Jul 6, 2016, at 10:32am, Kevin Delgado <kdel...@liveramp.com> wrote:

But this only works if the flag can be applied independent of the other tuples in the group. I am asking for the situation where I am not sure whether the group can be flagged for deletion before seeing all the values in a group. For example. if I have a stream of records with two types of ids, id1 and id2, I need to group all the records by id1, and then delete EVERY record with the same id1 if ANY two records with the same id1 have duplicate id2. A simple upstream function will not work because i cannot determine whether an id2 is a duplicate for a group of id1 without seeing all the tuples of a specific id1. I know I can use two GroupBy and Buffer calls (first for flagging, second for filtering) but is there a way to do it with just one?

Unless you can buffer each group in memory, you don’t really have another good option.

Note though that the HashJoin I’d mentioned as the second step is a map-side operation, so you’re only doing one reduce.

— Ken

On Thursday, June 30, 2016 at 3:41:41 PM UTC-7, Chris K Wensel wrote:
if i understand correctly,

just put an upstream Function in play to flag the tuple as undesirable, secondary sort on the flag, if it shows up at the top of the iterator in the buffer, discard the iterator.

also, look at how AggregateBy works. you could build a Function that when it sees the undesirable grouping, sends a flag across the wire, but also begins aggressively dropping any new tuples found in the grouping.

do the first option first, optimize with the second.

ckw

On Jun 30, 2016, at 3:01 PM, kdel...@liveramp.com wrote:

A common challenge I run into is when I have the result of a GroupBy, and for each group I want to either pass all the values to the output or none of the values, but I don't know which it is until I've seen all the values. What's the best way to do this? I know I cannot reset the arguments iterator from bufferCall (to iterate twice) nor can I chain Buffers together, and caching them will usually result in running out of memory. What do you suggest?

-Kevin

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/34b61aad-e2e4-40d0-ae24-fb2e7e9f06a1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

—
Chris K Wensel
ch...@wensel.net

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/d0a513fa-52a9-433c-8101-b7be731bcc60%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Kevin Delgado

unread,

Jul 6, 2016, 1:51:19 PM7/6/16

to cascading-user

Thanks Ken, I wasn't clear on what the advantage was to the HashJoin as the second step when you originally posted that, but now I see that it is because it is done map-side.

-Kevin

Reply all

Reply to author

Forward