AggregateBy and sort support

71 views
Skip to first unread message

Ken Krugler

unread,
Jun 1, 2012, 1:38:20 PM6/1/12
to cascadi...@googlegroups.com
Hi Chris,

While looking into implementing a FirstBy, I was expecting that I could specify sorting fields in the AggregateBy constructor.

This would then let me efficiently use First as my aggregator.

But AggregateBy currently doesn't let you specify sorting.

In the 1.2.5 code it looks like this would be a trivial change, as it would just get used in the initialize() method when setting up the GroupBy:

   Pipe pipe = new GroupBy( name, functions, groupingFields );

-- Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Chris K Wensel

unread,
Jun 4, 2012, 9:55:43 AM6/4/12
to cascadi...@googlegroups.com
good catch. will make something available soon.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.


Oscar Boykin

unread,
Jun 4, 2012, 1:14:52 PM6/4/12
to cascadi...@googlegroups.com
I'm confused here.

How can sorting be handled in any reasonable way with AggregateBy? The mappers are going to do aggregation before sending to the reducers. Am I missing the use case?

--
Oscar Boykin :: @posco :: https://twitter.com/intent/user?screen_name=posco


On Monday, June 4, 2012 at 6:55 AM, Chris K Wensel wrote:

> good catch. will make something available soon.
>
> ckw
> On Jun 1, 2012, at 10:38 AM, Ken Krugler wrote:
> > Hi Chris,
> >
> > While looking into implementing a FirstBy, I was expecting that I could specify sorting fields in the AggregateBy constructor.
> >
> > This would then let me efficiently use First as my aggregator.
> >
> > But AggregateBy currently doesn't let you specify sorting.
> >
> > In the 1.2.5 code it looks like this would be a trivial change, as it would just get used in the initialize() method when setting up the GroupBy:
> >
> > Pipe pipe = new GroupBy( name, functions, groupingFields );
> >
> > -- Ken
> >
> > --------------------------
> > Ken Krugler
> > http://www.scaleunlimited.com (http://www.scaleunlimited.com/)
> > custom big data solutions & training
> > Hadoop, Cascading, Mahout & Solr
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "cascading-user" group.
> > To post to this group, send email to cascadi...@googlegroups.com (mailto:cascadi...@googlegroups.com).
> > To unsubscribe from this group, send email to cascading-use...@googlegroups.com (mailto:cascading-use...@googlegroups.com).
> > For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>
>
> --
> Chris K Wensel
> ch...@concurrentinc.com (mailto:ch...@concurrentinc.com)
> http://concurrentinc.com
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com (mailto:cascadi...@googlegroups.com).
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com (mailto:cascading-use...@googlegroups.com).

Chris K Wensel

unread,
Jun 4, 2012, 1:21:29 PM6/4/12
to cascadi...@googlegroups.com
actually, I haven't even had time to ponder this at all. so my _fix_ might be simply updating the javadoc..

fwiw, been a tad busy, new .org site is up (and .com) and we will be pushing out 2.0 tomorrow if all goes well..

ckw
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>

--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com

Ken Krugler

unread,
Jun 4, 2012, 2:31:20 PM6/4/12
to cascadi...@googlegroups.com
I might also be confused, but here's the thinking…

I'm using a First to find the first (sorted) entry in each group, so I currently do a GroupBy with a sort field, followed by the First.

It would be more efficient to not shuffle a bunch of tuples from the map to the reduce where they won't be getting used.

So I was planning on implementing a FirstBy, where the Functor is given the sort field(s), and uses those for comparison to pre-discard anything that doesn't sort first.

But AggregateBy always does a GroupBy with no sorting field specified, so the reduce operation can't do a First to complete the operation.

Or is there another, better way to handle this?

Thanks,

-- Ken

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.


--------------------------
Ken Krugler

Oscar Boykin

unread,
Jun 4, 2012, 5:11:31 PM6/4/12
to cascadi...@googlegroups.com
I guess I'd implement a MaxBy which would use the selected fields
comparator to do the comparison.

We did something like this in scalding (I guess you can do the mental
translation from scala to java for this very java-like code):

https://github.com/twitter/scalding/blob/master/src/main/scala/com/twitter/scalding/Operations.scala#L253

Ken Krugler

unread,
Jun 4, 2012, 5:41:48 PM6/4/12
to cascadi...@googlegroups.com
On Jun 4, 2012, at 2:11pm, Oscar Boykin wrote:

I guess I'd implement a MaxBy which would use the selected fields
comparator to do the comparison.

Has anybody compared performance of Hadoop sort + pick first (Filter) versus Max?

I was assuming that the former would be faster.

-- Ken
Reply all
Reply to author
Forward
0 new messages