[Scalding] - Using the value of the group key in the reduce operation

647 views
Skip to first unread message

Neta

unread,
Dec 19, 2013, 2:49:23 PM12/19/13
to cascadi...@googlegroups.com
Hi,

I'm writing a scalding job that needs to group by some key, then use the key's value in the ReduceOperation. How can this be done?

For example, if I calculate a histogram I can use constant width=0.1 and write:

    .groupBy('group){_.histogram('value->'hist, 0.1)}

But I'd like the width to depend on the values in the stream and write something like:

    .groupBy(('group,'width)){_.histogram('value->'hist, 'width)}

In vanilla mapReduce I have access to the key:

void reduce(K2 key,
            Iterator<V2> values,


(I know I can create a Histogram from each value using a map operation and then reduce the Histograms, but I'd like to know if there's a more elegant way).

Thanks.


This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. 
If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.

David Shimon

unread,
Jan 14, 2014, 3:59:23 AM1/14/14
to cascadi...@googlegroups.com
Hi,


Recently, @johnynek opened pull request: "add mapGroup to the typed API". 

def mapGroup[V](smfn : (K, Iterator[T]) => Iterator[V]): This[K, V]

Was added to KeyedListLike trait, so now you can use the Key as well.

(Another reason to use Typed API...)

Oscar Boykin

unread,
Jan 14, 2014, 10:39:20 AM1/14/14
to cascadi...@googlegroups.com
In the fields API you should be able to access key fields in the reduce operations. Did this fail for you?
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/916e99aa-19c1-41c8-af20-c052c6d64c72%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
Oscar Boykin :: @posco :: http://twitter.com/posco

Neta

unread,
Jan 14, 2014, 11:46:03 AM1/14/14
to cascadi...@googlegroups.com
Thanks David

Neta

unread,
Jan 14, 2014, 11:52:10 AM1/14/14
to cascadi...@googlegroups.com
Oscar - in this example the compilation fails (giving Symbol instead of Double).

I didn't see how I can access the key, but I might have missed something.

Thanks.


On Tuesday, January 14, 2014 5:39:20 PM UTC+2, Oscar Boykin wrote:
In the fields API you should be able to access key fields in the reduce operations. Did this fail for you?

On Thursday, December 19, 2013, Neta wrote:
Hi,

I'm writing a scalding job that needs to group by some key, then use the key's value in the ReduceOperation. How can this be done?

For example, if I calculate a histogram I can use constant width=0.1 and write:

    .groupBy('group){_.histogram('value->'hist, 0.1)}

But I'd like the width to depend on the values in the stream and write something like:

    .groupBy(('group,'width)){_.histogram('value->'hist, 'width)}

In vanilla mapReduce I have access to the key:

void reduce(K2 key,
            Iterator<V2> values,


(I know I can create a Histogram from each value using a map operation and then reduce the Histograms, but I'd like to know if there's a more elegant way).

Thanks.


This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. 
If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.


--
Oscar Boykin :: @posco :: http://twitter.com/posco

m.orazow

unread,
May 6, 2014, 2:48:31 AM5/6/14
to cascadi...@googlegroups.com
Hey Oscar,

Could you please give some pointers on how to use group key in the Fields api?

Best


On Tuesday, January 14, 2014 4:39:20 PM UTC+1, Oscar Boykin wrote:
In the fields API you should be able to access key fields in the reduce operations. Did this fail for you?

On Thursday, December 19, 2013, Neta wrote:
Hi,

I'm writing a scalding job that needs to group by some key, then use the key's value in the ReduceOperation. How can this be done?

For example, if I calculate a histogram I can use constant width=0.1 and write:

    .groupBy('group){_.histogram('value->'hist, 0.1)}

But I'd like the width to depend on the values in the stream and write something like:

    .groupBy(('group,'width)){_.histogram('value->'hist, 'width)}

In vanilla mapReduce I have access to the key:

void reduce(K2 key,
            Iterator<V2> values,


(I know I can create a Histogram from each value using a map operation and then reduce the Histograms, but I'd like to know if there's a more elegant way).

Thanks.


This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. 
If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.

Søren Holbech

unread,
Jul 15, 2016, 2:56:26 AM7/15/16
to cascading-user
I could really use an example of this as well. See this snippet for use case/problem:

pipe.groupBy('key){
  _.reduce('value){ (left:Int,right:Int) =>
    (left + right) match {
      case tooMuch if tooMuch > threshold => { println(s"Overflow in ${ key? }"); 0}
      case ok => ok
Reply all
Reply to author
Forward
0 new messages