strategy for groupByKey/reduceByKey

697 views
Skip to first unread message

Bach Bui

unread,
Jan 25, 2013, 5:34:58 PM1/25/13
to spark...@googlegroups.com
Hi,

I try to understand groupByKey operations. Does the result of a given key resides in one machine. It seems to me that since the result is an RDD, multiple slaves may have different group of the same key? Also have similar question about reduceByKey.

And another question about reduce operation: this operation only run on the driver, right?

Thanks.

Reynold Xin

unread,
Jan 25, 2013, 5:39:06 PM1/25/13
to spark...@googlegroups.com
On Fri, Jan 25, 2013 at 2:34 PM, Bach Bui <bui.d...@gmail.com> wrote:
Hi,

I try to understand groupByKey operations. Does the result of a given key resides in one machine. It seems to me that since the result is an RDD, multiple slaves may have different group of the same key? Also have similar question about reduceByKey.

There is a "shuffle" in a groupByKey, e.g. keys get redistributed to workers. Each worker will have a subset of keys, and all values for the same key get to the same worker node.
 

And another question about reduce operation: this operation only run on the driver, right?


reduce assumes the function is communicative and associative. It is run on each worker, and the results of the reduce from all workers are then sent to the master to perform the final reduce.

 
Thanks.

Bach Bui

unread,
Jan 25, 2013, 5:44:31 PM1/25/13
to spark...@googlegroups.com
Thanks Reynold for very concise answer.

Is there any optimization in groupdByKey, reduceByKey on which worker should do what, for example if values of a key may already in one machine, then the worker for that key should be there?

Reynold Xin

unread,
Jan 25, 2013, 5:47:55 PM1/25/13
to spark...@googlegroups.com
There is a map-side reduce/aggregation before the shuffle for reduceByKey/groupByKey. It certainly helps a lot in reduceByKey if the reduce function actually "reduces" the amount of data.

In groupByKey, map-side aggregation won't help you much because the "reduce" function is simply adding items of the same key to a list. 

--
Reynold Xin, AMPLab, UC Berkeley

Mark Hamstra

unread,
Jan 25, 2013, 8:16:04 PM1/25/13
to spark...@googlegroups.com
reduce assumes the function is communicative and associative

I believe you mean 'commutative'.  And is that really a requirement, or is ordering preserved in Spark's reduce?  If ordering is not preserved, then the docs should really be updated to include the commutative requirement for reduce in addition to associativity.
Reply all
Reply to author
Forward
0 new messages