On Thu, Nov 26, 2009 at 11:51 PM, Esé <
opusdp...@gmail.com> wrote:
> Hey folks,
>
> I have an interesting question regarding data partitioning via keys.
> To simplify the problem down to its essentials, let's say I am trying
> to find the # of unique viewers per channel from a stream of tuples
[...]
> However, I am wondering whether it's possible to get around this issue
> by somehow spreading the tuple aggregation to multiple reducers as an
> intermediate step - i.e. for example via grouping by (channelId + some
> other key), summing up by channel id and *then* as a final step,
> grouping by channelid for the final sum.
>
> Just wondering if this makes sense or is there some easy solution or
> primitive in cascading I can use.
The simplest thing you can do here is adding some sort of hash value
for every row (based on visitor id) - c, v, h - for example, h = v %
SOME_MAGICAL_NUMBER_N.
Then you do group by (c, h), make count, group by (c), make sum.
Experiment with SOME_MAGICAL_NUMBER_N for best result. Note that you
can just write a specific partitioner for task #1 if you were using
regular Hadoop without Cascading, so I wondered some time ago if there
are some grand plans to introduce custom partitioning schemes to
Cascading:
http://groups.google.com/group/cascading-user/browse_thread/thread/bcc0bebc72959a3/9997a916a8d09b4b
--
WBR, Mikhail Yakshin