Hey all
We just pushed 2.5.4-wip-101that includes a planner fix for a case where a variation on HashJoin+Merge+GroupBy could force duplicate data to the GroupBy when using Hadoop mode on a cluster.
Note that a Merge+GroupBy is redundant, so this isn't a common case. But can show up in frameworks that render Cascading assemblies. fwiw, there is no performance penalty by doing a Merge+GroupBy though.
You can see the test here for the assembly:
Full commit is here:
https://github.com/cwensel/cascading/commit/e8d271fc816ac46fe8e84dfdcf76fe73e26508a6
Anyway, please give it a whirl to make sure we haven't introduced any problems. I'll push out a 2.5.4 in a day or so with positive feedback.
ckw