possible data duplication bug

18 views
Skip to first unread message

Chris K Wensel

unread,
Apr 21, 2014, 6:37:45 PM4/21/14
to cascadi...@googlegroups.com

Hey all

We just pushed 2.5.4-wip-101that includes a planner fix for a case where a variation on HashJoin+Merge+GroupBy could force duplicate data to the GroupBy when using Hadoop mode on a cluster. 

Note that a Merge+GroupBy is redundant, so this isn't a common case. But can show up in frameworks that render Cascading assemblies. fwiw, there is no performance penalty by doing a Merge+GroupBy though.

You can see the test here for the assembly:

Full commit is here:
https://github.com/cwensel/cascading/commit/e8d271fc816ac46fe8e84dfdcf76fe73e26508a6

Anyway, please give it a whirl to make sure we haven't introduced any problems. I'll push out a 2.5.4 in a day or so with positive feedback.

ckw

Chris K Wensel

unread,
Apr 24, 2014, 10:09:36 PM4/24/14
to cascadi...@googlegroups.com
fyi, planning pushing 2.5.4 tomorrow. please let me know if you've noticed any issues.

fwiw, we have tested Scalding develop branch and it looks good.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2F9FEF26-59B2-4546-93D4-4C2ED866F466%40wensel.net.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages