Cascading 3.2 WIP

18 views
Skip to first unread message

Chris K Wensel

unread,
Oct 13, 2016, 1:20:52 PM10/13/16
to cascadi...@googlegroups.com
Hey all

Just a heads of changes in WIP 3.2 (wip-10 should publish later today)


Specifically:

Fixed issue on Apache Hadoop MapReduce where a c.t.Tap used in both accumulated and streamed roles within a single
step, but having distinct roles in different pipelines, could not be successfully planned and executed.

Fixed issue where a triangle of split through a c.p.HashJoin into a c.p.Merge on Apache Tez would fail
during planning.

Fixed issue where a triangle of split and joins on Apache Hadoop MapReduce would fail during planning.

Fixed issue where trivial pipe assemblies that reduced to assemblies with multiple edges between two elements would
fail, e.g. merging a file with itself.

In summary, we have better support for very small assemblies that do little more than fork, join, and merge data.

Counterintuitively, Cascading 3 put a lot of focus on handling very large and complex assemblies (DAGs of pipes and taps) in a timely manner but extremely small but complex assemblies were under served to a degree, esp if the same file was used as a source through multiple branches.

The challenge with these assemblies was that there were multiple edges between two elements (a pipe or a tap) without any intermediate pipe (like an Each/Identity function) for the planner rules to match and capture in order for the rule to fire.

So two changes have been made internally. 

All ElementGraph objects (the set of classes that represent the logical and physical dags internally) are now multi-graphs so we don’t lose the additional edges during graph contractions (an operation that happens during rule matching to hide pipes/taps that are irrelevant to the rule for matching).

Rudimentary support for edge capture/exclusion has been added. That is, the isomorphism matching algorithm we use internally (based on VF2) only captures vertices. but now in some cases, we not only just match edges, the edges can be excluded from the resulting transformed graph — there are limitations, but its sufficient so far.

So, for tight assemblies where a source is accumulated on a HashJoin, but also fed into a Merge, we have a situation where the input feeds two different code paths (what we call a pipeline). suffice it to say without the ability to exclude an edge (the edge from the source to the merge) we can create a valid pipeline when the source in question should be feeding the accumulated side of the HashJoin in its entirety, vs a a split of the file feeding the Merge in a different pipeline.

Check out the commits to see the new and updated tests we now support.


ckw


Reply all
Reply to author
Forward
0 new messages