user controlled partitioning in cascading

15 views
Skip to first unread message

Oscar Boykin

unread,
May 22, 2015, 9:34:13 PM5/22/15
to cascadi...@googlegroups.com
What are the options for people who want to do a grouping, then control what filename each key group is written to?

We have some people that have basically reverse engineered the hashing and way the files get created now so that they can do total sorts and have an external system load the files and know which files contain which ranges of the keys.

I don't think there is an actual API for this now in cascading, but how difficult would it be to add one? I'm afraid of us getting stuck on an old version since critical pipelines might assuming a certain partitioning that is then broken by future versions of cascading.
--
Oscar Boykin :: @posco :: http://twitter.com/posco

Chris K Wensel

unread,
May 23, 2015, 3:17:09 AM5/23/15
to cascadi...@googlegroups.com
One option is specifying a DecoratorTap sub-class that manages the intermediate files. PartitionTap does some work with managing filenames, by virtue of managing folder/directory names.

thus writing files based on partitions, and using a predicate to only read those that matter would be interesting. unfortunately in 3.0 i wasn’t able to collapse PartitionTap with DecoratorTap.

(a parquet tap would allow for column projection, this might be more interesting)

hadoop provides little control over file names, and Cascading spends much time overcoming this to at least provide control over folder names.

subsequently, a HashedTap implementing DecoratorTap could be implemented to do this. (as of 2.7 you can have the planner wrap intermediate taps with a DecoratorTap for intra flow taps, see DistCacheTap)

that said, more control over this would be by leveraging the Cascading 3 planner to push down predicates to the intermediate taps. where intermediate taps only come into play when using MapReduce.

but that seems moot in the face of actually not writing intermediate data to hdfs between hash partitioned pairs of work.

I would suggest giving Cascading 3 and the Tez platform support a run through.

or time can be spend whipping incremental improvements against a dead horse.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CANX%3DQ2q7NNbO_BfTEg-%2BEx1GUs%2BowdthECpzKtTkaWHy%2BqzwNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Reply all
Reply to author
Forward
0 new messages