Multiple outputs from a single MapReduce job

472 views

Skip to first unread message

Matthieu Martin

unread,

Jan 10, 2012, 1:09:19 PM1/10/12

to cascadi...@googlegroups.com

I'm coming from the world of Pig and wondering if Cascading has anything like Pig's "multi-query execution" (see: https://pig.apache.org/docs/r0.7.0/piglatin_ref1.html#Multi-Query+Execution). In particular, I'm interested in being able to create multiple output files from a single MapReduce job (note: I'm not asking about creating multiple output files from a single Flow or Cascade). I've also heard people refer to this as creating "side files" (i.e. since Hadoop, by default, dumps all of the output to one directory, one needs to create "side files" to store distinct output sets from a single job).

For those of you who are familiar with Pig, the previously referenced link (see above) uses the following Pig script to illustrate "multi-query execution":

A = LOAD ...
...
SPLIT A' INTO B IF ..., C IF ...
...
STORE B' ...
STORE C' ...

This example is followed by a few notes on the optimizations which Pig does. Within this list, I'm particularly interested in the the following features of "multi-query execution":

2. Makes the split non-blocking and allows processing to continue. ...
3. Allows multiple outputs from a job. This way some results can be stored as a side-effect of the main job. ...

I hope that makes sense. And thanks in advance for your responses.

Matt Martin
Think Big Analytics
matt....@thinkbiganalytics.com

Chris K Wensel

unread,

Jan 10, 2012, 1:35:06 PM1/10/12

to cascadi...@googlegroups.com

yes on both counts.

Branching a pipe assembly down multiple paths is non-blocking. all branches will be submitted concurrently, in the fewest number of MR jobs. but keep in mind Cascading is a physical planner, not a logical one. so you need to assemble the flows in the most optimal way, we won't rewrite the flow logically.

as for multiple outputs, see the TemplateTap. will let one stream write to unlimited number of locations based on the values in the stream. or use the MultiSinkTap, if you just want to write the same data to multiple locations in different formats.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Chris K Wensel

ch...@concurrentinc.com

http://concurrentinc.com

Reply all

Reply to author

Forward

0 new messages