Dumping pipe to disk

8 views
Skip to first unread message

yosib

unread,
Dec 12, 2011, 4:15:24 AM12/12/11
to cascading-user, yo...@amobee.com
Hi,

I have the following scenario:

I am creating a Pipe that apply Each with a Function. From this pipe I
am creating several new Pipes each one of them with a dedicated sink
Tap.
From these Pipes I am creating a single flow.

The common part of all pipes (with the Function) is the most time
consuming part of the entire process. In the Function I am reading all
the data parsing it and creates a new stream Tuples that contains only
the relevant data for the rest of the Pipes.

It would have been great if the cascading infrastructure would have
run the function on all the data, dump it to disk (tmp dir) and then
run all the sub Pipes on this output. But what really happens is that
the Function is being called for all the sub Pipes which makes the run
of the entire process to be very slow.

I know I can force cascading to flush the Function results to disk by
creating a separate Flow for the common part and than make all the
other flows to work on the result of the first flow.

I would like to know if there is a simpler way to it in a different
way. Can I give cascading a hint to dump to common part to disk and
continue from there?


Yosi

Ken Krugler

unread,
Dec 12, 2011, 9:42:11 AM12/12/11
to cascadi...@googlegroups.com
Implement the isSafe() method in your function, and have it return false.

-- Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply all
Reply to author
Forward
0 new messages