I have a query which uses the same subquery several times. It looks
something like
S3 SrcData => QueryA => QueryB => QueryC => etc.
^
||
S3 SrcData => QueryA
QueryA is being called twice and it would be nice to be able to
"cache" the result in a temp file in HDFS instead of running the query
twice. This is especially true because QueryA is filter job on the
SrcData and the cached result would be much smaller.
I found a few threads in the cascading list which says to implement
the isSafe() function on Operation to return false
http://groups.google.com/group/cascading-user/browse_thread/thread/59e3463093c1eebb#
http://groups.google.com/group/cascading-user/browse_thread/thread/cd283dadc6f76bbe/d09111e95b6a8852?lnk=gst&q=%22Dumping+pipe+to+disk#d09111e95b6a8852
Is there anyway to access and implement from Cascalog? Thanks
Andy
https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FWdkAe-QS7NU/edit
On Dec 16, 1:07 pm, Andrew Xue <and...@lumoslabs.com> wrote:
> Hi
>
> I have a query which uses the same subquery several times. It looks
> something like
>
> S3 SrcData => QueryA => QueryB => QueryC => etc.
> ^
> ||
> S3 SrcData => QueryA
>
> QueryA is being called twice and it would be nice to be able to
> "cache" the result in a temp file in HDFS instead of running the query
> twice. This is especially true because QueryA is filter job on the
> SrcData and the cached result would be much smaller.
>
> I found a few threads in the cascading list which says to implement
> the isSafe() function on Operation to return false
>