dumping temporary values between subqueries

122 views
Skip to first unread message

Andrew Xue

unread,
Dec 16, 2011, 4:07:14 PM12/16/11
to cascalog-user
Hi

I have a query which uses the same subquery several times. It looks
something like

S3 SrcData => QueryA => QueryB => QueryC => etc.
^
||
S3 SrcData => QueryA


QueryA is being called twice and it would be nice to be able to
"cache" the result in a temp file in HDFS instead of running the query
twice. This is especially true because QueryA is filter job on the
SrcData and the cached result would be much smaller.

I found a few threads in the cascading list which says to implement
the isSafe() function on Operation to return false

http://groups.google.com/group/cascading-user/browse_thread/thread/59e3463093c1eebb#
http://groups.google.com/group/cascading-user/browse_thread/thread/cd283dadc6f76bbe/d09111e95b6a8852?lnk=gst&q=%22Dumping+pipe+to+disk#d09111e95b6a8852

Is there anyway to access and implement from Cascalog? Thanks

Andy

Andrew Xue

unread,
Dec 16, 2011, 4:11:36 PM12/16/11
to cascalog-user
ok, the ascii picture didn't work so here is a google doc drawing

https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FWdkAe-QS7NU/edit

On Dec 16, 1:07 pm, Andrew Xue <and...@lumoslabs.com> wrote:
> Hi
>
> I have a query which uses the same subquery several times. It looks
> something like
>
> S3 SrcData => QueryA => QueryB => QueryC => etc.
>                                                              ^
>                                                              ||
>                                 S3 SrcData => QueryA
>
> QueryA is being called twice and it would be nice to be able to
> "cache" the result in a temp file in HDFS instead of running the query
> twice. This is especially true because QueryA is filter job on the
> SrcData and the cached result would be much smaller.
>
> I found a few threads in the cascading list which says to implement
> the isSafe() function on Operation to return false
>

> http://groups.google.com/group/cascading-user/browse_thread/thread/59...http://groups.google.com/group/cascading-user/browse_thread/thread/cd...

nathanmarz

unread,
Dec 21, 2011, 5:20:23 PM12/21/11
to cascalog-user
Cascading will do this automatically as long as the subquery includes
a reduce step... if it's map-only (e.g. just a filter) it will redo
query A. I opened up an issue to expose the isSafe method for
subqueries so you can force this optimization: https://github.com/nathanmarz/cascalog/issues/38

You can probably force this optimization now by implementing a regular
Cascading filter that always returns false and sets that isSafe
method. So something like:

(<- [?foo ?bar] (source ?foo ?bar) (my-filter ?foo)
((IdentityUnsafe.) ?foo) (:distinct false))

where IdentityUnsafe is your Cascading filter implementation.


On Dec 16, 1:11 pm, Andrew Xue <and...@lumoslabs.com> wrote:
> ok, the ascii picture didn't work so here is a google doc drawing
>
> https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
>
> On Dec 16, 1:07 pm, Andrew Xue <and...@lumoslabs.com> wrote:
>
>
>
>
>
>
>
> > Hi
>
> > I have a query which uses the same subquery several times. It looks
> > something like
>
> > S3 SrcData => QueryA => QueryB => QueryC => etc.
> >                                                              ^
> >                                                              ||
> >                                 S3 SrcData => QueryA
>
> > QueryA is being called twice and it would be nice to be able to
> > "cache" the result in a temp file in HDFS instead of running the query
> > twice. This is especially true because QueryA is filter job on the
> > SrcData and the cached result would be much smaller.
>
> > I found a few threads in the cascading list which says to implement
> > the isSafe() function on Operation to return false
>
> >http://groups.google.com/group/cascading-user/browse_thread/thread/59......

Sam Ritchie

unread,
Dec 21, 2011, 6:56:25 PM12/21/11
to cascal...@googlegroups.com
Hey, you'll probably find cascalog.checkpoint in cascalog-contrib helpful. I discuss an example here: http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html. Something like
(workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
                   (?- (hfs-seqfile query-a-data)
                        (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
                   (?- (hfs-seqfile query-b-data)
                        (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
                   (?- (hfs-seqfile query-c-data)
                        (query-c query-a-data query-b-data))))
and so on and so forth.
--
Sam Ritchie, Twitter Inc
703.662.1337
@sritchie09
(Too brief? Here's why! http://emailcharter.org)


--
Sam Ritchie, Twitter Inc
@sritchie09

(Too brief? Here's why! http://emailcharter.org)

Andrew Xue

unread,
Dec 25, 2011, 5:39:36 PM12/25/11
to cascalog-user
hey sam -- this looks great, but having some trouble with getting
checkpoint working -- have you seen this error before?

ClassCastException java.lang.String cannot be cast to
clojure.lang.IFn cascalog.contrib.checkpoint/exec-workflow!/
iter--210--214/fn--215 (checkpoint.clj:88)

i get this error with the workflow i was testing as well as when i cut
and pasted in the checkpoint_test.clj code and tried to (run-test!)



On Dec 21, 6:56 pm, Sam Ritchie <sritchi...@gmail.com> wrote:
> Hey, you'll probably find cascalog.checkpoint in cascalog-contrib helpful.
> I discuss an example here:http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
> Something like
> (workflow ["/tmp/checkpoints"]
>     queryA ([:tmp-dirs query-a-data]
>                    (?- (hfs-seqfile query-a-data)
>                         (query-a s3-path)))
>     queryB ([:deps queryA :tmp-dirs query-b-data]
>                    (?- (hfs-seqfile query-b-data)
>                         (query-b query-a-data)))
>     queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
>                    (?- (hfs-seqfile query-c-data)
>                         (query-c query-a-data query-b-data))))
> and so on and so forth.
>
> On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue <and...@lumoslabs.com> wrote:
>
> > ok, the ascii picture didn't work so here is a google doc drawing
>
> https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
>
>
>
>
>
>
>
>
>
> > On Dec 16, 1:07 pm, Andrew Xue <and...@lumoslabs.com> wrote:
> > > Hi
>
> > > I have a query which uses the same subquery several times. It looks
> > > something like
>
> > > S3 SrcData => QueryA => QueryB => QueryC => etc.
> > >                                                              ^
> > >                                                              ||
> > >                                 S3 SrcData => QueryA
>
> > > QueryA is being called twice and it would be nice to be able to
> > > "cache" the result in a temp file in HDFS instead of running the query
> > > twice. This is especially true because QueryA is filter job on the
> > > SrcData and the cached result would be much smaller.
>
> > > I found a few threads in the cascading list which says to implement
> > > the isSafe() function on Operation to return false
>
> http://groups.google.com/group/cascading-user/browse_thread/thread/59....
> ..
>
>
>
> > > Is there anyway to access and implement from Cascalog? Thanks
>
> > > Andy
>
> --
> Sam Ritchie, Twitter Inc
> 703.662.1337
> @sritchie09
> (Too brief? Here's why!http://emailcharter.org)

Sam Ritchie

unread,
Dec 25, 2011, 5:54:22 PM12/25/11
to cascal...@googlegroups.com
Andy, try using http://clojars.org/cascalog-checkpoint, or

[cascalog-checkpoint "0.1.0"]

instead of the global cascalog-contrib; I changed the blog post, but I haven't been able to figure out how to take the cascalog-contrib jar off of clojars.
(Too brief? Here's why! http://emailcharter.org)

Andrew Xue

unread,
Dec 25, 2011, 6:45:06 PM12/25/11
to cascalog-user
cool that works -- this is seriously awesome, thanks!

On Dec 25, 5:54 pm, Sam Ritchie <sritchi...@gmail.com> wrote:
> Andy, try usinghttp://clojars.org/cascalog-checkpoint, or

Andrew Xue

unread,
Dec 25, 2011, 11:11:12 PM12/25/11
to cascalog-user
can you nest workflows?

for example something like:

(defn inner-workflow
[input-path output-path]
(workflow ["tmp/inner-workflow"]
step 1 ...
step 2 ... etc
))

(defn outer-workflow
[input-path output-path]
(workflow ["tmp/outer-workflow"]
step 1 ([:tmp-dirs inner-workflow-staging] (inner-workflow input-
path inner-workflow-staging))
step 2 ... etc
))

nathanmarz

unread,
Dec 28, 2011, 3:34:10 AM12/28/11
to cascalog-user
You could... but it's not really recommended. For example, if you were
to delete "/tmp/outer-workflow" but not "/tmp/inner-workflow", you'd
run into problems. If the workflow had previously only run part way
through inner-workflow, inner-workflow will end up emitting stale
results if outer-workflow is rerun.

Andrew Xue

unread,
Jan 2, 2012, 6:05:41 PM1/2/12
to cascalog-user
hey sam -- there seems to be an issue with checkpoint using u/
collectify

i am using cacsalog-1.8.5-SNAPSHOT and util.clj no longer has this
function

Andrew Xue

unread,
Jan 2, 2012, 6:08:07 PM1/2/12
to cascalog-user
ah ok, i guess cascalog is using the collectify in jacknife now

Sam Ritchie

unread,
Jan 2, 2012, 6:44:20 PM1/2/12
to cascal...@googlegroups.com
Andy, I found that we were re-using a number of functions between ElephantDB, Storm and Cascalog and decided to pull them out into a separate library. Once I write a few more tests I'll announce it formally on the list. Hopefully you and the rest of the gang will find some of the pieces useful in your own projects.

Cheers,
Sam
Reply all
Reply to author
Forward
0 new messages