dumping temporary values between subqueries

Andrew Xue

unread,

Dec 16, 2011, 4:07:14 PM12/16/11

to cascalog-user

Hi

I have a query which uses the same subquery several times. It looks
something like

S3 SrcData => QueryA => QueryB => QueryC => etc.
^
||
S3 SrcData => QueryA

QueryA is being called twice and it would be nice to be able to
"cache" the result in a temp file in HDFS instead of running the query
twice. This is especially true because QueryA is filter job on the
SrcData and the cached result would be much smaller.

I found a few threads in the cascading list which says to implement
the isSafe() function on Operation to return false

http://groups.google.com/group/cascading-user/browse_thread/thread/59e3463093c1eebb#
http://groups.google.com/group/cascading-user/browse_thread/thread/cd283dadc6f76bbe/d09111e95b6a8852?lnk=gst&q=%22Dumping+pipe+to+disk#d09111e95b6a8852

Is there anyway to access and implement from Cascalog? Thanks

Andy

Andrew Xue

unread,

Dec 16, 2011, 4:11:36 PM12/16/11

to cascalog-user

ok, the ascii picture didn't work so here is a google doc drawing

https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FWdkAe-QS7NU/edit

On Dec 16, 1:07 pm, Andrew Xue <and...@lumoslabs.com> wrote:
> Hi
>
> I have a query which uses the same subquery several times. It looks
> something like
>
> S3 SrcData => QueryA => QueryB => QueryC => etc.
> ^
> ||
> S3 SrcData => QueryA
>
> QueryA is being called twice and it would be nice to be able to
> "cache" the result in a temp file in HDFS instead of running the query
> twice. This is especially true because QueryA is filter job on the
> SrcData and the cached result would be much smaller.
>
> I found a few threads in the cascading list which says to implement
> the isSafe() function on Operation to return false
>

> http://groups.google.com/group/cascading-user/browse_thread/thread/59...http://groups.google.com/group/cascading-user/browse_thread/thread/cd...

nathanmarz

unread,

Dec 21, 2011, 5:20:23 PM12/21/11

to cascalog-user

Cascading will do this automatically as long as the subquery includes
a reduce step... if it's map-only (e.g. just a filter) it will redo
query A. I opened up an issue to expose the isSafe method for
subqueries so you can force this optimization: https://github.com/nathanmarz/cascalog/issues/38

You can probably force this optimization now by implementing a regular
Cascading filter that always returns false and sets that isSafe
method. So something like:

(<- [?foo ?bar] (source ?foo ?bar) (my-filter ?foo)
((IdentityUnsafe.) ?foo) (:distinct false))

where IdentityUnsafe is your Cascading filter implementation.

On Dec 16, 1:11 pm, Andrew Xue <and...@lumoslabs.com> wrote:
> ok, the ascii picture didn't work so here is a google doc drawing
>

> https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...

>
> On Dec 16, 1:07 pm, Andrew Xue <and...@lumoslabs.com> wrote:
>
>
>
>
>
>
>
> > Hi
>
> > I have a query which uses the same subquery several times. It looks
> > something like
>
> > S3 SrcData => QueryA => QueryB => QueryC => etc.
> > ^
> > ||
> > S3 SrcData => QueryA
>
> > QueryA is being called twice and it would be nice to be able to
> > "cache" the result in a temp file in HDFS instead of running the query
> > twice. This is especially true because QueryA is filter job on the
> > SrcData and the cached result would be much smaller.
>
> > I found a few threads in the cascading list which says to implement
> > the isSafe() function on Operation to return false
>

> >http://groups.google.com/group/cascading-user/browse_thread/thread/59......

Sam Ritchie

unread,

Dec 21, 2011, 6:56:25 PM12/21/11

to cascal...@googlegroups.com

Hey, you'll probably find cascalog.checkpoint in cascalog-contrib helpful. I discuss an example here: http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html. Something like
(workflow ["/tmp/checkpoints"]
queryA ([:tmp-dirs query-a-data]
(?- (hfs-seqfile query-a-data)
(query-a s3-path)))
queryB ([:deps queryA :tmp-dirs query-b-data]
(?- (hfs-seqfile query-b-data)
(query-b query-a-data)))
queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
(?- (hfs-seqfile query-c-data)
(query-c query-a-data query-b-data))))
and so on and so forth.

--
Sam Ritchie, Twitter Inc
703.662.1337
@sritchie09
(Too brief? Here's why! http://emailcharter.org)

--

Sam Ritchie, Twitter Inc

703.662.1337

@sritchie09

(Too brief? Here's why! http://emailcharter.org)

Andrew Xue

unread,

Dec 25, 2011, 5:39:36 PM12/25/11

to cascalog-user

hey sam -- this looks great, but having some trouble with getting
checkpoint working -- have you seen this error before?

ClassCastException java.lang.String cannot be cast to
clojure.lang.IFn cascalog.contrib.checkpoint/exec-workflow!/
iter--210--214/fn--215 (checkpoint.clj:88)

i get this error with the workflow i was testing as well as when i cut
and pasted in the checkpoint_test.clj code and tried to (run-test!)

On Dec 21, 6:56 pm, Sam Ritchie <sritchi...@gmail.com> wrote:
> Hey, you'll probably find cascalog.checkpoint in cascalog-contrib helpful.
> I discuss an example here:http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
> Something like
> (workflow ["/tmp/checkpoints"]
> queryA ([:tmp-dirs query-a-data]
> (?- (hfs-seqfile query-a-data)
> (query-a s3-path)))
> queryB ([:deps queryA :tmp-dirs query-b-data]
> (?- (hfs-seqfile query-b-data)
> (query-b query-a-data)))
> queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
> (?- (hfs-seqfile query-c-data)
> (query-c query-a-data query-b-data))))
> and so on and so forth.
>
> On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue <and...@lumoslabs.com> wrote:
>
> > ok, the ascii picture didn't work so here is a google doc drawing
>

> https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...

>
>
>
>
>
>
>
>
>
> > On Dec 16, 1:07 pm, Andrew Xue <and...@lumoslabs.com> wrote:
> > > Hi
>
> > > I have a query which uses the same subquery several times. It looks
> > > something like
>
> > > S3 SrcData => QueryA => QueryB => QueryC => etc.
> > > ^
> > > ||
> > > S3 SrcData => QueryA
>
> > > QueryA is being called twice and it would be nice to be able to
> > > "cache" the result in a temp file in HDFS instead of running the query
> > > twice. This is especially true because QueryA is filter job on the
> > > SrcData and the cached result would be much smaller.
>
> > > I found a few threads in the cascading list which says to implement
> > > the isSafe() function on Operation to return false
>

> http://groups.google.com/group/cascading-user/browse_thread/thread/59....
> ..

>
>
>
> > > Is there anyway to access and implement from Cascalog? Thanks
>
> > > Andy
>
> --
> Sam Ritchie, Twitter Inc
> 703.662.1337
> @sritchie09

> (Too brief? Here's why!http://emailcharter.org)

Sam Ritchie

unread,

Dec 25, 2011, 5:54:22 PM12/25/11

to cascal...@googlegroups.com

Andy, try using http://clojars.org/cascalog-checkpoint, or

[cascalog-checkpoint "0.1.0"]

instead of the global cascalog-contrib; I changed the blog post, but I haven't been able to figure out how to take the cascalog-contrib jar off of clojars.

(Too brief? Here's why! http://emailcharter.org)

Andrew Xue

unread,

Dec 25, 2011, 6:45:06 PM12/25/11

to cascalog-user

cool that works -- this is seriously awesome, thanks!

On Dec 25, 5:54 pm, Sam Ritchie <sritchi...@gmail.com> wrote:
> Andy, try usinghttp://clojars.org/cascalog-checkpoint, or

Andrew Xue

unread,

Dec 25, 2011, 11:11:12 PM12/25/11

to cascalog-user

can you nest workflows?

for example something like:

(defn inner-workflow
[input-path output-path]
(workflow ["tmp/inner-workflow"]
step 1 ...
step 2 ... etc
))

(defn outer-workflow
[input-path output-path]
(workflow ["tmp/outer-workflow"]
step 1 ([:tmp-dirs inner-workflow-staging] (inner-workflow input-
path inner-workflow-staging))
step 2 ... etc
))

nathanmarz

unread,

Dec 28, 2011, 3:34:10 AM12/28/11

to cascalog-user

You could... but it's not really recommended. For example, if you were
to delete "/tmp/outer-workflow" but not "/tmp/inner-workflow", you'd
run into problems. If the workflow had previously only run part way
through inner-workflow, inner-workflow will end up emitting stale
results if outer-workflow is rerun.

Andrew Xue

unread,

Jan 2, 2012, 6:05:41 PM1/2/12

to cascalog-user

hey sam -- there seems to be an issue with checkpoint using u/
collectify

i am using cacsalog-1.8.5-SNAPSHOT and util.clj no longer has this
function

Andrew Xue

unread,

Jan 2, 2012, 6:08:07 PM1/2/12

to cascalog-user

ah ok, i guess cascalog is using the collectify in jacknife now

Sam Ritchie

unread,

Jan 2, 2012, 6:44:20 PM1/2/12

to cascal...@googlegroups.com

Andy, I found that we were re-using a number of functions between ElephantDB, Storm and Cascalog and decided to pull them out into a separate library. Once I write a few more tests I'll announce it formally on the list. Hopefully you and the rest of the gang will find some of the pieces useful in your own projects.