Cascading Traps

677 views
Skip to first unread message

mlimotte

unread,
Jul 19, 2010, 2:29:46 PM7/19/10
to cascalog-user
Hi Nathan,

Cascading has a nice feature where a few malformed input records can
be "trapped", without stopping the entire flow. (http://
www.cascading.org/documentation/features/failure-traps.html ) This
allows you to run a big job and not have it die because of a little
dirty date, and still capture the bad records to a file, so you you
debug them later on.

As far as I can tell, Cascalog does not support this as of now. Any
pointers on alternatives, or how I can hack it in for my flow quickly.

Marc

nathanmarz

unread,
Jul 19, 2010, 3:19:06 PM7/19/10
to cascalog-user
I knew someone was going to ask about this eventually :)

I'll have to think more about the right way to add this to Cascalog,
as this is definitely a feature I'd like to support. For now the
alternative is to use a custom operation to wrap the code that might
throw an exception in a try...catch. From there you can deal with the
bad record manually (like writing to Scribe or a queue). That's not a
great alternative, I'll see if I can come up with something better
soon.



On Jul 19, 11:29 am, mlimotte <mslimo...@gmail.com> wrote:
> Hi Nathan,
>
> Cascading has a nice feature where a few malformed input records can
> be "trapped", without stopping the entire flow.  (http://www.cascading.org/documentation/features/failure-traps.html)  This

mlimotte

unread,
Jul 19, 2010, 6:01:24 PM7/19/10
to cascalog-user
Ok, will be great when this is available.

nathanmarz

unread,
Jul 19, 2010, 6:34:12 PM7/19/10
to cascalog-user
Here's what I'm thinking for design. Let me know what you think.

There will be two ways to add traps to a query or set of queries. The
first is with "with-trap", i.e.:

(with-trap (hfs-textline "/tmp/mytrap")
(?<- ...)
(?- ...)
)

This will trap any error occurring in any query within the scope of
the form into the given tap.

The second method will allow you to set traps in a more fine grained
way. We will do this by naming subqueries and then attaching traps to
those subqueries using a map. So, for example:

(with-trap-map {"error-subquery" (hfs-textline "/tmp/mytrap")}
(let [sq (<- [?f1 ?f2] (source ?f1) (possible-error-op ?f1 :> ?f2)
(possible-error-filter ?f1) (:name "error-subquery"))]
(?<- (hfs-textline "/tmp/results") [?f3] (sq _ ?f2) (* 2 ?f2 :> ?f3)))

In this case, any errors in possible-error-op and posssible-error-
filter will be trapped to /tmp/mytrap, while any errors in the *
operation will cause errors in the flow.

Marc Limotte

unread,
Jul 19, 2010, 7:33:28 PM7/19/10
to cascal...@googlegroups.com
I like the global option.  Nice and simple. 

I wonder if the fine grained method can be done without requiring the developer to come up with an arbitrary string name.  I was thinking something like the following (I'm new to clojure, so I hope this makes sense...):


(let [sq (<- [?f1 ?f2]
                  (source ?f1)
                  (:trap (hfs-textline "/tmp/mytrap"))
                  (possible-error-op ?f1 :> ?f2)
                  (possible-error-filter ?f1))]

  (?<- (hfs-textline "/tmp/results") [?f3] (sq _ ?f2) (* 2 ?f2 :> ?f3)))

Then you can generate the name under the covers with (uuid)

Or, if you want to use the same trap with multiple sub-queries, but not globally:

(def my-trap (hfs-textline "/tmp/mytrap"))
(let [sq (<- [?f1 ?f2]
                  (source ?f1)
                  (:trap my-trap)
                  (possible-error-op ?f1 :> ?f2)
                  (possible-error-filter ?f1))
       qry2 (<- [...] (source...) (:trap my-trap) (preds)..)]
  (?<- (hfs-textline "/tmp/results") [?f3] (sq _ ?f2) (qry2 ...) (* 2 ?f2 :> ?f3)))

Marc

Jim Blomo

unread,
Jul 19, 2010, 7:54:08 PM7/19/10
to cascal...@googlegroups.com
Funny, I am dealing with this today, too.

On Mon, Jul 19, 2010 at 3:34 PM, nathanmarz <natha...@gmail.com> wrote:

> Here's what I'm thinking for design. Let me know what you think.
>
> There will be two ways to add traps to a query or set of queries. The
> first is with "with-trap", i.e.:
>
> (with-trap (hfs-textline "/tmp/mytrap")
> (?<- ...)
> (?- ...)
> )

This seems straightforward and matches my understanding of the with-*
semantics. I haven't used traps in normal Cascading code, so I'm
curious: would these traps keep a hfs file open? Or open a file in a
directory for each expression?

> (with-trap-map {"error-subquery" (hfs-textline "/tmp/mytrap")}
> (let [sq (<- [?f1 ?f2] (source ?f1) (possible-error-op ?f1 :> ?f2)
> (possible-error-filter ?f1) (:name "error-subquery"))]
> (?<- (hfs-textline "/tmp/results") [?f3] (sq _ ?f2) (* 2 ?f2 :> ?f3)))

This seems pretty awkward as it requires linking up failure cases
between what could be a large number of lines. Again, I'm not sure
how this would work implementation wise, but I think something like

(<- [?f1 ?f2] (source ?f1) (possible-error-op ?f1 :> ?f2) (:trap
(hfs-textline "/tmp/mytrap")))

might work better. Then you have the trap embedded in the query.
Having this usage also seems like it would make the with-trap easier
to implement (just inject the (:trap ) clause in each expression).

Jim

nathanmarz

unread,
Jul 20, 2010, 4:26:44 AM7/20/10
to cascalog-user
Thanks for the feedback guys. I don't know the details of how
Cascading implements traps, hopefully Chris Wensel can chime in.

I just pushed the :trap implementation to GitHub. I didn't implement
with-trap.

Some caveats about traps:

1. You can't use the same tap in multiple :trap clauses in a single
query yet. I ran into some Cascading issues and am working with Chris
Wensel on this.
2. The ordering of fields within tuples written to a trap is undefined
(but will be consistent across all tuples in a single run). The actual
fields in the tuple will be whatever was in the flow at the time of
the error, which depends on how the query planner orders operations
and can be unpredictable.

If you just want to protect a single operation and do want strong
ordering and predictability in the fields, you can do so with a simple
subquery like:

(<- [?f1 ?f2 ?f3] (mytap ?f1 ?f2 ?f3) (failing-op ?f1) (:distinct
false) (:trap mytrap))

-Nathan

On Jul 19, 4:54 pm, Jim Blomo <j...@xcf.berkeley.edu> wrote:
> Funny, I am dealing with this today, too.
>

Marc Limotte

unread,
Jul 20, 2010, 5:21:31 AM7/20/10
to cascal...@googlegroups.com
Great.  Nice turn-around time.  This functionality should be sufficient for now.

Chris K Wensel

unread,
Jul 20, 2010, 4:10:27 PM7/20/10
to cascal...@googlegroups.com
> I haven't used traps in normal Cascading code, so I'm
> curious: would these traps keep a hfs file open? Or open a file in a
> directory for each expression?


It works as you would expect.

One trap directory for each trap. one part file for each mapper/reducer.

The part file stays open once opened, if ever, so it can accept additional tuples.

Hadoop MR does not support appends, so we couldn't close and reopen if we wanted too.

If you want a trap for each expression, you need to name each pipe individually and bind a trap to it.

Traps are not a device for filtering. they only exist to capture exceptional unanticipated cases when you don't want the job to stop in the face of them.

ckw

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

sourab...@corp.247customer.com

unread,
May 31, 2013, 6:06:00 AM5/31/13
to cascal...@googlegroups.com
Can any one provide any pointer how the same can be used in Jcascalog?

Thanks
Sourabh

sourab...@corp.247customer.com

unread,
Jun 6, 2013, 7:28:44 AM6/6/13
to cascal...@googlegroups.com
I am using Jcascalog and I am new to this technology. Can some one kindly point me any example of trap in jcascalog? I really appreciate your help.

Thanks
Sourabh

David Kincaid

unread,
Jun 6, 2013, 8:48:59 AM6/6/13
to cascal...@googlegroups.com
To add on to an example from the Wiki:

Api.execute(
  Api.hfsTextline("/tmp/myresults"),
  new Subquery("?count")
    .predicate(Api.hfsTextline("src/java/jcascalog/example"), "_")
    .predicate(new Count(), "?count"))
    .predicate(Option.Trap, Api.hfsTextline("/tmp/error-trap"));

sourab...@corp.247customer.com

unread,
Jun 6, 2013, 9:51:19 AM6/6/13
to cascal...@googlegroups.com
Thanks a lot. This is exactly what I was looking for.


On Monday, July 19, 2010 11:59:46 PM UTC+5:30, mlimotte wrote:

Andy Xue

unread,
Apr 4, 2014, 3:35:13 PM4/4/14
to cascal...@googlegroups.com
hmm i am having trouble getting the trap to work ... did this make it into cascalog 2.x? i tried both

(:trap (stdout))

and

(:trap (lfs-textline "mypath"))

Soren Macbeth

unread,
Apr 4, 2014, 4:45:51 PM4/4/14
to cascal...@googlegroups.com
they work for me, although I've never tried stdout or lfs-textline


--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
http://about.me/soren

dan young

unread,
Apr 4, 2014, 6:17:47 PM4/4/14
to cascal...@googlegroups.com

Same here...seems to work for me too...

Reply all
Reply to author
Forward
0 new messages