Cascalog 2.0 preview

749 views
Skip to first unread message

Nathan Marz

unread,
Apr 20, 2012, 4:21:50 PM4/20/12
to cascal...@googlegroups.com
Cascalog 2.0 is going to be a non-backwards compatible release of Cascalog. The goal is to fix the current problems with the API, small and large. 

The feature/serfn branch on Github fixes the biggest problem with Cascalog's current API, which is custom operations. It adds serializable functions (from https://github.com/nathanmarz/serializable-fn) into Cascalog which allows us to simplify how custom operations work while making them much more powerful. A preview build of feature/serfn is available from Clojars under the version "2.0.0-SNAPSHOT".

The new API lets you create anonymous functions to use in your queries, allowing things like:

(?<- (stdout) [?word ?count]
  (sentence ?sentence)
  ((mapcatop [s] (.split s " ")) ?sentence :> ?word)
  (c/count ?count))

The closure of your custom opts are captured as well, so you can do things like:

(defn mk-adder [amt]
  (mapop [v] (+ v amt)))

(?<- (stdout) [?v]
  (integer ?i)
  ((mk-adder 6) ?i :> ?v))

Finally, it also adds the notion of "prepared" ops, which are higher order functions parameterized with the Cascading "FlowProcess" and "OperationCall" on the task that return the operation to execute. This replaces the need for "stateful" ops and gives operations access to the JobConf and other metadata about the job. For instance:

(prepmapop [flow-process opcall]
  (fn [v] (+ v 3))

I've opened up an issue on Github to discuss other ways the Cascalog API can be improved: https://github.com/nathanmarz/cascalog/issues/70

One of the bigger changes already listed there is changing the default of the :distinct option to be false rather than true. Please comment there with other problems with the current API and changes that should be made in Cascalog to alleviate those problems. Remember, this is going to be a non-backwards-compatible release so feel free to suggest non-backwards-compatible ideas.

I'm really excited about 2.0 and look forward to hearing your feedback.

-Nathan

Bertrand Dechoux

unread,
Apr 21, 2012, 5:15:50 AM4/21/12
to cascalog-user
> The new API lets you create anonymous functions to use in your queries

From what I understood, in previous versions of cascalog, non-
anonymous functions (ie 'named' functions) needed to be contained in
the jar that was send across the MapReduce cluster ie those functions
had to be compiled. Do your statement means that this constraint does
not hold anymore? (at least for anonymous functions?) Could you
elaborate a bit on this point for a clojure rookie?

Thanks in advance

Bertrand

On Apr 20, 10:21 pm, Nathan Marz <nathan.m...@gmail.com> wrote:
> Cascalog 2.0 is going to be a non-backwards compatible release of Cascalog.
> The goal is to fix the current problems with the API, small and large.
>
> The feature/serfn branch on Github fixes the biggest problem with
> Cascalog's current API, which is custom operations. It adds serializable
> functions (fromhttps://github.com/nathanmarz/serializable-fn) into

Andrew Xue

unread,
Apr 21, 2012, 11:40:09 PM4/21/12
to cascalog-user
+1 for :distinct being false by default

Nathan Marz

unread,
Apr 22, 2012, 3:56:48 AM4/22/12
to cascal...@googlegroups.com
Before the functions you used needed to be attached to a var somewhere. To pass a function around dynamically, you would have to pass around the var.

Now, you can define a custom operation anywhere and pass it around any which way. So you don't need to use the var form if you're passing it around dynamically. Using custom ops is the same as using functions in any other Clojure code (with the constraint that the parts of the closure that you use must be serializable).

The REPL experience isn't perfect yet, but it's fixable and there's an issue open for it here: https://github.com/nathanmarz/cascalog/issues/73. Right now any functions your custom ops use need to be available on the tasks – so if it's not in the closure of your op then it needs to be attached to a var. For instance, the following will not work at the REPL;

(defn foo [x] (str x "!!!"))

(?<- (hfs-textline "/tmp/out") [?out]
  ((hfs-textline "/tmp/in") ?line)
  ((mapop [l] (foo l)) ?line :> ?out)

The reason it doesn't work is because "foo" does not exist on the tasks. However, we can make a Cascalog REPL that captures everything defined at the REPL and redefines it on all the tasks. This will make the REPL experience on a Hadoop cluster identicaly to the local experience. Note that running this at a REPL does work now (this is the complete query for word count), because it's not dependent on any other functions defined at the REPL:

(?<- (hfs-textline "/tmp/out") [?word ?count]
  ((hfs-textline "/tmp/in") ?line)
  ((mapcatop [l] (.split l " ")) ?line :> ?word)
  (c/count ?count))

The mapcatop, of course, does not exist in the jar and is shipped dynamically to the tasks.

-Nathan

--
Twitter: @nathanmarz
http://nathanmarz.com

Bertrand Dechoux

unread,
Apr 22, 2012, 8:06:21 AM4/22/12
to cascalog-user
Great. Thanks a lot for the explanation.

Bertrand

kovas boguta

unread,
Jun 12, 2012, 10:56:50 PM6/12/12
to cascal...@googlegroups.com
I started doing some real work with this, finally. Very useful stuff.

I'm hitting some kind of bug though: Everything needs to be op'ified,
otherwise I get:

RuntimeException Cannot serialize regular functions that are not bound
to vars serializable.fn/serialize-find (fn.clj:75)

So for example (c/count ?count) gives that error; I need to supply a
whole (aggregateop ..) expr to get it to work. Same thing goes for all
the other cascalog operations, and using built-in clojure vars as
operations directly (which is a super useful feature).

Any ideas on what could be causing this?

Thanks!

Nathan Marz

unread,
Jun 13, 2012, 12:57:37 AM6/13/12
to cascal...@googlegroups.com
Can you send some code that reproduces this? I'm unable to do so at the repl.

kovas boguta

unread,
Jun 13, 2012, 8:41:08 PM6/13/12
to cascal...@googlegroups.com
Looks like a heisenbug.

Though probably related to the fact that the jvm was getting switched
out under me by ops..

kovas boguta

unread,
Jun 18, 2012, 9:07:03 PM6/18/12
to cascal...@googlegroups.com
Is it necessary to do (use 'serializable.fn) when using this at the
command line?

Doing that seems to fix some of my examples.

There seems to be something state-dependent going on behind the
scenes... if I haven't done (use 'serializable.fn) , then some
examples will work, but once it gives the "Cannot serialize regular
functions" for an input, it will now also give that exception for
previously valid inputs.

Andy Xue

unread,
Jun 23, 2012, 1:30:15 PM6/23/12
to cascal...@googlegroups.com
hey nathan -- one other thing i have noticed is that i normally want to do something like this

(* 2 !my-var :> !my-var)

right now my understanding is that this triggers a filter job that accepts everything where the double of input = the input. instead it would be nice where it simply applies the function and the field re-binds to the output. 

the reason is that this would make it much easier for field names to be passed through transparently through chains of query. currently there appears to be no way to do this without explicitly either changing the input or output var somehow (like append an "_in" string / "_out" string)

Nathan Marz

unread,
Jun 24, 2012, 1:16:12 AM6/24/12
to cascal...@googlegroups.com
Yea, this behavior isn't going to be changed. Cascalog vars being immutable is important for keeping the semantics simple.

Sam Ritchie

unread,
Jun 25, 2012, 4:43:34 PM6/25/12
to cascal...@googlegroups.com
You can't do this because ordering of the predicates doesn't matter in datalog. Predicates are logical constraints, not to be thought of as function applications. These two queries are identical -- 

(<- [?x]
     (src ?x)
     (* 2 ?x :> ?x)
     (square ?x :> ?square))

(<- [?x]
     (src ?x)
     (square ?x :> ?square)
     (* 2 ?x :> ?x))

Changing that *2 predicate to reassign ?x would break datalog.
--
Sam Ritchie, Twitter Inc
@sritchie09

(Too brief? Here's why! http://emailcharter.org)

Soren Macbeth

unread,
Jan 11, 2013, 4:23:58 PM1/11/13
to cascal...@googlegroups.com
I'd like to pick development of cascalog 2.0 back up and revive this conversation. Given the new stuff in cascading 2.x, what would we like to see as part of cascalog 2.0?

Paul Lam

unread,
Jan 12, 2013, 3:50:08 AM1/12/13
to cascal...@googlegroups.com
Thank you for initiating this, Soren.

An improved logic solver would be nice. So we don't need to do sub-query, for example. We can take a look at core.logic and see what we can borrow. Some suggestions in this thread, like anonymous function, would be useful too. And there's that long list of issues we can pick at.

I'd like to help out to get development going again. Do you have a branch where I can fork?

Soren Macbeth

unread,
Jan 12, 2013, 2:14:24 PM1/12/13
to cascal...@googlegroups.com
Hi Paul,

My branch that I've just merged the main repo develop branch into (bringing it up to date with cascading 2.x) is the serfn branch at https://github.com/sorenmacbeth/cascalog

Improving the logic solver using core.logic sounds like a great thing to add. 
--
http://about.me/soren

Soren Macbeth

unread,
Jan 13, 2013, 3:31:16 PM1/13/13
to cascal...@googlegroups.com
I've merged my branch into the official serfn branch, so let's all fork off of that. 
--
http://about.me/soren


--
http://about.me/soren

Jeroen van Dijk

unread,
Jan 14, 2013, 4:14:18 AM1/14/13
to cascal...@googlegroups.com
Thanks for taking the initiative Soren. 

Like Paul said, a smarter solver could be nice. Say I have the following example:

    (defn tap-accessor [tap]
      (<- [?a ?b] 
         (tap ?t)
         (get-a ?t ?a)
         (get-b ?t ?b)))

    (?<- [?a] output-tap (tap-accessor ?a _))

In the current Cascalog version #'get-b is being called while the result is being neglected. A smarter solver could do this more efficiently probably by not calling #'get-b

Something a core.logic solver could also help with is providing better error messages, maybe even as far as 'did you mean this' when you have written a query that is unsolvable. Or provide ideas for writing a query more efficient.

Another thing that would be nice is adding tools/hooks for profiling. Maybe we could add certain information to the log output that could later be analysed or when the process is one java process we could put this information in an atom. I often find myself guessing or applying 'best' practices in order to make a query more efficient.

Other Ideas
------------------
One of the things that has bothered me the most is the Cascading Taps part of Cascalog. For example, I am a user of the maple library which is very inefficient for large Postgres databases. To fix this though I need to rewrite a large part of this library in Java. It would be nice if we could provide Clojure abstractions that would make the process of creating Cascading Taps easier, while still being efficient. This is probably more for Cascalog-contrib, but I would still be interested in what others think of this: e.g. does it make sense?

Maybe not a feature, but still important, what about documentation? I think it would help for the Cascalog community if there would be a place where we could put tutorials. From getting started to using Cascalog Checkpoint, from building your own tap to deploying on EMR. Currently all this information is absent or somewhere on either Nathan's blog, the Wiki, this mailinglist or random blogposts

Maybe another wild idea that could be interesting is adding support for nrepl so you don't have to SSH into a (EMR) machine. Not even sure if this is a real benefit, just throwing some ideas.

Jeroen

Paco Nathan

unread,
Jan 14, 2013, 11:28:17 AM1/14/13
to cascal...@googlegroups.com
Would be excellent to add support for choosing flow planning on different topologies, from within the language, and also to select different flow planners within the context of the same app.  For example, being about to have one app which is part batch, part local, part IMDG, etc. There are plenty of use cases for that emerging.

Cascading has work in progress to make it easier to build out support for new topologies -- it'd be great to plan ahead for that in Cascalog.

Soren Macbeth

unread,
Jan 14, 2013, 2:12:06 PM1/14/13
to cascal...@googlegroups.com
These all sounds like great ideas. I'm especially keen on starting to improve the solver using core.logic as te first place to begin. I haven't had a chance to play around with core.logic much yet, so help here would be awesome. 
Reply all
Reply to author
Forward
0 new messages