Description of def macros

Robert Malko

unread,

Nov 8, 2010, 1:58:10 PM11/8/10

to cascalog-user

Hi Cascaloggers,

I just wanted to say that writing clojure/cascalog is very very fun
but I'm having a hard time grasping what some of the included macros
do (defmapcatop, deffilterop, defmapop, defaggregateop, defbufferop,
defaggregateop, defparallelagg).

I've read all the blog posts, the source and the google group posts
but still can't really deduce what the point of these are and when I
should use them.

For instance, why would I define a filterop if I can just use a plain
clojure function for filtering?

Any help to further explain these macros would really take my cascalog
experience to the next level.

Thanks Nathan!

nathanmarz

unread,

Nov 8, 2010, 5:56:23 PM11/8/10

to cascalog-user

Been meaning to write this up... I'll put a more detailed explanation
on the wiki at some point, let me know what doesn't make sense below.

All those "def" macros define custom operations with differing
semantics. Let's use this "test' dataset as an example:

["a" 1]
["b" 2]
["a" 3]

defmapop: Define a custom operation which adds fields to a tuple.

(defmapop add-2-fields [x] [1 2])

(<- [?a ?b ?c] (test _ ?a) (add-2-fields ?a :> ?b ?c)

Results:
[1 1 2]
[2 1 2]
[3 1 2]

deffilterop: Define a custom operation which only keeps tuples for
which this operation returns true.

(deffilterop is2 [x] (= x 2))

(<- [?a ?b] (test ?a ?b) (is2 ?b))

Results:
["b" 2]

defmapcatop: Define a custom operation which creates *multiple*
tuples.

(defmapcatop twomoretuples [x] [[(inc x)] [(+ 2 x)]])

(<- [?a ?b ?c] (test ?a ?b) (twomoretuples ?b :> ?c))

Results:
["a" 1 2]
["a" 1 3]
["b" 2 3]
["b" 2 4]
["a" 3 4]
["a" 3 5]

defbufferop: Defines an aggregator which receives all the tuples for
the group in a single seq. Buffers cannot be used with any other
buffers/aggregators in a query. Buffers operate reduce-side.

(defbufferop dosum [tuples] (reduce + (map first tuples)))

(<- [?a ?sum] (test ?a ?b) (dosum ?b :> ?sum))

Results:
["a" 4]
["b" 2]

defaggregateop: Defines an aggregator which must be written in a more
restricted way. Aggregators *can* be used with other aggregators in a
query (i.e., you can do a count and sum of a group at same time).
Aggregators operate reduce-side. Aggregators consist of code for
"initializing", "aggregating", and "extracting a result". Aggregators
return a seq of tuples.

(defaggregateop dosum ([] 0) ([state val] (+ state val)) ([state]
[state]))

defparallelagg: Defines an even more restricted aggregator that is
defined using two functions. These aggregators are more efficient as
they make use of map-side combiner optimizations. parallelaggs can be
composed with other parallelaggs/regular aggregators. However, when
composed with regular aggregators the entire computation is moved
reduce-side.

(defparallelagg dosum :init-var #'identity :combine-var #'+)

Vanilla Clojure functions can also be used as operations. When given
no output vars they work as filterops, when given output vars they
work as mapops. The drawback of using a regular Clojure function is
that they can't be inserted dynamically into a query. For example:

(defn mk-query [op] (<- [?a ?b] (test _ ?a) (op ?a :> ?b))

The "op" passed to that function must be defined using one of
Cascalog's "def" macros and can't be a regular Clojure function. This
is b/c Cascalog uses the var name of functions to distribute the
operation across the cluster.

Hope that helps! Let me know if you have more questions.

-Nathan

Robert Malko

unread,

Nov 9, 2010, 3:52:19 PM11/9/10

to cascalog-user

Hi Nathan,

Thanks for this. Can you please elaborate on defaggregateop? You
said it was more restricted and then gave an example but I'm not sure
what the output is and if it has to follow a certain form. Other than
that, everything else makes sense.

Best

nathanmarz

unread,

Nov 9, 2010, 4:40:49 PM11/9/10

to cascalog-user

No problem. So if we look again at the "dosum" example:

(defaggregateop dosum
([] 0)
([state val] (+ state val))
([state] [state]))

An aggregateop accumulates some state over the course of the
aggregation. The code body with no parameters sets the initial value
of the state.

The code body with >1 parameter is the "accumulation" function. It
takes a state value and a tuple in the aggregation and returns a new
state value. In this case, it receives a 1-tuple containing the next
value in the grouping to add to the sum. If the aggregator took in 2-
tuples as input, that piece of code would take 3 parameters (1 for the
state, 2 for the tuple).

The code body with 1 parameter is the "return" function. It takes in
the totally accumulated state and returns a seq of tuples as output.
In this case it just returns the state as-is. "[state]" is equivalent
to saying "[[state]]". 1-tuples can be written as just values and
without the []. "[state state]" would be the same as saying "[[state]
[state]" (returning 2 tuples as output).

Also, I noticed that my buffer version has a typo. It should be:

(defbufferop dosum [tuples] [(reduce + (map first tuples))])

Hope that helps,
Nathan

Reply all

Reply to author

Forward