Been meaning to write this up... I'll put a more detailed explanation
on the wiki at some point, let me know what doesn't make sense below.
All those "def" macros define custom operations with differing
semantics. Let's use this "test' dataset as an example:
["a" 1]
["b" 2]
["a" 3]
defmapop: Define a custom operation which adds fields to a tuple.
(defmapop add-2-fields [x] [1 2])
(<- [?a ?b ?c] (test _ ?a) (add-2-fields ?a :> ?b ?c)
Results:
[1 1 2]
[2 1 2]
[3 1 2]
deffilterop: Define a custom operation which only keeps tuples for
which this operation returns true.
(deffilterop is2 [x] (= x 2))
(<- [?a ?b] (test ?a ?b) (is2 ?b))
Results:
["b" 2]
defmapcatop: Define a custom operation which creates *multiple*
tuples.
(defmapcatop twomoretuples [x] [[(inc x)] [(+ 2 x)]])
(<- [?a ?b ?c] (test ?a ?b) (twomoretuples ?b :> ?c))
Results:
["a" 1 2]
["a" 1 3]
["b" 2 3]
["b" 2 4]
["a" 3 4]
["a" 3 5]
defbufferop: Defines an aggregator which receives all the tuples for
the group in a single seq. Buffers cannot be used with any other
buffers/aggregators in a query. Buffers operate reduce-side.
(defbufferop dosum [tuples] (reduce + (map first tuples)))
(<- [?a ?sum] (test ?a ?b) (dosum ?b :> ?sum))
Results:
["a" 4]
["b" 2]
defaggregateop: Defines an aggregator which must be written in a more
restricted way. Aggregators *can* be used with other aggregators in a
query (i.e., you can do a count and sum of a group at same time).
Aggregators operate reduce-side. Aggregators consist of code for
"initializing", "aggregating", and "extracting a result". Aggregators
return a seq of tuples.
(defaggregateop dosum ([] 0) ([state val] (+ state val)) ([state]
[state]))
defparallelagg: Defines an even more restricted aggregator that is
defined using two functions. These aggregators are more efficient as
they make use of map-side combiner optimizations. parallelaggs can be
composed with other parallelaggs/regular aggregators. However, when
composed with regular aggregators the entire computation is moved
reduce-side.
(defparallelagg dosum :init-var #'identity :combine-var #'+)
Vanilla Clojure functions can also be used as operations. When given
no output vars they work as filterops, when given output vars they
work as mapops. The drawback of using a regular Clojure function is
that they can't be inserted dynamically into a query. For example:
(defn mk-query [op] (<- [?a ?b] (test _ ?a) (op ?a :> ?b))
The "op" passed to that function must be defined using one of
Cascalog's "def" macros and can't be a regular Clojure function. This
is b/c Cascalog uses the var name of functions to distribute the
operation across the cluster.
Hope that helps! Let me know if you have more questions.
-Nathan