New features: select-tap-fields and select by position, useful for "wide" taps

92 views
Skip to first unread message

nathanmarz

unread,
Jun 22, 2010, 5:22:49 PM6/22/10
to cascalog-user
A number of people have expressed difficulty with taps that produce a
lot of fields as there was no way to select a subset of fields without
declaring all the unwanted fields with _.

I pushed a couple new features to Cascalog that should resolve these
issues.

The first is a new function in the API called "select-tap-fields".
select-tap-fields takes in a tap and a vector of fields, and it
produces a subquery that emits the fields you requested in the order
you specified. For example:

(let [sq (select-tap-fields widetap ["field3" "field20"])]
(?<- (stdout) [?a ?b] (sq ?a ?b)))

You could also write this as:

(?<- (stdout) [?a ?b] ((select-tap-fields widetap ["field3"
"field20"]) ?a ?b))

The second feature allows you to select vars by position for the
output vars of any predicate by using the :#> selector. :#> is an
alternative to :> or :>>. For example, let's say "widetap" has 13
fields. To select the 4th field as ?a and the 10th field as !b, you
would write:

(<- [?a !b] (widetap :#> 13 {3 ?a, 9 !b}))

You can use :#> for the output of functions or aggregators too. You
can map positions to constants as well, they're not limited to
variables.

mlimotte

unread,
Jul 13, 2010, 10:16:44 PM7/13/10
to cascalog-user
Hi.

I'd like to use these widetap features. After a (re-parse), I have 33
fields, so I'd like to use :#> to specify just a couple of variables.
I tried the following code:

(ns metamx.openx_agg
(:use cascalog.api)
(:require [cascalog [workflow :as w] [predicate :as p] [vars :as v]
[ops :as c]])
(:gen-class))

(defn openx-data [dir]
(let [raw (hfs-textline dir)]
(<- [?current-time ?visitor-id]
(raw ?line)
(c/re-parse [(re-pattern "[^\001]")] ?line :#> 33 {0 ?current-
time, 29 ?visitor-id}))))

(defn compute-agg [output-tap openx-dir]
(let [openx (openx-data openx-dir)]
(?<- output-tap [?a ?b] (openx ?a ?b) )))

(defn -main [openx-dir output-dir] (compute-agg (stdout) openx-dir))


But the IDE complains immediately: "expected left paren, symbol or
literal ::" (highlighting the :#> operator).

And if I try to build, it complains with:
Exception in thread "main" java.lang.Exception: Unable to resolve
symbol: ?current-time in this context (____.clj:8)

Am I using it wrong?

Marc

nathanmarz

unread,
Jul 13, 2010, 10:40:32 PM7/13/10
to cascalog-user
I'm able to compile it fine. Are you using an older version of
cascalog? I just today updated the version in the demo from 1.0.1-
SNAPSHOT to 1.1.0-SNAPSHOT.

Hope that helps,
Nathan

mlimotte

unread,
Jul 14, 2010, 4:40:13 PM7/14/10
to cascalog-user
Thanks. It's working now, although LaClojure (the Intellij clojure
IDE) doesn't recognize :#> as valid. It even flags it as an error
inside cascalog/predicate.clj. But I can still compile with lein.

Marc
Reply all
Reply to author
Forward
0 new messages