Parsing files from Amazon S3 buckets

Avram

unread,

Jun 29, 2010, 6:07:03 PM6/29/10

to cascalog-user

I would like to replicate the newsfeed example (http://nathanmarz.com/
blog/cascalog-news-feed/) using data residing in S3 buckets.

In Hive, I can point a location to 's3://somebucket/', can I do this
via cascalog ?

Also, I am running into an issue with re-parse from the link's example
code:
Clojure 1.1.0
user=> (use 'cascalog.playground) (bootstrap)
nil
nil
user=>
user=> (defn follows-data [dir]
(let [source (hfs-textline dir)]
(<- [?p ?p2] (source ?line) (re-parse [#"[^\s]+"] ?line :> ?p ?p2)
(:distinct false))))
java.lang.Exception: Unable to resolve symbol: re-parse in this
context (NO_SOURCE_FILE:5)
user=> (doc reparse)
java.lang.Exception: Unable to resolve var: reparse in this context
(NO_SOURCE_FILE:7)

Thanks in advance,

~A

nathanmarz

unread,

Jun 29, 2010, 11:26:02 PM6/29/10

to cascalog-user

Yes, you can point hfs-* taps to s3. If you have your key and secret
key specified in your hadoop configuration, the exact syntax you
described will work. Otherwise, you can embed those within your URL
string. Check the Hfs documentation from Cascading for the details:
http://www.cascading.org/javadoc/cascading/tap/Hfs.html

re-parse is defined in the cascalog.ops namespace, and 'bootstrap'
from cascalog.playground imports and refers that namespace as "c". So
to use re-parse, do this:

(c/re-parse [#"[^\s]+"] ?line :> ?p ?p2)

Avram

unread,

Jul 2, 2010, 3:14:35 PM7/2/10

to cascalog-user

Thanks for this. I'm still not quite there yet though. My data is
tab delimited and I ultimately only care about the 1st, 2nd, 5th, and
7th field. There are 50-100 columns. Unsure how to use destructuring
to get this.
Also, unclear on how to invoke the function with the right kind of
quoting around the s3:// path.

So far, I have this (basically unaltered from your example of the
follows-data function)...

(defn get-my-data [dir]
(let [source (hfs-textline dir)]
(<-[ ?p ?p2] (source ?line) (c/re-parse [#"[^\s]+"] ?line :> ?p ?p2)
( :distinct false))))

(get-my-data "s3://path.to.my.bucket/folder/part-*")

user=> (get-my-data "s3://path.to.my.bucket/folder/part-*")
{:type :generator, :id "34a7407c-b28f-4987-a594-
ad4415b86d42", :ground? true, :sourcemap {"892d9ba0-28bb-4808-
a772-5de1099921a4" #<Hfs Hfs["TextLine[['line']->[ALL]]"]["s3://
path.to.my.bucket/folder/part-*"]"]>}, :pipe #<Each
Each(892d9ba0-28bb-4808-a772-5de1099921a4)
[Identity[decl:ARGS]]>, :outfields ["?p__gen8" "?p2__gen9"]}
user=>

What am I doing wrong?

Many thanks,
~A

nathanmarz

unread,

Jul 2, 2010, 11:43:08 PM7/2/10

to cascalog-user

<- returns a subquery. What you're printing out at the REPL is the
"compiled" version of the subquery. To execute the subquery into some
output tap, you should use the ?- function, i.e. (?- (hfs-textline "/
tmp/results") myquery)

As for getting only the 1st, 2nd, 5th, and 7th fields, you'll want to
make use of the :#> output selector. For example, if "my-data" has 100
fields, here's how you can write a predicate to select only those
fields you're interested in:

(my-data :#> 100 {0 ?a, 1 ?b, 4 ?c, 6 ?d})

:#> expects two arguments, the first being the number of fields
emitted by the subquery and the second being a map from position to
variable name.

re-parse won't work though if the number of columns is different in
each record (you need to declare the exact number of fields at compile
time). In this case, you'll want to write some custom parsing logic
using defmapop. Let me know if you need help with that.

Reply all

Reply to author

Forward