hfs-textline to take multiple input files

490 views
Skip to first unread message

thejus

unread,
Aug 21, 2012, 7:23:28 AM8/21/12
to cascal...@googlegroups.com
Hi Guys,

Here is my problem,
I have a directory which contains dated files.
Every day I need to run a cascalog job on a subset of these files.
I can copy this subset into a temporary directory, run the job and delete it later.
But these files are huge and hence take a considerable amount of time to copy.

As far as i have checked hfs-texline takes only one path.
Is there hfs-textline like function to create a tap for multiple files,
Or a function that can append two taps.

--
Thejus

Sam Ritchie

unread,
Aug 21, 2012, 11:11:37 PM8/21/12
to cascal...@googlegroups.com
Hey,

you can use :source-pattern like so:

(hfs-textline "/root/path" :source-pattern "/2011/*/01/*")

 to get all data in the first of every month, for example.

-- 
Sam Ritchie, Twitter Inc
@sritchie

(Too brief? Here's why! http://emailcharter.org)

thejus

unread,
Aug 23, 2012, 2:26:56 AM8/23/12
to cascal...@googlegroups.com
Hi Sam,

I was looking for the same.
Thanks.

--
Thejus

Stefan Hübner

unread,
Aug 23, 2012, 3:47:21 AM8/23/12
to cascal...@googlegroups.com
Sam Ritchie <sritc...@gmail.com> writes:

> Hey,
>
> you can use :source-pattern like so:
>
> (hfs-textline "/root/path" :source-pattern "/2011/*/01/*")
>
> to get all data in the first of every month, for example.

Actually, adding the glob pattern to the path works just as well.

Erik Andrejko

unread,
Aug 23, 2012, 11:59:26 AM8/23/12
to cascal...@googlegroups.com
Hi Sam,

Does this also work with s3n paths?

Erik

Sam Ritchie

unread,
Aug 25, 2012, 9:01:03 PM8/25/12
to cascal...@googlegroups.com
Yup, just make sure the path is prefixed with s3n://.

sourab...@corp.247customer.com

unread,
May 10, 2013, 8:35:37 AM5/10/13
to cascal...@googlegroups.com

Any idea how can I use source-pattern in jcascalog?
In we can pass it in cascalog like: (hfs-textline "/root/path" :source-pattern "/2011/*/01/*")

In jcascalog we use Api.hfsTextline(path). Is there any way to pass the pattern as argument to Api.hfsTextline instead of complete path.

Thanks
Sourabh

David Kincaid

unread,
May 10, 2013, 9:34:58 AM5/10/13
to cascal...@googlegroups.com
I don't see it in Api.java, but I think you could pretty easily create it yourself. Something like this:

public static Object hfsTextline(String path, String pattern) {
        return Util.bootSimpleFn("cascalog.api", "hfs-textline").invoke(path, Keyword.intern("source-pattern"), pattern);
    }

(Util is cascalog.Util and Keyword is clojure.lang.Keyword)

That is completely untested, but that's the general idea I think.

Revanth Revoori

unread,
Jul 16, 2014, 4:30:22 AM7/16/14
to cascal...@googlegroups.com
Hi David,

I have a similar problem,I have to include all the folders with a pattern 032014/7 , 042014/7 , 052014/7 in to a tap so that it will take all the files inside those folder.I also have to include specific Pail Structure,Please give me any help regarding this.

Andrés Corrada-Emmanuel

unread,
Sep 11, 2014, 5:42:55 PM9/11/14
to cascal...@googlegroups.com
For the benefit of others, I'll share my hack to do arbitrary collections of files that may not obey a simple pattern. The key is the line that maps the :sink keyword to hfs-textline sequences.


(defn make-hours-multi-tap
  "Makes a single data tap out of multiple calendar hour subdirectories."
  [base-path start-hour end-hour path-pattern]
  (let
    [hour-paths (t-paths/make-calendar-hour-paths base-path start-hour end-hour)
     ;; Not all hour paths may be present so we filter out the ones that are non-existent or empty
     hour-paths-with-files (filter hour-path-has-files? hour-paths)]
    (if (empty? hour-paths-with-files)
      nil
      ;; There are files with data that we can tap
      (let [hour-taps (map #(:sink (hfs-textline (str %  path-pattern))) hour-paths-with-files)]
        (MultiSourceTap. (into-array cascading.tap.Tap hour-taps))))))

Beau Fabry

unread,
Sep 17, 2014, 11:18:27 PM9/17/14
to cascal...@googlegroups.com
Thanks for sharing this Andrés! We were hacking something up with globbing prior to my seeing this.

Matthew Wooller

unread,
Mar 5, 2015, 6:13:59 AM3/5/15
to cascal...@googlegroups.com
Thank you - I had been trying to pull something together for a couple of hours, then found this - and as usual, I was overcomplicating things - this is nice and straight forward...

Cheers!

M.
Reply all
Reply to author
Forward
0 new messages