Loading multiple files

ulrik.s...@jayway.com

unread,

Jan 5, 2016, 5:03:07 AM1/5/16

to PigPen Support

I have lots of files and would like to avoid concat'ing them all from lots of loading jobs. I saw in an earlier discussion this comment:

"If you know that all of the files are in the same format, change find-inputs to generate a string like this: {/path/file1,/path/file2,/path/file3} and pass that to a single load-clj command."

Is this how I load multiple files? I can't get it to work, at least not using load-json.

Matt Bossenbroek

unread,

Jan 5, 2016, 11:37:08 AM1/5/16

to ulrik.s...@jayway.com, PigPen Support

Unfortunately, that is the state of the art. It's a limitation of the host system.

Here's some references on what that syntax should look like:

http://stackoverflow.com/questions/12630584/load-multiple-files-in-pig

http://stackoverflow.com/questions/3515481/pig-latin-load-multiple-files-from-a-date-range-part-of-the-directory-structur

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

From my experience, doing this globbing is much more efficient than loading each individually and applying concat.

If that doesn't work, what's the error message you're getting?

-Matt

--
You received this message because you are subscribed to the Google Groups "PigPen Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pigpen-suppor...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ulrik.s...@jayway.com

unread,

Jan 6, 2016, 9:42:45 AM1/6/16

to PigPen Support, ulrik.s...@jayway.com

It would be great if the globbing thing worked. I don't get an error, just no output.

```
(->> (pig/load-json "etc/test.jsonl.gz")
(pig/dump)
count)
=> 10

(->> (pig/load-json "etc/test2.jsonl.gz")
(pig/dump)
count)
=> 10

(->> (pig/load-json "{etc/test.jsonl.gz,etc/test2.jsonl.gz}")
(pig/dump)
count)
=> 0
```

But I might have misunderstood the meaning of "change find-inputs to generate a string like this: {/path/file1,/path/file2,/path/file3} and pass that to a single load-clj command".

Matt Bossenbroek

unread,

Jan 6, 2016, 11:41:14 AM1/6/16

to ulrik.s...@jayway.com, PigPen Support

Oooooh - you're using it locally! Do you have an actual use case for globbing in the repl, or did you just want to test that it works?

I've never implemented globbing locally because the local mode is generally just for testing prior to running on an actual cluster (where globbing is supported). I rarely use actual files in the repl & prefer just to use pig/return for unit tests.

Are all of the files in the same directory (with nothing else)? If so, you should be able to ask it to load a directory and it should have the same effect. If you're interested, here's the code that does the file listing for a given path: https://github.com/Netflix/PigPen/blob/e7c6c56df87e98ecde4970a3f85717d03c79c981/pigpen-core/src/main/clojure/pigpen/extensions/io.clj#L24

If you have your files organized such that you can load a whole folder, that should work on the cluster as well.

Let me know if that will work for you.

-Matt

ulrik.s...@jayway.com

unread,

Jan 6, 2016, 2:01:27 PM1/6/16

to PigPen Support, ulrik.s...@jayway.com

It was only to test that it works. I figured if it doesn't work locally, there's no point in going to the cluster. :)

I tried pointing to a single directory instead, and that actually works locally. Thanks for that tip.

```
(->> (pig/load-json "etc/test/")
(pig/dump)
count)
=> 20
```

ulrik.s...@jayway.com

unread,

Jan 6, 2016, 4:09:56 PM1/6/16

to PigPen Support, ulrik.s...@jayway.com

I have problems on the EMR cluster when specifying an S3 folder. In the script, I have:

LOAD 's3://mybucket/20150317/'

The logs says:

2016-01-06 20:56:41,495 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (JobControl): listStatus s3://mybucket/20150317 with recursive false
2016-01-06 20:56:41,704 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (JobControl): Total input paths to process : 15

The folder contains 15 files, so that's correct (even though the trailing slash is gone at this point).

Yet it fails with this message:

Input(s):
Failed to read data from "s3://mybucket/20150317"

I have a feeling it tries to read the S3 "folder" itself, but I can't be sure. Any other ideas?

Matt Bossenbroek

unread,

Jan 6, 2016, 4:20:05 PM1/6/16

to ulrik.s...@jayway.com, PigPen Support

Could you try these variations?

's3://mybucket/20150317'

's3://mybucket/20150317/*'

I remember pig being finicky about that - let me see if I can find a reference...

-Matt

ulrik.s...@jayway.com

unread,

Jan 6, 2016, 6:22:01 PM1/6/16

to PigPen Support, ulrik.s...@jayway.com

No luck with any of those. Both say that there are 15 input paths to process, and they both fail:

> 's3://mybucket/20150317'

Input(s):
Failed to read data from "s3://udl-prod/messages/20150317"

> 's3://mybucket/20150317/*'

Input(s):
Failed to read data from "s3://udl-prod/messages/20150317/*"

In all cases, the output also fails. Is that an obvious consequence of input failure, or is there something wrong with the output configuration?

Output(s):
Failed to produce result in "hdfs://ip-10-126-39-199.eu-west-1.compute.internal:8020/user/hadoop/output.clj"

The output is coded like:

...
(pig/store-clj "output.clj"))

The script is called like this:

2016-01-06T23:04:25.954Z INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar pig-script --run-pig-script --args -f s3://mybucket/my-script.pig -p OUTPUT=s3://mybucket/output/'

I have created the S3 folder given in OUTPUT above. I don't actually use OUTPUT anywhere. Is it implicit, or do I need to refer to it somewhere?

Matt Bossenbroek

unread,

Jan 6, 2016, 6:32:50 PM1/6/16

to ulrik.s...@jayway.com, PigPen Support

No luck with any of those.

That sucks - I couldn't find anything that I remember seeing in the past either.

In all cases, the output also fails. Is that an obvious consequence of input failure, or is there something wrong with the output configuration?

IIRC, yes.

I don't actually use OUTPUT anywhere. Is it implicit, or do I need to refer to it somewhere?

That could be it. I would include the full output path in the store command, like this:

(pig/store-clj "s3://mybucket/output/output.clj"))

It's not going to consume any parameters you supply unless you use them explicitly, like this:

(pig/store-clj "$OUTPUToutput.clj"))

Then it'll show up in the script like that & pig will do the parameter substitution as normal.

Is it launching any hadoop jobs, or does it fail before that stage? It's often confusing which part failed & pig will just say 'it failed' without relaying the real cause.

If you're starting fresh, another option might be to try the cascading route… let me know if you have more questions on that.

-Matt

ulrik.s...@jayway.com

unread,

Jan 6, 2016, 7:28:59 PM1/6/16

to PigPen Support, ulrik.s...@jayway.com

I logged in to the master and used grunt to get quicker feedback.

I tried the full path to the output file, but no difference:

Input(s):
Failed to read data from "s3://mybucket/20150317"

Output(s):
Failed to produce result in "s3://mybucket/output/output.clj"

It seems to actually start a hadoop job:

16/01/07 00:04:48 INFO mapReduceLayer.MapReduceLauncher: Running jobs are [job_1452109769870_0014]
...
16/01/07 00:04:58 INFO mapReduceLayer.MapReduceLauncher: job job_1452109769870_0014 has failed! Stop running all dependent jobs

Matt Bossenbroek

unread,

Jan 6, 2016, 7:47:56 PM1/6/16

to ulrik.s...@jayway.com, PigPen Support

In that case, the real error is likely on hadoop. Try looking through the hadoop job tracker to find the logs from the job it launched. That'll be the useful error.

-Matt

Matt Bossenbroek

unread,

Jan 26, 2016, 1:12:22 PM1/26/16

to ulrik.s...@jayway.com, PigPen Support

Were you able to track down the cause of the error? Was it related to the path globbing?

-Matt

Ulrik Sandberg

unread,

Jan 26, 2016, 2:28:55 PM1/26/16

to Matt Bossenbroek, PigPen Support

No, I never found the reason why it didn't work. I switched to Cascalog and it worked right away. I could point out either folders or files on S3 without any problems.

Matt Bossenbroek

unread,

Jan 26, 2016, 2:35:25 PM1/26/16

to Ulrik Sandberg, PigPen Support

Just curious, did you try the pigpen/cascading route at all? I'm guessing the cascading backend would handle globbing similarly for both pigpen & cascalog.

-Matt

Ulrik Sandberg

unread,

Jan 26, 2016, 4:40:54 PM1/26/16

to Matt Bossenbroek, PigPen Support

No, I didn't. Sorry.

Reply all

Reply to author

Forward