AWS EMR-4.3.0, Cascalog and S3 paths (EMRFS)

Scott Burton

unread,

Aug 16, 2016, 6:07:22 PM8/16/16

to cascalog-user

Hi all;

I'm just catching on to Cascalog now, so I don't really know where the bodies are buried yet.

I'm trying to access files on Amazon S3 while running Cascalog uberjars on an AWS EMR 4.3.0 cluster; I'd like to read the files using `hfs-textline` or `hfs-delimited` and ideally sink my output to S3 as well.

I'm able to access files on S3 using the EMR version of Hive, using ordinary S3 paths ("s3://bucket/path/to/file"), so I'm reasonably certain EMRFS works on this cluster. However, when running my Cascalog job .jar on the cluster, I get the following:

> hadoop jar /home/hadoop/my-job-0.2.0-standalone.jar clojure.main s3://my-bucket/path/to/file.csv ~/out
Exception in thread "main" java.io.FileNotFoundException: s3://my-bucket/path/to/file.csv (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:146)
	at java.io.FileInputStream.<init>(FileInputStream.java:101)
	at clojure.lang.Compiler.loadFile(Compiler.java:7314)
	at clojure.main$load_script.invokeStatic(main.clj:275)
	at clojure.main$load_script.invoke(main.clj:268)
	at clojure.main$script_opt.invokeStatic(main.clj:337)
	at clojure.main$script_opt.invoke(main.clj:330)
	at clojure.main$main.invokeStatic(main.clj:421)
	at clojure.main$main.doInvoke(main.clj:384)
	at clojure.lang.RestFn.invoke(RestFn.java:421)
	at clojure.lang.Var.invoke(Var.java:383)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.Var.applyTo(Var.java:700)
	at clojure.main.main(main.java:37)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

`FileNotFoundException` makes it look like we're trying and failing to use HDFS? Maybe?

Running in the REPL gives me a different error:

IllegalArgumentException AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).  org.apache.hadoop.fs.s3.S3Credentials.initialize (S3Credentials.java:66)

I'm not certain if the REPL version is running a bundled Hadoop or EMR's Hadoop, but this looks like a Hadoop-sourced S3 error rather than a EMRFS one.

Last thing, here's my project.clj:

(defproject refpath-count "0.2.0"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :repositories {"conjars" "http://conjars.org/repo"}
  :dependencies [
      [org.clojure/clojure "1.8.0"]
      [cascalog/cascalog-core "3.0.0"]
      [cascalog/cascalog-more-taps "3.0.0"]
    ]
  :profiles { :dev 
      {
        :dependencies [
          [org.apache.hadoop/hadoop-core "1.2.1"]
          [cascalog/midje-cascalog "3.0.0"]
        ]
      }
      :plugins [
        [lein-midje "3.0.1"]
      ]
      :provided {
        :dependencies [
          [org.apache.hadoop/hadoop-core "1.2.1"]
        ]
      }
  }
  :jvm-opts ["-Xms768m" "-Xmx768m"]
  :main nil
  :aot [main.core]
  )

Any help would be greatly appreciated. Thanks!!

Igor Postelnik

unread,

Aug 16, 2016, 6:27:59 PM8/16/16

to cascalog-user

There was an issue in several versions of lein where :main nil wasn't respected.

-Igor

Scott Burton

unread,

Aug 16, 2016, 8:01:09 PM8/16/16

to cascalog-user

Awesome info, thanks Igor.

Do you think that's the source of the problem? I'm able to execute the main method I want by passing it as an argument to the jar, like so:

hadoop jar /home/hadoop/my-job-0.2.0-standalone.jar clojure.main s3://my-bucket/path/to/file.csv ~/out

I'm suspicious of the provided dependencies section. Should this `hadoop-core "1.2.1"` be in line with EMR's version of Hadoop?

Gareth Rogers

unread,

Aug 17, 2016, 3:48:11 AM8/17/16

to cascalog-user

To answer the question about local vs in the cluster. When running in the cluster the cluster itself has an IAM role and is set up such that tools that are using the S3 access libraries can find and use the credentials. This is not typically the case when running in your repl and locally you need to specify your credentials. You can do this in the S3 URI e.g. s3://<access_key>:<secret_key>@bucket/path/to/file. From the error message it looks like you could also set these as Hadoop properties but I've never tried that. From a quick search it looks like you can set fs.s3.awsAccessKeyId and fs.s3n.awsSecretAccessKey (http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation) in your job conf. I think that just means passing the map {"fs.s3.awsAccessKeyId" "<access_key>" "fs.s3.awsSecretAccessKey" "<secret_access_key>"} to the function with-job-conf (http://nathanmarz.github.io/cascalog/cascalog.api.html#var-with-job-conf). I'm a bit rusty on Cascalog though and don't have any examples to hand.

It is better if the provided Hadoop dependency matches that on the EMR cluster. It means your local tests are running against the same version of Hadoop as that running live.

You've not gone through what you've tried so just a simple checks I do when I've had this error in the past. For me it's always been I can't type and from what you've said it does feel like the file doesn't exist. Have you checked, with a copy and paste, that the file does exist? e.g. aws s3 ls s3://bucket/path/to/file Above I can see you've switched between using .csv and not in the file name but I guess that's just an inconsistency in your replacements.

Could you paste some of the Cascalog code you're using to open the file?

Thanks

Gareth

Scott Burton

unread,

Aug 17, 2016, 2:29:37 PM8/17/16

to cascalog-user

Hey Gareth;

Thanks for clearing that up about cluster vs. REPL mode - that was my intuition but I wasn't certain. I am using a IAM role that grants read access to files on my S3 buckets, and I'm able to access them using paths beginning with "s3://" in this EMR instance's version of Hive, so I believe the IAM role is working.

Also, I am able to read the file at this path with the bundled `aws` binary, list it with `aws s3 ls s3://bucket/path`, and copy the file in to HDFS using `hadoop fs -cp s3://bucket/path dest`.

Here's the main function that reads the file paths passed as arguments:

; Main function; this is defined in project.clj
(defn -main 
  [in out & args]
  (?- 
    (hfs-delimited out :sinkmode :replace :delimiter ",")
    ; (stdout)
    (<- 
      [?refpath ?count]
      ((refpath-counts (user-consideration-paths (hfs-delimited in :delimiter ","))) :> ?refpath ?count)
    )
  )
)

`refpath-counts` and `user-consideration-paths` are queries which accept Cascalog generators as arguments. This much works in local mode using local files as input.

The uberjar also works on EMR when fed files out of HDFS (I've tried copying them in from S3, as above), using relative local paths on the distributed FS and no scheme (i.e. "./my-src.csv"). So as far as aI can tell right now, the only thing not working is files with "s3://" as the scheme, opened by "hfs-delimited".

Thanks again for taking a look.

Gareth Rogers

unread,

Aug 18, 2016, 6:56:21 AM8/18/16

to cascalog-user

Hi Scott

Then that's a bit mysterious! It doesn't feel like something you're obviously doing wrong.

A couple of simple things you could try. Switch from s3 to s3n, change the EMR release version, and see if you can explicitly set the AWS access key ID and secret access key ID via the Hadoop job conf for both s3 and s3n. There was some detail in my previous post about how to do that but again I've not done it myself.

I've never done this but I bet you can get a Clojure REPL running on the EMR master node. From there you could try using Amazonica to query S3 from within Clojure and also write some stupid Cascalog code like:

(defn -main []

(?- (stdout)

(hfs-delimited "s3://your-bucket/your-file/" :delimiter ",")))

Taking care not to flood stdout with your file contents ;)

Does hfs-textline work? and can you write to S3?

Scott Burton

unread,

Aug 18, 2016, 2:56:57 PM8/18/16

to cascalog-user

Gareth;

Your suggestion to change the release version triggered me to do two things: a. launch the latest EMR (I'd been using 4.3 vs 5.0) and b. write a smoke test using a stupid-simple example.

I'll post a link to the smoke-test repo shortly, but TL;DR the damn thing works. I suspect it was the :aot setting in my project.clj; with :aot ['main.core'] and :main nil, the arguments I passed in to hadoop ("hadoop jar [jarfile] [main-method] *[args]") were probably parsed incorrectly, leading to a false negative. The following project.clj seems to work and s3:// paths are now parsed correctly:

(defproject refpath-count "0.2.1"

  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :repositories {"conjars" "http://conjars.org/repo"}
  :dependencies [
      [org.clojure/clojure "1.8.0"]
      [cascalog/cascalog-core "3.0.0"]
      [cascalog/cascalog-more-taps "3.0.0"]
    ]
  :profiles { 
    :dev {
      :dependencies [
        [org.apache.hadoop/hadoop-core "1.2.1"]
        [cascalog/midje-cascalog "3.0.0"]
      ]
    }
    :plugins [
      [lein-midje "3.0.1"]
    ]
    :provided {
      :dependencies [
        [org.apache.hadoop/hadoop-core "1.2.1"]
      ]
    }

    :uberjar {:aot :all}
  }
  :jvm-opts ["-Xms768m" "-Xmx768m"]
  :main "main.core"
  )

Frustrating but probably worth the effort. Many thanks for helping me track it down!

Gareth Rogers

unread,

Aug 19, 2016, 3:23:25 AM8/19/16

to cascalog-user

Excellent, glad you found the problem :)

Now for the next way of breaking things

:profiles {:provided {:dependencies [[org.apache.hadoop/hadoop-client "2.7.2"]

[org.apache.hadoop/hadoop-common "2.7.2"]]}}

This should just switch the version of Hadoop you use when running in your repl (and probably other places, can't quite remember the rules of :provided) to that which you're running against in EMR. Hence no troubles. The split of hadoop-core into hadoop-client and hadoop-common I think was the only difficulty in the switch.

Gareth

Scott Burton

unread,

Aug 19, 2016, 2:00:55 PM8/19/16

to cascalog-user

Good stuff Gareth, I'll give that a try to see if the EMRFS works in a REPL. Thanks again for all your help!

As promised, here's an absurdly simple example app that exercises paths with hfs-textline and thus, theoretically, works with s3:// paths:

https://github.com/hyfn/s3-emrfs-test

Reply all

Reply to author

Forward