I'm attempting to use the EMR Runner to enrich and shred records generated by the scala-kinesis-collector and persisted to S3 with the scala-s3-sink.
Notes
- this is on a second run so I am using skip staging
- I have tried this with both s3://.. and s3n:// URIs and get the same error either way
- The AWS creds that I'm passing in have full admin permissions
- The EMR roles are configured correctly
When I run emr-etl-runner I get the following error (this is on a repeated run so I am using the skip staging option, but the error is identical every time):
$ bundle exec bin/snowplow-emr-etl-runner --config /home/ubuntu/snowplow-emr-runner-config.yml --skip staging
D, [2015-04-24T17:49:33.500168 #14300] DEBUG -- : Initializing EMR jobflow
D, [2015-04-24T17:49:34.661986 #14300] DEBUG -- : EMR jobflow j-YRU3U2Z7Z5AH started, waiting for jobflow to complete...
F, [2015-04-24T17:57:35.534551 #14300] FATAL -- :
Snowplow ETL: SHUTTING_DOWN [Shut down as step failed] ~ elapsed time n/a [2015-04-24 17:52:55 UTC - ]
- 1. Elasticity Scalding Step: Enrich Raw Events: FAILED ~ 00:02:40 [2015-04-24 17:53:50 UTC - 2015-04-24 17:56:30 UTC]
- 2. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
- 3. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
- 4. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:299:in `run'
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:60:in `run'
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'
bin/snowplow-emr-etl-runner:39:in `<main>'
When I look into log_bucket/j-YRU3U2Z7Z5AH/task-attempts/job_xxx_0002/attempt.../syslog I see the following
2015-04-24 17:55:06,102 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
2015-04-24 17:55:07,009 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-04-24 17:55:07,720 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-04-24 17:55:07,915 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2015-04-24 17:55:07,989 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2015-04-24 17:55:07,989 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoop for UID 105 from the native implementation
2015-04-24 17:55:07,993 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy3.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapred.FileOutputCommitter.getTempTaskOutputPath(FileOutputCommitter.java:234)
at org.apache.hadoop.mapred.Task.initialize(Task.java:522)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2015-04-24 17:55:08,001 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
2015-04-24 17:55:08,002 INFO org.apache.hadoop.mapred.Child: Error cleaning up
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy3.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapred.DirectFileOutputCommitter.isDirectWrite(DirectFileOutputCommitter.java:128)
at org.apache.hadoop.mapred.DirectFileOutputCommitter.isDirectWrite(DirectFileOutputCommitter.java:116)
at org.apache.hadoop.mapred.DirectFileOutputCommitter.abortTask(DirectFileOutputCommitter.java:80)
at org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:233)
at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:1037)
at org.apache.hadoop.mapred.Child$5.run(Child.java:287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:284)
Finally, my configuration file:
:logging:
:level: DEBUG # You can optionally switch to INFO for production
:aws:
:access_key_id: XXXXXXXXXXXXXXXXXX
:secret_access_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
:s3:
:region: us-east-1
:buckets:
:assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
:log: s3://xxxx-dev2-snowplow-etl/logs
:raw:
:in: s3://xxxx-dev2-snowplow-raw
:processing: s3://xxxx-dev2-snowplow-etl/processing
:archive: s3://xxxx-dev2-snowplow-etl/archive # e.g. s3://my-archive-bucket/raw
:enriched:
:good: s3://xxxx-dev2-snowplow-enriched/good # e.g. s3://my-out-bucket/enriched/good
:bad: s3://xxxx-dev2-snowplow-enriched/bad # e.g. s3://my-out-bucket/enriched/bad
:errors: # Leave blank unless :continue_on_unexpected_error: set to true below
:shredded:
:good: s3://xxxx-dev2-snowplow-shredded/good # e.g. s3://my-out-bucket/shredded/good
:bad: s3://xxxx-dev2-snowplow-shredded/bad # e.g. s3://my-out-bucket/shredded/bad
:errors: # Leave blank unless :continue_on_unexpected_error: set to true below
:emr:
:region: us-east-1 # Always set this
:jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
:service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
:placement: us-east-1a # Set this if not running in VPC. Leave blank otherwise
:ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
:ec2_key_name: xxxxx-dev
:bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
:software:
:hbase: # To launch on cluster, provide version, "0.92.0", keep quotes
:lingual: # To launch on cluster, provide version, "1.1", keep quotes
# Adjust your Hadoop cluster below
:jobflow:
:master_instance_type: c4.large
:core_instance_count: 1
:core_instance_type: c4.large
:task_instance_count: 0 # Increase to use spot instances
:task_instance_type: m1.small
:task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
:etl:
:job_name: Snowplow ETL # Give your job a name
:versions:
:hadoop_enrich: 0.14.1 # Version of the Hadoop Enrichment process
:hadoop_shred: 0.4.0 # Version of the Hadoop Shredding process
:collector_format: thrift # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
:continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
:iglu:
:schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
:data:
:cache_size: 500
:repositories:
- :name: "Iglu Central"
:priority: 0
:vendor_prefixes:
- com.snowplowanalytics
:connection:
:http:
Thank you all for this great product, I'm sure I'm making a simple mistake!