EMR Fails with "AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties..."

2,957 views
Skip to first unread message

Michael Thomas

unread,
Apr 24, 2015, 2:14:37 PM4/24/15
to snowpl...@googlegroups.com
I'm attempting to use the EMR Runner to enrich and shred records generated by the scala-kinesis-collector and persisted to S3 with the scala-s3-sink.

Notes
 - this is on a second run so I am using skip staging
 - I have tried this with both s3://.. and s3n:// URIs and get the same error either way
 - The AWS creds that I'm passing in have full admin permissions 
 - The EMR roles are configured correctly

When I run emr-etl-runner I get the following error (this is on a repeated run so I am using the skip staging option, but the error is identical every time):

$ bundle exec bin/snowplow-emr-etl-runner --config /home/ubuntu/snowplow-emr-runner-config.yml --skip staging
D, [2015-04-24T17:49:33.500168 #14300] DEBUG -- : Initializing EMR jobflow
D, [2015-04-24T17:49:34.661986 #14300] DEBUG -- : EMR jobflow j-YRU3U2Z7Z5AH started, waiting for jobflow to complete...

F, [2015-04-24T17:57:35.534551 #14300] FATAL -- :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-YRU3U2Z7Z5AH failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: SHUTTING_DOWN [Shut down as step failed] ~ elapsed time n/a [2015-04-24 17:52:55 UTC - ]
 - 1. Elasticity Scalding Step: Enrich Raw Events: FAILED ~ 00:02:40 [2015-04-24 17:53:50 UTC - 2015-04-24 17:56:30 UTC]
 - 2. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
 - 3. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 4. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:299:in `run'
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:60:in `run'
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
    /home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'
    bin/snowplow-emr-etl-runner:39:in `<main>'

When I look into log_bucket/j-YRU3U2Z7Z5AH/task-attempts/job_xxx_0002/attempt.../syslog I see the following

2015-04-24 17:55:06,102 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
2015-04-24 17:55:07,009 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-04-24 17:55:07,720 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-04-24 17:55:07,915 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2015-04-24 17:55:07,989 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2015-04-24 17:55:07,989 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoop for UID 105 from the native implementation
2015-04-24 17:55:07,993 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy3.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapred.FileOutputCommitter.getTempTaskOutputPath(FileOutputCommitter.java:234)
at org.apache.hadoop.mapred.Task.initialize(Task.java:522)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2015-04-24 17:55:08,001 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
2015-04-24 17:55:08,002 INFO org.apache.hadoop.mapred.Child: Error cleaning up
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy3.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapred.DirectFileOutputCommitter.isDirectWrite(DirectFileOutputCommitter.java:128)
at org.apache.hadoop.mapred.DirectFileOutputCommitter.isDirectWrite(DirectFileOutputCommitter.java:116)
at org.apache.hadoop.mapred.DirectFileOutputCommitter.abortTask(DirectFileOutputCommitter.java:80)
at org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:233)
at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:1037)
at org.apache.hadoop.mapred.Child$5.run(Child.java:287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:284)


Finally, my configuration file:
:logging:
  :level: DEBUG # You can optionally switch to INFO for production
:aws:
  :access_key_id: XXXXXXXXXXXXXXXXXX
  :secret_access_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
:s3:
  :region: us-east-1
  :buckets:
    :assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
    :log: s3://xxxx-dev2-snowplow-etl/logs
    :raw:
      :in: s3://xxxx-dev2-snowplow-raw
      :processing: s3://xxxx-dev2-snowplow-etl/processing
      :archive: s3://xxxx-dev2-snowplow-etl/archive    # e.g. s3://my-archive-bucket/raw
    :enriched:
      :good: s3://xxxx-dev2-snowplow-enriched/good       # e.g. s3://my-out-bucket/enriched/good
      :bad: s3://xxxx-dev2-snowplow-enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
      :errors:    # Leave blank unless :continue_on_unexpected_error: set to true below
    :shredded:
      :good: s3://xxxx-dev2-snowplow-shredded/good       # e.g. s3://my-out-bucket/shredded/good
      :bad: s3://xxxx-dev2-snowplow-shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
      :errors:     # Leave blank unless :continue_on_unexpected_error: set to true below
:emr:
  :region: us-east-1        # Always set this
  :jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
  :service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
  :placement: us-east-1a     # Set this if not running in VPC. Leave blank otherwise
  :ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
  :ec2_key_name: xxxxx-dev
  :bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
  :software:
    :hbase:                # To launch on cluster, provide version, "0.92.0", keep quotes
    :lingual:              # To launch on cluster, provide version, "1.1", keep quotes
  # Adjust your Hadoop cluster below
  :jobflow:
    :master_instance_type: c4.large
    :core_instance_count: 1
    :core_instance_type: c4.large
    :task_instance_count: 0 # Increase to use spot instances
    :task_instance_type: m1.small
    :task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
:etl:
  :job_name: Snowplow ETL # Give your job a name
  :versions:
    :hadoop_enrich: 0.14.1 # Version of the Hadoop Enrichment process
    :hadoop_shred: 0.4.0 # Version of the Hadoop Shredding process
  :collector_format: thrift # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
  :continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
:iglu:
  :schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
  :data:
    :cache_size: 500
    :repositories:
      - :name: "Iglu Central"
        :priority: 0
        :vendor_prefixes:
          - com.snowplowanalytics
        :connection:
          :http:
            :uri: http://iglucentral.com

Thank you all for this great product, I'm sure I'm making a simple mistake!

Alex Dean

unread,
Apr 24, 2015, 2:35:27 PM4/24/15
to snowpl...@googlegroups.com
Hi Michael,

Could you try with:

  :buckets:
    :assets: s3://snowplow-hosted-assets

but all other buckets s3n://

See if that works?

A

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Michael Thomas

unread,
Apr 24, 2015, 3:05:04 PM4/24/15
to snowpl...@googlegroups.com
Thanks for the quick response -- that's actually one configuration I tried:

:logging:
  :level: DEBUG # You can optionally switch to INFO for production
:aws:
  :access_key_id: xx
  :secret_access_key: xx
:s3:
  :region: us-east-1
  :buckets:
    :assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
    :log: s3n://xx-dev2-snowplow-etl/logs
    :raw:
      :in: s3n://xx-dev2-snowplow-raw
      :processing: s3n://xx-dev2-snowplow-etl/processing
      :archive: s3n://xx-dev2-snowplow-etl/archive    # e.g. s3://my-archive-bucket/raw
    :enriched:
      :good: s3n://xx-dev2-snowplow-enriched/good       # e.g. s3://my-out-bucket/enriched/good
      :bad: s3n://xx-dev2-snowplow-enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
      :errors:    # Leave blank unless :continue_on_unexpected_error: set to true below
    :shredded:
      :good: s3n://xx-dev2-snowplow-shredded/good       # e.g. s3://my-out-bucket/shredded/good
      :bad: s3n://xx-dev2-snowplow-shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
      :errors:     # Leave blank unless :continue_on_unexpected_error: set to true below
:emr:
  :region: us-east-1        # Always set this
  :jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
  :service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
  :placement: us-east-1a     # Set this if not running in VPC. Leave blank otherwise
  :ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
  :ec2_key_name: xx-dev
  :bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
  :software:
    :hbase:                # To launch on cluster, provide version, "0.92.0", keep quotes
    :lingual:              # To launch on cluster, provide version, "1.1", keep quotes
  # Adjust your Hadoop cluster below
  :jobflow:
    :master_instance_type: m1.large
    :core_instance_count: 1
    :core_instance_type: m1.large
    :task_instance_count: 0 # Increase to use spot instances
    :task_instance_type: m1.small
    :task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
:etl:
  :job_name: Snowplow ETL # Give your job a name
  :versions:
    :hadoop_enrich: 0.14.1 # Version of the Hadoop Enrichment process
    :hadoop_shred: 0.4.0 # Version of the Hadoop Shredding process
  :collector_format: thrift # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
  :continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
:iglu:
  :schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
  :data:
    :cache_size: 500
    :repositories:
      - :name: "Iglu Central"
        :priority: 0
        :vendor_prefixes:
          - com.snowplowanalytics
        :connection:
          :http:
            :uri: http://iglucentral.com

Alex Dean

unread,
Apr 24, 2015, 3:10:03 PM4/24/15
to snowpl...@googlegroups.com
Hmm, odd.

Can you install the aws cli tool, setup a profile with the same AWS permissions as in your config.yml and just aws s3 ls each of the buckets in your config.yml?

A

Michael Thomas

unread,
Apr 24, 2015, 3:31:11 PM4/24/15
to snowpl...@googlegroups.com
Everything seems to be good there:

(env)ubuntu@ip-10-0-3-177:~$ aws configure
AWS Access Key ID [None]: XXX (same)
AWS Secret Access Key [None]: xxx (same)
Default region name [None]: us-east-1
Default output format [None]: json
(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-etl/processing
upload: ./test.txt to s3://xxx-dev2-snowplow-etl/processing
(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-etl/processing/
upload: ./test.txt to s3://xxx-dev2-snowplow-etl/processing/test.txt
(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-etl/archive/
upload: ./test.txt to s3://xxx-dev2-snowplow-etl/archive/test.txt
(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-enriched/good
upload: ./test.txt to s3://xxx-dev2-snowplow-enriched/good
(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-enriched/bad
upload: ./test.txt to s3://xxx-dev2-snowplow-enriched/bad
(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-shredded/bad
upload: ./test.txt to s3://xxx-dev2-snowplow-shredded/bad
(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-shredded/good
upload: ./test.txt to s3://xxx-dev2-snowplow-shredded/good
(env)ubuntu@ip-10-0-3-177:~$

Alex Dean

unread,
Apr 24, 2015, 3:39:16 PM4/24/15
to snowpl...@googlegroups.com
Can you go into:

IAM > Roles > EMR_EC2_DefaultRole

and show us the policy document EMR_EC2_DefaultRole?

And do the same with EMR_DefaultRole?

A

Michael Thomas

unread,
Apr 24, 2015, 3:50:19 PM4/24/15
to snowpl...@googlegroups.com
EMR_DefaultRole
{
  "Version": "2012-10-17", 
  "Statement": [
    {
      "Action": [
        "ec2:AuthorizeSecurityGroupIngress", 
        "ec2:CancelSpotInstanceRequests", 
        "ec2:CreateSecurityGroup", 
        "ec2:CreateTags", 
        "ec2:Describe*", 
        "ec2:DeleteTags", 
        "ec2:ModifyImageAttribute", 
        "ec2:ModifyInstanceAttribute", 
        "ec2:RequestSpotInstances", 
        "ec2:RunInstances", 
        "ec2:TerminateInstances", 
        "iam:PassRole", 
        "iam:ListRolePolicies", 
        "iam:GetRole", 
        "iam:GetRolePolicy", 
        "iam:ListInstanceProfiles", 
        "s3:Get*", 
        "s3:List*", 
        "s3:CreateBucket", 
        "sdb:BatchPutAttributes", 
        "sdb:Select"
      ], 
      "Resource": "*", 
      "Effect": "Allow"
    }
  ]
}
EMR_EC2_DefaultRole
{
  "Statement": [
    {
      "Action": [
        "cloudwatch:*", 
        "dynamodb:*", 
        "ec2:Describe*", 
        "elasticmapreduce:Describe*", 
        "rds:Describe*", 
        "s3:*", 
        "sdb:*", 
        "sns:*", 
        "sqs:*"
      ], 
      "Resource": [
        "*"
      ], 
      "Effect": "Allow"
    }
  ]
}


--
You received this message because you are subscribed to a topic in the Google Groups "Snowplow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/snowplow-user/nLjOuKwZoIQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to snowplow-use...@googlegroups.com.

Alex Dean

unread,
Apr 25, 2015, 7:50:59 AM4/25/15
to snowpl...@googlegroups.com
Hi Michael,

Those look the exact same as the ones we are using. I have to say I'm a bit stumped - it's not s3:// vs s3n://, it's not S3 permissions, it's not your EMR roles and you're not using a VPC.

Has anybody in the community seen this before?

Cheers,

Alex

Anthony Bui

unread,
Apr 28, 2015, 5:03:53 PM4/28/15
to snowpl...@googlegroups.com
:+1: 

Also seeing this issue when running EMR after setting up snowplow-stream-collector-0.3.0 and snowplow-lzo-s3-sink-0.1.0. (Tested the snowplow-kinesis-enrich-0.3.0 as well and can confirm that sink is working fine.) Fails on "enrich" step.

```
ubuntu@ip-172-31-44-131:/usr/apps/snowplow$ ./snowplow-runner-and-loader.sh
...
D, [2015-04-28T18:00:47.567877 #24963] DEBUG -- : Waiting a minute to allow S3 to settle (eventual consistency)
D, [2015-04-28T18:01:47.568897 #24963] DEBUG -- : Initializing EMR jobflow
D, [2015-04-28T18:01:49.098633 #24963] DEBUG -- : EMR jobflow j-377DU1VTS1Z3Z started, waiting for jobflow to complete...
F, [2015-04-28T18:15:50.410639 #24963] FATAL -- :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-377DU1VTS1Z3Z failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: SHUTTING_DOWN [Shut down as step failed] ~ elapsed time n/a [2015-04-28 18:09:13 UTC - ]
 - 1. Elasticity Setup Hadoop Debugging: COMPLETED ~ 00:00:35 [2015-04-28 18:10:09 UTC - 2015-04-28 18:10:44 UTC]
 - 2. Elasticity Scalding Step: Enrich Raw Events: FAILED ~ 00:07:53 [2015-04-28 18:10:44 UTC - 2015-04-28 18:18:38 UTC]
 - 3. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 4. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
 - 5. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):
    /usr/apps/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:299:in `run'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:60:in `run'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'
    /usr/apps/snowplow/3-enrich/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in `<main>'
```

```
2015-04-28 18:15:08,000 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
2015-04-28 18:15:15,601 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-04-28 18:15:22,164 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-04-28 18:15:24,596 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2015-04-28 18:15:25,304 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@54b83d10
2015-04-28 18:15:37,660 INFO cascading.tap.hadoop.io.MultiInputSplit: current split input path: s3://brainfall-sp/lzo-raw/processing/2015-04-28-49550064561695896617344931859091525558367182535746650114-49550064561695896617344931859098779113284871204148084738.lzo
2015-04-28 18:15:37,976 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library
2015-04-28 18:15:37,980 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 049362b7cf53ff5f739d6b1532457f2c6cd495e8]
2015-04-28 18:15:38,023 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available
2015-04-28 18:15:38,023 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded
2015-04-28 18:15:38,189 WARN cascading.tap.hadoop.io.MultiInputFormat: unable to get record reader, but not retrying
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
	at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
	at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at org.apache.hadoop.fs.s3native.$Proxy7.initialize(Unknown Source)
	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:179)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.<init>(DeprecatedInputFormatWrapper.java:251)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:118)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)
	at cascading.util.Util.retry(Util.java:762)
	at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
2015-04-28 18:15:38,410 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2015-04-28 18:15:38,801 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2015-04-28 18:15:38,802 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoop for UID 105 from the native implementation
2015-04-28 18:15:38,808 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
	at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
	at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at org.apache.hadoop.fs.s3native.$Proxy7.initialize(Unknown Source)
	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:179)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.<init>(DeprecatedInputFormatWrapper.java:251)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:118)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)
	at cascading.util.Util.retry(Util.java:762)
	at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
2015-04-28 18:15:38,841 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
```

Still debugging myself atm, but let me know if I can provide anything else.
ab

Alex Dean

unread,
Apr 28, 2015, 6:40:52 PM4/28/15
to snowpl...@googlegroups.com
Michael, Anthony - I just had a thought. How is your MaxMind geoip enrichment configured?

A

Anthony Bui

unread,
Apr 29, 2015, 1:09:16 PM4/29/15
to snowpl...@googlegroups.com
Hey Alex - I attempted running EMR without enrichments but still getting the same failure.

```
# snowplow-runner-and-loader.sh

RUNNER_PATH=/usr/apps/snowplow/3-enrich/emr-etl-runner
LZO_RUNNER_CONFIG=/usr/apps/snowplow/config.lzo.yml

export BUNDLE_GEMFILE=${RUNNER_PATH}/Gemfile
bundle exec ${RUNNER_PATH}/bin/snowplow-emr-etl-runner --config ${LZO_RUNNER_CONFIG} --debug --skip staging
```

Anyway, here's my ip_lookups enrichment config file in case it helps:

```
# config/enrichments/ip_lookups.json
{
        "schema": "iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/1-0-0",

        "data": {

                "name": "ip_lookups",
                "vendor": "com.snowplowanalytics.snowplow",
                "enabled": true,
                "parameters": {
                        "geo": {
                                "database": "GeoLiteCity.dat",
                                "uri": "http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind"
                        }
                }
        }
}
```

Cheers
ab

...

Alex Dean

unread,
Apr 29, 2015, 1:38:41 PM4/29/15
to snowpl...@googlegroups.com
Hi Anthony,

I might be mistaken - but have you been running Snowplow successfully before? Have you got an existing setup which is running successfully alongside this one?

A

To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anthony Bui

unread,
Apr 29, 2015, 3:07:12 PM4/29/15
to snowpl...@googlegroups.com
Hi Alex -

Yes, we currently have a batch-based cloudfront pipeline processing
daily into redshift. That is still running successfully so no issues
there. =P

And now we are trying to test out the real-time pipeline with hopes of
replacing our batch-based process. We've currently got the
elasticsearch-sink working and viewable via kibana. Looks great! The
lzo-s3-sink also appears to be working (lzo raw event logs get dumped
into s3). We're just stuck on this last hurdle, trying to get raw
events from the lzo-s3-sink processed and loaded into redshift.

(As far as I know, there isn't a way to load kinesis enriched events
into redshift at the moment.)

Appreciate the help!
ab
> You received this message because you are subscribed to a topic in the
> Google Groups "Snowplow" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/snowplow-user/nLjOuKwZoIQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Alex Dean

unread,
Apr 30, 2015, 1:27:21 PM4/30/15
to snowpl...@googlegroups.com
Hi Anthony,

You are right - there is no way of loading the Kinesis enriched events into Redshift at the moment, so yes the "lambda" approach of having Kinesis flow + EMR flow is the way to go for now.

Some good-ish news: we have managed to reproduce the issue with lzo-s3-sink + EMR: https://github.com/snowplow/snowplow/issues/1647

Anthony, Michael, please +1 that issue to get GitHub updates as we work through it...

A
Reply all
Reply to author
Forward
0 new messages