EMR Fails with "AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties..."

Michael Thomas

unread,

Apr 24, 2015, 2:14:37 PM4/24/15

to snowpl...@googlegroups.com

I'm attempting to use the EMR Runner to enrich and shred records generated by the scala-kinesis-collector and persisted to S3 with the scala-s3-sink.

Notes

- this is on a second run so I am using skip staging

- I have tried this with both s3://.. and s3n:// URIs and get the same error either way

- The AWS creds that I'm passing in have full admin permissions

- The EMR roles are configured correctly

When I run emr-etl-runner I get the following error (this is on a repeated run so I am using the skip staging option, but the error is identical every time):

$ bundle exec bin/snowplow-emr-etl-runner --config /home/ubuntu/snowplow-emr-runner-config.yml --skip staging

D, [2015-04-24T17:49:33.500168 #14300] DEBUG -- : Initializing EMR jobflow

D, [2015-04-24T17:49:34.661986 #14300] DEBUG -- : EMR jobflow j-YRU3U2Z7Z5AH started, waiting for jobflow to complete...

F, [2015-04-24T17:57:35.534551 #14300] FATAL -- :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-YRU3U2Z7Z5AH failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.

Snowplow ETL: SHUTTING_DOWN [Shut down as step failed] ~ elapsed time n/a [2015-04-24 17:52:55 UTC - ]

- 1. Elasticity Scalding Step: Enrich Raw Events: FAILED ~ 00:02:40 [2015-04-24 17:53:50 UTC - 2015-04-24 17:56:30 UTC]

- 2. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]

- 3. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]

- 4. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:299:in `run'

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:60:in `run'

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'

/home/ubuntu/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'

bin/snowplow-emr-etl-runner:39:in `<main>'

When I look into log_bucket/j-YRU3U2Z7Z5AH/task-attempts/job_xxx_0002/attempt.../syslog I see the following

2015-04-24 17:55:06,102 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively

2015-04-24 17:55:07,009 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library

2015-04-24 17:55:07,720 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!

2015-04-24 17:55:07,915 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

2015-04-24 17:55:07,989 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.

2015-04-24 17:55:07,989 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoop for UID 105 from the native implementation

2015-04-24 17:55:07,993 WARN org.apache.hadoop.mapred.Child: Error running child

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)

at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)

at org.apache.hadoop.fs.s3native.$Proxy3.initialize(Unknown Source)

at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)

at org.apache.hadoop.mapred.FileOutputCommitter.getTempTaskOutputPath(FileOutputCommitter.java:234)

at org.apache.hadoop.mapred.Task.initialize(Task.java:522)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353)

at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)

at org.apache.hadoop.mapred.Child.main(Child.java:249)

2015-04-24 17:55:08,001 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

2015-04-24 17:55:08,002 INFO org.apache.hadoop.mapred.Child: Error cleaning up

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)

at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)

at org.apache.hadoop.fs.s3native.$Proxy3.initialize(Unknown Source)

at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)

at org.apache.hadoop.mapred.DirectFileOutputCommitter.isDirectWrite(DirectFileOutputCommitter.java:128)

at org.apache.hadoop.mapred.DirectFileOutputCommitter.isDirectWrite(DirectFileOutputCommitter.java:116)

at org.apache.hadoop.mapred.DirectFileOutputCommitter.abortTask(DirectFileOutputCommitter.java:80)

at org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:233)

at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:1037)

at org.apache.hadoop.mapred.Child$5.run(Child.java:287)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)

at org.apache.hadoop.mapred.Child.main(Child.java:284)

Finally, my configuration file:

:logging:

:level: DEBUG # You can optionally switch to INFO for production

:aws:

:access_key_id: XXXXXXXXXXXXXXXXXX

:secret_access_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

:s3:

:region: us-east-1

:buckets:

:assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket

:log: s3://xxxx-dev2-snowplow-etl/logs

:raw:

:in: s3://xxxx-dev2-snowplow-raw

:processing: s3://xxxx-dev2-snowplow-etl/processing

:archive: s3://xxxx-dev2-snowplow-etl/archive # e.g. s3://my-archive-bucket/raw

:enriched:

:good: s3://xxxx-dev2-snowplow-enriched/good # e.g. s3://my-out-bucket/enriched/good

:bad: s3://xxxx-dev2-snowplow-enriched/bad # e.g. s3://my-out-bucket/enriched/bad

:errors: # Leave blank unless :continue_on_unexpected_error: set to true below

:shredded:

:good: s3://xxxx-dev2-snowplow-shredded/good # e.g. s3://my-out-bucket/shredded/good

:bad: s3://xxxx-dev2-snowplow-shredded/bad # e.g. s3://my-out-bucket/shredded/bad

:errors: # Leave blank unless :continue_on_unexpected_error: set to true below

:emr:

:ami_version: 2.4.2 # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html

:region: us-east-1 # Always set this

:jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles

:service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles

:placement: us-east-1a # Set this if not running in VPC. Leave blank otherwise

:ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise

:ec2_key_name: xxxxx-dev

:bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise

:software:

:hbase: # To launch on cluster, provide version, "0.92.0", keep quotes

:lingual: # To launch on cluster, provide version, "1.1", keep quotes

# Adjust your Hadoop cluster below

:jobflow:

:master_instance_type: c4.large

:core_instance_count: 1

:core_instance_type: c4.large

:task_instance_count: 0 # Increase to use spot instances

:task_instance_type: m1.small

:task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances

:etl:

:job_name: Snowplow ETL # Give your job a name

:versions:

:hadoop_enrich: 0.14.1 # Version of the Hadoop Enrichment process

:hadoop_shred: 0.4.0 # Version of the Hadoop Shredding process

:collector_format: thrift # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs

:continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL

:iglu:

:schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0

:data:

:cache_size: 500

:repositories:

- :name: "Iglu Central"

:priority: 0

:vendor_prefixes:

- com.snowplowanalytics

:connection:

:http:

:uri: http://iglucentral.com

Thank you all for this great product, I'm sure I'm making a simple mistake!

Alex Dean

unread,

Apr 24, 2015, 2:35:27 PM4/24/15

to snowpl...@googlegroups.com

Hi Michael,

Could you try with:

:buckets:
:assets: s3://snowplow-hosted-assets

but all other buckets s3n://

See if that works?

A

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Michael Thomas

unread,

Apr 24, 2015, 3:05:04 PM4/24/15

to snowpl...@googlegroups.com

Thanks for the quick response -- that's actually one configuration I tried:

:logging:

:level: DEBUG # You can optionally switch to INFO for production

:aws:

:access_key_id: xx

:secret_access_key: xx

:s3:

:region: us-east-1

:buckets:

:assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket

:log: s3n://xx-dev2-snowplow-etl/logs

:raw:

:in: s3n://xx-dev2-snowplow-raw

:processing: s3n://xx-dev2-snowplow-etl/processing

:archive: s3n://xx-dev2-snowplow-etl/archive # e.g. s3://my-archive-bucket/raw

:enriched:

:good: s3n://xx-dev2-snowplow-enriched/good # e.g. s3://my-out-bucket/enriched/good

:bad: s3n://xx-dev2-snowplow-enriched/bad # e.g. s3://my-out-bucket/enriched/bad

:errors: # Leave blank unless :continue_on_unexpected_error: set to true below

:shredded:

:good: s3n://xx-dev2-snowplow-shredded/good # e.g. s3://my-out-bucket/shredded/good

:bad: s3n://xx-dev2-snowplow-shredded/bad # e.g. s3://my-out-bucket/shredded/bad

:errors: # Leave blank unless :continue_on_unexpected_error: set to true below

:emr:

:ami_version: 2.4.2 # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html

:region: us-east-1 # Always set this

:jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles

:service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles

:placement: us-east-1a # Set this if not running in VPC. Leave blank otherwise

:ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise

:ec2_key_name: xx-dev

:bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise

:software:

:hbase: # To launch on cluster, provide version, "0.92.0", keep quotes

:lingual: # To launch on cluster, provide version, "1.1", keep quotes

# Adjust your Hadoop cluster below

:jobflow:

:master_instance_type: m1.large

:core_instance_count: 1

:core_instance_type: m1.large

:task_instance_count: 0 # Increase to use spot instances

:task_instance_type: m1.small

:task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances

:etl:

:job_name: Snowplow ETL # Give your job a name

:versions:

:hadoop_enrich: 0.14.1 # Version of the Hadoop Enrichment process

:hadoop_shred: 0.4.0 # Version of the Hadoop Shredding process

:collector_format: thrift # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs

:continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL

:iglu:

:schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0

:data:

:cache_size: 500

:repositories:

- :name: "Iglu Central"

:priority: 0

:vendor_prefixes:

- com.snowplowanalytics

:connection:

:http:

:uri: http://iglucentral.com

Alex Dean

unread,

Apr 24, 2015, 3:10:03 PM4/24/15

to snowpl...@googlegroups.com

Hmm, odd.

Can you install the aws cli tool, setup a profile with the same AWS permissions as in your config.yml and just aws s3 ls each of the buckets in your config.yml?

A

Michael Thomas

unread,

Apr 24, 2015, 3:31:11 PM4/24/15

to snowpl...@googlegroups.com

Everything seems to be good there:

(env)ubuntu@ip-10-0-3-177:~$ aws configure

AWS Access Key ID [None]: XXX (same)

AWS Secret Access Key [None]: xxx (same)

Default region name [None]: us-east-1

Default output format [None]: json

(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-etl/processing

upload: ./test.txt to s3://xxx-dev2-snowplow-etl/processing

(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-etl/processing/

upload: ./test.txt to s3://xxx-dev2-snowplow-etl/processing/test.txt

(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-etl/archive/

upload: ./test.txt to s3://xxx-dev2-snowplow-etl/archive/test.txt

(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-enriched/good

upload: ./test.txt to s3://xxx-dev2-snowplow-enriched/good

(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-enriched/bad

upload: ./test.txt to s3://xxx-dev2-snowplow-enriched/bad

(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-shredded/bad

upload: ./test.txt to s3://xxx-dev2-snowplow-shredded/bad

(env)ubuntu@ip-10-0-3-177:~$ aws s3 cp test.txt s3://xxx-dev2-snowplow-shredded/good

upload: ./test.txt to s3://xxx-dev2-snowplow-shredded/good

(env)ubuntu@ip-10-0-3-177:~$

Alex Dean

unread,

Apr 24, 2015, 3:39:16 PM4/24/15

to snowpl...@googlegroups.com

Can you go into:

IAM > Roles > EMR_EC2_DefaultRole

and show us the policy document EMR_EC2_DefaultRole?

And do the same with EMR_DefaultRole?

A

Michael Thomas

unread,

Apr 24, 2015, 3:50:19 PM4/24/15

to snowpl...@googlegroups.com

EMR_DefaultRole

{
  "Version": "2012-10-17", 
  "Statement": [
    {
      "Action": [
        "ec2:AuthorizeSecurityGroupIngress", 
        "ec2:CancelSpotInstanceRequests", 
        "ec2:CreateSecurityGroup", 
        "ec2:CreateTags", 
        "ec2:Describe*", 
        "ec2:DeleteTags", 
        "ec2:ModifyImageAttribute", 
        "ec2:ModifyInstanceAttribute", 
        "ec2:RequestSpotInstances", 
        "ec2:RunInstances", 
        "ec2:TerminateInstances", 
        "iam:PassRole", 
        "iam:ListRolePolicies", 
        "iam:GetRole", 
        "iam:GetRolePolicy", 
        "iam:ListInstanceProfiles", 
        "s3:Get*", 
        "s3:List*", 
        "s3:CreateBucket", 
        "sdb:BatchPutAttributes", 
        "sdb:Select"
      ], 
      "Resource": "*", 
      "Effect": "Allow"
    }
  ]
}

EMR_EC2_DefaultRole

  "Statement": [
    {
      "Action": [
        "cloudwatch:*", 
        "dynamodb:*", 
        "ec2:Describe*", 
        "elasticmapreduce:Describe*", 
        "rds:Describe*", 
        "s3:*", 
        "sdb:*", 
        "sns:*", 
        "sqs:*"
      ], 
      "Resource": [
        "*"
      ], 
      "Effect": "Allow"
    }
  ]
}

--
You received this message because you are subscribed to a topic in the Google Groups "Snowplow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/snowplow-user/nLjOuKwZoIQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to snowplow-use...@googlegroups.com.

Alex Dean

unread,

Apr 25, 2015, 7:50:59 AM4/25/15

to snowpl...@googlegroups.com

Hi Michael,

Those look the exact same as the ones we are using. I have to say I'm a bit stumped - it's not s3:// vs s3n://, it's not S3 permissions, it's not your EMR roles and you're not using a VPC.

Has anybody in the community seen this before?

Cheers,

Alex

Anthony Bui

unread,

Apr 28, 2015, 5:03:53 PM4/28/15

to snowpl...@googlegroups.com

:+1:

Also seeing this issue when running EMR after setting up snowplow-stream-collector-0.3.0 and snowplow-lzo-s3-sink-0.1.0. (Tested the snowplow-kinesis-enrich-0.3.0 as well and can confirm that sink is working fine.) Fails on "enrich" step.

```

ubuntu@ip-172-31-44-131:/usr/apps/snowplow$ ./snowplow-runner-and-loader.sh

...

D, [2015-04-28T18:00:47.567877 #24963] DEBUG -- : Waiting a minute to allow S3 to settle (eventual consistency)

D, [2015-04-28T18:01:47.568897 #24963] DEBUG -- : Initializing EMR jobflow

D, [2015-04-28T18:01:49.098633 #24963] DEBUG -- : EMR jobflow j-377DU1VTS1Z3Z started, waiting for jobflow to complete...

F, [2015-04-28T18:15:50.410639 #24963] FATAL -- :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-377DU1VTS1Z3Z failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.

Snowplow ETL: SHUTTING_DOWN [Shut down as step failed] ~ elapsed time n/a [2015-04-28 18:09:13 UTC - ]

- 1. Elasticity Setup Hadoop Debugging: COMPLETED ~ 00:00:35 [2015-04-28 18:10:09 UTC - 2015-04-28 18:10:44 UTC]

- 2. Elasticity Scalding Step: Enrich Raw Events: FAILED ~ 00:07:53 [2015-04-28 18:10:44 UTC - 2015-04-28 18:18:38 UTC]

- 3. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]

- 4. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]

- 5. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

/usr/apps/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:299:in `run'

/usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'

/usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'

/usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'

/usr/apps/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:60:in `run'

/usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'

/usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'

/usr/apps/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `block in common_method_added'

/usr/apps/snowplow/3-enrich/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in `<main>'

```

2015-04-28 18:15:08,000 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
2015-04-28 18:15:15,601 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-04-28 18:15:22,164 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-04-28 18:15:24,596 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2015-04-28 18:15:25,304 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@54b83d10
2015-04-28 18:15:37,660 INFO cascading.tap.hadoop.io.MultiInputSplit: current split input path: s3://brainfall-sp/lzo-raw/processing/2015-04-28-49550064561695896617344931859091525558367182535746650114-49550064561695896617344931859098779113284871204148084738.lzo
2015-04-28 18:15:37,976 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library
2015-04-28 18:15:37,980 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 049362b7cf53ff5f739d6b1532457f2c6cd495e8]
2015-04-28 18:15:38,023 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available
2015-04-28 18:15:38,023 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded
2015-04-28 18:15:38,189 WARN cascading.tap.hadoop.io.MultiInputFormat: unable to get record reader, but not retrying
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
	at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
	at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at org.apache.hadoop.fs.s3native.$Proxy7.initialize(Unknown Source)
	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:179)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.<init>(DeprecatedInputFormatWrapper.java:251)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:118)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)
	at cascading.util.Util.retry(Util.java:762)
	at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
2015-04-28 18:15:38,410 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2015-04-28 18:15:38,801 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2015-04-28 18:15:38,802 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoop for UID 105 from the native implementation
2015-04-28 18:15:38,808 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
	at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
	at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at org.apache.hadoop.fs.s3native.$Proxy7.initialize(Unknown Source)
	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:179)
	at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.<init>(DeprecatedInputFormatWrapper.java:251)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:118)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)
	at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)
	at cascading.util.Util.retry(Util.java:762)
	at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
2015-04-28 18:15:38,841 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

```

Still debugging myself atm, but let me know if I can provide anything else.

ab

Alex Dean

unread,

Apr 28, 2015, 6:40:52 PM4/28/15

to snowpl...@googlegroups.com

Michael, Anthony - I just had a thought. How is your MaxMind geoip enrichment configured?

A

Anthony Bui

unread,

Apr 29, 2015, 1:09:16 PM4/29/15

to snowpl...@googlegroups.com

Hey Alex - I attempted running EMR without enrichments but still getting the same failure.

```

# snowplow-runner-and-loader.sh

RUNNER_PATH=/usr/apps/snowplow/3-enrich/emr-etl-runner

LZO_RUNNER_CONFIG=/usr/apps/snowplow/config.lzo.yml

export BUNDLE_GEMFILE=${RUNNER_PATH}/Gemfile

bundle exec ${RUNNER_PATH}/bin/snowplow-emr-etl-runner --config ${LZO_RUNNER_CONFIG} --debug --skip staging

```

Anyway, here's my ip_lookups enrichment config file in case it helps:

```

# config/enrichments/ip_lookups.json

{

"schema": "iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/1-0-0",

"data": {

"name": "ip_lookups",

"vendor": "com.snowplowanalytics.snowplow",

"enabled": true,

"parameters": {

"geo": {

"database": "GeoLiteCity.dat",

"uri": "http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind"

}

```

Cheers

ab

...

Alex Dean

unread,

Apr 29, 2015, 1:38:41 PM4/29/15

to snowpl...@googlegroups.com

Hi Anthony,

I might be mistaken - but have you been running Snowplow successfully before? Have you got an existing setup which is running successfully alongside this one?

A

To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anthony Bui

unread,

Apr 29, 2015, 3:07:12 PM4/29/15

to snowpl...@googlegroups.com

Hi Alex -

Yes, we currently have a batch-based cloudfront pipeline processing
daily into redshift. That is still running successfully so no issues
there. =P

And now we are trying to test out the real-time pipeline with hopes of
replacing our batch-based process. We've currently got the
elasticsearch-sink working and viewable via kibana. Looks great! The
lzo-s3-sink also appears to be working (lzo raw event logs get dumped
into s3). We're just stuck on this last hurdle, trying to get raw
events from the lzo-s3-sink processed and loaded into redshift.

(As far as I know, there isn't a way to load kinesis enriched events
into redshift at the moment.)

Appreciate the help!
ab

> You received this message because you are subscribed to a topic in the
> Google Groups "Snowplow" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/snowplow-user/nLjOuKwZoIQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Alex Dean

unread,

Apr 30, 2015, 1:27:21 PM4/30/15

to snowpl...@googlegroups.com

Hi Anthony,

You are right - there is no way of loading the Kinesis enriched events into Redshift at the moment, so yes the "lambda" approach of having Kinesis flow + EMR flow is the way to go for now.

Some good-ish news: we have managed to reproduce the issue with lzo-s3-sink + EMR: https://github.com/snowplow/snowplow/issues/1647

Anthony, Michael, please +1 that issue to get GitHub updates as we work through it...

A

Reply all

Reply to author

Forward