EmrEtl fails while copying data at Raw S3 -> HDFS step.

Oguzhan Yayla

unread,

Aug 6, 2015, 8:56:18 AM8/6/15

to Snowplow

Hi,

EmrEtl Runner fails at Raw S3 -> HDFS step with a error message : Data files not archived.

[Elasticity S3DistCp Step: Raw S3 -> HDFS: FAILED ~ 00:05:56 [2015-08-06 12:33:36 +0000 - 2015-08-06 12:39:32 +0000]

Couldn't really figure out what's going wrong, so any help is appreciated. Thanks a lot in advance.

Here is the traceback from emr logs :

Error: java.lang.RuntimeException: Reducer task failed to copy 663 files: s3://lolo-snowplow-archive/processing/E1HT7595MTEQ1H.2015-06-11-07.f1ef91f2.gz etc

at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:75)

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)

Error: java.lang.RuntimeException: Reducer task failed to copy 829 files: s3://lolo-snowplow-archive/processing/E1HT7595MTEQ1H.2015-06-13-22.7b288dae.gz etc

at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:75)

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)

Error: java.lang.RuntimeException: Reducer task failed to copy 528 files: s3://lolo-snowplow-archive/processing/E1HT7595MTEQ1H.2015-06-12-09.d5438eff.gz etc

at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:75)

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)

Cheers

Alex Dean

unread,

Aug 6, 2015, 9:48:42 AM8/6/15

to Snowplow

Hey Oguzhan,

Have you made any recent changes to your config? Could you share an anonymized version of it?

Thanks,

Alex

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Oguzhan Yayla

unread,

Aug 6, 2015, 10:06:40 AM8/6/15

to Snowplow

Hi Alex,

I changed the instance type from m1.small to m1.medium and deleted hbase and lingual because we do not have custom steps in jobflow. Here is the config yaml :

:s3:

:region: eu-west-1

:buckets:

:assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket

:log: s3://bucket-archive/emr-logs

:raw:

:in: s3://bucket-cf-logs

:processing: s3://bucket-archive/processing

:archive: s3://bucket-archive/raw

:enriched:

:good: s3://bucket-out/enriched/good

:bad: s3://bucket-out/enriched/bad

:errors: s3://bucket-out/enriched/error

:shredded:

:good: s3://bucket-out/shredded/good

:bad: s3://bucket-out/shredded/bad

:errors: s3://bucket-out/shredded/error

:emr:

:ami_version: 3.6.0 # Don't change this

:region: eu-west-1

:jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles

:service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles

:placement:

:ec2_subnet_id: subnet-8XXXX # Set this if running in VPC. Leave blank otherwise

:ec2_key_name: my-key-name

:bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise

:software:

:hbase: # set if you need them.

:lingual: # set if you need them.

# Adjust your Hadoop cluster below

:jobflow:

:master_instance_type: m1.medium

:core_instance_count: 2

:core_instance_type: m1.medium

:task_instance_count: 0 # Increase to use spot instances

:task_instance_type: m1.medium

:task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances

:etl:

:job_name: Wahanda SP ETL # Give your job a name

:versions:

:hadoop_enrich: 1.0.0 # Version of the Hadoop Enrichment process

:hadoop_shred: 0.4.0 # Version of the Hadoop Shredding process

:collector_format: cloudfront # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs

:continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL

:iglu:

:schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0

:data:

:cache_size: 500

:repositories:

- :name: "Iglu Central"

:priority: 0

:vendor_prefixes:

- com.snowplowanalytics

:connection:

:http:

:uri: http://iglucentral.com

Thanks !

Alex Dean

unread,

Aug 6, 2015, 11:58:37 AM8/6/15

to Snowplow

Can you paste your last working version?

Oguzhan Yayla

unread,

Aug 7, 2015, 4:30:59 AM8/7/15

to Snowplow

Hi again,

I'm using the latest version of Snowplow and snowplow-emr-etl-runner 0.16.0

Cheers

Alex Dean

unread,

Aug 7, 2015, 4:49:29 AM8/7/15

to Snowplow

Hey Oguzhan,

We have a very similar job to yours that is working fine (CloudFront, AMI 3.6.0 etc). The only difference I can see is that yours is in a VPC, whereas ours is not.

At the moment I'm leaning towards thinking this is some kind of VPC and-or IAM related issue. This is the only S3DistCp step in the pipeline which reads from S3 - so potentially something around the interplay of AMI 3.x.x, S3DistCp, S3, VPCs and IAM.

So I'd take another look at your VPC setup and IAM permissions.

A

Oguzhan Yayla

unread,

Aug 7, 2015, 7:28:31 AM8/7/15

to Snowplow

Hi Alex,

To be able to test this, I used aws-cli on EC2 instance where Snowplow is installed. I used the same key and secret (from config.yml) for aws-cli and moved some files to a S3 bucket and it worked so I assume that there is not IAM or VPC related problem. What else I can check ? Different AMI version maybe?

Thank you so much.

Alex Dean

unread,

Aug 7, 2015, 7:40:25 AM8/7/15

to Snowplow

Hi Oguzhan,

That was a good test to do. I think the next test to do is to spin up an EMR cluster (set it to not terminate on failure), then SSH into the master node and attempt the same exercise with aws cli.

This is a test of whether you have the S3 access you need from the VPC which is running the job (vs from the EC2 box which is orchestrating the job).

A

Oguzhan Yayla

unread,

Aug 7, 2015, 10:27:29 AM8/7/15

to Snowplow

Hi Alex,

I set it not terminate on failure, but it still does.

I tried setting :continue_on_unexpected_error: to true, still didn't move on to next step. And then I changed the @action_on_failure to continue here : https://github.com/snowplow/snowplow/blob/627ab0b67eb30d3082bd03242e1a542cba75c1a0/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/scalding_step.rb#L27 but it's still failing. So I can't ssh to master node to test this. Is there anything else I can do to move on the next steps in case of failure ?

Thanks a lot

Cheers

Alex Dean

unread,

Aug 7, 2015, 12:35:46 PM8/7/15

to Snowplow

I think you want:

jobflow.action_on_failure                 = 'CONTINUE'
jobflow.keep_job_flow_alive_when_no_steps = true

A

Oguzhan Yayla

unread,

Aug 10, 2015, 8:40:51 AM8/10/15

to Snowplow

Hi Alex,

Sorry for asking but could you please tell me where 'jobflow.keep_job_flow_alive_when_no_steps' setting is ?

Thanks a lot

Alex Dean

unread,

Aug 10, 2015, 8:45:05 AM8/10/15

to Snowplow

You'll have to edit the EmrEtlRunner to add that setting.

A

Oguzhan Yayla

unread,

Aug 11, 2015, 5:33:49 AM8/11/15

to Snowplow

Hi Alex,

Added these two settings into snowplow-emr-etl-runner/emr_job.rb as its shown here https://github.com/rslifka/elasticity but it still terminates I'm afraid, Do I need to change this line too ? https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb#L373

Thanks a lot !

Alex Dean

unread,

Aug 11, 2015, 5:41:28 AM8/11/15

to Snowplow

When you say "it still terminates", do you mean the cluster or EmrEtlRunner? It doesn't matter if EmrEtlRunner terminates; as long as the cluster is still running (check the UI), you can SSH in.

If you can't get a cluster surviving long enough for testing through EmrEtlRunner, then just use the $ aws emr command with as similar command-line arguments as possible.

A

Oguzhan Yayla

unread,

Aug 11, 2015, 9:49:44 AM8/11/15

to Snowplow

Hi,

First of all thanks a lot for the all quick replies, we're blocked with this problem and trying to solve it asap.

So I launched a cluster in the same VPC, subnet, keys, AMI version, instance profile, emr role and then ssh'ed to node and copied/moved/removed a file to the same S3 bucket and it all worked. What'd be the next step ?

Best

Alex Dean

unread,

Aug 11, 2015, 9:55:31 AM8/11/15

to Snowplow

You can try running with --skip s3distcp - that should unblock you though the job will be much slower.

The last thing to check is that you can perform the same S3 operations from a task instance (i.e. slave) as well as from the master. If you can, then next step is to file a bug with AWS about S3DistCp on your particular setup. Be as specific as you can.

A

Oguzhan Yayla

unread,

Aug 13, 2015, 6:28:41 AM8/13/15

to Snowplow

Hi again,

I ran with --skip s3distcp, as you mentioned it was much slower and it only had two steps : Elasticity Scalding Step: Enrich Raw Events, Elasticity Scalding Step: Shred Enriched Events

It failed, after running Elasticity Scalding Step: Enrich Raw Events for around 14h and there not much logs. Here is the only error log I could find :

...

2015-08-12 12:53:23,836 INFO [main] com.amazonaws.latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 9F56F42991799322), S3 Extended Request ID: 8B93RR4HbvioAL9wxLUdGZFIDlw9xUNaXhcxkV9D5h4+wGUKEyZOPgOo9fwaz63I], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[9F56F42991799322], ServiceEndpoint=[https://wh-snowplow-out.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.624], HttpRequestTime=[15.356], HttpClientReceiveResponseTime=[12.032], RequestSigningTime=[0.676]

wh-snowplow-out bucket do exist btw. I have enclosed this log file, and the syslog. On the other hand there is enriched data in s3://x-snowplow-out/enriched/good/run=2015-08-13.. so I'm really confused and can't really identify the problem.

Any help is appreciated, thank you so much in advance !

syslog-emr.webarchive

syslog2.webarchive

Alex Dean

unread,

Aug 13, 2015, 6:45:02 AM8/13/15

to snowpl...@googlegroups.com

It sounds like Hadoop Enrich is really not enjoying writing your event volumes direct to S3.

I recommend raising the S3DistCp issue you're seeing with AWS support.

Cheers,

Alex

Oguzhan Yayla

unread,

Aug 19, 2015, 10:58:00 AM8/19/15

to Snowplow

Hi Alex,

Solved the problem with S3DistCp. Amount of data (files) were too much for S3DistCp/S3. Problem is solved by processing the data in smaller chunks.

Thanks a lot.

Cheers

Oguzhan

Alex Dean

unread,

Aug 19, 2015, 10:59:00 AM8/19/15

to Snowplow

Thanks for sharing the solution!

A

Gabor Ratky

unread,

Dec 11, 2015, 4:31:32 AM12/11/15

to Snowplow

Hi,

Resurrecting the thread to add some more information. We started bumping into similar errors and failed EmrEtlRunner after we changed our configuration from 5 m1.medium instances (2 core + 3 spot task) to a single m3.xlarge core instance. It seems that if there are less reducer slots (as there are less instances), s3distcp can fail more frequently.

Unfortunately we've also seen staging (sluice: logs -> processing) failures, so some aspect of S3 might be the root cause. We're on an old version of EmrEtlRunner and use s3n:// uris which could also be the root cause.

We'll upgrade soon but until then, we'll play around the configuration to see if we can avoid the issue.

Any additional suggestion or input welcome :)

Best,

Gabor

Alex Dean

unread,

Dec 11, 2015, 7:40:00 AM12/11/15

to Snowplow

Thanks Gabor,

Let us know how you get on after upgrade. Generally we find S3DistCp and Boto to be more reliable and faster than Sluice; we have plans to replace Sluice with S3DistCp in places and manifests in others.

A

Haiyu Zhen

unread,

Jan 18, 2016, 3:29:09 PM1/18/16

to Snowplow

Hi Oguzhan,

I am new to EMR and Snowplow, and I have the same issue now.

I am wondering how to "processing the data in smaller chunks"? Do I change settings on AWS side or EmrEtlRunner config file?