EMR fails due to an internal error.

Oguzhan Yayla

unread,

Jan 9, 2015, 12:04:08 PM1/9/15

to snowpl...@googlegroups.com

Hi,

Just updated to the latest version. I'm not sure if related to update but EmrEtlRunner fails with EmrExecutionError.

Actually the process doesn't even start.

EMR says : "Starting Provisioning Amazon EC2 capacity" and then it cancels all the tasks in the queue and terminates with error which is :

"Failed to start the job flow due to an internal error".

Unfortunately it doesn't log anything (s3 bucket is empty), so can't write more details about the problem. Any idea about why it throws an internal error ?

or what can I do to dig in more into the problem ?

Thanks

--

Oguzhan

Alex Dean

unread,

Jan 9, 2015, 12:10:25 PM1/9/15

to snowpl...@googlegroups.com

Hi Oguzhan,

Sometimes AWS does throw an internal error - as it occurred before the job steps kicked off, it's not going to be related to your update.

If you find it happening regularly, it's worth checking with your AWS Support contact.

Cheers,

Alex

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Oguzhan Yayla

unread,

Jan 27, 2015, 7:28:13 AM1/27/15

to snowpl...@googlegroups.com

Hi Alex,

We still couldn't solve this problem.

It gets stuck in 'Provisioning EC2 Instances' state and fails after ~31mins with the "Terminated with errorsFailed to start the job flow due to an internal error."

err message. I've checked that EC2 instances are launched and on a running state without any problem so it doesn't look like a provisioning problem. I've set a log bucket but it doesn't log anything there either. Nothing meaningful in syslog too. I was able to run the jobs before successfully so can't really figure out what's wrong. Here is the config file :

Thanks a lot in advance

:logging: :level: DEBUG # You can optionally switch to INFO for production :aws: :access_key_id: XXX :secret_access_key: XXX :s3: :region: eu-west-1 :buckets: :assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket :log: s3://xxx-snowplow-cf-logs/logs :raw: :in: s3://xxx-snowplow-cf-logs :processing: s3://xxx-snowplow-archive/processing :archive: s3://xxx-snowplow-archive/raw :enriched: :good: s3://xxx-snowplow-out/enriched/good # e.g. s3://my-out-bucket/enriched/good :bad: s3://xxx-snowplow-out/enriched/bad # e.g. s3://my-out-bucket/enriched/bad :errors: s3://xxx-snowplow-out/enriched/error # Leave blank unless :continue_on_unexpected_error: set to true below :shredded: :good: s3://xxx-snowplow-out/shredded/good # e.g. s3://my-out-bucket/shredded/good :bad: s3://xxx-snowplow-out/shredded/bad # e.g. s3://my-out-bucket/shredded/bad :errors: s3://xxx-snowplow-out/shredded/error # Leave blank unless :continue_on_unexpected_error: set to true below :emr: :ami_version: 2.4.2 # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html :region: eu-west-1 # Always set this :placement: :ec2_subnet_id: subnet-xxxxx # Set this if running in VPC. Leave blank otherwise :ec2_key_name: xx-sp-xx :software: :hbase: "0.92.0" # To launch on cluster, provide version, "0.92.0", keep quotes :lingual: "1.1" # To launch on cluster, provide version, "1.1", keep quotes # Adjust your Hadoop cluster below :jobflow: :master_instance_type: m1.small :core_instance_count: 2 :core_instance_type: m1.small :task_instance_count: 0 # Increase to use spot instances :task_instance_type: m1.small :task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances :etl: :job_name: Snowplow XXX ETL # Give your job a name :versions: :hadoop_enrich: 0.11.0 # Version of the Hadoop Enrichment process :hadoop_shred: 0.3.0 # Version of the Hadoop Shredding process :collector_format: cloudfront # Or 'clj-tomcat' for the Clojure Collector :continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL :iglu: :schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0 :data: :cache_size: 500 :repositories: - :name: "Iglu Central" :priority: 0 :vendor_prefixes: - com.snowplowanalytics :connection: :http: :uri: http://iglucentral.com

Alex Dean

unread,

Jan 27, 2015, 8:26:10 AM1/27/15

to snowpl...@googlegroups.com

Hi Oguzhan,

Your config looks fine... Can you confirm that when you look inside the EMR dashboard, all jobflow steps for the given job are set to CANCELLED? What % of your jobflows are failing like this?

Cheers,

Alex

Oguzhan Yayla

unread,

Jan 27, 2015, 8:44:03 AM1/27/15

to snowpl...@googlegroups.com

Hi,

All of them are failing. I've attached a screenshot of EMR dashboard. They're all cancelled.

That's the traceback :

D, [2015-01-27T12:59:16.064267 #13652] DEBUG -- : Initializing EMR jobflow

D, [2015-01-27T12:59:17.828148 #13652] DEBUG -- : EMR jobflow j-HOVKA8NM6B8N started, waiting for jobflow to complete...

F, [2015-01-27T13:31:20.449837 #13652] FATAL -- :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-HOVKA8NM6B8N failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.

Snowplow Wahanda ETL: FAILED [Failed to start the job flow due to an internal error] ~ elapsed time n/a [ - 2015-01-27 13:30:27 UTC]

- 1. Start HBase 0.92.0: CANCELLED ~ elapsed time n/a [ - ]

- 2. Elasticity S3DistCp Step: Raw S3 -> HDFS: CANCELLED ~ elapsed time n/a [ - ]

- 3. Elasticity Scalding Step: Enrich Raw Events: CANCELLED ~ elapsed time n/a [ - ]

- 4. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]

- 5. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]

- 6. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

/root/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:282:in `run'

/root/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call'

/root/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call_with'

/root/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/decorators.rb:157:in `run'

/root/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:60:in `run'

/root/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call'

/root/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call_with'

/root/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/decorators.rb:157:in `run'

bin/snowplow-emr-etl-runner:39:in `<main>'

I've checked the Troubleshooting-jobs-on-Elastic-MapReduce page but it since I do not have any logs, It didn't help much unfortunately.

Thanks in advance

Screen Shot 2015-01-27 at 1.39.57 PM.png

Screen Shot 2015-01-27 at 1.40.19 PM.png

Alex Dean

unread,

Jan 27, 2015, 8:51:54 AM1/27/15

to snowpl...@googlegroups.com

Hmm,

That's not normal. What does your Amazon support representative say?

A

Oguzhan Yayla

unread,

Jan 28, 2015, 6:13:39 AM1/28/15

to snowpl...@googlegroups.com

Hi again,

It was having problem with the NATted routing. I fixed it by using Public Subnet. But I'm not sure about its side-effects.

Is there any side-effects of using public subnet that I should be careful about ?

P.S: For more info about the problem : https://forums.aws.amazon.com/thread.jspa?threadID=119332&tstart=0#annotation-21311

Cheers

Alex Dean

unread,

Mar 20, 2015, 1:10:41 PM3/20/15

to snowpl...@googlegroups.com

Hey Oguzhan,

Sorry for the delay in getting back to you on this one! You are right - we need to use a public subnet with Snowplow, because the EMR jobflow is reading from and writing to Amazon S3, which requires internet to access.

More info here:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-vpc-subnet.html