EMR internal error

3,759 views
Skip to first unread message

Gareth Rogers

unread,
Jan 7, 2014, 5:37:19 AM1/7/14
to snowpl...@googlegroups.com
I'm trying to run the Snowplow EMR enricher and I am persistently getting the response "Terminated with errors. Failed to start the job flow due to an internal error". The jobs create no logs in S3 nor in the EMR console (that I can find) and I've got very little to go on. I think there is a problem provisioning the EC2 instances.

I ran a successful processing 16th December then tried to run another processing on the 18th (before I broke for a long Christmas break :)) but got this internal error failure. Enthusiastically back from my hols I've run 6 jobs that all terminated with these unhelpful error messages. So far I've tried changing the :emr: :placement: to the different eu-west-1 availability zones (a, b and c), changing the processing bucket to run on a single log file (as the staging of the logs was successful before Christmas I'm now running with the --skip staging option) and changing the output directories (ok, bad rows and error rows) to new directories. None of this has helped.

Does anyone have any suggestions on what I can do to get more information about what has gone wrong?

Alex Dean

unread,
Jan 7, 2014, 8:22:23 AM1/7/14
to snowpl...@googlegroups.com
Hi Gareth,

Sorry to hear you are still having issues. If you believe the EC2 instances aren't being provisioned - could it be that your EC2 key was created in the wrong region?

:ec2_key_name: ADD HERE
This EC2 key must be created in the same region as the placement you are running EMR in.

Is that it - let me know!

Alex


--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gareth Rogers

unread,
Jan 7, 2014, 8:42:07 AM1/7/14
to snowpl...@googlegroups.com
The EC2 key exists in eu-west-1 where the jobs are running and I've successfully run EMR jobs using this key.

Alex Dean

unread,
Jan 7, 2014, 8:42:55 AM1/7/14
to snowpl...@googlegroups.com
Hmm. Have you ever successfully run the Snowplow ETL process, or not? i.e. is it an intermittent error or a continuous one.

A

Gareth Rogers

unread,
Jan 7, 2014, 9:22:30 AM1/7/14
to snowpl...@googlegroups.com
Yes I have successfully run jobs in the past. However the last 7 I've tried to run (one in December, the rest yesterday) have all fail. It's become a continuous problem but I don't know what I've done to break it. As far as I'm aware I hadn't changed anything when it first failed!

Alex Dean

unread,
Jan 7, 2014, 9:31:34 AM1/7/14
to snowpl...@googlegroups.com
Hi Gareth,

It's a puzzle! We have jobs running fine in EC2 eu-west-1 right now. Have you gone through this guide:

https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce

How confident are you it's a provisioning issue?

A

Gareth Rogers

unread,
Jan 7, 2014, 9:40:07 AM1/7/14
to snowpl...@googlegroups.com
I've attached the description of the job retrieved using the EMR CLI (./elastic-mapreduce --describe --jobflow). It has the line:

  "LastStateChangeReason": "Error provisioning instances",

which is the only hint I've got it's a provisioning problem.

I'm not sure where else to look as no logs have been produced, nor what else I can do to get more debugging information.

Thanks
emr-job-description.json

Alex Dean

unread,
Jan 7, 2014, 10:25:05 AM1/7/14
to snowpl...@googlegroups.com
Hi Gareth,

I agree - it looks like a provisioning problem. If you haven't changed the configuration since it stopped working, it may well be something on the Amazon side (e.g. you are already running the maximum number of instances currently allowed on your account). Could you raise a ticket in your Amazon support account and see what they say?

Thanks,

Alex

Gareth Rogers

unread,
Jan 13, 2014, 9:55:37 AM1/13/14
to snowpl...@googlegroups.com
After a couple more experiments, a break and a bit more experimentation I've started to get somewhere.

I found that I could not even run the WordCount tutorial which I have at least managed to get running again. This I've done by creating a new VPC, I'm not sure why the existing (default) one no longer works.

At the moment I've started a new Snowplow EMR job which looks like it is going to fail in the same way. I will try updating Snowplow to see if that helps although I'm waiting for the current job to time out.

Looking at the Snowplow jobs they don't appear to have a Subnet ID.

Does this extra information help to diagnose what might be wrong?

Alex Dean

unread,
Jan 13, 2014, 12:14:00 PM1/13/14
to snowpl...@googlegroups.com
Hi Gareth,

I'm not aware that our EMR orchestration library (Elasticity) supports running EMR on a VPC. Can you try running Snowplow on a non-VPC environment and let us know if that works.

Cheers,

Alex
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Gareth Rogers

unread,
Jan 13, 2014, 12:23:35 PM1/13/14
to snowpl...@googlegroups.com
How do I do that?

Craig Thomas

unread,
Mar 20, 2014, 2:02:19 PM3/20/14
to snowpl...@googlegroups.com
Hi Alex,

My name is Craig and I am taking over getting this working from Gareth,
I have done some investigation and found that the issue is the Amazon on the 4th of December implemented their new VPC system and EMR now maps to this. There is now a default VPC: with default subnets and routing that does not support the communication between the Master and slaves and well the outside world really.
Text from Console log of instance:

get_instance_data.rb: Failed to get extra instance data. Sleeping for 18
2014-03-20T17:06:33+00:00 - Running: wget -U EMR-wget -S -T 10 -t 2 -O /mnt/var/lib/instance-controller/extraInstanceData.json.tmp --no-check-certificate 'https://aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com/j-1EBDNLVFG1DXH/ig-2HGQY12WJHLIT.json?Expires=1395852861&AWSAccessKeyId=AKIAIX7OBXE4SL5HQTOQ&Signature=KQuQ0TtINKrSI3AM9vqjcfpGTiI%3D'
--2014-03-20 17:06:33--  https://aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com/j-1EBDNLVFG1DXH/ig-2HGQY12WJHLIT.json?Expires=1395852861&AWSAccessKeyId=AKIAIX7OBXE4SL5HQTOQ&Signature=KQuQ0TtINKrSI3AM9vqjcfpGTiI%3D
Resolving aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com... 54.239.36.17
Connecting to aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com|54.239.36.17|:443... failed: Connection timed out.
Retrying.
--2014-03-20 17:06:44--  (try: 2)  https://aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com/j-1EBDNLVFG1DXH/ig-2HGQY12WJHLIT.json?Expires=1395852861&AWSAccessKeyId=AKIAIX7OBXE4SL5HQTOQ&Signature=KQuQ0TtINKrSI3AM9vqjcfpGTiI%3D
Connecting to aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com|54.239.36.17|:443... failed: Connection timed out.
Giving up.

2014-03-20T17:06:54+00:00 - Running: traceroute -T -n -m 10 aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com
traceroute to aws157-instance-data-1-prod-eu-west-1.s3.amazonaws.com (176.32.109.105), 10 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *

2014-03-20T17:07:04+00:00 - Running: traceroute -T -n -m 10 s3.amazonaws.com traceroute to s3.amazonaws.com (205.251.243.1), 10 hops max, 60 byte packets 1 * * * 2 * * * 3 * * * 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * *

However cloning the Jobs and setting them to the non default VPC causes everything to work fine. Unfortunately there is no way to change the default VPC used, the documentation states that you have to now provide the VPC in the configuration at the time the jobs are created or the default will be used (Ref: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/default-vpc.html) . I will look into the details of how this is done tomorrow, but i am hoping that you have now come across this issue? As Amazon has upgraded people accounts.
The other alternative would be to try and change the default VPC network details to allow the instances to communicate with each other.

Craig

Alex Dean

unread,
Mar 21, 2014, 4:47:26 AM3/21/14
to snowpl...@googlegroups.com
Hi Craig,

Thanks for the detailed explanation. This isn't something we've heard of before - I guess you guys are the first ones to have your EMR operations affected in this way.

The only configuration option in our EMR library that looks relevant is:
jobflow.ec2_subnet_id                     = nil
(https://github.com/rslifka/elasticity#2---specifying-options)

I would suggest editing your copy of EmrEtlRunner to have a play around with this setting and see if you can get the job running. If you can, then we can add this as a configuration option to the EmrEtlRunner's config.yaml. If that doesn't work, then I'll try and get you in front of Rob (Elasticity's author) so you can explain the problem to him...

Cheers,

Alex


For more options, visit https://groups.google.com/d/optout.

Craig Thomas

unread,
Mar 25, 2014, 11:47:34 AM3/25/14
to snowpl...@googlegroups.com
Hi Alex,

At the moment i'm struggling to get this to work. I have made changes to that variable by they do not get used by my run-workflow.sh i suspect that the gem needs to be rebuilt, however i am not sure how to do this.
Are you able to shed light on the commands needed?

Craig

Alex Dean

unread,
Mar 25, 2014, 1:10:55 PM3/25/14
to snowpl...@googlegroups.com
Hi Craig,


I have made changes to that variable by they do not get used by my run-workflow.sh i suspect that the gem needs to be rebuilt

Not sure I follow - what is run-workflow.sh and what gem are you referring to? Sorry if my prior email was unclear - I was suggesting manually editing your installed copy of EmrEtlRunner, specifically here:

https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_jobs.rb#L46

and adding in:
jobflow.ec2_subnet_id                     = 'blah'

Does that make sense - did you try that and if so what happened?
A



Simon Rumble

unread,
Mar 25, 2014, 8:32:38 PM3/25/14
to snowpl...@googlegroups.com
Hi Alex. Looks like I'm having the same problem, trying to launch within a VPC. In the AWS console I get:
Terminated with errorsFailed to start the job: No default VPC for this user

I edited 3-3nrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_jobs.rb
adding:
        @jobflow.ec2_subnet_id = 'subnet-b42222c0'

in the # Configure block of options.

Then tried to start my EMR job again, same problem.

Alex Dean

unread,
Mar 26, 2014, 5:06:39 AM3/26/14
to snowpl...@googlegroups.com
Hi Simon,

Right - so we know the issue is systemic now. Can I confirm - from your email, it looks like the issue you raised in snowplow/snowplow will not resolve your issue:

https://github.com/snowplow/snowplow/issues/581

Can you confirm?

In the meantime I have raised an issue with Elasticity, the EMR library that we use at Snowplow:

https://github.com/snowplow/snowplow/issues/581

I'd encourage anyone who is having issues to comment on that ticket to help Rob get up to speed quickly.

Thanks,

Alex

Rob S.

unread,
Mar 26, 2014, 6:21:35 AM3/26/14
to snowpl...@googlegroups.com
Hi everyone,

I released an updated version of Elasticity to RubyGems just now to fix this issue.  Turns out I had embedded the subnet ID in the wrong part of the XML jobflow structure.

That being said, VPC setup can be tricky business on Amazon and it's not my area of expertise.  If there are any further issues, please let me know although looking through the API guide and the resulting XML jobflow status from EMR directly, I see no other VPC-related touchpoints.

Hope this helps!

Rob

Craig Thomas

unread,
Mar 26, 2014, 6:46:40 AM3/26/14
to snowpl...@googlegroups.com
Hi Rob,

Thanks for the Quick update. A thought crosses my mind, in that with VPC you have so specify Network and Subnet. Do you know if the network will automatically be selected based on Subnet_ID, or will you only be able to select Subnets from the Default Network (VPC).
In which case our problem would still exist.

In any which case we are testing as we speak and will have the answer shortly.

Thanks again,

Craig

Craig Thomas

unread,
Mar 26, 2014, 7:18:56 AM3/26/14
to snowpl...@googlegroups.com
Hi Alex,

Is there an easy way for us to update Elasticity to version 3.0.1 which has this fix in it. Would this be just a case of editing the local copy of Gemfile from
# ErmEtlRunner is a Ruby app (not a RubyGem)
# built with Bundler, so we add in the
# RubyGems it requires here.
gem "elasticity", "~> 2.6"
to

# ErmEtlRunner is a Ruby app (not a RubyGem)
# built with Bundler, so we add in the
# RubyGems it requires here.
gem "elasticity", "~> 3.0.1"
And rerun:
bundle install --deployment

 


--
Craig Thomas
Senior Development Operations Engineer
Metail
- - -
Email: cr...@metail.com
Web: www.metail.com
Skype: abercat
Address: 16 Millers Yard, Mill Lane, Cambridge. CB2 1RQ

Inline images 4

 Inline images 5


--
You received this message because you are subscribed to a topic in the Google Groups "Snowplow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/snowplow-user/cZphPy-QGRw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to snowplow-use...@googlegroups.com.

Alex Dean

unread,
Mar 26, 2014, 7:28:50 AM3/26/14
to snowpl...@googlegroups.com
Exactly that - can you give that a try and let us know if it fixes the problem?

A

Craig Thomas

unread,
Mar 26, 2014, 8:03:04 AM3/26/14
to snowpl...@googlegroups.com
Hi both,

We managed to get this two work but we needed to make changes to the code.
Subnet and Placement can not be specified at the same time since subnet implies placement.

Current Elasticity code as of 3.0.1 does not support 'nil' variable being passed to the placement field.
So to get this working and to test we did the following.

Promote Elasticity ec2_subnet_id to snowplow config with the following:
@jobflow.ec2_subnet_id = config[:emr][:ec2_subnet_id]

We made changes to this file:
Snowplow Path: snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/elasticity-3.0.1/lib/elasticity/job_flow.rb
Elasticity 3.0.1 Path: elasticity-3.0.1/lib/elasticity/job_flow.rb

 def jobflow_preamble
      preamble = {
        :name => @name,
        :ami_version => @ami_version,
        :visible_to_all_users => @visible_to_all_users,
        :instances => {
          :keep_job_flow_alive_when_no_steps => @keep_job_flow_alive_when_no_steps,
          :hadoop_version => @hadoop_version,
          :instance_groups => jobflow_instance_groups,
#          :placement => {
#            :availability_zone => @placement
#          }
        }
      }

ToDo:
SnowPlow - Please can you promote the variable to Config.
Elasticity - Can add a check to the job_flow.rb to allow for a nil value to be passed as placement, and check that placement of subnet is defined but not both.

Regards,

Alex Dean

unread,
Mar 26, 2014, 8:08:30 AM3/26/14
to snowpl...@googlegroups.com
Hi Craig,

Thanks - that's super-helpful. We will get that actioned on our side...

A

Simon Rumble

unread,
Mar 26, 2014, 7:41:07 PM3/26/14
to snowpl...@googlegroups.com
On 26 March 2014 23:03, Craig Thomas <cr...@metail.co.uk> wrote:
We managed to get this two work but we needed to make changes to the code.
Subnet and Placement can not be specified at the same time since subnet implies placement.

Okay I've managed to get this working here too. Thanks everyone for the amazing, and fast, work on this. All I had to do was sleep ;)

To be clear for people who don't really do Ruby like me:
@jobflow.ec2_subnet_id = config[:emr][:ec2_subnet_id]

Translates to a line like this:
  :ec2_subnet_id: <your subnet id>

Going under the:
:emr: block of configuration in the config.yml.

I also had to do "bundle install --no-deployment" because of some Ruby locking thing. No idea what that means or what I've now irreparably broken ;)

It seems to have stepped me through that particular error but now I'm encountering a new one:
2014-03-26 23:12:53,006 ERROR org.apache.hadoop.security.UserGroupInformation (IPC Server handler 11 on 9000): PriviledgedActionException as:hadoop cause:java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
I have a strong suspicion that this has something to do with the VPC routing table, but I'm a bit stumped on where to look or how to test network connectivity between the nodes in the cluster...

Craig Thomas

unread,
Mar 27, 2014, 6:46:11 AM3/27/14
to snowpl...@googlegroups.com
Hi Simon,

For reference rather than doing 'bundle install --no-deployment',  I edited the snowplow Gemfile and Gemfile.lock to update the elasticity to version '3.0.2'.
In Gemfile.lock elasticity features twice! Once in the version to use section at the top, and once at the bottom of the file where the dependencies are.

You can then use 'bundle install --deployment' as normal.

Craig

Alex Dean

unread,
Mar 31, 2014, 11:48:11 AM3/31/14
to snowpl...@googlegroups.com
Hi folks,

Just to say that this is now fixed in Snowplow in the next release, coming soon:

https://github.com/snowplow/snowplow/tree/feature/0.9.1/3-enrich/emr-etl-runner

A


--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Vandenberk

unread,
Sep 30, 2014, 10:29:23 AM9/30/14
to snowpl...@googlegroups.com, Dani Sola, Paul McAdam
Hi folks,

Apologies for resurrecting this old thread, but we have exactly the same issue after upgrading from Snowplow version 0.9.3 to 0.9.8 recently.

We didn't make any changes to our VPC and/or our subnets, the only difference is the Snowplow upgrade.

We have been successfully running EMR clusters **inside** our VPC for months now, but after upgrading to 0.9.8 that has stopped working.

If we change our Snowplow EMR config so that the EMR cluster runs **outside** our VPC everything is running perfectly fine, but running it **inside** our VPC results in the error message that gave rise to this thread:

Terminated with errorsFailed to start the job flow due to an internal error

... without any more detail (no further error messages, nothing in the logs)

Anyone experiencing something similar, especially after upgrading to more recent versions of Snowplow?

Thanks,

Peter Vandenberk
Principal Developer
Simply Business

Craig Thomas

unread,
Sep 30, 2014, 10:33:10 AM9/30/14
to snowpl...@googlegroups.com, Dani Sola, Paul McAdam
Best route is to make sure your VPC allows connections in from your office and then ssh into the nodes and look through the log files.
To aid in this i tend to just clone the failed jobs in EMR with appropriate debug flags.

Craig

--
Craig Thomas
Senior Development Operations Engineer
Metail
- - -
Email: cr...@metail.com
Web: www.metail.com
Skype: abercat
Address: 50 St Andrews Street, Cambridge, CB2 3AH


--
You received this message because you are subscribed to a topic in the Google Groups "Snowplow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/snowplow-user/cZphPy-QGRw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to snowplow-use...@googlegroups.com.

Alex Dean

unread,
Sep 30, 2014, 12:38:33 PM9/30/14
to snowpl...@googlegroups.com, Dani Sola, Paul McAdam
Thanks for sharing Craig!

The only relevant change I can think of between 0.9.3 and 0.9.8 is this bug fix:

https://github.com/snowplow/snowplow/blob/master/CHANGELOG#L43

Cheers,

Alex

Peter Vandenberk

unread,
Sep 30, 2014, 1:08:42 PM9/30/14
to snowpl...@googlegroups.com, Dani Sola, Paul McAdam
Hi Alex,

Thanks for replying, and for linking to the bug fix... that explains **a lot** actually!!! :smile:

In our 0.9.3 config, we were setting ":ec2_subnet_id", assuming that that would create the cluster in a subnet inside our VPC, but we now know that - because of the bug in elasticity - the setting was being ignored and the cluster created outside of our VPC... so my earlier statement was incorrect: we assumed we were running inside the VPC, but we weren't.

So when we upgraded to 0.9.8, the ":ec2_subnet_id" setting all of a sudden **did** create the cluster in a subnet inside our VPC, but that didn't work - as it now turns out after further investigation - because the target subnet didn't have an IGW set up   :-(

tl;dr - it's all good now!  :smile:

Peter

Alex Dean

unread,
Sep 30, 2014, 1:12:00 PM9/30/14
to snowpl...@googlegroups.com, Dani Sola, Paul McAdam
Glad that sorted it!

In actual fact, the bug was in Snowplow's EmrEtlRunner, not Elasticity - it was just very kindly spotted for us by Elasticity author Rob Slifka. Beers on me when I see him in SF next week :-)

A
Reply all
Reply to author
Forward
0 new messages