Managing Spark Instances

88 views
Skip to first unread message

Sean Casey

unread,
Mar 30, 2016, 5:38:18 PM3/30/16
to Sparkour
Hi there,

I randomly found your guide to managing Spark instances on EC2 at:


However, I am unable to launch a Spark cluster, as my IAM Role does not have the proper permissions.

Do you know (or have an IAM policy document) for the permissions required to launch a cluster?

Thanks!

Brian Uri!

unread,
Mar 30, 2016, 5:57:34 PM3/30/16
to Sparkour
Hi Sean,

The execution of the spark-ec2 script depends upon your AWS SECRET KEYS set in the environment where the script is run, and not the IAM roles assigned to the cluster through the script. Roles only come into play when the cluster tries to access other AWS services.

What is the error message you're seeing and when does it occur? I'll be glad to help.

BU

Sean Casey

unread,
Mar 30, 2016, 11:04:20 PM3/30/16
to spar...@googlegroups.com
Hi Brian,

Thanks for the quick response! In that case, do you know what IAM permissions should be applied to the AWS credentials that the script uses? Ideally, I would like to avoid using my AWS root account or Administrator account for security reasons (we try to apply the principle of least privilege to all of our scripts).

Error message available here: http://pastebin.com/WWKGQcm5

Cheers,

Sean

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Uri!

unread,
Mar 31, 2016, 6:57:30 AM3/31/16
to Sparkour
Hi Sean,

Maintaining least privilege is going to require a few more script execution attempts to sound out all of the required Actions, but here's how you can get started. For reference, I'm stepping through the scripts at:

https://github.com/amplab/spark-ec2/blob/branch-1.5/spark_ec2.py
https://github.com/boto/boto/blob/develop/boto/ec2/connection.py
https://github.com/boto/boto/blob/develop/boto/ec2/securitygroup.py

The complete list of EC2 Actions is available at:

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_Operations.html
  1. The IAM user who owns your secret keys definitely needs to have these Actions allowed to get past the error you're currently seeing:
    • AuthorizeSecurityGroupEgress
    • AuthorizeSecurityGroupIngress
  2. If you expect to use the spark_ec2 script to also tear down your cluster in the future, include these:
    • RevokeSecurityGroupEgress
    • RevokeSecurityGroupIngress
  3. Normally, I would say that you need these Actions to create/delete the security group but it looks like you've gotten past that step, so I'm assuming you created an SG in advance.
    • CreateSecurityGroup
    • DeleteSecurityGroup
  4. Once you have gotten past the security groups step, there will no doubt be additional errors related to Instances, KeyPairs, Images, Tags, and Volumes, depending on the pieces you're using to assemble the cluster. You can trace the error into the scripts to see what EC2 Resource is failing and then search the API Operations page for relevant Actions.
  5. Without seeing the next error message, I'm guessing that you'll eventually need one or more of these:
    • DescribeInstances
    • RunInstances
    • StartInstances
    • StopInstances
    • TerminateInstances
    • DescribeKeyPairs
    • DescribeImages
    • CreateTags
    • DescribeTags
    • DescribeVolumes
    • CreateVolume
    • AttachVolume
    • DetachVolume
    • DeleteVolution
Feel free to post a new stack trace if you get stumped and I can take a look.

Regards,
BU



On Wednesday, March 30, 2016 at 11:04:20 PM UTC-4, Sean Casey wrote:
Hi Brian,

Thanks for the quick response! In that case, do you know what IAM permissions should be applied to the AWS credentials that the script uses? Ideally, I would like to avoid using my AWS root account or Administrator account for security reasons (we try to apply the principle of least privilege to all of our scripts).

Error message available here: http://pastebin.com/WWKGQcm5

Cheers,

Sean

Brian Uri!

unread,
Mar 31, 2016, 7:03:13 AM3/31/16
to Sparkour
I've also opened a ticket to update the recipe with more explicit information on required script permissions:
https://ddmsence.atlassian.net/browse/SPARKOUR-7

If you'd like to share your final working policy, I'll be glad to incorporate it!

Regards,
BU

Sean Casey

unread,
Mar 31, 2016, 2:30:22 PM3/31/16
to spar...@googlegroups.com
Hi Brian,

Thanks for the feedback! I've updated my policy with the following permissions:


It looks like there's one critical addition of iam:PassRole that's required in the policy.

I'm getting further along in attempting to launch a cluster now, but have ran into the following exception:

Launching Spark cluster...
Setting up security groups...
Creating security group test3-master
Creating security group test3-slaves
Searching for existing cluster test3 in region us-east-1...
Spark AMI: ami-35b1885c
Launching instances...
Launched 1 slave in us-east-1d, regid = r-c3441d11
Launched master in us-east-1d, regid = r-29441dfb
Waiting for AWS to propagate instance metadata...
Traceback (most recent call last):
  File "/opt/spark/spark_ec2.py", line 1526, in <module>
    main()
  File "/opt/spark/spark_ec2.py", line 1518, in main
    real_main()
  File "/opt/spark/spark_ec2.py", line 1347, in real_main
    (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
  File "/opt/spark/spark_ec2.py", line 735, in launch_cluster
    map(str.strip, tag.split(':', 1)) for tag in opts.additional_tags.split(',')
ValueError: dictionary update sequence element #0 has length 1; 2 is required


Command being used to launch:

export AWS_ACCESS_KEY_ID="MYKEYHERE"
export AWS_SECRET_ACCESS_KEY="MYSECRETKEYHERE"

/opt/spark/spark-ec2 \
    --key-pair=spark \
    --identity-file=/path/to/ssh.pem \
    --region=us-east-1 \
    --vpc-id=vpc-myvpcid \
    --subnet-id=subnet-mysubnetid \
    --zone=us-east-1d \
    --authorized-address=10.0.0.0/16 \
    --slaves=1 \
    --instance-type=r3.large \
    --spark-version=1.6.0 \
    --hadoop-major-version=1 \
    --instance-profile-name=sparkcluster \
    launch test3

Any ideas?

Thanks!


Sean Casey

unread,
Mar 31, 2016, 2:31:43 PM3/31/16
to spar...@googlegroups.com
Just to clarify: it looks like it's launching the cluster, but unable to get metadata back from the instances?

Brian Uri!

unread,
Mar 31, 2016, 2:42:29 PM3/31/16
to Sparkour
It looks like no metadata has been requested at that point -- it's just trying to corral any tags you've specified into a dictionary so they can be added to the instance.

I know you didn't specify any tags, so I don't know what the issue could be there. Try specifying a tag to see if that works around the issue:

--additional-tags tagName:tagValue,tagName2:tagValue2

Regards,
BU


On Thursday, March 31, 2016 at 2:31:43 PM UTC-4, Sean Casey wrote:
Just to clarify: it looks like it's launching the cluster, but unable to get metadata back from the instances?
To unsubscribe from this group and all its topics, send an email to sparkour+unsubscribe@googlegroups.com.

Sean Casey

unread,
Mar 31, 2016, 4:10:16 PM3/31/16
to spar...@googlegroups.com
Hi Brian,

Sorry, missed a line in the original command I sent you... I had specified a tag in the command, but had used the incorrect syntax. 

After updating the command with the correct tagging syntax, I ran into another error which was caused by needing another IAM permission:

ec2:DescribeInstanceStatus

Looks like we're almost there.... the cluster fires up now, but the script has trouble connecting via SSH afterwards:

Launching Spark cluster...
Setting up security groups...
Searching for existing cluster test in region us-east-1...
Spark AMI: ami-35b1885c
Launching instances...
Launched 1 slave in us-east-1d, regid = r-79653cab
Launched master in us-east-1d, regid = r-fd653c2f
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state............

Warning: SSH connection error. (This could be temporary.)
Host:
SSH return code: 255
SSH output: ssh: Could not resolve hostname : Name or service not known

[Repeat SSH connection error over and over]


Cheers,

Sean



To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

Brian Uri!

unread,
Mar 31, 2016, 4:27:01 PM3/31/16
to Sparkour
Sean,

Keep on waiting and letting it fail, until you see "Cluster is now in ssh-ready state" in the logs. The script tries to SSH until the instance is available. The first time I ran the script, this took over 10 minutes.

Regards,
BU

On Thursday, March 31, 2016 at 4:10:16 PM UTC-4, Sean Casey wrote:
Hi Brian,

To unsubscribe from this group and all its topics, send an email to sparkour+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+unsubscribe@googlegroups.com.

Sean Casey

unread,
Mar 31, 2016, 5:55:40 PM3/31/16
to spar...@googlegroups.com
Alrighty, figured out the first problem: had to update my NAT instance's security group to allow traffic from the Spark cluster sec group.

We're getting closer now.... the script still can't SSH in (left it there for 45 mins, still no luck):

Warning: SSH connection error. (This could be temporary.)
Host:
SSH return code: 255
SSH output: ssh: Could not resolve hostname : Name or service not known

Here's the relevant output from the master's /var/log/cloud-init.log:

And from /var/log/yum.log:


Cheers,

Sean


To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

Brian Uri!

unread,
Mar 31, 2016, 7:08:03 PM3/31/16
to Sparkour
I don't have any insights on what might be wrong at that stage. Some sanity checks to consider:
  • Are the master / worker instances listed as Running with passing Status Checks in the EC2 dashboard?
  • Can you SSH into the machines on your own from the same machine where you ran the spark-ec2 script via:
    • "Public DNS" name in the EC2 dashboard (this is what the spark-ec2 script uses unless you used the private-ips parameter)
    • "Public IP" in the EC2 dashboard (if this works, it could suggest that DNS is not configured properly via the DHCP Option Sets in the VPC dashboard)
    • If your launch environment is in the same VPC, you could try the private IPs as well

Regards,

BU



On Thursday, March 31, 2016 at 5:55:40 PM UTC-4, Sean Casey wrote:
Alrighty, figured out the first problem: had to update my NAT instance's security group to allow traffic from the Spark cluster sec group.

We're getting closer now.... the script still can't SSH in (left it there for 45 mins, still no luck):

Warning: SSH connection error. (This could be temporary.)
Host:
SSH return code: 255
SSH output: ssh: Could not resolve hostname : Name or service not known

Here's the relevant output from the master's /var/log/cloud-init.log:

And from /var/log/yum.log:


Cheers,

Sean

To unsubscribe from this group and all its topics, send an email to sparkour+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+unsubscribe@googlegroups.com.

Sean Casey

unread,
Apr 1, 2016, 2:32:55 PM4/1/16
to spar...@googlegroups.com
Brian, you're a genius... the private-ips flag is exactly what I needed. The cluster is up and running!

I'll update the ticket with the final version of the IAM policy. Thank you for all of your help here, it's greatly appreciated... hope do I tip you a beer? 

Cheers,

Sean

To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Sparkour" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparkour/yD1bYGTzWiQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparkour+u...@googlegroups.com.

Brian Uri!

unread,
Apr 2, 2016, 8:51:56 AM4/2/16
to Sparkour
Glad to help! Traveling at the moment, but I'll update that recipe on Monday. Thanks for posting the IAM policy.

BU

Reply all
Reply to author
Forward
0 new messages