AWS S3 with IAM Assume Role Session access - not on EMR or EC2

357 views
Skip to first unread message

eeps...@marketshare.com

unread,
Mar 11, 2016, 5:22:50 PM3/11/16
to cascading-user
We've got access to data in S3 that we want to use as input to a Cascading flow.

The data are not ours, so access is granted via an IAM Role. 

We cannot access the data with simplawsAccessKeyId/awsSecretAccessKey combination - we require:
  • Role-based Access to S3.  (For example, via STSAssumeRoleSessionCredentialsProvider or similar.)
  • Aren't on EC2 or EMR.
Has anyone done something similar and gotten it to work?
Or have pointers of what to try?

Ken Krugler

unread,
Mar 11, 2016, 5:55:30 PM3/11/16
to cascadi...@googlegroups.com
I assume you've followed the steps required to configure the cluster for IAM roles, as per:


As an aside, normally you don't want to read directly from S3 in a workflow - it often leads to job failures when you've got lots of data.

So in our workflows we first use embedded distcp job (via the DistCp class) to copy files into HDFS.

-- Ken


From: eeps...@marketshare.com

Sent: March 11, 2016 2:22:50pm PST

To: cascading-user

Subject: AWS S3 with IAM Assume Role Session access - not on EMR or EC2


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/16a1824e-560e-413f-8df1-26e5bb93f22d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





eeps...@marketshare.com

unread,
Mar 11, 2016, 5:58:37 PM3/11/16
to cascading-user
We are not running on EMR.

Ken Krugler

unread,
Mar 11, 2016, 6:15:29 PM3/11/16
to cascadi...@googlegroups.com
Including the original email for context...

So you're running in your own cluster somewhere, and pulling from S3, right?

What I've read online is that "...The s3a filesystem adds it [support for IAM roles], —this is ready for production use in Hadoop 2.7.1+ (implicitly HDP 2.3; CDH
5.4 has cherrypicked the relevant patches."

But I haven't seen any details on exactly how to specify this.

-- Ken


From: eeps...@marketshare.com

Sent: March 11, 2016 2:58:37pm PST

To: cascading-user

Subject: Re: AWS S3 with IAM Assume Role Session access - not on EMR or EC2


We are not running on EMR.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

sarma.t...@tubemogul.com

unread,
Mar 11, 2016, 9:56:43 PM3/11/16
to cascadi...@googlegroups.com

Hey Ken,

I did not want to hijack the other thread but was wondering if this aside was from your experience or if there are other such tips documented somewhere.


As an aside, normally you don't want to read directly from S3 in a workflow - it often leads to job failures when you've got lots of data.

So in our workflows we first use embedded distcp job (via the DistCp class) to copy files into HDFS.

-- Ken

Ken Krugler

unread,
Mar 12, 2016, 10:41:12 AM3/12/16
to cascadi...@googlegroups.com


From: sarma.t...@tubemogul.com

Sent: March 11, 2016 6:56:32pm PST

To: cascadi...@googlegroups.com

Subject: RE: AWS S3 with IAM Assume Role Session access - not on EMR or EC2


Hey Ken,

I did not want to hijack the other thread but was wondering if this aside was from your experience or if there are other such tips documented somewhere.


That has just been my personal experience - other people may have different perspectives.

-- Ken


As an aside, normally you don't want to read directly from S3 in a workflow - it often leads to job failures when you've got lots of data.

So in our workflows we first use embedded distcp job (via the DistCp class) to copy files into HDFS.

-- Ken

From: eepstein@marketshare.com

Sent: March 11, 2016 2:22:50pm PST

To: cascading-user

Subject: AWS S3 with IAM Assume Role Session access - not on EMR or EC2


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages