Read from and write to S3 with different credentials on EMR

redshift-etl-user

unread,

Oct 26, 2013, 3:00:09 AM10/26/13

to cascadi...@googlegroups.com

I'm need to read from an S3 bucket with certain key and secret, and write to a different bucket with a different key and secret using EMR. The issue with initializing an Hfs tap with a secret key with a slash has been discussed before, but since I'm dealing with two sets of credentials, using Hadoop variables won't work (I can only specify one). Is there a way to do this without making assumptions about whether the secret key contains slashes?

As a side note, it seems to me that this restriction with specifying secret keys with slashes to Hfs is unnecessary, since Hadoop's NativeS3FileSystem is able to. Any thoughts on that?

Thanks!

Chris K Wensel

unread,

Oct 26, 2013, 5:03:29 AM10/26/13

to cascadi...@googlegroups.com

the limitation comes from (if i remember) of having the credentials as part of the S3 URL. Java URL parser or something in the stack doesn't like slashes there.

to overcome the issue, you need to make the value a property.

a very useful feature on Tap is the ability to set arbitrary properties, via the #getStepConfigDef() method.

in the case of NativeS3FileSystem, you would put the S3 credentials on the Hfs tap via the #getStepConfigDef(), and the FileSystem should see them.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.

--

Chris K Wensel

ch...@concurrentinc.com

http://concurrentinc.com

Christian Romming

unread,

Oct 26, 2013, 1:47:40 PM10/26/13

to cascadi...@googlegroups.com

Thanks, Chris.

Dug into this a bit more - it's the call to uri.getAuthority() that is supposed to return everything after the scheme up to and including the port number that gets confused by the slash (it's an illegal character according to the URI spec) and only returns everything up to the slash in the secret key, resulting in an invalid hostname. Both Hfs and NativeS3FileSystem has this.

Good to know about #getStepConfigDef(), but unfortunately in this case the reading and writing is done in the same step, so I think I'm out of luck. Looks like I'll have to align the bucket permissions somehow to get around this.

Thanks again.

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/X5ILuZT0jlk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

Alex Dean

unread,

Oct 27, 2013, 6:19:42 AM10/27/13

to cascadi...@googlegroups.com

I haven't had my coffee yet so could be missing something, but couldn't you switch:

S3 -> your job -> another S3

to:

S3 -> S3DistCp -> local HDFS ~> your job -> another S3

or:

S3 -> your job ~> local HDFS -> S3DistCp -> another S3

Then you only have to use #getStepConfigDef() once I think in your job?

Using S3DistCp and reading/writing from local HDFS is generally much faster anyway than reading/writing S3 direct.

Links:

- http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

- http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/

- https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_jobs.rb#L81

Hope it's helpful (and sorry for spam - I sent a prior version of this email direct to author by mistake)

A

redshift-etl-user

unread,

Oct 29, 2013, 12:21:10 AM10/29/13

to cascadi...@googlegroups.com

Thanks, Alex - that's a neat idea.

Related question: You say this approach is "much faster" - is there evidence that using S3DistCp is also faster overall when the input isn't a bunch of small files?

Thanks again.

Ken Krugler

unread,

Oct 29, 2013, 7:50:05 AM10/29/13

to cascadi...@googlegroups.com

On Oct 29, 2013, at 12:21am, redshift-etl-user <redshi...@gmail.com> wrote:

Thanks, Alex - that's a neat idea.

Related question: You say this approach is "much faster" - is there evidence that using S3DistCp is also faster overall when the input isn't a bunch of small files?

That's been our experience (as well as more reliable than trying to directly read from S3 in a job).

Note though that S3DistCp has a few differences from regular DistCp,which has made it awkward for us to use as a drop-in replacement.

-- Ken

On Sunday, October 27, 2013 3:19:42 AM UTC-7, Alex Dean wrote:
I haven't had my coffee yet so could be missing something, but couldn't you switch:

S3 -> your job -> another S3

to:

S3 -> S3DistCp -> local HDFS ~> your job -> another S3

or:

S3 -> your job ~> local HDFS -> S3DistCp -> another S3

Then you only have to use #getStepConfigDef() once I think in your job?

Using S3DistCp and reading/writing from local HDFS is generally much faster anyway than reading/writing S3 direct.

Links:

- http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
- http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
- https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_jobs.rb#L81

Hope it's helpful (and sorry for spam - I sent a prior version of this email direct to author by mistake)

A

On Saturday, October 26, 2013 8:00:09 AM UTC+1, redshift-etl-user wrote:
I'm need to read from an S3 bucket with certain key and secret, and write to a different bucket with a different key and secret using EMR. The issue with initializing an Hfs tap with a secret key with a slash has been discussed before, but since I'm dealing with two sets of credentials, using Hadoop variables won't work (I can only specify one). Is there a way to do this without making assumptions about whether the secret key contains slashes?

As a side note, it seems to me that this restriction with specifying secret keys with slashes to Hfs is unnecessary, since Hadoop's NativeS3FileSystem is able to. Any thoughts on that?

Thanks!

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

redshift-etl-user

unread,

Nov 5, 2013, 3:48:20 PM11/5/13

to cascadi...@googlegroups.com

S3DistCp definitely seems to be performing better than reading directly from S3, at least for many small files.

Found a race condition that causes its reducers to fail sometimes, though - you can circumvent the problem by setting the property "s3DistCp.copyfiles.mapper.numWorkers" to 1. There's a performance penalty in that it turns out multithreaded downloads.

Do you use S3DistCp only when reading or when writing as well?

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/X5ILuZT0jlk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

Alex Dean

unread,

Nov 7, 2013, 4:05:00 AM11/7/13

to cascadi...@googlegroups.com

So far, we just use S3DistCp for reading from S3 (not writing back to it).

If there's a race condition bug with writing, do raise a bug with the AWS team. They fix bugs in S3DistCp within a day or two (vs feature requests, which take months).

A

redshift-etl-user

unread,

Nov 7, 2013, 4:19:54 AM11/7/13

to cascadi...@googlegroups.com

Yeah - I've told them about it. 2 days and counting :)

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/X5ILuZT0jlk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/fc841bab-3822-40c3-8b1c-d011529b5ee3%40googlegroups.com.

Ken Krugler

unread,

Nov 7, 2013, 4:07:50 AM11/7/13

to cascadi...@googlegroups.com

On Nov 5, 2013, at 8:48pm, redshift-etl-user <redshi...@gmail.com> wrote:

S3DistCp definitely seems to be performing better than reading directly from S3, at least for many small files.

Found a race condition that causes its reducers to fail sometimes, though - you can circumvent the problem by setting the property "s3DistCp.copyfiles.mapper.numWorkers" to 1. There's a performance penalty in that it turns out multithreaded downloads.

Has the above race condition issue been reported on the AWS forums?

Do you use S3DistCp only when reading or when writing as well?

We use (embedded) DistCp to pull from S3 into HDFS at the start of most workflows, and then reverse that process to move results back into S3 when we're done.

Directly accessing S3 from a job has thus far been too unreliable, at least when dealing with very large datasets. It works most of the time, but for a daily job that's not good enough.

-- Ken

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CAOkn%3D1zDGOo4BYLiyWuPw6T-t%3DEY7P7-JBgMJAnMWvSFqRKS%2Bw%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

redshift-etl-user

unread,

Nov 7, 2013, 9:07:17 PM11/7/13

to cascadi...@googlegroups.com

Good to know, Ken. Yes, I've reported the problem to AWS.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/E3DB91E8-D840-4740-9BA6-960443A05EFD%40transpac.com.

Jack Spayed

unread,

Feb 10, 2014, 7:42:08 PM2/10/14

to cascadi...@googlegroups.com

Would you mind sharing your solution? I have the same issue ---- have access to foreign S3 bucket on a different account via accesskey/secretkey ---- need to copy the contents to a bucket on my account.

Unfortuntely I'm literally just learning EMR/Hadoop and having a beast of a time digging through ALL the relevant documentaiton.

What I tried (and failed)
Create a Role with the foreign account having full control over my EMR (http://docs.aws.amazon.com/IAM/latest/UserGuide/DelegatingAccess.html)
Add the foreign AccessKey/SecretKey to EMR credentials.json
Execute S3distcp with --src,s3://foreignbucket --dest,s3://mybucket

Didnt work --- complained about the foreign account (arn:xxxxxxxxx) not having the permission to perform actions in my EMR....

I'm stuck...

---was able to get (what i think is) foreign account number through EMR erroring out about permissions prior to creating a delegated role.

Any newbie level solution?

redshift-etl-user

unread,

Feb 10, 2014, 8:13:09 PM2/10/14

to cascadi...@googlegroups.com

Jack,

It's unclear to me whether you're having issues with S3 or EMR access permissions. If it's your EMR cluster you should use your own credentials to submit jobs to it. There are AWS credentials config options for the job itself where you can set the "foreign" credentials.

The solution that Alex is proposing is to run a job that copies from S3 to EMR's HDFS first. Maybe try that with a bucket you control first and then add the "foreign" credentials once you know that part is working?

Cheers.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/cf169f75-9504-4d16-88d7-e5b042d2d995%40googlegroups.com.

Jack Spayed

unread,

Feb 10, 2014, 8:19:59 PM2/10/14

to cascadi...@googlegroups.com

There are AWS credentials config options for the job itself where you can set the "foreign" credentials

^^^^ This is where I'm sure I'll strike gold. Unfortunately my google-fu isnt pointing me in the right direction.

I will continue to look at the S3 -> S3DistCp -> local HDFS ~> your job -> mybucket ----hope its not too complicated for me to learn as a day1 user.

Let me know if you can point me toward that AWS credentials doc.

redshift-etl-user

unread,

Feb 10, 2014, 8:41:55 PM2/10/14

to cascadi...@googlegroups.com

Here you go: https://wiki.apache.org/hadoop/AmazonS3

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/1ca25b30-2094-4ac0-b48f-a9ce85da00d7%40googlegroups.com.

Reply all

Reply to author

Forward