AWS S3 CP

58 views
Skip to first unread message

Christian Lund

unread,
Sep 30, 2016, 2:43:05 AM9/30/16
to Common Crawl
I'm aware this is perhaps not the proper group for this question, but since Common Crawl uses AWS, I thought it might be relevant nonetheless.

On an EC2 instance I tried copying crawl data from the S3 bucket, but I get the following error:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/1471982290442.1/wat/CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.wat.gz /var/www/html/commoncrawl/CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.wat.gz

Unable to locate credentials
Completed 1 part(s) with ... file(s) remaining

It works fine with wget, but I wanted to see if there were any performance advantages to using S3 CP instead.

Any feedback is welcome and appreciated.

Sebastian Nagel

unread,
Sep 30, 2016, 3:12:39 AM9/30/16
to common...@googlegroups.com
Hi Christian,

if the default role inherited, e.g., from your EC2 instance you're logged in, does not support it
- either configure the AWS command-line interface use a profile which has the permissions
aws --profile username s3 ...
- or use
aws --no-sign-request s3 ...
The latter works for public readable S3 objects (including Common Crawl data).

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Christian Lund

unread,
Oct 3, 2016, 12:00:49 PM10/3/16
to Common Crawl
Hi Sebastien,

Thanks a lot, "--no-sign-request" did the trick.

Not sure if this will reduce my AWS network usage (compared to wget), but it appears to be faster.

-Christian

Reply all
Reply to author
Forward
0 new messages