Costs involved in extracting URL list for 2014 crawl

170 views
Skip to first unread message

am...@socho-inc.com

unread,
Mar 24, 2014, 4:47:17 AM3/24/14
to common...@googlegroups.com

We would like get list of all urls that are part of 2014 crawl (available at s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10). Example code available at https://github.com/commoncrawl/example-warc-java/blob/master/src/main/java/org/commoncrawl/examples/java_warc/ReadS3Bucket.java shows how this can be done.

We have EC2 reserved instance in US East (N. Virginia) region. We want to run modified ReadS3Bucket code on this EC2 reserved instance. But since this code iterates through all files, we want to be sure that there are no S3 data transfer costs. S3 FAQ states that "There is no Data Transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same Region or for data transferred between the Amazon EC2 Northern Virginia Region and the Amazon S3 US Standard Region". But since we are not sure of AWS region where CommonCrawl data is stored, we can't be certain that there will no S3 data transfer costs involved.

My question is: If we access CC data from EC2 instance in US East region, will it be charged?
Reply all
Reply to author
Forward
0 new messages