New path to Common Crawl Corpus on AWS

160 views

Skip to first unread message

Sara Crouse

unread,

May 23, 2016, 9:51:18 PM5/23/16

to Common Crawl

Common Crawl Community,

We have been working with partner Amazon Web Services to improve and streamline the way that Common Crawl data is stored on AWS.

For users of the data, this means that the path to access any data in the corpus, from https or S3, is modified because the data has been moved to a new bucket (location) on AWS S3. Going forward, all Common Crawl data is accessible below https://commoncrawl.s3.amazonaws.com/ or s3://commoncrawl/.

For the next few weeks, the entire corpus will be available at *both* the old and new locations. During this time, all links on the Common Crawl website that point to datasets in the corpus will be updated to point to the new location.

This group will receive a reminder of this change and notification when the paths to the previous location are no longer active.

The first new dataset shared at the new location is the April crawl (s3://commoncrawl/crawl-data/CC-MAIN-2016-18/). Detail on the crawl archive of April 2016 is posted here on the Common Crawl blog. (Please note that the April crawl is not available at the old location.)

Best regards,

Sara

Reply all

Reply to author

Forward

0 new messages