S3 download cost from CommonCrawl?

584 views
Skip to first unread message

Phil Creston

unread,
Oct 12, 2022, 4:43:13 PM10/12/22
to Common Crawl
Does anyone have an estimate for how much it costs to download a month's CC archive to local (hard drive, not S3) storage via the AWS CLI from an authenticated account?

Being an Open Data Project, is it free (and therefore, the move to use AWS' authentication is just to prevent abuse) or will I end up being hit with a $,000 bill if I start downloading?

The unauthenticated wget 1MB/s download rate is problematic but I also don't want to end up bankrupting myself.

Thank you!

Sebastian Nagel

unread,
Oct 28, 2022, 6:57:36 AM10/28/22
to common...@googlegroups.com
Hi Phil,

(sorry for the late response, the mail somehow went out of sight)

Access to Common Crawl data is still free, independent from the access
scheme - anonymous via CloudFront or authenticated via S3. The move away
from unauthenticated S3 access was necessary to manage the growing usage
volume, see [1]. Yes, this may include abuse or let's call it overuse
since no resource is unlimited, and network bandwidth for sure.

> download a month's CC archive to local (hard drive, not S3)

Well, if it's about the WARC files, it's more about a stack of hard
disks. (:

Whenever possible, we recommend that you run your computing workload in
the same region (us-east-1) as the Common Crawl dataset is hosted.
Of course, we understand that there are valid use cases to process parts
or even entire crawl datasets on your hardware.

Best,
Sebastian


[1]
https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/
Reply all
Reply to author
Forward
0 new messages