Hi Phil,
(sorry for the late response, the mail somehow went out of sight)
Access to Common Crawl data is still free, independent from the access
scheme - anonymous via CloudFront or authenticated via S3. The move away
from unauthenticated S3 access was necessary to manage the growing usage
volume, see [1]. Yes, this may include abuse or let's call it overuse
since no resource is unlimited, and network bandwidth for sure.
> download a month's CC archive to local (hard drive, not S3)
Well, if it's about the WARC files, it's more about a stack of hard
disks. (:
Whenever possible, we recommend that you run your computing workload in
the same region (us-east-1) as the Common Crawl dataset is hosted.
Of course, we understand that there are valid use cases to process parts
or even entire crawl datasets on your hardware.
Best,
Sebastian
[1]
https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/