S3 Charges For Outbound Or Not

Vansh Devgan

unread,

Aug 25, 2025, 8:12:49 PM (12 days ago) Aug 25

to Common Crawl

Hi Team,

I’d like to confirm something regarding the use of Common Crawl data. Since this dataset is part of AWS Open Data and accessible via authenticated S3, will I incur any charges if I download and process several months of data on a bare-metal server hosted in a different region or with another cloud provider?

My understanding is that because the dataset is hosted under the Open Data program (and not in our own S3 buckets), there should be no additional charges. Could you please confirm if that’s correct? I am thinking to use S3 method as it can pull data faster to my machines.

Thanks,
Vansh Devgan

Tom Morris

unread,

Aug 25, 2025, 11:15:39 PM (12 days ago) Aug 25

to common...@googlegroups.com

The safest way to make sure that you're not charged is to not be authenticated when you try to access the data. If it works, you're golden. If it doesn't, you're probably going to get charged if you decide to authenticate.

Tom

Samuel Shadrach

unread,

Aug 26, 2025, 5:32:56 AM (11 days ago) Aug 26

to common...@googlegroups.com

False?

Samuel Shadrach

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEGnMB2CovRWW7hv86%3DF85DNeh6NE%2BSrex%3DLEx777jRELA%40mail.gmail.com.

Greg Lindahl

unread,

Aug 26, 2025, 5:40:14 AM (11 days ago) Aug 26

to common...@googlegroups.com

> I am thinking to use S3 method as it can pull data faster to my machines.

If you are outside AWS, you must use https://data.commoncrawl.org. Yes, there's a rate limit. That's intentional. Our infrastructure page https://status.commoncrawl.org/ has advice about how to speed up https access.

Tom: You gave incorrect advice. If you tried unauthenticated S3 access you'll discover that it triggers an error. Sebastian wrote a blog post about this change when it was made, March 2022. https://commoncrawl.org/blog/introducing-cloudfront-access-to-common-crawl-data

-- greg

--

You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/common-crawl/ff270045-4e71-4695-b321-b12d21f2507an%40googlegroups.com.

Reply all

Reply to author

Forward