Guidance on Making HTTPS Range Requests to Common Crawl

51 views
Skip to first unread message

aziz

unread,
Jun 20, 2025, 1:27:05 PMJun 20
to Common Crawl
Hello everyone,
I hope you're all doing well.

I am working on a project that requires extracting data from Common Crawl, and based on my estimation, I'll need to download 20-50M WARC record from each new crawl.

But the thing is, due to a very limited budget, I am unable to use AWS for this, as an alternative, I consider using HTTPS range requests.
And to avoid working with huge amount of data, I looked into the Index to determine where the records I need are located, and found that they are scattered across the entire 90,000 WARC files. Because of this, using HTTPS range requests with the offset and length values from the index seems like the most practical approach.

My the question is:
would it be too much to make range HTTPS requests for 20-50M records over the course of a few days? if so, then what are the polite and acceptable requests rate?

I appreciate any guidance, and thank you for providing such an incredible resource.
Best regards,
Aziz  

Jen English

unread,
Jun 20, 2025, 5:58:36 PMJun 20
to Common Crawl
Aziz,

You can find our guidance for HTTPS requests to the index in our FAQ: https://commoncrawl.org/faq

Please the details in the section down the page: Why am I getting connection errors or 5xx responses from index.commoncrawl.org?

Let us know if you have any follow up questions or need clarification on anything there. 

Best, 
Jen English, Common Crawl
Reply all
Reply to author
Forward
0 new messages