I hope you're all doing well.
I am working on a project that requires extracting data from Common Crawl, and based on my estimation, I'll need to download 20-50M WARC record from each new crawl.
But the thing is, due to a very limited budget, I am unable to use AWS for this, as an alternative, I consider using HTTPS range requests.
And to avoid working with huge amount of data, I looked into the Index
to determine where the records I need are located, and found that they are scattered across the entire 90,000 WARC files. Because of this, using HTTPS range requests with the offset and length values from the index seems like the most practical approach.
My the question is:
would it be too much to make range HTTPS requests for 20-50M records
over the course of a few days?
if so, then what are the polite and acceptable requests rate?
I appreciate any guidance, and thank you for providing such an incredible resource.
Best regards,
Aziz