WARC filtering and Rate Limits.

41 views
Skip to first unread message

GuyMark

unread,
Nov 2, 2025, 1:07:46 AM (9 days ago) Nov 2
to Common Crawl
Apologies if this is answered, but the official FAQ didn't seem to mention it ad I cannot find the info anywhere else.

I would like to know the following please:-

1. What is considered a polite but fast download speed. I do not want to be a nuisance, but I don't want to "sip" data if I am welcome to go at a much faster rate. I am currently slef-limiting to 75 Mbps but at this rate, by the time I finally download the last WARC, the data will be almost three months out of date.

2. Is there a requested max number of threads - I am currently using 2.

3. Is there a way to only download WARCS for pages which have English Content OR specific TLDs ?

4. Is there a way to ONLY download WARCS that only contain NEW pages that were not indexed in the previous crawl ?

5. Is there a way to "trickle download" so rather than waiting for a brand new set to be release, we can have a "trickle update" which presumably would "average out the load" for commoncrawl and also allow us to keep content moderately fresh.

Apologies if this is covered elsewhere. If it is and you have time, please can you give me a link to the right page!?

Thank you.

Mark


Reply all
Reply to author
Forward
0 new messages