GuyMark
unread,Nov 2, 2025, 1:07:46 AM (9 days ago) Nov 2Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Apologies if this is answered, but the official FAQ didn't seem to mention it ad I cannot find the info anywhere else.
I would like to know the following please:-
1. What is considered a polite but fast download speed. I do not want to be a nuisance, but I don't want to "sip" data if I am welcome to go at a much faster rate. I am currently slef-limiting to 75 Mbps but at this rate, by the time I finally download the last WARC, the data will be almost three months out of date.
2. Is there a requested max number of threads - I am currently using 2.
3. Is there a way to only download WARCS for pages which have English Content OR specific TLDs ?
4. Is there a way to ONLY download WARCS that only contain NEW pages that were not indexed in the previous crawl ?
5. Is there a way to "trickle download" so rather than waiting for a brand new set to be release, we can have a "trickle update" which presumably would "average out the load" for commoncrawl and also allow us to keep content moderately fresh.
Apologies if this is covered elsewhere. If it is and you have time, please can you give me a link to the right page!?
Thank you.
Mark