Hi Rob,
> Would the best thing be to download all the robot.txt files then drop
> the other domains names?
That's clearly the best option.
.com sites make currently about 45% of the robots.txt captures (WARC
records), see [1,2].
If it'd be only about .
co.uk (less than 2.5%), the answer would be a
different one:
1. determine the list of robots.txt captures using the columnar
URL index, together with the WARC record locations (filename, offset,
length)
2. download only the WARC records you need per HTTP range requests
This procedure is described in [3]. But the overhead would be by far too
much if you skip "only" 50% of the records.
Best,
Sebastian
[1]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/latestcrawl.html
[2]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/percentage.html
[3]
https://github.com/commoncrawl/robotstxt-experiments
On 5/20/25 15:56, 'Rob Mackin' via Common Crawl wrote:
> Hello, I hope everyone is well.
>
> Just wondering how I download just the robot.txt files from all .com
> and .
co.uk domains.
>
> Would the best thing be to download all the robot.txt files then drop
> the other domains names?
>
> Thanks
> Rob
>