Re: [cc] Downloading the robots.txt files for all .com and .co.uk domains

63 views
Skip to first unread message
Message has been deleted

Sebastian Nagel

unread,
May 20, 2025, 10:46:44 AM5/20/25
to common...@googlegroups.com
Hi Rob,

> Would the best thing be to download all the robot.txt files then drop
> the other domains names?

That's clearly the best option.

.com sites make currently about 45% of the robots.txt captures (WARC
records), see [1,2].

If it'd be only about .co.uk (less than 2.5%), the answer would be a
different one:
1. determine the list of robots.txt captures using the columnar
URL index, together with the WARC record locations (filename, offset,
length)
2. download only the WARC records you need per HTTP range requests

This procedure is described in [3]. But the overhead would be by far too
much if you skip "only" 50% of the records.


Best,
Sebastian


[1]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/latestcrawl.html
[2]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/percentage.html
[3] https://github.com/commoncrawl/robotstxt-experiments


On 5/20/25 15:56, 'Rob Mackin' via Common Crawl wrote:
> Hello, I hope everyone is well.
>
> Just wondering how I download just the robot.txt files from all .com
> and .co.uk domains.
>
> Would the best thing be to download all the robot.txt files then drop
> the other domains names?
>
> Thanks
> Rob
>
Reply all
Reply to author
Forward
Message has been deleted
0 new messages