Downloading the robots.txt files for all .com and .co.uk domains

51 views
Skip to first unread message

Rob Mackin

unread,
May 20, 2025, 10:31:16 AMMay 20
to Common Crawl
Hello, I hope everyone is well. 

Just wondering how I download just the robot.txt files from all .com and .co.uk domains. 

Would the best thing be to download all the robot.txt files then drop the other domains names?

Thanks 
Rob

Sebastian Nagel

unread,
May 20, 2025, 10:46:44 AMMay 20
to common...@googlegroups.com
Hi Rob,

> Would the best thing be to download all the robot.txt files then drop
> the other domains names?

That's clearly the best option.

.com sites make currently about 45% of the robots.txt captures (WARC
records), see [1,2].

If it'd be only about .co.uk (less than 2.5%), the answer would be a
different one:
1. determine the list of robots.txt captures using the columnar
URL index, together with the WARC record locations (filename, offset,
length)
2. download only the WARC records you need per HTTP range requests

This procedure is described in [3]. But the overhead would be by far too
much if you skip "only" 50% of the records.


Best,
Sebastian


[1]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/latestcrawl.html
[2]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/percentage.html
[3] https://github.com/commoncrawl/robotstxt-experiments

Rob Mackin

unread,
May 20, 2025, 12:04:28 PMMay 20
to Common Crawl
Hello Sebastian, 

Great thanks for getting back to me. 

I will let you know how I get on. 

Thanks
Rob

Reply all
Reply to author
Forward
0 new messages