Downloading the robots.txt files for all .com and .co.uk domains

unread,

May 20, 2025, 10:31:16 AMMay 20

to Common Crawl

Hello, I hope everyone is well.

Just wondering how I download just the robot.txt files from all .com and .co.uk domains.

Would the best thing be to download all the robot.txt files then drop the other domains names?

Thanks

Rob

unread,

May 20, 2025, 10:46:44 AMMay 20

to common...@googlegroups.com

Hi Rob,

> Would the best thing be to download all the robot.txt files then drop
> the other domains names?

That's clearly the best option.

.com sites make currently about 45% of the robots.txt captures (WARC
records), see [1,2].

If it'd be only about .co.uk (less than 2.5%), the answer would be a
different one:
1. determine the list of robots.txt captures using the columnar
URL index, together with the WARC record locations (filename, offset,
length)
2. download only the WARC records you need per HTTP range requests

This procedure is described in [3]. But the overhead would be by far too
much if you skip "only" 50% of the records.

Best,
Sebastian

[1]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/latestcrawl.html
[2]
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/percentage.html
[3] https://github.com/commoncrawl/robotstxt-experiments

unread,

May 20, 2025, 12:04:28 PMMay 20

to Common Crawl

Hello Sebastian,

Great thanks for getting back to me.

I will let you know how I get on.

Thanks

Rob

Reply all

Reply to author

Forward