Downloading the robots.txt files for all .com and .co.uk domains
51 views
Skip to first unread message
Rob Mackin
unread,
May 20, 2025, 10:31:16 AMMay 20
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hello, I hope everyone is well.
Just wondering how I download just the robot.txt files from all .com and .co.uk domains.
Would the best thing be to download all the robot.txt files then drop the other domains names?
Thanks
Rob
Sebastian Nagel
unread,
May 20, 2025, 10:46:44 AMMay 20
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Rob,
> Would the best thing be to download all the robot.txt files then drop
> the other domains names?
That's clearly the best option.
.com sites make currently about 45% of the robots.txt captures (WARC
records), see [1,2].
If it'd be only about .co.uk (less than 2.5%), the answer would be a
different one:
1. determine the list of robots.txt captures using the columnar
URL index, together with the WARC record locations (filename, offset,
length)
2. download only the WARC records you need per HTTP range requests
This procedure is described in [3]. But the overhead would be by far too
much if you skip "only" 50% of the records.