Why does the robotstxt crawl contain some records for non-'robots' URIs?

48 views
Skip to first unread message

Henry S. Thompson

unread,
May 5, 2025, 11:53:30 AMMay 5
to common...@googlegroups.com
[If you aren't interested in obscure details about the way the crawls
are collected, look away now!]

A modest percentage of the request/response pairs in robotstxt files
have WARC-Target-URIs which don't look like ".../robots.txt". Where
do these come from?

In a small handful of cases, the request/response pair looks (to me)
like a perfectly normal URI yielding a perfectly normal HTML page.
So, why is it in the robotstxt part of the crawl instead of the
warc part? For example, in
crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/robotstxt/CC-MAIN-20190817203056-20190817225056-00346.warc.gz,
we find:

WARC/1.0
WARC-Type: request
...
WARC-Target-URI: https://results.gothiacup.se/2019/start

GET /2019/start HTTP/1.1
...
Host: results.gothiacup.se


WARC/1.0
WARC-Type: response
...
Content-Length: 55214
Content-Type: application/http; msgtype=response
...
WARC-Target-URI: https://results.gothiacup.se/2019/start
...
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
...

<html>
<head>
...
<title>Search - Gothia Cup 2019 Results</title>
...
</head>
<body>
...
<h1>...
Gothia Cup 2019
</h1>

...

Ah, more research answered my own question. If a request for a
robots.txt results in a 302, not only does that request appear in the
robotstxt section, _not_ the crawldiagnostics section, but the
resulting request/response pair for the 302 Location URI, even if it's
not a robotstxt file, also appears in the robotstxt section.

So in the above case, the original legitimate request which redirected
to the above is found in
crawl-data/CC-MAIN-2019-35/segments/1566027313501.0/robotstxt/CC-MAIN-20190817222907-20190818004907-00108.warc.gz:

WARC/1.0
WARC-Type: request
...
WARC-Target-URI: https://results.gothiacup.se/robots.txt

GET /robots.txt HTTP/1.1
...
Host: results.gothiacup.se


WARC/1.0
WARC-Type: response
...
WARC-Target-URI: https://results.gothiacup.se/robots.txt
...
WARC-Identified-Payload-Type: text/html

HTTP/1.1 302 Moved Temporarily
...
Access-Control-Allow-Origin: https://static.cupmanager.net
Content-Type: text/html; preliminary=true; charset=UTF-8
Vary: X-Forwarded-Proto
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Expires: Sat, 06 May 1995 12:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Location: https://results.gothiacup.se/2019/start
Content-Length: 0

Which is broken in ... many ways...

A quick check of just one robotstxt warc file found at least 351
redirections out of 2095 request-response pairs:

253 HTTP/1.1 301 Moved Permanently
66 HTTP/1.1 302 Found
16 HTTP/1.1 302 Moved Temporarily
3 HTTP/1.1 302 Redirect
3 HTTP/1.1 303 See Other
2 HTTP/1.1 301 Moved
2 HTTP/1.1 307 Temporary Redirect
1 HTTP/1.1 301 Found
1 HTTP/1.1 301 Moved permanently
1 HTTP/1.1 301 MOVED PERMANENTLY
1 HTTP/1.1 302 Found
1 HTTP/1.1 302 Move Temporary
1 HTTP/1.1 302 Object Moved

of which the supplied Location header was _not_ obviously a robots.txt
file in 63 cases, of which only 7 did not have some form of robots.txt
WARC-Target-URI. That's at least 15% of the redirections being, well,
pretty unlikely to be correct.

The ways which website managers find to mess things up never cease to
amaze me.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail: h...@inf.ed.ac.uk
URL: https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Sebastian Nagel

unread,
May 5, 2025, 1:33:40 PMMay 5
to common...@googlegroups.com
Hi Henry,

good observation! Thanks for sharing it!

Just as an addition: In August 2021 [1] the rules for robots.txt
archiving have been improved: now a HTML page is not archived anymore.

However, the crawler still follows a redirected robots.txt - since end
of 2023 even up to 5 levels of redirection as required by RFC 9309.
But only if the final redirect target is actually a robots.txt it
is archived. If the redirect target turns out to be a HTML file, this
means that there are no robots rules and crawling is allowed.

To resolve the robots.txt redirects using the URL index isn't an easy
task, if it's about more than just a few sites, see [2].

Best,
Sebastian

[1] https://commoncrawl.org/blog/july-august-2021-crawl-archive-available
[2]
https://github.com/commoncrawl/robotstxt-experiments/blob/main/src/jupyter/data-preparation-top-k-sample.ipynb

Henry S. Thompson

unread,
May 6, 2025, 9:45:02 AMMay 6
to common...@googlegroups.com
[personal reply]
Sebastian Nagel writes:

> To resolve the robots.txt redirects using the URL index isn't an easy
> task, if it's about more than just a few sites, see [2].

Resolving _any_ redirects is tricky. Some years ago I spent rather
longer than I had planned trying to compute some statistics on number
of redirects, which meant trying to work _backwards_ take multiple
passes through crawldiagnostics following chains and stitching things
together, in the end gave it up as too computationally expensive,
since I didn't have any free compute resource at that time.
Reply all
Reply to author
Forward
0 new messages