[If you aren't interested in obscure details about the way the crawls
are collected, look away now!]
A modest percentage of the request/response pairs in robotstxt files
have WARC-Target-URIs which don't look like ".../robots.txt". Where
do these come from?
In a small handful of cases, the request/response pair looks (to me)
like a perfectly normal URI yielding a perfectly normal HTML page.
So, why is it in the robotstxt part of the crawl instead of the
warc part? For example, in
crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/robotstxt/CC-MAIN-20190817203056-20190817225056-00346.warc.gz,
we find:
WARC/1.0
WARC-Type: request
...
WARC-Target-URI:
https://results.gothiacup.se/2019/start
GET /2019/start HTTP/1.1
...
Host:
results.gothiacup.se
WARC/1.0
WARC-Type: response
...
Content-Length: 55214
Content-Type: application/http; msgtype=response
...
WARC-Target-URI:
https://results.gothiacup.se/2019/start
...
WARC-Identified-Payload-Type: text/html
HTTP/1.1 200 OK
...
<html>
<head>
...
<title>Search - Gothia Cup 2019 Results</title>
...
</head>
<body>
...
<h1>...
Gothia Cup 2019
</h1>
...
Ah, more research answered my own question. If a request for a
robots.txt results in a 302, not only does that request appear in the
robotstxt section, _not_ the crawldiagnostics section, but the
resulting request/response pair for the 302 Location URI, even if it's
not a robotstxt file, also appears in the robotstxt section.
So in the above case, the original legitimate request which redirected
to the above is found in
crawl-data/CC-MAIN-2019-35/segments/1566027313501.0/robotstxt/CC-MAIN-20190817222907-20190818004907-00108.warc.gz:
WARC/1.0
WARC-Type: request
...
WARC-Target-URI:
https://results.gothiacup.se/robots.txt
GET /robots.txt HTTP/1.1
...
Host:
results.gothiacup.se
WARC/1.0
WARC-Type: response
...
WARC-Target-URI:
https://results.gothiacup.se/robots.txt
...
WARC-Identified-Payload-Type: text/html
HTTP/1.1 302 Moved Temporarily
...
Access-Control-Allow-Origin:
https://static.cupmanager.net
Content-Type: text/html; preliminary=true; charset=UTF-8
Vary: X-Forwarded-Proto
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Expires: Sat, 06 May 1995 12:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Location:
https://results.gothiacup.se/2019/start
Content-Length: 0
Which is broken in ... many ways...
A quick check of just one robotstxt warc file found at least 351
redirections out of 2095 request-response pairs:
253 HTTP/1.1 301 Moved Permanently
66 HTTP/1.1 302 Found
16 HTTP/1.1 302 Moved Temporarily
3 HTTP/1.1 302 Redirect
3 HTTP/1.1 303 See Other
2 HTTP/1.1 301 Moved
2 HTTP/1.1 307 Temporary Redirect
1 HTTP/1.1 301 Found
1 HTTP/1.1 301 Moved permanently
1 HTTP/1.1 301 MOVED PERMANENTLY
1 HTTP/1.1 302 Found
1 HTTP/1.1 302 Move Temporary
1 HTTP/1.1 302 Object Moved
of which the supplied Location header was _not_ obviously a robots.txt
file in 63 cases, of which only 7 did not have some form of robots.txt
WARC-Target-URI. That's at least 15% of the redirections being, well,
pretty unlikely to be correct.
The ways which website managers find to mess things up never cease to
amaze me.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail:
h...@inf.ed.ac.uk
URL:
https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]