In the WARC file from the 2018-10 crawl I've been working with there are
431 application/pdf responses. Of these 91 have been truncated, which
can be detected by the fact that their message body contains exactly
1048576 bytes (1 megabyte).
These 91 are all and only the responses whose message body as contained
in the WARC file is not valid PDF.
So far so good.
But, only 82 of the 91 have a
WARC-Truncated: length
header in the WARC response prolog. The other 9 have no header
suggesting anything has gone wrong.
Note that the 156 overall occurences of truncation among the 54,829
responses in this WARC file are distributed as follows:
107 WARC-Truncated: length
49 WARC-Truncated: disconnect
Looking at the headers for the 9 odd cases reveals nothing obvious
that's true of all of them. 3 of them have pairs similar to the
following:
X-Crawler-Content-Length: 2539581
Content-Length: 1048576
but the other six have neither.
4 of them have
X-Crawler-Content-Encoding: gzip
but the others have no sign of compression. (The overall ratio of
compressed message bodies is 44,274/54,829 == 81%).
Most of the fetchTimeMs are under 1000, the longest I saw was just over
2000.
This is obviously not a big deal, but I thought that 10% of truncations
being silent was worth noting. I haven't looked at the text/html
responses, but could do so, Sebastian, if you thought it might be
useful.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail:
h...@inf.ed.ac.uk
URL:
http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.