Henry S. Thompson writes:
> Sebastian Nagel writes:
>> ...
>> Can you give a concrete example?
>
> Will do, in a subsequent message.
>
>>> Would it be possible to detect connection loss (which I'm guessing
>>> must be the explanation here) and include in the post-response
>>> metadata section some indiction of this?
>>
>> This is tracked since August 2018 and indicated as
>> WARC-Truncated: disconnect
>> Before (since 2013) only the "length" and "time" (10 min.) limit
>> have been flagged in the WARC files. I would need to dig into the
>> code to see how disconnects have been handled.
>
> I'll rerun my tests on a post-2018-08 file and see what I find.
OK, I've now extracted comparable information from a WARC file from
2018-10:
Compared to the one I looked it from 2015-10, it's
* Smaller (1.10 vs 1.20 GB compressed (92%), 4.54 vs 5.42 GB raw (84%))
* Shorter (54829 responses vs. 71016 (77%))
The 1MB limit has affected 107 responses (c.f. 72 in 2015-10, 149%), as
well as 49 truncated because of a disconnect.
After further investigation, I see that in fact I had substantially
under-reported the cases of WARC-Truncated in 2015-10, which should have
been:
27224 WARC-Truncated: length
6 WARC-Truncated: time
I think the explanation is that in 2015-10, WARC-Truncated: length
reflects two quite different situations:
1) The 1MB limit kicked in (72 cases);
2) There was no Content-Length header in the HTTP response, the
connection simply closed after the body (the other 27152 cases). My
reporting software only logged cases where the actual response
differed in length from that computed by WARC response header length
from the Content-Length _in_ the WARC response header (the "further
35" reported above). The similarity of the relevant ratios (35 out
of 107 = 33%, 49 out of 156 = 31%) suggest that as hypothesised,
these were all cases of premature disconnect.
So, net-net, the post-2018-08 reporting changes give a more accurate
picture of response length and its origins and nature than theretofore.
Reverting back to the potential impact of changing to a 4MB limit, here
are numbers and histograms for the 2018-10 file, comparable to the data
in my earlier message for 2015-10:
The 1MB limit affects 103* responses: if they were not truncated this
would add 608,828,559 bytes (about twice as much as in the 2015-10 case
in absolute terms), adding 12.5% (more like three times as much in
proportional terms).
The (pre-truncation) Content-Length distribution for the bodies of those
103 responses looks like this:
(80 truncated files, length histogram with 15 bins of width 200,000):
1,100,000 11 ***********
1,300,000 12 ************
1,500,000 13 *************
1,700,000 10 **********
1,900,000 7 *******
2,100,000 4 ****
2,300,000 6 ******
2,500,000 6 ******
2,700,000 2 **
2,900,000 1 *
3,100,000 1 *
3,300,000 3 ***
3,500,000 1 *
3,700,000 2 **
3,900,000 1 *
The remaining 23 cover a much larger range than the corresponding 19
from 2015-10:
(23 larger truncated files, length histogram with 13 bins of width 10,000,000):
5,000,000 13 *************
15,000,000 4 ****
25,000,000 2 **
35,000,000 0
45,000,000 2 **
55,000,000 0
65,000,000 0
75,000,000 0
85,000,000 1 *
95,000,000 0
105,000,000 0
115,000,000 0
125,000,000 0
135,000,000 0
145,000,000 0
155,000,000 0
165,000,000 0
175,000,000 0
185,000,000 0
195,000,000 0
205,000,000 1 *
If we look only at the application/pdf files, we get this
(82 truncated pdf files, length histogram with 19 bins of width 1,000,000):
1,500,000 41 *****************************************
2,500,000 18 ******************
3,500,000 5 *****
4,500,000 3 ***
5,500,000 4 ****
6,500,000 1 *
7,500,000 2 **
8,500,000 2 **
9,500,000 0
10,500,000 1 *
11,500,000 0
12,500,000 0
13,500,000 0
14,500,000 0
15,500,000 1 *
16,500,000 0
17,500,000 0
18,500,000 0
19,500,000 1 *
>20,000,000 3 (23MB, 26MB, 41MB)
Increasing the 1MB limit to 4MB would let 64 of these 82 through, along
with with 16 of other media types.
The truncated lengths of the 103 files currently being clipped, now
exactly 103MB, account for about 2.2% of the total of 4,876,729,301.
If the cutoff was moved from 1MB to 4MB, the lengths of those 103 files
in the WARC would increase by about 133MB, adding 2.9% to the total.
This compares to 104MB, 1.9% of the total, for my earlier sample from
2015-10.
So, assuming the more recent WARC file I looked at is not wildly
misleading for the month's crawl as a whole, this gives us a reasonable
estimate of shifting the cutoff to 4MB.
ht
*In 103 cases my reporting software detected a length mismatch. These
coincided with a WARC-Truncated: length header in 98 cases, the other 5
did not have such a header. I'll leave details of those 5 cases, as
well as the 9 cases (107 - 98) where there _was_ a Truncated header but
no length mismatch, to another email.