Minor points wrt truncation of large responses

18 views
Skip to first unread message

Henry S. Thompson

unread,
Jul 11, 2019, 12:50:50 PM7/11/19
to common...@googlegroups.com
In the WARC file from the 2018-10 crawl I've been working with there are
431 application/pdf responses. Of these 91 have been truncated, which
can be detected by the fact that their message body contains exactly
1048576 bytes (1 megabyte).

These 91 are all and only the responses whose message body as contained
in the WARC file is not valid PDF.

So far so good.

But, only 82 of the 91 have a

WARC-Truncated: length

header in the WARC response prolog. The other 9 have no header
suggesting anything has gone wrong.

Note that the 156 overall occurences of truncation among the 54,829
responses in this WARC file are distributed as follows:

107 WARC-Truncated: length
49 WARC-Truncated: disconnect

Looking at the headers for the 9 odd cases reveals nothing obvious
that's true of all of them. 3 of them have pairs similar to the
following:

X-Crawler-Content-Length: 2539581
Content-Length: 1048576

but the other six have neither.

4 of them have

X-Crawler-Content-Encoding: gzip

but the others have no sign of compression. (The overall ratio of
compressed message bodies is 44,274/54,829 == 81%).

Most of the fetchTimeMs are under 1000, the longest I saw was just over
2000.

This is obviously not a big deal, but I thought that 10% of truncations
being silent was worth noting. I haven't looked at the text/html
responses, but could do so, Sebastian, if you thought it might be
useful.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,
Jul 17, 2019, 7:59:05 AM7/17/19
to common...@googlegroups.com
Hi Henry,

thanks for sharing the observations. I'll have a look into it.

> In the WARC file from the 2018-10 crawl I've been working with

Could you share the full name or s3 path of the WARC file?
That would help to reproduce your findings. Thanks!

> but the other six have neither.
>
> 4 of them have
>
> X-Crawler-Content-Encoding: gzip

Also with chunked Transfer-Encoding there is no Content-Length header.
Maybe we should always add the Content-Length header, even if there
wasn't one originally?

> but I thought that 10% of truncations
> being silent was worth noting.

Thanks! Of course, there may be a bug. I'll check it,
truncations are tracked already on the protocol level [1].


Thanks,
Sebastian


[1]
https://github.com/commoncrawl/nutch/blob/cc/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java

Henry S. Thompson

unread,
Jul 17, 2019, 11:37:40 AM7/17/19
to common...@googlegroups.com
Sebastian Nagel writes:


>> In the WARC file from the 2018-10 crawl I've been working with
>
> Could you share the full name or s3 path of the WARC file?
> That would help to reproduce your findings. Thanks!

CC-MAIN-20181016221847-20181017003347-00114.warc.gz

> Also with chunked Transfer-Encoding there is no Content-Length header.
> Maybe we should always add the Content-Length header, even if there
> wasn't one originally?

That would simplify things a bit: as it stands I'm always computing the
expected payload length by subtracting the length of the WARC response
headers from _its_ Content-Length header. Since presumably it would be
an X-Crawler-Content-Length header, there would be no possible confusion
as to where it came from.

Sebastian Nagel

unread,
Jul 18, 2019, 6:51:13 AM7/18/19
to common...@googlegroups.com
Hi Henry,

thanks! I can confirm that not all truncations are flagged.
There are a couple of other points which could be improved.
The issues are tracked here:
https://github.com/commoncrawl/nutch/issues/10


> an X-Crawler-Content-Length header

No, it would mean to always add the "Content-Length" header.

"X-Crawler-Content-Length" is the original "Content-Length" which needs to
be replaced, see [1,2].

The point is that WARC files may store the HTTP body unchanged using the
original Content-Encoding or Transfer-Encoding. Common Crawl doesn't as this
would add extra work for the users to dechunk or decompress the payload.
Indeed, there are WARC readers which use the HTTP headers to read the content
and then fail if the Content-Length, Content-Encoding or Transfer-Encoding do
not match.

Best,
Sebastian

[1] http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/
[2] https://groups.google.com/forum/#!topic/openwayback-dev/vabOfUZhXAs

Henry S. Thompson

unread,
Jul 18, 2019, 9:28:34 AM7/18/19
to common...@googlegroups.com
Sebastian Nagel writes:

> thanks! I can confirm that not all truncations are flagged.
> There are a couple of other points which could be improved.
> The issues are tracked here:
> https://github.com/commoncrawl/nutch/issues/10

Thank you!

>> an X-Crawler-Content-Length header
>
> No, it would mean to always add the "Content-Length" header.
>
> "X-Crawler-Content-Length" is the original "Content-Length" which needs to
> be replaced, see [1,2].

Right.

> The point is that WARC files may store the HTTP body unchanged using the
> original Content-Encoding or Transfer-Encoding. Common Crawl doesn't as this
> would add extra work for the users to dechunk or decompress the payload.
> Indeed, there are WARC readers which use the HTTP headers to read the content
> and then fail if the Content-Length, Content-Encoding or Transfer-Encoding do
> not match.

Understood.
Reply all
Reply to author
Forward
0 new messages