1 MB limit question/observation

146 views
Skip to first unread message

Henry S. Thompson

unread,
Jul 5, 2019, 6:53:57 AM7/5/19
to common-crawl
Is that limit (only mentioned in passing in the August 2018 release
notice) still in force?

I'm interested in using CC as a way of crawling PDF on the Web, and
about 20% of the PDFs (28 out of 131) in my preliminary sample (from a
single file from the 2015 crawl) have been truncated and so are
more-or-less unreadable.

Background numbers: The file I worked with contains 71016 responses
and is 5,818,889,895 bytes in length. The 1MB limit affects 72
responses: if they were not truncated this would add 284,884,035 bytes
or about 4.9%.

The (pre-truncation) Content-Length distribution for those 72 files
looks like this:

(72 truncated files, length histogram with 12 bins of width 1,000,000):

1500000 31 *******************************
2500000 13 *************
3500000 9 *********
4500000 2 **
5500000 3 ***
6500000 3 ***
7500000 2 **
8500000 2 **
9500000 0
10500000 1 *
11500000 1 *
12500000 2 **
>13000000 3

If we look only at the application/pdf files, we get this

(28 truncated pdf files, length histogram with 12 bins of width 1,000,000):

1500000 7 *******
2500000 6 ******
3500000 5 *****
4500000 1 *
5500000 0
6500000 2 **
7500000 1 *
8500000 2 **
9500000 0
10500000 1 *
11500000 1 *
12500000 1 *
>13000000 1

Increasing the length limit to 4MB would let 18 out of those 28
through, along with 35 of other media types.

The truncated lengths of the 72 files currently being clipped,
approximately 72MB, account for about 1.3% of the total of
5,818,889,895.

If the cutoff was moved from 1MB to 4MB, the truncated lengths would
now sum to 184,209,889, an increase of about 104MB (1.8%) to the overall file
length.

Might I request an experiment for the August run which increases the cutoff to 4MB?

Thanks,

ht

[Footnote: Along with the 72 files discussed above, a further 35 have a

WARC-Truncated: length

header in the response prolog, and a length mismatch between the
Content-Length given there (which is less than 1MB) and in the
response itself, _without_ any sign that the difference is due to
compression. Inspecting a few of these does indeed suggest that they
are _not_ complete.

There's no obvious pattern to the fetch times as reported for these
cases.

Would it be possible to detect connection loss (which I'm guessing
must be the explanation here) and include in the post-response
metadata section some indiction of this?

There are also 209 responses with empty bodies, but they have Content-Length: 0 and are legitimate, if odd, I guess.
]
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,
Jul 5, 2019, 9:34:59 AM7/5/19
to common...@googlegroups.com
Hi Henry,

yes, the 1 MB limit is still in force and has been since 2014
(maybe even longer but I would need to verify it). Since August
2018 the WARC records are flagged if the payload is truncated.


> Might I request an experiment for the August run which increases the cutoff to 4MB?

Yes. Thanks for the suggestion and for the statistics!
I'll check how this might affect other formats as well,
esp. video or audio streams accidentally captured.

We need to keep an eye on the storage required. And the point is that
PDFs and images are quite costly here.

Eg. the 19 million PDF files in the May 2019 crawl, although
only 0.7% of all captures, account for 7 TiB or 14% of
the total WARC storage. You'll find a detailed calculation in [1].


> more-or-less unreadable.

There is a chance that the PDF is "linear" or "web optimized" which would allow
to read the first pages.


> [Footnote: Along with the 72 files discussed above, a further 35 have a
>
> WARC-Truncated: length
>
> header in the response prolog, and a length mismatch between the
> Content-Length given there (which is less than 1MB) and in the
> response itself

Can you give a concrete example?

Here one truncated capture from the January 2019 crawl:

WARC/1.0
WARC-Type: response
WARC-Date: 2019-01-22T13:21:23Z
WARC-Record-ID: <urn:uuid:35771d06-8372-441a-a8cc-bc1004511359>
Content-Length: 1049171
...
WARC-Truncated: length
WARC-Identified-Payload-Type: application/pdf

HTTP/1.1 200 OK
Date: Tue, 22 Jan 2019 13:21:21 GMT
...
X-Crawler-Content-Length: 1562599
Content-Length: 1048576
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: application/force-download

%PDF-1.4...

I've run "warcio check" on the WARC file (and "warcio test")
to verify that the Content-Lengh is correct.


> Would it be possible to detect connection loss (which I'm guessing
> must be the explanation here) and include in the post-response
> metadata section some indiction of this?

This is tracked since August 2018 and indicated as
WARC-Truncated: disconnect
Before (since 2013) only the "length" and "time" (10 min.) limit
have been flagged in the WARC files. I would need to dig into the
code to see how disconnects have been handled.

Best,
Sebastian


[1]
http://netpreserve.org/ga2019/wp-content/uploads/2019/07/IIPCWAC2019-SEBASTIAN_NAGEL-Accessing_WARC_files_via_SQL-poster.pdf

Henry S. Thompson

unread,
Jul 5, 2019, 12:18:39 PM7/5/19
to common...@googlegroups.com
Sebastian Nagel writes:

> HST writes...
>> Might I request an experiment for the August run which increases the
>> cutoff to 4MB?
>
> Yes. Thanks for the suggestion and for the statistics!

Thanks!

> I'll check how this might affect other formats as well,
> esp. video or audio streams accidentally captured.

There are some really big ones, but they're _well_ over 4MB.

> We need to keep an eye on the storage required. And the point is that
> PDFs and images are quite costly here.
>
> Eg. the 19 million PDF files in the May 2019 crawl, although
> only 0.7% of all captures, account for 7 TiB or 14% of
> the total WARC storage. You'll find a detailed calculation in [1].

Understood. See below.

>> more-or-less unreadable.
>
> There is a chance that the PDF is "linear" or "web optimized" which
> would allow to read the first pages.

None in my sample :-(


>> [Footnote: Along with the 72 files discussed above, a further 35 have a
>>
>> WARC-Truncated: length
>>
>> header in the response prolog, and a length mismatch between the
>> Content-Length given there (which is less than 1MB) and in the
>> response itself
>
> Can you give a concrete example?

Will do, in a subsequent message.

>> Would it be possible to detect connection loss (which I'm guessing
>> must be the explanation here) and include in the post-response
>> metadata section some indiction of this?
>
> This is tracked since August 2018 and indicated as
> WARC-Truncated: disconnect
> Before (since 2013) only the "length" and "time" (10 min.) limit
> have been flagged in the WARC files. I would need to dig into the
> code to see how disconnects have been handled.

I'll rerun my tests on a post-2018-08 file and see what I find.

Stand by ...

ht

Henry S. Thompson

unread,
Jul 10, 2019, 10:36:24 AM7/10/19
to common...@googlegroups.com
Henry S. Thompson writes:

> Sebastian Nagel writes:
>> ...
>> Can you give a concrete example?
>
> Will do, in a subsequent message.
>
>>> Would it be possible to detect connection loss (which I'm guessing
>>> must be the explanation here) and include in the post-response
>>> metadata section some indiction of this?
>>
>> This is tracked since August 2018 and indicated as
>> WARC-Truncated: disconnect
>> Before (since 2013) only the "length" and "time" (10 min.) limit
>> have been flagged in the WARC files. I would need to dig into the
>> code to see how disconnects have been handled.
>
> I'll rerun my tests on a post-2018-08 file and see what I find.

OK, I've now extracted comparable information from a WARC file from
2018-10:

Compared to the one I looked it from 2015-10, it's

* Smaller (1.10 vs 1.20 GB compressed (92%), 4.54 vs 5.42 GB raw (84%))
* Shorter (54829 responses vs. 71016 (77%))

The 1MB limit has affected 107 responses (c.f. 72 in 2015-10, 149%), as
well as 49 truncated because of a disconnect.

After further investigation, I see that in fact I had substantially
under-reported the cases of WARC-Truncated in 2015-10, which should have
been:

27224 WARC-Truncated: length
6 WARC-Truncated: time

I think the explanation is that in 2015-10, WARC-Truncated: length
reflects two quite different situations:
1) The 1MB limit kicked in (72 cases);

2) There was no Content-Length header in the HTTP response, the
connection simply closed after the body (the other 27152 cases). My
reporting software only logged cases where the actual response
differed in length from that computed by WARC response header length
from the Content-Length _in_ the WARC response header (the "further
35" reported above). The similarity of the relevant ratios (35 out
of 107 = 33%, 49 out of 156 = 31%) suggest that as hypothesised,
these were all cases of premature disconnect.

So, net-net, the post-2018-08 reporting changes give a more accurate
picture of response length and its origins and nature than theretofore.

Reverting back to the potential impact of changing to a 4MB limit, here
are numbers and histograms for the 2018-10 file, comparable to the data
in my earlier message for 2015-10:

The 1MB limit affects 103* responses: if they were not truncated this
would add 608,828,559 bytes (about twice as much as in the 2015-10 case
in absolute terms), adding 12.5% (more like three times as much in
proportional terms).

The (pre-truncation) Content-Length distribution for the bodies of those
103 responses looks like this:

(80 truncated files, length histogram with 15 bins of width 200,000):

1,100,000 11 ***********
1,300,000 12 ************
1,500,000 13 *************
1,700,000 10 **********
1,900,000 7 *******
2,100,000 4 ****
2,300,000 6 ******
2,500,000 6 ******
2,700,000 2 **
2,900,000 1 *
3,100,000 1 *
3,300,000 3 ***
3,500,000 1 *
3,700,000 2 **
3,900,000 1 *

The remaining 23 cover a much larger range than the corresponding 19
from 2015-10:

(23 larger truncated files, length histogram with 13 bins of width 10,000,000):

5,000,000 13 *************
15,000,000 4 ****
25,000,000 2 **
35,000,000 0
45,000,000 2 **
55,000,000 0
65,000,000 0
75,000,000 0
85,000,000 1 *
95,000,000 0
105,000,000 0
115,000,000 0
125,000,000 0
135,000,000 0
145,000,000 0
155,000,000 0
165,000,000 0
175,000,000 0
185,000,000 0
195,000,000 0
205,000,000 1 *

If we look only at the application/pdf files, we get this

(82 truncated pdf files, length histogram with 19 bins of width 1,000,000):

1,500,000 41 *****************************************
2,500,000 18 ******************
3,500,000 5 *****
4,500,000 3 ***
5,500,000 4 ****
6,500,000 1 *
7,500,000 2 **
8,500,000 2 **
9,500,000 0
10,500,000 1 *
11,500,000 0
12,500,000 0
13,500,000 0
14,500,000 0
15,500,000 1 *
16,500,000 0
17,500,000 0
18,500,000 0
19,500,000 1 *
>20,000,000 3 (23MB, 26MB, 41MB)

Increasing the 1MB limit to 4MB would let 64 of these 82 through, along
with with 16 of other media types.

The truncated lengths of the 103 files currently being clipped, now
exactly 103MB, account for about 2.2% of the total of 4,876,729,301.

If the cutoff was moved from 1MB to 4MB, the lengths of those 103 files
in the WARC would increase by about 133MB, adding 2.9% to the total.
This compares to 104MB, 1.9% of the total, for my earlier sample from
2015-10.

So, assuming the more recent WARC file I looked at is not wildly
misleading for the month's crawl as a whole, this gives us a reasonable
estimate of shifting the cutoff to 4MB.

ht

*In 103 cases my reporting software detected a length mismatch. These
coincided with a WARC-Truncated: length header in 98 cases, the other 5
did not have such a header. I'll leave details of those 5 cases, as
well as the 9 cases (107 - 98) where there _was_ a Truncated header but
no length mismatch, to another email.

Sebastian Nagel

unread,
Jul 19, 2019, 11:47:34 AM7/19/19
to common...@googlegroups.com
Hi Henry,

thanks again for the detailed calculations.

I've started to reproduce them, see
https://github.com/commoncrawl/cc-pyspark/pull/9
https://github.com/sebastian-nagel/cc-notebooks-var/blob/master/warc-truncation-stats.ipynb
That's also done to make it easier to look for further issues regarding marking
of truncated records.


> Increasing the 1MB limit to 4MB would let 64 of these 82 through, along
> with with 16 of other media types.
>
> The truncated lengths of the 103 files currently being clipped, now
> exactly 103MB, account for about 2.2% of the total of 4,876,729,301.
>
> If the cutoff was moved from 1MB to 4MB, the lengths of those 103 files
> in the WARC would increase by about 133MB, adding 2.9% to the total.

Adding 133 MB to 4.5 GB doesn't sound a lot. However, the relevant metric
is how many MB we would add to the gzip-compressed WARC file:
- HTML pages occupy 10-20% of their original size when compressed
- PDFs about 90%
- images, videos, zip archives, etc. just stay at their original size

If the 90% compression ratio holds also for the now longer PDF documents
this would add more than 100 MB per WARC file just for 64 PDF files.

However, a 100 MB WARC file can store thousands of HTML pages.
Common Crawl's primary mission was to provide a broad collection
of web pages and and I still believe that's what most users are
interested in.

In short words: we'll do not raise the content limit for now.
We may in the future but then in a smaller step mostly to catch
up with the increased average size of HTML pages.

But after a couple of discussions we came up to this proposal:
what about a data set separate from the main crawls dedicated
to PDFs only (maybe including office documents and similar)?
Are you or anybody interested? The challenge is less about
crawling and packaging the content (same as usual) but how
to select a representative sample. Let me know what you think.
Thanks!

In addition, I'll try to fix all the issues around the marking
of truncated records and ev. will also add this information to the
URL indexes.


> After further investigation, I see that in fact I had substantially
> under-reported the cases of WARC-Truncated in 2015-10, which should have
> been:
>
> 27224 WARC-Truncated: length
> 6 WARC-Truncated: time

After a closer look into the HTTP protocol implementations used
from 2013 until July 2018 [1]:
- most of the records marked as `WARC-Truncated: length` are
false positives, i.e. not truncated at all. This happens if no
`Content-Length` is sent in the HTTP response which seems to
apply to 30-40% of all responses in 2015
- with compressed content encoding: truncations are marked only if
the compressed size already exceeds the content limit

I'm sorry about this, but the only way to look for real truncations by length
is to check for a payload length of exactly 1 MiB.


Best,
Sebastian


[1]
https://github.com/commoncrawl/nutch/blob/cc-1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

Tom Morris

unread,
Jul 19, 2019, 3:22:45 PM7/19/19
to common...@googlegroups.com
On Fri, Jul 19, 2019 at 11:47 AM Sebastian Nagel <seba...@commoncrawl.org> wrote:

In short words: we'll do not raise the content limit for now.
We may in the future but then in a smaller step mostly to catch
up with the increased average size of HTML pages.

But after a couple of discussions we came up to this proposal:
what about a data set separate from the main crawls dedicated
to PDFs only (maybe including office documents and similar)?
Are you or anybody interested?  The challenge is less about
crawling and packaging the content (same as usual) but how
to select a representative sample. 

It's probably worth pulling this question out into its own thread if you're interested in getting wider feedback on it. It's kind of buried in the current thread.

Tom 

Henry S. Thompson

unread,
Aug 7, 2019, 5:59:45 AM8/7/19
to common...@googlegroups.com
Sebastian Nagel writes:

> ...
>
> In short words: we'll do not raise the content limit for now.
> We may in the future but then in a smaller step mostly to catch
> up with the increased average size of HTML pages.

Understood.

> But after a couple of discussions we came up to this proposal:
> what about a data set separate from the main crawls dedicated
> to PDFs only (maybe including office documents and similar)?
> Are you or anybody interested? The challenge is less about
> crawling and packaging the content (same as usual) but how
> to select a representative sample. Let me know what you think.

I think this is a great idea, it would suit my needs very well.
Sampling is indeed the question... I assume that at the moment the
percentage of non-HTML in the crawl is pretty much the same as the %age
in the seed list?

> In addition, I'll try to fix all the issues around the marking
> of truncated records and ev. will also add this information to the
> URL indexes.

Great.

>> After further investigation, I see that in fact I had substantially
>> under-reported the cases of WARC-Truncated in 2015-10, which should have
>> been:
>>
>> 27224 WARC-Truncated: length
>> 6 WARC-Truncated: time

> After a closer look into the HTTP protocol implementations used
> from 2013 until July 2018 [1]:
> - most of the records marked as `WARC-Truncated: length` are
> false positives, i.e. not truncated at all. This happens if no
> `Content-Length` is sent in the HTTP response which seems to
> apply to 30-40% of all responses in 2015
> - with compressed content encoding: truncations are marked only if
> the compressed size already exceeds the content limit
>
> I'm sorry about this, but the only way to look for real truncations by length
> is to check for a payload length of exactly 1 MiB.

Understood.

ht

Sebastian Nagel

unread,
Oct 8, 2019, 3:44:38 AM10/8/19
to common...@googlegroups.com
Hi Henry, hi Tom,

> It's probably worth pulling this question out into its own thread if you're interested
> in getting wider feedback on it. It's kind of buried in the current thread.

I'll soon start a new thread. Sorry for the delay.

> I think this is a great idea, it would suit my needs very well.

Perfect.

> Sampling is indeed the question...

At least, there is enough data to take the sample from - more than 100 million URLs
of PDF documents are known.

> I assume that at the moment the percentage of non-HTML in the crawl is pretty much the
> same as the %age in the seed list?

Yes, should be the approx. the same. However, in the seed list we do not know the MIME
type of the URLs not visited before.


But let's move the discussion to the new thread!

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages