content-length?

16 views
Skip to first unread message

Tim Allison

unread,
Aug 31, 2021, 3:45:22 PMAug 31
to Common Crawl
This is embarrassingly basic...

I'm trying to get the content-length value from the original http headers for truncated files.  I thought that if I did something like the following with jwarc:

Optional<String> httpContentLengthString = ((WarcResponse) record).headers().first(
"Content-Length");

long httpContentLength = -1;
if (httpContentLengthString.isPresent()) {
try {
httpContentLength = Long.parseLong(httpContentLengthString.get());
} catch (NumberFormatException e) {

}
}

I'd get what the http header returned as the content-length.  However in the few files that I've manually spot-checked, the number stored in this header field in the WARC is not the content-length I see when I HEAD the site.


The content-length is: 2,581,792

However, the value stored in the WARC in CC-MAIN-2021-31: 1048966 

The PDF file is actually 2,581,792.

Is this expected behavior?  Or is the CC crawler requesting only the first MB, so it is seeing "content-length: 1,048,966"?

Is there any way to extract from the index files or the WARC files what the header said the full length of the body would be?

Again, many, many thanks!

  Best,
     Tim

The WARC file is stored in: crawl-data/CC-MAIN-2021-31/segments/1627046151641.83/warc/CC-MAIN-20210725080735-20210725110735-00148.warc.gz

offset: 635585666
length: 961412

Tim Allison

unread,
Aug 31, 2021, 4:02:00 PMAug 31
to Common Crawl
Yep...embarrassing... 

 ((WarcResponse) record).http().headers().first("Content-Length");

Please disregard...

Tim Allison

unread,
Aug 31, 2021, 4:04:52 PMAug 31
to Common Crawl
To confirm, though, for the truncated files, the http-header returned: 1048576, which I'm guessing is accurate because that's what the cc crawler requested, not the size of the full payload...

Sebastian Nagel

unread,
Aug 31, 2021, 4:28:04 PMAug 31
to common...@googlegroups.com
Hi Tim,

yes, the content is truncated to 1 MiB. There's a long discussion about this at
https://groups.google.com/g/common-crawl/c/JJW6fv1rUQw/m/xUz7E7__BgAJ

If a page/document is truncated it's flagged in the WARC record header (since Nov 2019):
WARC-Truncated: length
There are other reasons for the truncation: time (timeout), (network) disconnect.
The truncation is also marked in the indexes.

The original "Content-Length" HTTP header is rewritten to
X-Crawler-Content-Length: 2581792
and a new one with the truncated length is added
Content-Length: 1048576
to avoid that WARC parsers choke on the wrong payload length.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/39832a43-c263-491c-9e7a-945e43ec03cbn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/39832a43-c263-491c-9e7a-945e43ec03cbn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tim Allison

unread,
Sep 2, 2021, 1:48:51 PMSep 2
to Common Crawl
Thank you!
Reply all
Reply to author
Forward
0 new messages