This is embarrassingly basic...
I'm trying to get the content-length value from the original http headers for truncated files. I thought that if I did something like the following with jwarc:
Optional<String> httpContentLengthString = ((WarcResponse) record).headers().first(
"Content-Length");
long httpContentLength = -1;
if (httpContentLengthString.isPresent()) {
try {
httpContentLength = Long.parseLong(httpContentLengthString.get());
} catch (NumberFormatException e) {
}
}
I'd get what the http header returned as the content-length. However in the few files that I've manually spot-checked, the number stored in this header field in the WARC is not the content-length I see when I HEAD the site.
The content-length is: 2,581,792
However, the value stored in the WARC in CC-MAIN-2021-31: 1048966
The PDF file is actually 2,581,792.
Is this expected behavior? Or is the CC crawler requesting only the first MB, so it is seeing "content-length: 1,048,966"?
Is there any way to extract from the index files or the WARC files what the header said the full length of the body would be?
Again, many, many thanks!
Best,
Tim
The WARC file is stored in: crawl-data/CC-MAIN-2021-31/segments/1627046151641.83/warc/CC-MAIN-20210725080735-20210725110735-00148.warc.gz
offset: 635585666
length: 961412