Format error in WARC files of August 2018 crawl

75 views
Skip to first unread message

Sebastian Nagel

unread,
Sep 17, 2018, 11:07:20 AM9/17/18
to Common Crawl
Hi all,

the WARC files of the last monthly crawl (August 2018, CC-MAIN-2018-34)
contain a redundant empty line between the HTTP headers and the payload
of WARC response records. This extra line may cause the following
problems when processing the WARC files:

- because WARC readers/parsers assume only a single empty line,
the extracted payload content starts with "\r\n".

While leading new lines are usually ignored by HTML processors,
document parsers for binary formats (PDF, office documents, etc.)
are likely to fail.

- the length of the payload in the optional HTTP "Content-Length" header
is off by 2. This may also cause WARC processors to fail.

Luckily, Greg Lindahl detected the bug right before the September crawl
is started. It's fixed now for the upcoming crawl, more information can
be found on https://github.com/commoncrawl/nutch/issues/5

As of today, we made no decision whether we'll fix the WARC files
of the August crawl later.

Apologies for this bug!

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages