Common Crawl saves gzipped body in extracted form

jr...@lastline.com

unread,

Feb 10, 2016, 6:01:20 AM2/10/16

to Common Crawl

Hi Common Crawlers,

I have downloaded a WARC file from the Common Crawl data. This file contains several gzipped responses which are stored plaintext (without the gzip encoding).

I used warctools from Internet Archive to extract the responses out of the WARC file. However this tool expects the Content-Length field to match the actual length of the body in the WARC (See the issue on github). warctools uses a more up to date version of hanzo warctools which is recommended on the Common Crawl website under "Processing the file format" (outdated link by the way).

After reading the official WARC draft I could not find out how gzipped content is supposed to be stored. However probably multiple WARC file parsers will have an issue with this.

It would be nice to know whether you consider this a bug and plan on fixing this and whether this is a major issue which concerns most WARC files of the Common Crawl data or only a small part.

Regards

Joris

Julien Nioche

unread,

Feb 10, 2016, 9:39:28 AM2/10/16

to common...@googlegroups.com

Hi Joris

Fixed in [https://github.com/commoncrawl/nutch/commit/3551eb6dbb7f7152a13d2e4eb0f8eb6014dc8252].

The crawler stored the response headers including the content length even though Nutch stores the content decompressed. This will be used in the next release of CC, thanks for reporting it.

Julien

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble

Tom Morris

unread,

Feb 10, 2016, 10:27:49 AM2/10/16

to common...@googlegroups.com

On Wed, Feb 10, 2016 at 6:01 AM, <jr...@lastline.com> wrote:

... and whether this is a major issue which concerns most WARC files of the Common Crawl data or only a small part.

Looks like this affects crawls since Nov. 2013

https://github.com/commoncrawl/nutch/commit/2acd111890aa6c23920d6dd1412a5d11deb63e9f

Tom

Tom Morris

unread,

Feb 10, 2016, 11:40:58 AM2/10/16

to common...@googlegroups.com

On Wed, Feb 10, 2016 at 9:39 AM, Julien Nioche <lists.dig...@gmail.com> wrote:

Fixed in [https://github.com/commoncrawl/nutch/commit/3551eb6dbb7f7152a13d2e4eb0f8eb6014dc8252].

Very speedy fix! I'm wondering if the polarity is correct, though. This comment on the warctools repo:

https://github.com/internetarchive/warctools/pull/14#issuecomment-179399777

Seems to imply that rather than making the header match the decompressed body, that, instead, the compressed body should be stored in the WARC. I haven't looked at the spec, but that makes sense to me from a historical fidelity point of view, although it will make for more work for consumers.

Tom

Julien Nioche

unread,

Feb 10, 2016, 11:45:34 AM2/10/16

to common...@googlegroups.com

Hi Tom

> the compressed body should be stored in the WARC

That's not how Nutch deals with the content, it decompresses it prior to storing it. Could be changed of course but only after checking that it has no impact on anything else. In the meantime the patch I committed earlier preserves the existing behavior

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,

Feb 10, 2016, 11:54:15 AM2/10/16

to common...@googlegroups.com

On Wed, Feb 10, 2016 at 11:45 AM, Julien Nioche <lists.dig...@gmail.com> wrote:

> the compressed body should be stored in the WARC

That's not how Nutch deals with the content, it decompresses it prior to storing it. Could be changed of course but only after checking that it has no impact on anything else. In the meantime the patch I committed earlier preserves the existing behavior

Actually, after checking the current draft of the spec http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

(which I should have done instead of relying on the commenter in the warctools issue), it's clear that the body is to be stored with the transfer-encoding removed, so the current behavior is correct.

Tom

jr...@lastline.com

unread,

Feb 11, 2016, 5:11:10 AM2/11/16

to Common Crawl

Thanks for the fast help.
There is also an issue on github where the exact specification about Transfer-Encoding is being discussed.

Reply all

Reply to author

Forward