Hi Common Crawlers,
I have downloaded a WARC file from the Common Crawl data. This file contains several gzipped responses which are stored plaintext (without the gzip encoding).
I used warctools from Internet Archive to extract the responses out of the WARC file. However this tool expects the Content-Length field to match the actual length of the body in the WARC (See the issue on github). warctools uses a more up to date version of hanzo warctools which is recommended on the Common Crawl website under "Processing the file format" (outdated link by the way).
After reading the official WARC draft I could not find out how gzipped content is supposed to be stored. However probably multiple WARC file parsers will have an issue with this.
It would be nice to know whether you consider this a bug and plan on
fixing this and whether this is a major issue which concerns most WARC
files of the Common Crawl data or only a small part.
Regards
Joris
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
... and whether this is a major issue which concerns most WARC files of the Common Crawl data or only a small part.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
> the compressed body should be stored in the WARCThat's not how Nutch deals with the content, it decompresses it prior to storing it. Could be changed of course but only after checking that it has no impact on anything else. In the meantime the patch I committed earlier preserves the existing behavior