We have a number of warc files that the hanzo tools choke on with a
gzip error, but that work with other tools. Here is one of them (97M,
unfortunately I don't have any small examples):
http://tmpaidata002.us.archive.org/ARCHIVEIT-1299-20090422221545-00420-crawling10.us.archive.org.warc.gz
$ warcvalidator -v -t /tmp/ -f
ARCHIVEIT-1299-20090422221545-00420-crawling10.us.archive.org.warc.gz
error when uncompressing data at offset 89617045 (Gzip error number: -5)
> debug point: caller<lib/private/wfile.c:WFile_nextRecordGzipCompressed:641>"unable to read gzipped record"
invalid
But "zcat ... > /dev/null" does not complain for me, nor does the
heritrix warc reader, called like so:
$ HERITRIX_OUT=/dev/stdout CLASS_MAIN=org.archive.io.warc.WARCReader
$HERITRIX_HOME/bin/foreground_heritrix --strict
ARCHIVEIT-1299-20090422221545-00420-crawling10.us.archive.org.warc.gz
> /dev/null