gzip error on valid(?) warcs

4 views
Skip to first unread message

Noah Levitt

unread,
Jul 28, 2009, 7:08:31 PM7/28/09
to warc-...@googlegroups.com, nle...@archive.org
Hello hanzo warc-tools,

We have a number of warc files that the hanzo tools choke on with a
gzip error, but that work with other tools. Here is one of them (97M,
unfortunately I don't have any small examples):
http://tmpaidata002.us.archive.org/ARCHIVEIT-1299-20090422221545-00420-crawling10.us.archive.org.warc.gz

$ warcvalidator -v -t /tmp/ -f
ARCHIVEIT-1299-20090422221545-00420-crawling10.us.archive.org.warc.gz
error when uncompressing data at offset 89617045 (Gzip error number: -5)
> debug point: caller<lib/private/wfile.c:WFile_nextRecordGzipCompressed:641>"unable to read gzipped record"
invalid

But "zcat ... > /dev/null" does not complain for me, nor does the
heritrix warc reader, called like so:

$ HERITRIX_OUT=/dev/stdout CLASS_MAIN=org.archive.io.warc.WARCReader
$HERITRIX_HOME/bin/foreground_heritrix --strict
ARCHIVEIT-1299-20090422221545-00420-crawling10.us.archive.org.warc.gz
> /dev/null

Noah
nle...@archive.org

Reply all
Reply to author
Forward
0 new messages