Hi List,
After some debugging, everything seems to be OK in the "warc-tools"
and Steve experimentations
on my machine went very well. I didn't notice any problem between
Heritrix "arcdreader" outputs and warc-tools "app/wrecordbody" outputs.
I'll suggest you to follow the below steps to check that everybody
understand the reported issue.
First of all, please ensure that you're using the lastest version of
"warc-tools":
$ svn checkout
http://warc-tools.googlecode.com/svn/trunk/ warc-
tools-read-only
$ cd warc-tools-read-only && make
Also, install Java (1.5.x or above) and download Heritrix (I'm using
version 1.14.1) to get the "bin/arcreader" utility ready to run.
Next, download Steve's suspicious ARC file from here:
$ wget
http://home.us.archive.org/~steve/data/arcs/IAH-20080909203837-00001-takomaki.local.arc.gz
So, let's go:
(1) Rename the ARC file (use small name for it to better fit within
this email):
$ mv IAH-20080909203837-00001-takomaki.local.arc.gz
takomaki.local.arc.gz
(2) Dump the last 2 ARC records and extract them with Hertrix arcreader:
$ /heritrix-1.14.1/bin/arcreader takomaki.local.arc.gz | tail -2 | awk
'{print $3 " " $7 " " $8}'
http://crawler.archive.org/xref/org/archive/queue/QueueCat.html
3377662 8044
http://crawler.archive.org/xref/org/archive/crawler/frontier/CostAssignmentPolicy.html
3380345 4366
$ /heritrix-1.14.1/bin/arcreader -f dump -o 3377662
takomaki.local.arc.gz > /tmp/QueueCat.html-arcreader.html
$ /heritrix-1.14.1/bin/arcreader -f dump -o 3380345
takomaki.local.arc.gz > /tmp/CostAssignmentPolicy-arcreader.html
$ ls -la /tmp/*arcreader.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:40 /tmp/
CostAssignmentPolicy-arcreader.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 14:41 /tmp/QueueCat.html-
arcreader.html
(3) Convert the ARC file to WARC with the warctools "arc2warc" utility:
$ ./app/arc2warc -a takomaki.local.arc.gz -f takomaki.local.warc
$ ./app/arc2warc -a takomaki.local.arc.gz -f takomaki.local.warc.gz -c
$ ls -la takomaki.local.warc*
-rw-r--r-- 1 younes wheel 17762980 Oct 3 14:52 takomaki.local.warc
-rw-r--r-- 1 younes wheel 3511535 Oct 3 13:36
takomaki.local.warc.gz
(4) Get the offsets of the last 2 WARC records from the uncompressed
and compressed WARC files (created in step 3):
$ ./app/warcdump -f takomaki.local.warc 2>/dev/null | grep "WARC/" |
grep "." | awk '{print $2}' | tail -2
17749955
17758295
$ ./app/warcdump -f takomaki.local.warc.gz 2>/dev/null | grep "WARC/"
| grep "." | awk '{print $2}' | tail -2
3506908
3509709
(5) Extract the last 2 WARC records with the warc-tools "wrecordbody"
command:
$ ./app/wrecordbody -f takomaki.local.warc -o 17749955 > /tmp/
QueueCat.html-warctools.html
$ ./app/wrecordbody -f takomaki.local.warc.gz -o 3506908 > /tmp/
QueueCat.html-warctools-gz.html
$ ./app/wrecordbody -f takomaki.local.warc -o 17758295 > /tmp/
CostAssignmentPolicy.html-warctools.html
$ ./app/wrecordbody -f takomaki.local.warc.gz -o 3509709 > /tmp/
CostAssignmentPolicy.html-warctools-gz.html
$ ls -la /tmp/*warctools*.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 15:26 /tmp/
CostAssignmentPolicy-warctools-gz.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 15:26 /tmp/
CostAssignmentPolicy-warctools.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 15:25 /tmp/QueueCat.html-
warctools-gz.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 15:24 /tmp/QueueCat.html-
warctools.html
(6) Compare the body contents (Heritrix vs war-tools):
$ ls -la /tmp/QueueCat*.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 14:41 /tmp/QueueCat.html-
arcreader.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 14:58 /tmp/QueueCat.html-
warctools-gz.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 15:11 /tmp/QueueCat.html-
warctools.html
$ sha1sum /tmp/QueueCat.html*.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a /tmp/QueueCat.html-
arcreader.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a /tmp/QueueCat.html-warctools-
gz.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a /tmp/QueueCat.html-
warctools.html
$ ls -la /tmp/CostAssignmentPolicy*.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:40 /tmp/
CostAssignmentPolicy-arcreader.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:58 /tmp/
CostAssignmentPolicy.html-warctools-gz.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:55 /tmp/
CostAssignmentPolicy.html-warctools.html
$ sha1sum /tmp/CostAssignmentPolicy*.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4 /tmp/CostAssignmentPolicy-
arcreader.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4 /tmp/
CostAssignmentPolicy.html-warctools-gz.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4 /tmp/
CostAssignmentPolicy.html-warctools.html
Everything is fine as you can see.
Hope this help.
cheers
Younès
I'll try to redo Steve's steps on my box (OS X Leopard).