wrecordbody output truncated

6 views
Skip to first unread message

siznax

unread,
Oct 1, 2008, 10:59:33 AM10/1/08
to warc-tools
hi All,

a colleague pointed out to me that wrecordbody may be giving truncated
output from Heritrix ARC files. i converted one of our ARC files a
warc using warc-tools arc2warc, and can confirm similar results,
although the record lengths agree with the original arc record
lengths.

$HERITRIX/bin/arcreader ${arc} | tail -2
... QueueCat.html text/html - - 3377662 8044
... CostAssignmentPolicy.html text/html - - 3380345 4366

$HERITRIX/bin/arcreader ${arc} -f dump -o 3377662 > \
QueueCat-arcreader.html

$WARCTOOLS/arc2warc -a ${arc} -f ${warc}
$WARCTOOLS/arc2warc -a ${arc} -f ${warcGz} -c

$WARCTOOLS/warcdump -f ${warc} | grep uuid | tr -s " " | tail -2
17749955 ... 8044 ...
17758295 ... 4366 ...

$WARCTOOLS/wrecordbody -f ${warc} -o 17749955 > \
QueueCat-wrecordbody.html

$WARCTOOLS/warcdump -f ${warcGz} | grep uuid | tr -s " " | tail -2
3505385 ... 8044 ...
3508185 ... 4366 ...

$WARCTOOLS/wrecordbody -f ${warcGz} -o 3505385 > \
QueueCat-wrecordbody_gz.html

$ ls -l QueueCat-* | cut -d " " -f 8,12
8044 QueueCat-arcreader.html
7680 QueueCat-wrecordbody.html
7680 QueueCat-wrecordbody_gz.html

although, the converted warc indeed appears to ontain full records.

i haven't seen any traffic on this possible issue. can someone make me
a member of the google code project, so i can open an issue?

also, i wouldn't mind helping to remove some spam from the discussion
list, if you want to make me a an admin on this group as well.

/st...@archive.org

WARC

unread,
Oct 1, 2008, 12:14:12 PM10/1/08
to warc-...@googlegroups.com
Hi Steve,

We are converting hundered of ARC files without noticing any problem.
Could you put your problematic ARC file somewhere so we can test it?

cheers
Younès

Le 1 oct. 08 à 16:59, siznax a écrit :

st...@archive.org

unread,
Oct 1, 2008, 1:29:48 PM10/1/08
to warc-...@googlegroups.com
hello Younès,

thanks for responding. i've put two ARC files that,
when converted with arc2warc, both get truncated
output from wrecordbody here:

http://home.us.archive.org/~steve/data/arcs/


/st...@archive.org

goj...@gmail.com

unread,
Oct 1, 2008, 5:45:34 PM10/1/08
to warc-tools
I don't think it's the conversion that's the issue; that was just a
convenient way to reproduce with only warc-tools software and long-
established ARCs.

Dominic at IA reports the same thing happens with WARCs written by
Heritrix. So it's really the wrecordbody utility that is suspect.

The truncated size reported in Steve's test, 7680, is 15*512, and the
amount of missing data is <512. So I'm guessing there's a missed flush
on a 512-byte buffer somewhere...

- Gordon

WARC

unread,
Oct 1, 2008, 5:57:57 PM10/1/08
to warc-...@googlegroups.com
Hi Gordon,

We'll check that immediatly. More on that by next monday.
Thanks for reporting guys

cheers
Younès

Le 1 oct. 08 à 23:45, goj...@archive.org a écrit :

st...@archive.org

unread,
Oct 1, 2008, 6:11:55 PM10/1/08
to warc-...@googlegroups.com
another data point seems to support Gordon's theory:

arcreader
FlipFileInputStream.html text/html - - 28388195 1596
FlipFileOutputStream.html text/html - - 28389141 1596

arcreader -f dump -o 28388195 > test2_arcreader.html

arc2warc -a arc -f warc

warcdump -f warc
133055730 1940 WARC/0.18 1596 ...
133057670 1941 WARC/0.18 1596 ...

wrecordbody -f warc -o 133055730 > test2_wrecordbody.html

ls -l test2*
1596 test2_arcreader.html
1536 test2_wrecordbody.html


1536 = 512*3


/st...@archive.org

WARC

unread,
Oct 1, 2008, 6:17:50 PM10/1/08
to warc-...@googlegroups.com
Thanks Steve ;-)


Le 2 oct. 08 à 00:11, st...@archive.org a écrit :

WARC

unread,
Oct 3, 2008, 9:29:22 AM10/3/08
to warc-...@googlegroups.com, siznax, goj...@gmail.com, Lyes Amazouz, mark williamson
Hi List,

After some debugging, everything seems to be OK in the "warc-tools"
and Steve experimentations
on my machine went very well. I didn't notice any problem between
Heritrix "arcdreader" outputs and warc-tools "app/wrecordbody" outputs.

I'll suggest you to follow the below steps to check that everybody
understand the reported issue.

First of all, please ensure that you're using the lastest version of
"warc-tools":
$ svn checkout http://warc-tools.googlecode.com/svn/trunk/ warc-
tools-read-only
$ cd warc-tools-read-only && make

Also, install Java (1.5.x or above) and download Heritrix (I'm using
version 1.14.1) to get the "bin/arcreader" utility ready to run.

Next, download Steve's suspicious ARC file from here:
$ wget http://home.us.archive.org/~steve/data/arcs/IAH-20080909203837-00001-takomaki.local.arc.gz

So, let's go:

(1) Rename the ARC file (use small name for it to better fit within
this email):
$ mv IAH-20080909203837-00001-takomaki.local.arc.gz
takomaki.local.arc.gz

(2) Dump the last 2 ARC records and extract them with Hertrix arcreader:
$ /heritrix-1.14.1/bin/arcreader takomaki.local.arc.gz | tail -2 | awk
'{print $3 " " $7 " " $8}'
http://crawler.archive.org/xref/org/archive/queue/QueueCat.html
3377662 8044
http://crawler.archive.org/xref/org/archive/crawler/frontier/CostAssignmentPolicy.html
3380345 4366

$ /heritrix-1.14.1/bin/arcreader -f dump -o 3377662
takomaki.local.arc.gz > /tmp/QueueCat.html-arcreader.html
$ /heritrix-1.14.1/bin/arcreader -f dump -o 3380345
takomaki.local.arc.gz > /tmp/CostAssignmentPolicy-arcreader.html

$ ls -la /tmp/*arcreader.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:40 /tmp/
CostAssignmentPolicy-arcreader.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 14:41 /tmp/QueueCat.html-
arcreader.html

(3) Convert the ARC file to WARC with the warctools "arc2warc" utility:
$ ./app/arc2warc -a takomaki.local.arc.gz -f takomaki.local.warc
$ ./app/arc2warc -a takomaki.local.arc.gz -f takomaki.local.warc.gz -c

$ ls -la takomaki.local.warc*
-rw-r--r-- 1 younes wheel 17762980 Oct 3 14:52 takomaki.local.warc
-rw-r--r-- 1 younes wheel 3511535 Oct 3 13:36
takomaki.local.warc.gz

(4) Get the offsets of the last 2 WARC records from the uncompressed
and compressed WARC files (created in step 3):
$ ./app/warcdump -f takomaki.local.warc 2>/dev/null | grep "WARC/" |
grep "." | awk '{print $2}' | tail -2
17749955
17758295

$ ./app/warcdump -f takomaki.local.warc.gz 2>/dev/null | grep "WARC/"
| grep "." | awk '{print $2}' | tail -2
3506908
3509709

(5) Extract the last 2 WARC records with the warc-tools "wrecordbody"
command:

$ ./app/wrecordbody -f takomaki.local.warc -o 17749955 > /tmp/
QueueCat.html-warctools.html
$ ./app/wrecordbody -f takomaki.local.warc.gz -o 3506908 > /tmp/
QueueCat.html-warctools-gz.html

$ ./app/wrecordbody -f takomaki.local.warc -o 17758295 > /tmp/
CostAssignmentPolicy.html-warctools.html
$ ./app/wrecordbody -f takomaki.local.warc.gz -o 3509709 > /tmp/
CostAssignmentPolicy.html-warctools-gz.html

$ ls -la /tmp/*warctools*.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 15:26 /tmp/
CostAssignmentPolicy-warctools-gz.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 15:26 /tmp/
CostAssignmentPolicy-warctools.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 15:25 /tmp/QueueCat.html-
warctools-gz.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 15:24 /tmp/QueueCat.html-
warctools.html


(6) Compare the body contents (Heritrix vs war-tools):
$ ls -la /tmp/QueueCat*.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 14:41 /tmp/QueueCat.html-
arcreader.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 14:58 /tmp/QueueCat.html-
warctools-gz.html
-rw-r--r-- 1 younes wheel 8044 Oct 3 15:11 /tmp/QueueCat.html-
warctools.html

$ sha1sum /tmp/QueueCat.html*.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a /tmp/QueueCat.html-
arcreader.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a /tmp/QueueCat.html-warctools-
gz.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a /tmp/QueueCat.html-
warctools.html

$ ls -la /tmp/CostAssignmentPolicy*.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:40 /tmp/
CostAssignmentPolicy-arcreader.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:58 /tmp/
CostAssignmentPolicy.html-warctools-gz.html
-rw-r--r-- 1 younes wheel 4366 Oct 3 14:55 /tmp/
CostAssignmentPolicy.html-warctools.html

$ sha1sum /tmp/CostAssignmentPolicy*.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4 /tmp/CostAssignmentPolicy-
arcreader.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4 /tmp/
CostAssignmentPolicy.html-warctools-gz.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4 /tmp/
CostAssignmentPolicy.html-warctools.html

Everything is fine as you can see.
Hope this help.

cheers
Younès

I'll try to redo Steve's steps on my box (OS X Leopard).

WARC

unread,
Oct 3, 2008, 9:36:52 AM10/3/08
to warc-...@googlegroups.com, siznax, Gordon Mohr
Hi List,

After some debugging, everything seems to be OK in the "warc-tools" and Steve experimentations
on my machine went very well. I didn't notice any problem between Heritrix "arcdreader" outputs and warc-tools "app/wrecordbody" outputs.

I'll suggest you to follow the below steps to check that everybody understand the reported issue.

First of all, please ensure that you're using the lastest version of "warc-tools":
$ svn checkout http://warc-tools.googlecode.com/svn/trunk/   warc-tools-read-only
$ cd warc-tools-read-only && make

Also, install Java (1.5.x or above) and download Heritrix (I'm using version 1.14.1) to get the "bin/arcreader" utility ready to run.

Next, download Steve's suspicious ARC file from here:
$ wget http://home.us.archive.org/~steve/data/arcs/IAH-20080909203837-00001-takomaki.local.arc.gz

So, let's go:

(1) Rename the ARC file (use  small name for it to better fit within this email):
$ mv IAH-20080909203837-00001-takomaki.local.arc.gz   takomaki.local.arc.gz

(2) Dump the last 2 ARC records and extract them with Hertrix arcreader:
$ /heritrix-1.14.1/bin/arcreader takomaki.local.arc.gz | tail -2 | awk '{print $3 " " $7 " " $8}'
http://crawler.archive.org/xref/org/archive/queue/QueueCat.html 3377662 8044
http://crawler.archive.org/xref/org/archive/crawler/frontier/CostAssignmentPolicy.html 3380345 4366

$ /heritrix-1.14.1/bin/arcreader -f dump -o 3377662 takomaki.local.arc.gz > /tmp/QueueCat-arcreader.html

$ /heritrix-1.14.1/bin/arcreader -f dump -o 3380345 takomaki.local.arc.gz > /tmp/CostAssignmentPolicy-arcreader.html

$ ls -la /tmp/*arcreader.html
-rw-r--r--  1 younes  wheel  4366 Oct  3 14:40 /tmp/CostAssignmentPolicy-arcreader.html
-rw-r--r--  1 younes  wheel  8044 Oct  3 14:41 /tmp/QueueCat-arcreader.html


(3) Convert the ARC file to WARC with the warctools "arc2warc" utility:
$ ./app/arc2warc -a takomaki.local.arc.gz -f takomaki.local.warc
$ ./app/arc2warc -a takomaki.local.arc.gz -f takomaki.local.warc.gz -c

$ ls -la takomaki.local.warc*
-rw-r--r--  1 younes  wheel  17762980 Oct  3 14:52 takomaki.local.warc
-rw-r--r--  1 younes  wheel   3511535 Oct  3 13:36 takomaki.local.warc.gz

(4) Get the offsets of the last 2 WARC records from the uncompressed and compressed WARC files (created in step 3):
$ ./app/warcdump -f takomaki.local.warc 2>/dev/null | grep "WARC/" | grep "." | awk '{print $2}' | tail -2
17749955
17758295

$ ./app/warcdump -f takomaki.local.warc.gz 2>/dev/null | grep "WARC/" | grep "." | awk '{print $2}' | tail -2
3506908
3509709

(5) Extract the last 2 WARC records with the warc-tools "wrecordbody" command:

$ ./app/wrecordbody -f takomaki.local.warc -o 17749955  > /tmp/QueueCat-warctools.html
$ ./app/wrecordbody -f takomaki.local.warc.gz -o 3506908 > /tmp/QueueCat-warctools-gz.html

$ ./app/wrecordbody -f takomaki.local.warc -o 17758295 > /tmp/CostAssignmentPolicy-warctools.html
$ ./app/wrecordbody -f takomaki.local.warc.gz -o 3509709 > /tmp/CostAssignmentPolicy-warctools-gz.html


$ ls -la /tmp/*warctools*.html
-rw-r--r--  1 younes  wheel  4366 Oct  3 15:26 /tmp/CostAssignmentPolicy-warctools-gz.html
-rw-r--r--  1 younes  wheel  4366 Oct  3 15:26 /tmp/CostAssignmentPolicy-warctools.html
-rw-r--r--  1 younes  wheel  8044 Oct  3 15:25 /tmp/QueueCat-warctools-gz.html
-rw-r--r--  1 younes  wheel  8044 Oct  3 15:24 /tmp/QueueCat-warctools.html



(6) Compare the body contents (Heritrix vs war-tools):
$ ls -la /tmp/QueueCat*.html
-rw-r--r--  1 younes  wheel  8044 Oct  3 14:41 /tmp/QueueCat-arcreader.html
-rw-r--r--  1 younes  wheel  8044 Oct  3 14:58 /tmp/QueueCat-warctools-gz.html
-rw-r--r--  1 younes  wheel  8044 Oct  3 15:11 /tmp/QueueCat-warctools.html

$ sha1sum /tmp/QueueCat.html*.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a  /tmp/QueueCat-arcreader.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a  /tmp/QueueCat-warctools-gz.html
37a1ff5a8b993f93ad63515a9e7a96b7ee05f62a  /tmp/QueueCat-warctools.html


$ ls -la /tmp/CostAssignmentPolicy*.html
-rw-r--r--  1 younes  wheel  4366 Oct  3 14:40 /tmp/CostAssignmentPolicy-arcreader.html
-rw-r--r--  1 younes  wheel  4366 Oct  3 14:58 /tmp/CostAssignmentPolicy-warctools-gz.html
-rw-r--r--  1 younes  wheel  4366 Oct  3 14:55 /tmp/CostAssignmentPolicy-warctools.html

$ sha1sum /tmp/CostAssignmentPolicy*.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4  /tmp/CostAssignmentPolicy-arcreader.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4  /tmp/CostAssignmentPolicy-warctools-gz.html
06fc131afaa8c82cb51481cc35b5f7089598f8f4  /tmp/CostAssignmentPolicy-warctools.html


Everything is fine as you can see.
Hope this help.

cheers
Younès

I'll try to redo Steve's steps on my box (OS X Leopard).


Alex Osborne

unread,
Oct 3, 2008, 11:12:44 PM10/3/08
to siznax, voidp...@gmail.com, warc-...@googlegroups.com, Gordon Mohr
WARC wrote:
> After some debugging, everything seems to be OK in the "warc-tools"
> and Steve experimentations
> on my machine went very well. I didn't notice any problem between
> Heritrix "arcdreader" outputs and warc-tools "app/wrecordbody" outputs.
Younès, it looks like you fixed it in revision 195 (2008-09-30) so
actually before Steve reported it, which is probably where the confusion
is resulting from. That's a paradoxical negative two days fix time. ;-)

Steve, try updating and see if it fixes it for you too. Here's what I
get on Linux x86_64:

% svn up -r194 | grep Upd && make &> /dev/null && ./app/wrecordbody -f
takomaki.local.warc -o 17749955 2> /dev/null | wc -c
Updated to revision 194.
7680

% svn up -r195 | grep Upd && make &> /dev/null && ./app/wrecordbody -f
takomaki.local.warc -o 17749955 2> /dev/null | wc -c
Updated to revision 195.
8044

Cheers,

Alex

WARC

unread,
Oct 4, 2008, 6:03:59 AM10/4/08
to warc-...@googlegroups.com, siznax, Gordon Mohr
Hi Alex,

Superb !

Please, always use the last revision ;-)

cheers
Younès

Le 4 oct. 08 à 05:12, Alex Osborne a écrit :
Reply all
Reply to author
Forward
0 new messages