warcvalidator slow

1 view
Skip to first unread message

siznax

unread,
Sep 14, 2009, 11:46:19 AM9/14/09
to warc-tools
hi WARC Tools,

warcvalidator seems to take an inordinate amount of time
to validate our warc files, which ultimately do turn out
to be valid.

WARC Tools r245
warcvalidator.c: r211 | voidptrptr | 2008-11-07

on an unloaded crawler with dual 2.6GHz cpus and 4GB memory,
it took about 5 hours to process a 1GB WARC (the new standard
size) and 15 minutes to validate a 100MB WARC. gzip takes about
20 seconds to unpack a 1GB WARC.

OS: Ubuntu 5.10 "Breezy Badger"
kernel: Linux 2.6.16.1 #1 SMP May11 2006 x86_64 GNU/Linux
cpu: 2 x AMD 64GB 2605.873MHz
mem: 4015252k total

during processing, the cpus were mostly idle, about 90% of
memory was in use, and disk activity was low. in top, with a
sample delay of 0.1 seconds, i could see that warcvalidator
appears briefly and then goes away for a few seconds, then
appears briefly again.

time zcat WARC_1GB > /dev/null
real 0m20.781s
user 0m19.620s
sys 0m1.150s

time warcvalidator -f WARC_100MB
real 15m10.865s
user 0m8.510s
sys 0m8.770s

time warcvalidator -f WARC_1GB
real 298m58.218s (4.97 hrs)
user 1m21.760s
sys 1m35.540s

any idea what warcvalidator is doing?


/st...@archive.org
Reply all
Reply to author
Forward
0 new messages