hi WARC Tools,
warcvalidator seems to take an inordinate amount of time
to validate our warc files, which ultimately do turn out
to be valid.
WARC Tools r245
warcvalidator.c: r211 | voidptrptr | 2008-11-07
on an unloaded crawler with dual 2.6GHz cpus and 4GB memory,
it took about 5 hours to process a 1GB WARC (the new standard
size) and 15 minutes to validate a 100MB WARC. gzip takes about
20 seconds to unpack a 1GB WARC.
OS: Ubuntu 5.10 "Breezy Badger"
kernel: Linux 2.6.16.1 #1 SMP May11 2006 x86_64 GNU/Linux
cpu: 2 x AMD 64GB 2605.873MHz
mem: 4015252k total
during processing, the cpus were mostly idle, about 90% of
memory was in use, and disk activity was low. in top, with a
sample delay of 0.1 seconds, i could see that warcvalidator
appears briefly and then goes away for a few seconds, then
appears briefly again.
time zcat WARC_1GB > /dev/null
real 0m20.781s
user 0m19.620s
sys 0m1.150s
time warcvalidator -f WARC_100MB
real 15m10.865s
user 0m8.510s
sys 0m8.770s
time warcvalidator -f WARC_1GB
real 298m58.218s (4.97 hrs)
user 1m21.760s
sys 1m35.540s
any idea what warcvalidator is doing?
/
st...@archive.org