I'd like to understand
1) How the index of index files is built;
2) How the compression of the warc files is done so that individual
entries can be extracted and uncompressed.
1) When I ran my own mini-crawl to fill in the large PDF pages that
had been truncated, using nutch-cc, it produced cdx-....gz files, but
a) They were not sorted, but rather, perfectly reasonably,
corresponded 1-to-1 with the warc files;
b) No cluster.idx was produced, I don't think, which also, I think,
makes sense, because of the fact that the cdx files as produces
were not sorted.
So, what tool does the refactor/sort and produces cluster.idx?
2) How does the crawler produce the warc files in such a way that you
can index into them? My understanding is that simple concatenation
of separately compressed files would introduce some number of nulls
between each file, or is that only true in the case of compressed
tar files?
Thanks,
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail:
h...@inf.ed.ac.uk
URL:
http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.