Some questions about index construction

60 views
Skip to first unread message

Henry S. Thompson

unread,
Mar 19, 2021, 10:32:37 AM3/19/21
to common...@googlegroups.com
I'd like to understand
1) How the index of index files is built;
2) How the compression of the warc files is done so that individual
entries can be extracted and uncompressed.

1) When I ran my own mini-crawl to fill in the large PDF pages that
had been truncated, using nutch-cc, it produced cdx-....gz files, but
a) They were not sorted, but rather, perfectly reasonably,
corresponded 1-to-1 with the warc files;
b) No cluster.idx was produced, I don't think, which also, I think,
makes sense, because of the fact that the cdx files as produces
were not sorted.
So, what tool does the refactor/sort and produces cluster.idx?

2) How does the crawler produce the warc files in such a way that you
can index into them? My understanding is that simple concatenation
of separately compressed files would introduce some number of nulls
between each file, or is that only true in the case of compressed
tar files?

Thanks,

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,
Mar 19, 2021, 10:55:35 AM3/19/21
to common...@googlegroups.com
Hi Henry,

> 1) How the index of index files is built;

> So, what tool does the refactor/sort and produces cluster.idx?

See [1] https://github.com/commoncrawl/webarchive-indexing/
- this is a MapReduce job designed to build a so-called zipnum index
for billions of records. It's based on PyWB [2]
- for smaller projects you consider to use PyWB directly, see [3]


> 2) How the compression of the warc files is done so that individual
> entries can be extracted and uncompressed.

> simple concatenation of separately compressed files would introduce
> some number of nulls between each file

Gzip allows the concatenation of multiple files into a single file without
any separator, see [4]. The WARC standard [5] and also the zipnum index [6]
use of this feature. The blog [7] explains the basic ideas.

Best,
Sebastian


[1] https://github.com/commoncrawl/webarchive-indexing/
[2] https://pywb.readthedocs.io/en/latest/
[3] https://pywb.readthedocs.io/en/latest/manual/usage.html#using-existing-web-archive-collections
[4] http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
[5] https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-at-time-compression
[6] https://pywb.readthedocs.io/en/latest/manual/indexing.html?highlight=zipnum#zipnum-sharded-index
[7] https://rushter.com/blog/gzip-indexing/

Henry S. Thompson

unread,
Mar 19, 2021, 12:01:25 PM3/19/21
to common...@googlegroups.com
Sebastian Nagel writes:

>> 1) How the index of index files is built;
>
>> So, what tool does the refactor/sort and produces cluster.idx?
>
> See [1] https://github.com/commoncrawl/webarchive-indexing/
> ...
>> 2) How the compression of the warc files is done so that individual
>> entries can be extracted and uncompressed.
> ...
> Gzip allows the concatenation of multiple files into a single file without
> any separator, see [4]. The WARC standard [5] and also the zipnum index [6]
> use of this feature. The blog [7] explains the basic ideas.
>
Perfect, thank you!
Reply all
Reply to author
Forward
0 new messages