Python code to use cluster.idx locally

75 views
Skip to first unread message

Henry S. Thompson

unread,
Jun 16, 2021, 5:47:23 AM6/16/21
to common...@googlegroups.com
I've spent an hour searching, and can't actually find a more-or-less
self-contained Python package for looking up a domain in local copies
of the index files (cluster.idx and cdx-......gz) for a given CC
month.

That functionality is of course all there in pybm, but finding the
entry points I need for my purposes seemed pretty challenging...

Any suggestions of where to look would be welcome,

Thanks,

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,
Jun 16, 2021, 6:16:17 AM6/16/21
to common...@googlegroups.com
Hi Henry,

the PyWB project includes everything you need, see
https://pywb.readthedocs.io/en/latest/
https://github.com/webrecorder/pywb/

Basically all you need is to adapt the config.yaml
accordingly, e.g.

shard_index_loc:
match: '.*(collections/[^/]+/)'
# local cdx-*.gz files are also found in collections/
replace: '\1'

For comparison see the config.yaml of
https://github.com/commoncrawl/cc-index-server
which configures the index server to read the cdx-*.gz files from S3.

Best,
Sebastian

Henry S. Thompson

unread,
Jun 16, 2021, 7:58:55 AM6/16/21
to common...@googlegroups.com
Sebastian Nagel writes:

> the PyWB project includes everything you need, see
> https://pywb.readthedocs.io/en/latest/
> https://github.com/webrecorder/pywb/
>
> Basically all you need is to adapt the config.yaml
> accordingly, e.g.
>
> shard_index_loc:
> match: '.*(collections/[^/]+/)'
> # local cdx-*.gz files are also found in collections/
> replace: '\1'
>
> For comparison see the config.yaml of
> https://github.com/commoncrawl/cc-index-server
> which configures the index server to read the cdx-*.gz files from S3.

Thanks, that's very helpful, but not enough --- AFAICS, the pywb
documentation doesn't describe a Python API that I can use. I don't
need/want to run a server, I just want to look up domains in my local
copy of the index.

Sebastian Nagel

unread,
Jun 16, 2021, 8:21:41 AM6/16/21
to common...@googlegroups.com
Hi Henry,

> Thanks, that's very helpful, but not enough --- AFAICS, the pywb
> documentation doesn't describe a Python API that I can use. I don't
> need/want to run a server, I just want to look up domains in my local
> copy of the index.

Got it. You might look into the class "ZipNumIndexSource" in
https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/index/zipnum.py

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages