cdx_toolkit, a cdx client in Python

17 views
Skip to first unread message

Greg Lindahl

unread,
Mar 12, 2018, 1:20:46 PM3/12/18
to common...@googlegroups.com
I was looking at some code I'd written to query data from various CDX
indices and then do analysis on it, and realized that there was a cdx
index client hiding inside, which did clever things such as
efficiently knit together Common Crawl's monthly indices into a
single, virtual index. I've hacked it up to beta quality and am
interested in comments on the interfaces:

https://github.com/cocrawler/cdx_toolkit

In addition to a Python3 API, it also supports command-line tools that
can generate jsonl and csv output. And you can combine data from
the Internet Archive and Common Crawl.

-- greg


Reply all
Reply to author
Forward
0 new messages