I was looking at some code I'd written to query data from various CDX
indices and then do analysis on it, and realized that there was a cdx
index client hiding inside, which did clever things such as
efficiently knit together Common Crawl's monthly indices into a
single, virtual index. I've hacked it up to beta quality and am
interested in comments on the interfaces:
https://github.com/cocrawler/cdx_toolkit
In addition to a Python3 API, it also supports command-line tools that
can generate jsonl and csv output. And you can combine data from
the Internet Archive and Common Crawl.
-- greg