cdx_toolkit, a cdx client in Python

17 views

Skip to first unread message

Greg Lindahl

unread,

Mar 12, 2018, 1:20:46 PM3/12/18

to common...@googlegroups.com

I was looking at some code I'd written to query data from various CDX
indices and then do analysis on it, and realized that there was a cdx
index client hiding inside, which did clever things such as
efficiently knit together Common Crawl's monthly indices into a
single, virtual index. I've hacked it up to beta quality and am
interested in comments on the interfaces:

https://github.com/cocrawler/cdx_toolkit

In addition to a Python3 API, it also supports command-line tools that
can generate jsonl and csv output. And you can combine data from
the Internet Archive and Common Crawl.

-- greg

Reply all

Reply to author

Forward

0 new messages