Hi,
To follow up on the release of the CommonCrawl Index API (
http://index.commoncrawl.org), I also wanted to announce a simple command-line tool for querying this index:
This tool is a python script that should make it even easier to use the api.
The script can be used to more easily download all the pages of the index by performing several downloads in parallel.
For example, to download all urls in the *.io tld from the Feb 2015 index (announced today), one might do the following:
./cdx-index-client.py *.io --coll=CC-MAIN-2015-11
To only check how many pages will be download (a good idea) before starting:
./cdx-index-client.py *.io --coll=CC-MAIN-2015-11 --show-num-pages
For larger queries, this of course may take a bit of time..
It's also possible to adjust the parallelization (--processes) and page size (--page-size).
Run with -h to see a full list of options.
If there is interest, it is also possible to build a MapReduce version of this tool that could run across multiple machines.
I look forward to hearing any feedback/bugs/suggestions you may have from using this tool or the index!
Thanks,
Ilya