Announcing: command-line client for index server

40 views
Skip to first unread message

Ilya Kreymer

unread,
Mar 30, 2015, 9:54:00 PM3/30/15
to common...@googlegroups.com
Hi,

To follow up on the release of the CommonCrawl Index API (http://index.commoncrawl.org), I also wanted to announce a simple command-line tool for querying this index:


This tool is a python script that should make it even easier to use the api.

The script can be used to more easily download all the pages of the index by performing several downloads in parallel.

For example, to download all urls in the *.io tld from the Feb 2015 index (announced today), one might do the following:

./cdx-index-client.py *.io --coll=CC-MAIN-2015-11

The --coll argument specifies which collection (available collections are listed on http://index.commoncrawl.org)

To only check how many pages will be download (a good idea) before starting:

./cdx-index-client.py *.io --coll=CC-MAIN-2015-11 --show-num-pages

For larger queries, this of course may take a bit of time..

It's also possible to adjust the parallelization (--processes) and page size (--page-size).
Run with -h to see a full list of options.

If there is interest, it is also possible to build a MapReduce version of this tool that could run across multiple machines.

I look forward to hearing any feedback/bugs/suggestions you may have from using this tool or the index!

Thanks,
Ilya


Reply all
Reply to author
Forward
0 new messages