Dear users,
we're happy that our URL index server is popular and heavily used.
However, it's only a single server and we cannot scale it up.
We think time and hardware are better spent to improve the crawler
and data.
Please try not to overload the URL index server! And please avoid
1. bulk downloads, e.g., *all .com results over all monthly crawl archives*.
It's ok, to perform bulk queries, but please try not to fetch Terabytes
of data via the index server! Below are instructions how to download the
index files directly.
2. fetching the list of available monthly indexes too often. The content of
http://index.commoncrawl.org/collinfo.json
is changed once per month. No need to fetch it multiple times per second.
Please keep it cached!
How to download index files:
The overview page on
http://index.commoncrawl.org/
links to a list of index files for each monthly index, e.g.
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/cc-index.paths.gz
Download it, decompress it, and fetch the files in the list by adding the prefix
https://commoncrawl.s3.amazonaws.com/
or when accessing it via S3
s3://commoncrawl/
Want to fetch index files for a single top-level domain (here .fr)?
- the file list contains a cluster.idx file
cc-index/collections/CC-MAIN-2018-05/indexes/cluster.idx
- fetch it, e.g.:
wget
https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2018-05/indexes/cluster.idx
- the first field in the cluster.idx contains the SURT representation of the URL,
with the reversed host/domain name:
fr,01-portable)/pal-et-si-internet-nexistait-pas.htm
- it's easy to list the cdx files containing all results from the .fr TLD:
grep '^fr,' cluster.idx | cut -f2 | uniq
cdx-00193.gz
cdx-00194.gz
cdx-00195.gz
cdx-00196.gz
That's only 4 files! I'm sure you're able to find the full path/URL
in the file list. If not, I'm happy to help.
- .com results make more than 50% of the index:
grep '^com,' cluster.idx | cut -f2 | uniq | wc -l
155
Please, fetch the index files directly. That's even much faster
and you can get all .com URLs from a monthly index in about one hour.
I'll add or link these instructions to the overview page soon.
Thanks,
Sebastian