Overloading index.commoncrawl.org and bulk index downloads

9,507 views
Skip to first unread message

Sebastian Nagel

unread,
Feb 5, 2018, 5:10:19 AM2/5/18
to common...@googlegroups.com
Dear users,

we're happy that our URL index server is popular and heavily used.
However, it's only a single server and we cannot scale it up.
We think time and hardware are better spent to improve the crawler
and data.

Please try not to overload the URL index server! And please avoid

1. bulk downloads, e.g., *all .com results over all monthly crawl archives*.
It's ok, to perform bulk queries, but please try not to fetch Terabytes
of data via the index server! Below are instructions how to download the
index files directly.

2. fetching the list of available monthly indexes too often. The content of
http://index.commoncrawl.org/collinfo.json
is changed once per month. No need to fetch it multiple times per second.
Please keep it cached!


How to download index files:

The overview page on
http://index.commoncrawl.org/
links to a list of index files for each monthly index, e.g.
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/cc-index.paths.gz
Download it, decompress it, and fetch the files in the list by adding the prefix
https://commoncrawl.s3.amazonaws.com/
or when accessing it via S3
s3://commoncrawl/

Want to fetch index files for a single top-level domain (here .fr)?

- the file list contains a cluster.idx file
cc-index/collections/CC-MAIN-2018-05/indexes/cluster.idx
- fetch it, e.g.:
wget https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2018-05/indexes/cluster.idx
- the first field in the cluster.idx contains the SURT representation of the URL,
with the reversed host/domain name:
fr,01-portable)/pal-et-si-internet-nexistait-pas.htm
- it's easy to list the cdx files containing all results from the .fr TLD:
grep '^fr,' cluster.idx | cut -f2 | uniq
cdx-00193.gz
cdx-00194.gz
cdx-00195.gz
cdx-00196.gz
That's only 4 files! I'm sure you're able to find the full path/URL
in the file list. If not, I'm happy to help.
- .com results make more than 50% of the index:
grep '^com,' cluster.idx | cut -f2 | uniq | wc -l
155
Please, fetch the index files directly. That's even much faster
and you can get all .com URLs from a monthly index in about one hour.

I'll add or link these instructions to the overview page soon.

Thanks,
Sebastian

Tom Morris

unread,
Feb 6, 2018, 2:17:55 AM2/6/18
to common...@googlegroups.com
It's good to have a specific, focused message like this, but this has all been said before (and should be obvious without having to state it).

Perhaps technical solutions could be pursued in addition to social ones:

- 5xx errors when exceeding reasonable thresholds
- traffic "shaping" which slows throughput down progressively the more aggressive the client gets.
- <insert your idea here>

Tom


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Feb 6, 2018, 3:01:13 AM2/6/18
to common...@googlegroups.com
Hi Tom,


> - 5xx errors when exceeding reasonable thresholds

Yes, of course, but as user I would expect that the CDX client handles
a "503 Slow Down" and throttles down. It will add additional load if
you have to restart a bulk query which failed because of an unhandled 5xx.


> - traffic "shaping" which slows throughput down progressively the more aggressive the client gets.

I've already set it up. But the "quota" per IP is quite generous and
still allows running bulk queries.


In any case, I think any efforts are better spent in "cooperative" development
to improve the data, make it better accessible, etc.
Setting up more sophisticated limits, or on the other end avoiding limits using
proxies or similar, is lost time.


Best,
Sebastian


On 02/06/2018 08:17 AM, Tom Morris wrote:
> It's good to have a specific, focused message like this, but this has all been said before (and
> should be obvious without having to state it).
>
> Perhaps technical solutions could be pursued in addition to social ones:
>
> - 5xx errors when exceeding reasonable thresholds
> - traffic "shaping" which slows throughput down progressively the more aggressive the client gets.
> - <insert your idea here>
>
> Tom
>
> On Mon, Feb 5, 2018 at 5:10 AM, Sebastian Nagel <seba...@commoncrawl.org
> <mailto:seba...@commoncrawl.org>> wrote:
>
> Dear users,
>
> we're happy that our URL index server is popular and heavily used.
> However, it's only a single server and we cannot scale it up.
> We think time and hardware are better spent to improve the crawler
> and data.
>
> Please try not to overload the URL index server! And please avoid
>
> 1. bulk downloads, e.g., *all .com results over all monthly crawl archives*.
>    It's ok, to perform bulk queries, but please try not to fetch Terabytes
>    of data via the index server! Below are instructions how to download the
>    index files directly.
>
> 2. fetching the list of available monthly indexes too often. The content of
>      http://index.commoncrawl.org/collinfo.json <http://index.commoncrawl.org/collinfo.json>
>    is changed once per month. No need to fetch it multiple times per second.
>    Please keep it cached!
>
>
> How to download index files:
>
> The overview page on
>      http://index.commoncrawl.org/
> links to a list of index files for each monthly index, e.g.
>      https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/cc-index.paths.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/cc-index.paths.gz>
> Download it, decompress it, and fetch the files in the list by adding the prefix
>      https://commoncrawl.s3.amazonaws.com/ <https://commoncrawl.s3.amazonaws.com/>
> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

oliver...@gmail.com

unread,
Aug 6, 2018, 8:42:53 AM8/6/18
to Common Crawl
I have just tried your proposal. It worked really fine, thanks.  But there seems to be an issue regarding completeness.
The idx files seem not to contain all urls. While using the web interface I find some urls which I can't find in the corresponding idx file.

Example:

When searching for "de,wuerzburg" in the CC-MAIN-2018-30.idx file I do not get a result.

Do you have an idea?

Sebastian Nagel

unread,
Aug 6, 2018, 9:19:26 AM8/6/18
to common...@googlegroups.com
Hi Oliver,

the mentioned list of index files (cc-index.paths.gz)
contains 300 files named cdx-*.gz, the records for the
domain wuerzburg.de are contained in one of these 300
files (about 250 GB in total). You could use the
cluster.idx to determine in which one, but it
does not contain all URLs.

A description of the index format is found here:
https://github.com/webrecorder/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx

Best,
Sebastian

https://github.com/webrecorder/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx
On 08/06/2018 02:42 PM, oliver...@gmail.com wrote:
> I have just tried your proposal. It worked really fine, thanks.  But there seems to be an issue
> regarding completeness.
> The idx files seem not to contain all urls. While using the web interface I find some urls which I
> can't find in the corresponding idx file.
>
> Example:
> http://index.commoncrawl.org/CC-MAIN-2018-30-index?url=wuerzburg.de&output=json
>
> When searching for "de,wuerzburg" in the CC-MAIN-2018-30.idx file I do not get a result.
>
> Do you have an idea?
>
> Am Montag, 5. Februar 2018 11:10:19 UTC+1 schrieb Sebastian Nagel:
>
> Dear users,
>
> we're happy that our URL index server is popular and heavily used.
> However, it's only a single server and we cannot scale it up.
> We think time and hardware are better spent to improve the crawler
> and data.
>
> Please try not to overload the URL index server! And please avoid
>
> 1. bulk downloads, e.g., *all .com results over all monthly crawl archives*.
>    It's ok, to perform bulk queries, but please try not to fetch Terabytes
>    of data via the index server! Below are instructions how to download the
>    index files directly.
>
> 2. fetching the list of available monthly indexes too often. The content of
>      http://index.commoncrawl.org/collinfo.json <http://index.commoncrawl.org/collinfo.json>
>    is changed once per month. No need to fetch it multiple times per second.
>    Please keep it cached!
>
>
> How to download index files:
>
> The overview page on
>      http://index.commoncrawl.org/
> links to a list of index files for each monthly index, e.g.
>      https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/cc-index.paths.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/cc-index.paths.gz>
> Download it, decompress it, and fetch the files in the list by adding the prefix
>      https://commoncrawl.s3.amazonaws.com/ <https://commoncrawl.s3.amazonaws.com/>

oliver...@gmail.com

unread,
Aug 7, 2018, 3:57:26 AM8/7/18
to Common Crawl
Hi Sebastian,

thanks for your quick reply! Does that mean even if I check for specific domain all existing cluster.idx
files it can happen that there is data available in one of the cdx-*.gz files butthe domain is not mentioned in any of the cluster.idx files?

My example:
I have merged all eixsting cluster.idx files in one big file and searched there for a specific domain name but couldn't find it. Afterwords I have checked the web api with thesame domain name and voila: There are entries.
I thought at least the domain name should be available via this merged cluster.idx file.

Best regards,
Oliver

Sebastian Nagel

unread,
Aug 7, 2018, 4:51:13 AM8/7/18
to common...@googlegroups.com
Hi Oliver,

as described in

https://github.com/webrecorder/pywb/wiki/CDX-Index-Format#zipnum-sharded-cdx

the cluster.idx files contain only one out of 3000 records. The cdx-*.gz are
sorted by domain (or SURT), consequently only domains with at least 3000 captures
are guaranteed to be contained in the cluster.idx files. Recent monthly crawls
cover about 30 million domains, much more than the 1.5 million lines in the
cluster.idx.


If you only need lists of domain names, please have a look at

- the columnar index
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

- the crawl statistics
https://github.com/commoncrawl/cc-crawl-statistics
Domain counts (successful fetches only) are contained in 10 files, e.g.:
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2018-30/count/part-00000.bz2
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2018-30/count/part-00001.bz2
...
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2018-30/count/part-00009.bz2

- the domain-level webgraphs
http://commoncrawl.org/2018/05/webgraphs-feb-mar-apr-2018/

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages