Rate Limits

Nigel Vickers

ongelezen,

14 jan 2017, 10:14:4014-01-2017

aan Common Crawl

We have our own database of .de Domains. Working against 9/2016 CC we find a number of discrepancies. This is to be expected, our data dates back, in some cases, to the early 1990's. We have written an application which will try to reconcile our data with CC's WARC records. We wish to start pre-production tests. We call the index with an URL, parse the json response and download the WARC. We shall be rate limiting to under 10 requests/sec. Are there restrictions or formalities to be observed?

Nigel Vickers

Sebastian Nagel

ongelezen,

16 jan 2017, 12:24:0316-01-2017

aan common...@googlegroups.com

Hi Nigel,

yes, that's ok. However, the server index.commoncrawl.org is usually quite loaded, so it's unlikely
that 10 responses/sec. are reached. In case you have a longer list of queries (100,000 and more) it
might be worth

- to set up an index server on your own AWS EC2 instance [1]
- or fetch and process the index files directly [2]
- we also provide aggregated host/domain counts [3,4]

But it's mostly your decision depending on the time you are able to wait until all your requests are
processed.

Thanks,
Sebastian

[1] https://github.com/commoncrawl/cc-index-server/
[2] s3://commoncrawl/cc-index/collections/CC-MAIN-2016-40/ (September 2016 crawl)
[3] s3://commoncrawl/crawl-analysis/CC-MAIN-2016-40/count/
[4] https://github.com/commoncrawl/cc-crawl-statistics/ (code to generate counts)

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Nigel Vickers

ongelezen,

28 jan 2017, 14:05:0828-01-2017

aan Common Crawl

On Monday, 16 January 2017 18:24:03 UTC+1, Sebastian Nagel wrote:

Hallo Sebastian,

Thanks for your suggestions.

yes, that's ok. However, the server index.commoncrawl.org is usually quite loaded, so it's unlikely
that 10 responses/sec. are reached. In case you have a longer list of queries (100,000 and more) it
might be worth

Your assessment was correct. We were down to round trips of up to 13 secs/request.

Working under heavy compliance/certification constraints an AWS instance is currently not permissable.

We switched to -50.

We have a WARC downloader written in go to access commoncrawl.s3.amazonaws.com which takes a json index item as a call. We wrapped the cdx-index-client python script to download index pages on demand. This worked intially quite well. With 3 instances sharding NumPages 6 calls per second including the WARC download was possible. The concept proved to be very brittle. We began to have a number of problems(short files:no error, missing fields, one file with nearly 20% of items empty) with the downloaded index files. For a number of hours yesterday the index server was not available.

I have decided to write our own index client in go using the index API.

We have about 460,000 .de Domains in our own legacy Database. These are mainly commercial. About 280,000 of these pass our net test and are considered active. Of these some 90,000 were not present in the September 2016 crawl. This situation is naturally fluid. We are considering obtaining permission to donate these to cc. This would allow us to shutdown our own crawlers. For this to work we have to be able to run our "keyworders" against the WARC data in s3. Our first run will be against 2mio items downloaded last week. If we can demonstrate we can cross reference to our own data we may have a concept.

rgds Nigel Vickers

Sebastian Nagel

ongelezen,

29 jan 2017, 16:32:2629-01-2017

aan common...@googlegroups.com

Hi Nigel,

> Your assessment was correct. We were down to round trips of up to 13 secs/request.

Yeah, the index server is heavily loaded. I hope to get it moved to a more powerful
EC2 instance this spring.

> We have a WARC downloader written in go to access commoncrawl.s3.amazonaws.com which takes a json
> index item as a call.

You want to fetch a web page (WARC record) using filename, offset, and length given in the cdx file?
It's possible without using the index server, by just accessing AWS S3, see
https://groups.google.com/d/msg/common-crawl/8vnQnUA-0-0/TAb1LeNWFgAJ
Curl could be also used instead of the AWS CLI:
curl -s -r375611148-$((375611148+17240-1)) \
"https://commoncrawl.s3.amazonaws.com/crawl-data/..." | gzip -dc

> We are considering obtaining
> permission to donate these to cc.

Great!

> If we can demonstrate we can cross reference to our own
> data we may have a concept.

Let us know. If you have a good idea, how to simplify and speed-up the task to sub-sample
CC data, let us also know. It's a frequent but non-trivial problem.

Best,
Sebastian

On 01/28/2017 08:05 PM, Nigel Vickers wrote:
>
>
> On Monday, 16 January 2017 18:24:03 UTC+1, Sebastian Nagel wrote:
> Hallo Sebastian,
> Thanks for your suggestions.
>
>

> yes, that's ok. However, the server index.commoncrawl.org <http://index.commoncrawl.org> is

Allen beantwoorden

Auteur beantwoorden

Doorsturen