Apology for the Rate Limit / 503

963 views
Skip to first unread message

Jasper Yu

unread,
May 1, 2023, 3:00:38 AM5/1/23
to Common Crawl
Hi All,

We are sorry for causing this situation, the reason behind this is we need to download all files from the site, and filter out imgur links for Archive Team Imgur project.
We need to download this as fast as possible to meet the deadline of imgur deleting those images, so we are using high bandwidth to download this (> 40Gbps continuously, I personally download at 20Gbps rate), which cause the 503 for 45 mins duration every 2 hours during last few days, and now it is globally rate limited to potentially kb/s.

Please accept our apology

Splash672

unread,
May 1, 2023, 5:29:07 PM5/1/23
to Common Crawl
when will it be back to normal and are you still downloading

Jasper Yu

unread,
May 1, 2023, 6:06:38 PM5/1/23
to Common Crawl
Hi,

We already stop all the downloading process after it is being rate limited.

Thanks

Juah

unread,
May 2, 2023, 7:05:33 AM5/2/23
to Common Crawl
Holy crawl - 40 Gbps - that is serious bandwidth. 
Message has been deleted

Tom Shi

unread,
May 3, 2023, 8:52:19 AM5/3/23
to Common Crawl
When will you finish your plan?

xz liu

unread,
May 3, 2023, 11:53:40 AM5/3/23
to Common Crawl
it seems the rate limited  to 128 kb/s for everyone ? when will it get back to normal  o.0

Juah

unread,
May 3, 2023, 11:58:46 AM5/3/23
to Common Crawl
Don't tell me you are downloading whenever you need to look at the data - this is crazy the requests are now officially unusable. 

Arthur Strong

unread,
May 4, 2023, 3:59:11 AM5/4/23
to Common Crawl
Hetzner and Linode, right now (May  4 07:58:59 AM UTC).
128kb/s.

Greg Lindahl

unread,
May 5, 2023, 3:26:12 PM5/5/23
to common...@googlegroups.com
Hi, all.

When you see this kind of rate limit, it's being enforced by Amazon's
CloudFront service for accesses to the AWS Open Data bucket, due to
high usage. The limit typically gets lifted after usage goes back to
normal.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/fe1b0767-0292-4a41-a312-1bcf0c933814n%40googlegroups.com.

David Mackey

unread,
May 5, 2023, 7:53:01 PM5/5/23
to Common Crawl
I'm guessing there isn't a way to have CloudFront set per IP bandwidth rate limits? Seems unfortunate that a single user can take it down for everyone. I'm unable to pull down a single WARC file atm and am getting:

<Error>
<Code>SlowDown</Code>
<Message>Please reduce your request rate.</Message>
<RequestId>87P2P6Z162AW9XYY</RequestId>
<HostId>ne4N7mOYSPgP+iGGaZOD3fCaMgAuueCMCKA315dpoyqGhCM5PPLtFnTBRXKigKo7Ki14xXJIluo=</HostId>
</Error>

Greg Lindahl

unread,
May 8, 2023, 4:52:26 PM5/8/23
to common...@googlegroups.com
Cloudfront is a very distributed system and it is what it is.

I've started monitoring data.commoncrawl.org and saw only 1 503 in the
past 48 hours. But that's just the San Francisco Bay Area endpoint.

index.commoncrawl.org was working very poorly over the weekend. The
columnar index is probably working fine.

-- greg
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/255d2563-f25d-4bf1-a895-4709f868ed65n%40googlegroups.com.

Volo

unread,
May 11, 2023, 8:09:56 AM5/11/23
to Common Crawl
I've just discovered Common Crawl Index Server, but nothing is working for me. Whenever I try to open or query a search page (e.g.: https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1) it fails with a "504 Gateway Time-out" response. Is this also caused by the bandwidth limit mentioned by the OP, or it's a different issue?

Sebastian Nagel

unread,
May 11, 2023, 9:38:48 AM5/11/23
to common...@googlegroups.com
Hi,

> Is this also caused by the bandwidth limit mentioned by the OP, or
> it's a different issue?

It's related: the CDX index server also needs to access data on
s3://commoncrawl/ and it cannot serve as many requests as it should be
if the access to S3 is slow - this includes an adaptive slower request
rate because of perceived response status codes of "HTTP 503 Slow Down".

Unfortunately, there are still users continuing sending 100k or more
requests per hour while the server is able to successfully respond just
few thousands.

Sorry about this. We know that the CDX server needs to be fixed and
we'll try to - but it may take some time.

Best,
Sebastian

On 5/11/23 14:09, Volo wrote:
> I've just discovered Common Crawl Index Server, but nothing is working
> for me. Whenever I try to open or query a search page (e.g.:
> https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1 <https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1>) it fails with a "*504* Gateway Time-out" response. Is this also caused by the bandwidth limit mentioned by the OP, or it's a different issue?
>
> On Monday, May 8, 2023 at 10:52:26 PM UTC+2 Greg Lindahl wrote:
>
> Cloudfront is a very distributed system and it is what it is.
>
> I've started monitoring data.commoncrawl.org
> <http://data.commoncrawl.org> and saw only 1 503 in the
> past 48 hours. But that's just the San Francisco Bay Area endpoint.
>
> index.commoncrawl.org <http://index.commoncrawl.org> was working
> https://groups.google.com/d/msgid/common-crawl/fe1b0767-0292-4a41-a312-1bcf0c933814n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/fe1b0767-0292-4a41-a312-1bcf0c933814n%40googlegroups.com>
> > > .
> > >
> > >
> >
> > --
> > You received this message because you are subscribed to the
> Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an email to common-crawl...@googlegroups.com.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/255d2563-f25d-4bf1-a895-4709f868ed65n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/255d2563-f25d-4bf1-a895-4709f868ed65n%40googlegroups.com>.
>
>
> ------------------------------------------------------------------------
> *Snapp Mobile Germany GmbH*
> Holzstrasse 28
> 80469, München
> www.snappmobile.io <http://www.snappmobile.io/>
>
> /Sitz der Gesellschaft: München/
> /Registergericht: Amtsgericht München, HRB 229710/
> /Geschäftsführer Jasper Alan Colville Morgan, Pasi Juhani Lehtimäki/
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/e3e36b02-3d16-4f10-bbba-c81d27afbad5n%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/e3e36b02-3d16-4f10-bbba-c81d27afbad5n%40googlegroups.com?utm_medium=email&utm_source=footer>.

David Mackey

unread,
May 11, 2023, 3:37:38 PM5/11/23
to Common Crawl
Hi,

I understand Cloudfront is a highly distributed system but I'm still confused by the situation.

Amazon has their Web Application Firewall (WAF) which has rate limiting rule capabilities which can sit in front of Cloudfront.

Is the CreativeCrawl Cloudfront not behind the WAF?

I apologize if I'm retreading well covered ground.

Sincerely,
Dave

William Roe

unread,
May 14, 2023, 9:37:40 AM5/14/23
to Common Crawl
Hi, 

I reduced my index search to 1 thread, and I am still getting a 504 error on every request. 
What I  thought I read was to limit the index to 1 request/sec, and implement retries.  Is that correct?  In the past 504 errors were only periodic.  What changed?

V/R

-Bill
Reply all
Reply to author
Forward
0 new messages