503 SlowDown

Colin Dellow

unread,

Dec 8, 2022, 11:11:18 AM12/8/22

to Common Crawl

I'm seeing a very high rate of SlowDown responses. Unfortunately, it's such that I the service is unusable -- even the first step of fetching the list of paths fails.

These steps failed, for example:

$ aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2022-27/wat.paths.gz .

download failed: s3://commoncrawl/crawl-data/CC-MAIN-2022-27/wat.paths.gz to ./wat.paths.gz An error occurred (SlowDown) when calling the GetObject operation (reached max retries: 4): Please reduce your request rate.

$ wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-27/wat.paths.gz

2022-12-08 16:07:20 ERROR 503: Service Unavailable.

$ wget 'https://index.commoncrawl.org/CC-MAIN-2022-40-index?url=https%3A%2F%2Fattributz.github.io%2F&output=json'

2022-12-08 16:08:25 ERROR 500: Internal Error: An error occurred (SlowDown) when calling the GetObject operation (reached max retries: 4): Please reduce your request rate..

If I persist, I can get the file after 5-10 attempts. But this makes me think it wouldn't be worth trying to then make follow-up requests for the actual data.

Is there something I can change? I was hoping that an authenticated S3 request could carve out some small amount of bandwidth, but it seems like it's subject to the same limits as the unauthenticated things.

Ekku Jokinen

unread,

Dec 8, 2022, 12:35:44 PM12/8/22

to Common Crawl

I'm also experiencing these and I do remember having the same issue some time back, as well. I'm assuming it starts to thottle if the system is under heavy pressure and will just take some time to recoupe. If this is the case, would be nice to know if there was some estimate of time it will take to recover!

Sebastian Nagel

unread,

Dec 8, 2022, 12:50:24 PM12/8/22

to common...@googlegroups.com

Hi Colin, hi Ekku,

> Is there something I can change? I was hoping that an authenticated
> S3 request could carve out some small amount of bandwidth, but it
> seems like it's subject to the same limits as the unauthenticated
> things.

In the end all requests (via CloudFront or S3 and authenticated) are
sent to the same bucket.

If the requests are sent from the AWS cloud in us-east-1, using the
S3 API should always be the better choice.

> under heavy pressure and will just take some time to recoupe. If this
> is the case, would be nice to know if there was some estimate of time
> it will take to recover!

Since the introduction of CloudFront-backed access in March 2022,
repeated 503s are observed infrequently and only temporarily (lasting
not more than a few hours). So, maybe wait one day and try again.
As Colin mentioned, retrying few times should be also succeed, this
could be a solution for single but urgent download, eg. path listings.

Best,
Sebastian

On 12/8/22 18:35, Ekku Jokinen wrote:
> I'm also experiencing these and I do remember having the same issue some
> time back, as well. I'm assuming it starts to thottle if the system is
> under heavy pressure and will just take some time to recoupe. If this is
> the case, would be nice to know if there was some estimate of time it
> will take to recover!
>
> On Thursday, December 8, 2022 at 6:11:18 PM UTC+2 clde...@gmail.com wrote:
>
> I'm seeing a very high rate of SlowDown responses. Unfortunately,
> it's such that I the service is unusable -- even the first step of
> fetching the list of paths fails.
>
> These steps failed, for example:
>
> $ aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2022-27/wat.paths.gz .
> download failed:
> s3://commoncrawl/crawl-data/CC-MAIN-2022-27/wat.paths.gz to
> ./wat.paths.gz An error occurred (SlowDown) when calling the
> GetObject operation (reached max retries: 4): Please reduce your
> request rate.
>
> $ wget
> https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-27/wat.paths.gz
> <https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-27/wat.paths.gz>
> 2022-12-08 16:07:20 ERROR 503: Service Unavailable.
>
> $ wget

> 'https://index.commoncrawl.org/CC-MAIN-2022-40-index?url=https%3A%2F%2Fattributz.github.io%2F&output=json <https://index.commoncrawl.org/CC-MAIN-2022-40-index?url=https%3A%2F%2Fattributz.github.io%2F&output=json>'

Reply all

Reply to author

Forward