503 error while downloading data using wget

97 views

Skip to first unread message

Soyeon Kim

unread,

Apr 25, 2023, 8:58:57 AM4/25/23

to Common Crawl

Hello everyone,

In advance, thank you for all your efforts to sustain & share the crucial data for AI research.

I try to download commoncrawl data using wget(e.g, wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-17/segments/1618038056869.3/warc/CC-MAIN-20210410105831-20210410135831-00329.warc.gz)

Since there are a lot of warc.gz file in each index, I split the list such as 20 or 90, then download them concurrently using wget with background mode.

The download process goes well but after like 20 iterations, 503 error comes up.

I read s3 document metioning about 5,500 request limit but I think it is only related when I use boto3 or aws s3 cli command.(is it right?)

Considering that my request through wget which is much lower than 5500 per sec but facing with 503 error, is there more harsh throttling when using wget ? Is there any favorable recommendation to download warc file stably?

Thank you in advance

Sebastian Nagel

unread,

Apr 25, 2023, 9:37:46 AM4/25/23

to common...@googlegroups.com

Hi,

we are aware that currently many requests fail
with a "HTTP 503 Slow Down".

Please, simply slow down your request rate if
you see such responses repeatedly.

Also: use the S3 API only from the AWS region "us-east-1"
(Northern Virginia).

For further details, see
https://commoncrawl.org/access-the-data/

Thanks!

Best,
Sebastian

On 4/25/23 14:58, Soyeon Kim wrote:
> Hello everyone,
>
> In advance, thank you for all your efforts to sustain & share the
> crucial data for AI research.
> I try to download commoncrawl data using wget(e.g,
> wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-17/segments/1618038056869.3/warc/CC-MAIN-20210410105831-20210410135831-00329.warc.gz)
>
> Since there are a lot of warc.gz file in each index, I split the list
> such as 20 or 90, then download them concurrently using wget with
> background mode.
> The download process goes well but after like 20 iterations, 503 error
> comes up.
> I read s3 document metioning about 5,500 request

> <https://docs.aws.amazon.com/athena/latest/ug/performance-tuning-s3-throttling.html> limit but I think it is only related when I use boto3 or aws s3 cli command.(is it right?)

Reply all

Reply to author

Forward

0 new messages