Since there are a lot of warc.gz file in each index, I split the list such as 20 or 90, then download them concurrently using wget with background mode.
The download process goes well but after like 20 iterations, 503 error comes up.
I read s3 document metioning about 5,500 request limit but I think it is only related when I use boto3 or aws s3 cli command.(is it right?)
Considering that my request through wget which is much lower than 5500 per sec but facing with 503 error, is there more harsh throttling when using wget ? Is there any favorable recommendation to download warc file stably?
Thank you in advance
Sebastian Nagel
unread,
Apr 25, 2023, 9:37:46 AM4/25/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi,
we are aware that currently many requests fail
with a "HTTP 503 Slow Down".
Please, simply slow down your request rate if
you see such responses repeatedly.
Also: use the S3 API only from the AWS region "us-east-1"
(Northern Virginia).
On 4/25/23 14:58, Soyeon Kim wrote:
> Hello everyone,
>
> In advance, thank you for all your efforts to sustain & share the
> crucial data for AI research.
> I try to download commoncrawl data using wget(e.g,
> wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-17/segments/1618038056869.3/warc/CC-MAIN-20210410105831-20210410135831-00329.warc.gz)
>
> Since there are a lot of warc.gz file in each index, I split the list
> such as 20 or 90, then download them concurrently using wget with
> background mode.
> The download process goes well but after like 20 iterations, 503 error
> comes up.
> I read s3 document metioning about 5,500 request