News Archive rate reduction

kasper...@gmail.com

unread,

Sep 20, 2022, 6:12:24 PM9/20/22

to Common Crawl

Hi everyone,

I was recently downloading last months news crawl via authenticated S3. About halfway through, I got a rate reduction error:

"botocore.exceptions.ClientError: An error occurred (SlowDown) when calling the GetObject operation (reached max retries: 4): Please reduce your request rate."

I was downloading at a rate of about 1 file per minute. What would be an appropriate rate?

Greg Lindahl

unread,

Sep 20, 2022, 9:31:35 PM9/20/22

to common...@googlegroups.com

On Tue, Sep 20, 2022 at 03:12:24PM -0700, kasper...@gmail.com wrote:

> "botocore.exceptions.ClientError: An error occurred (SlowDown) when calling
> the GetObject operation (reached max retries: 4): Please reduce your
> request rate."

The usual advice is to temporarily slow down when you see this kind of
error. If you're requesting 1 file per minute, and you've seen 4
retries already, sleeping for several minutes would be good.

Even better is to explicitly wait 1 minute before retrying at all.

-- greg

Michael M Behrendt

unread,

Sep 21, 2022, 2:01:32 AM9/21/22

to common...@googlegroups.com

It’s a bit unclear to me whether there Is still a way to download from common crawl with high perf (i.e. >>1MB/sec, i.e. many GB/sec) that is discussed below, from outside of AWS, and without incurring additional costs on any end.

Can someone please help clarifying?

Much apologies if this was already answered somwhere else and I missed the pointer.

From: common...@googlegroups.com <common...@googlegroups.com> on behalf of kasper...@gmail.com <kasper...@gmail.com>
Date: Wednesday, September 21, 2022 at 12:12 AM
To: Common Crawl <common...@googlegroups.com>
Subject: [EXTERNAL] [cc] News Archive rate reduction

Hi everyone, I was recently downloading last months news crawl via authenticated S3. About halfway through, I got a rate reduction error: "botocore. exceptions. ClientError: An error occurred (SlowDown) when calling the GetObject operation (reached

ZjQcmQRYFpfptBannerStart

This Message Is From an Untrusted Sender

You have not previously corresponded with this sender.

ZjQcmQRYFpfptBannerEnd

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f3b531dd-1494-4eaa-9d62-5aa4de492314n%40googlegroups.com.

kasper...@gmail.com

unread,

Sep 21, 2022, 10:46:54 AM9/21/22

to Common Crawl

I was just using authenticated S3 as described here:

https://groups.google.com/g/common-crawl/c/atjkwHO6WwQ/m/Ptsx9LgaAwAJ

I can wrap it into a try statement, but I was wondering if there is a rate that avoids these errors.

Greg Lindahl

unread,

Sep 21, 2022, 12:37:46 PM9/21/22

to common...@googlegroups.com

The ideal rate depends on how many other people are also downloading.
That's why I am recommending explicitly waiting when you see a 503
(SlowDown).

Reply all

Reply to author

Forward