News Archive rate reduction

28 views
Skip to first unread message

kasper...@gmail.com

unread,
Sep 20, 2022, 6:12:24 PM (5 days ago) Sep 20
to Common Crawl

Hi everyone,

I was recently downloading last months news crawl via authenticated S3. About halfway through, I got a rate reduction error:

"botocore.exceptions.ClientError: An error occurred (SlowDown) when calling the GetObject operation (reached max retries: 4): Please reduce your request rate."

I was downloading at a rate of about 1 file per minute. What would be an appropriate rate?

Greg Lindahl

unread,
Sep 20, 2022, 9:31:35 PM (5 days ago) Sep 20
to common...@googlegroups.com
On Tue, Sep 20, 2022 at 03:12:24PM -0700, kasper...@gmail.com wrote:

> "botocore.exceptions.ClientError: An error occurred (SlowDown) when calling
> the GetObject operation (reached max retries: 4): Please reduce your
> request rate."

The usual advice is to temporarily slow down when you see this kind of
error. If you're requesting 1 file per minute, and you've seen 4
retries already, sleeping for several minutes would be good.

Even better is to explicitly wait 1 minute before retrying at all.

-- greg


Michael M Behrendt

unread,
Sep 21, 2022, 2:01:32 AM (4 days ago) Sep 21
to common...@googlegroups.com

It’s a bit unclear to me whether there Is still a way to download from common crawl with high perf (i.e. >>1MB/sec, i.e. many GB/sec) that is discussed below, from outside of AWS, and without incurring additional costs on any end.

 

Can someone please help clarifying?

 

Much apologies if this was already answered somwhere else and I missed the pointer.

 

 

 

From: common...@googlegroups.com <common...@googlegroups.com> on behalf of kasper...@gmail.com <kasper...@gmail.com>
Date: Wednesday, September 21, 2022 at 12:12 AM
To: Common Crawl <common...@googlegroups.com>
Subject: [EXTERNAL] [cc] News Archive rate reduction

Hi everyone, I was recently downloading last months news crawl via authenticated S3. About halfway through, I got a rate reduction error: "botocore. exceptions. ClientError: An error occurred (SlowDown) when calling the GetObject operation (reached

ZjQcmQRYFpfptBannerStart

This Message Is From an Untrusted Sender

You have not previously corresponded with this sender.

ZjQcmQRYFpfptBannerEnd

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f3b531dd-1494-4eaa-9d62-5aa4de492314n%40googlegroups.com.

kasper...@gmail.com

unread,
Sep 21, 2022, 10:46:54 AM (4 days ago) Sep 21
to Common Crawl
I was just using authenticated S3 as described here:
I can wrap it into a try statement, but I was wondering if there is a rate that avoids these errors.

Greg Lindahl

unread,
Sep 21, 2022, 12:37:46 PM (4 days ago) Sep 21
to common...@googlegroups.com
The ideal rate depends on how many other people are also downloading.
That's why I am recommending explicitly waiting when you see a 503
(SlowDown).
Reply all
Reply to author
Forward
0 new messages