Why yes, there is a 503 problem

940 views
Skip to first unread message

Greg Lindahl

unread,
Jan 16, 2022, 8:24:53 PMJan 16
to common...@googlegroups.com
Common Crawl's data is in public buckets at Amazon AWS, thanks to a
generous donation of resources by Amazon to this non-profit project.

It does indeed seem that all(?) accesses to this buckets are currently
getting 503s. The ratelimit is supposed to be extremely high (5500
GETs/sec), so I am guessing there's a policy issue involved.

Common Crawl's lone employee lives in the CET timezone and it's 2am
there. I suspect he'll be working on this problem on Monday morning in
his timezone. It might involve coordinating with AWS employees, and
Monday so happens to be a public holiday in the US (Martin Luther King
Jr Day).

So: please, hang in there, it will be fixed, but not necessarily very
quickly.

As an advisor to (and user of) CommonCrawl, I share everyone's
frustrations with Common Crawl being down like this!

-- greg


Sebastian Nagel

unread,
Jan 17, 2022, 4:37:33 AMJan 17
to common...@googlegroups.com
Hi Greg, hi everybody,

thanks for all the notices. I can confirm that there are issues
and that most but not all requests to s3://commoncrawl/ receive
a "HTTP 503 Slow down". Afaics, the issue affects all kind of
services including our URL indexes (index.commoncrawl.org) and
also the columnar index queried by Amazon Athena.

We're trying to get this fixed. But as Greg pointed out this
may take some time.

Note: in order to reduce the load on the bucket, a request for
https://index.commoncrawl.org/collinfo.json returns now and
temporarily the empty list.

Best,
Sebastian

Greg Lindahl

unread,
Jan 18, 2022, 2:27:47 PMJan 18
to common...@googlegroups.com
I see that Sebastian put the full collinfo.json file back, and right
now my test suite fully runs.

I got a direct email and a bugzilla bug for cdx_toolkit related to the
outage, nice to see people using my client code! I raise a
ValueError() if there are no crawls in collinfo.json, that seems to
have been successful at informing people.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f31fa62a-fb50-671f-c542-fa1f2e698f1b%40commoncrawl.org.

Sebastian Nagel

unread,
Jan 19, 2022, 5:01:52 PMJan 19
to common...@googlegroups.com
Hi everybody,

> I see that Sebastian put the full collinfo.json file back

On Monday evening (GMT) the situation began to improve with most
requests to succeed. I've restored the collinfo.json and enabled
again all workers on the index server.

The reason for the many "503 Slow Down" responses was an extraordinarily
high number of requests and large egress volume from Saturday morning
(GMT) until Monday afternoon. We'll get support from Amazon's Open Data
Set team to avoid that spikes in the data usage cause slow downs again.

Best,
Sebastian

Alan Gibson

unread,
Feb 7, 2022, 3:38:34 PMFeb 7
to Common Crawl
Hi all, 

Thanks to all of you for your efforts on this. Unfortunately I'm getting 503 and a Slow Down message for every request. Has anyone else reported this problem being back?

Regards,

Alan Gibson

kasper...@gmail.com

unread,
Feb 7, 2022, 6:05:51 PMFeb 7
to Common Crawl
Same for me. It's been on and off today.

Ammar Ammar

unread,
Feb 8, 2022, 8:44:08 AMFeb 8
to Common Crawl
Hi all,
I am also having the same problem. Yesterday and today, most of my requests are failing with 503 error.

Sincerely,
Ammar

Ozgur Turel

unread,
Feb 8, 2022, 8:49:41 AMFeb 8
to common...@googlegroups.com
Hello to all,

I am having the same issues while accessing wat and wet files using http. What is the recommended rate and method for downloading these files? 

Regards,
Ozgur.
Message has been deleted

Suchin Gururangan

unread,
Feb 9, 2022, 3:45:39 AMFeb 9
to Common Crawl
Same issues here! 

Max Kesin

unread,
Feb 9, 2022, 8:28:53 AMFeb 9
to Common Crawl
Ditto, Feb 9th, 503 for every request I've tried.
Are there any viable mirrors?

Rod MacDougall

unread,
Feb 9, 2022, 10:46:14 AMFeb 9
to Common Crawl
Hi all,

Have been encountering the same issue for pretty much all of today and at the end of yesterday too - had only made a handful of requests. Error message (from index.commoncrawl.org):

{"message": "Internal Error: 503 Server Error: Slow Down for url: http://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2022-05/indexes/cdx-00294.gz"}

Gregor Kaczor

unread,
Feb 9, 2022, 11:05:19 AMFeb 9
to Common Crawl
same issues here. Commoncrawl files are practically not accessible.
Message has been deleted
Message has been deleted

Alan Gibson

unread,
Feb 11, 2022, 8:28:34 AMFeb 11
to Common Crawl
Hi all,

FYI: It's working fine for me when I use the AWS cli. Here's an example of downloading a segment:

aws s3api get-object --range 'bytes=30680420-30681660' --bucket 'commoncrawl' --key 'crawl-data/CC-MAIN-2022-05/segments/1642320301217.83/warc/CC-MAIN-20220119003144-20220119033144-00465.warc.gz' out.gz

If you're not already doing so, I strongly recommend using byte ranges if you're only pulling selected urls.

Regards,

Alan Gibson

Alan Gibson

unread,
Feb 11, 2022, 8:30:03 AMFeb 11
to Common Crawl
The CC is so incredibly large (well into the petabyte range) that AFAIK it's only available via S3. There are very few other systems that could hold it, let alone deliver it for free.
Message has been deleted

kasper...@gmail.com

unread,
Feb 11, 2022, 6:39:16 PMFeb 11
to Common Crawl
I still seem to be getting 503 errors. I tried a download on the news dataset. Anyone else have the same problem?

Max Kesin

unread,
Feb 13, 2022, 8:38:44 AMFeb 13
to Common Crawl
same here

Alan Gibson

unread,
Feb 15, 2022, 1:48:26 PMFeb 15
to Common Crawl
Does anyone know how many requests you should be able to do per X time? I'm downloading individual archived pages, so I'm doing queries in quick succession. I can do anywhere between 2 and 8 before I get a 503. 

Colin Dellow

unread,
Feb 15, 2022, 1:59:50 PMFeb 15
to Common Crawl
In Februrary 2020, I was able to do 17,000 requests/second from a single a1.4xlarge instance in the us-east-1 region.

kasper...@gmail.com

unread,
Feb 15, 2022, 2:59:03 PMFeb 15
to Common Crawl
Has there been any update on this? The news dataset does not seem to be accessible at all to me. Queries against the index also fail most of the time.

Markus Weston

unread,
Feb 15, 2022, 6:10:50 PMFeb 15
to Common Crawl
Yes, I just tried to access the index warc.paths.gz on two different releases, but getting 503 errors. I don't think I've accessed anything on S3 all day. I'm sure I'm not blocked because of high access, though maybe someone sharing my network is?

Sebastian Nagel

unread,
Feb 16, 2022, 3:49:13 AMFeb 16
to common...@googlegroups.com
Hi Markus, hi everybody,

we're aware of the elevated rate of 503s and are working on a solution
together with the Open Data Set at Amazon. We'll keep you updated!

Thanks to everybody reporting the issues. Very appreciated!

> I'm sure I'm not blocked because of high access, though
> maybe someone sharing my network is?

To confirm again: the 503s are not "targeted" to any individual
users. It's about the bucket s3://commoncrawl/ in general.
So, you're doing nothing wrong.

Thanks for your patience!

Best,
Sebastian
> __
>> <http://index.commoncrawl.org>)
>> <https://groups.google.com/d/msgid/common-crawl/f31fa62a-fb50-671f-c542-fa1f2e698f1b%40commoncrawl.org>.
>>
>> >
>>
>>
>> --
>> You received this message because
>> you are subscribed to the Google
>> Groups "Common Crawl" group.
>> To unsubscribe from this group and
>> stop receiving emails from it,
>> send an email to
>> common-crawl...@googlegroups.com.
>> To view this discussion on the web
>> visit
>> https://groups.google.com/d/msgid/common-crawl/30677914-db6a-46ce-a28c-f64f13c3df57n%40googlegroups.com
>> <https://groups.google.com/d/msgid/common-crawl/30677914-db6a-46ce-a28c-f64f13c3df57n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/50910bf0-bb46-495e-b9d0-9d5a0ccd074an%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/50910bf0-bb46-495e-b9d0-9d5a0ccd074an%40googlegroups.com?utm_medium=email&utm_source=footer>.

David Pennington

unread,
Feb 16, 2022, 9:48:30 AMFeb 16
to common...@googlegroups.com
Thank you Sebastian, I appreciate your help.

Do you have any recommendations on how best to copy one of the indexes? Does pulling from inside AWS still work? Should we setup a proxy on an EC2 instance? I assume copying to our own S3 bucket might also be a way to work around this.

GetObjectTagging is not allowed on the objects so the following didn't work for me.

aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2022-05/ s3://uniqueNameSpaceHere/CC-MAIN-2022-05

Sebastian Nagel

unread,
Feb 16, 2022, 10:45:30 AMFeb 16
to common...@googlegroups.com
Hi David,

> Does pulling from inside AWS still work?

The 503s also happen if you're requesting the data from inside AWS,
and also if the data is accessed via services such as Amazon Athena.

> GetObjectTagging is not allowed on the objects

Could you report the details of the error?

GetObjectTagging should be allowed, at least, it's not configured
otherwise. I've also successfully tried
aws --profile "non-commoncrawl-user" s3api \
get-object-tagging --bucket commoncrawl \
--key
crawl-data/CC-MAIN-2022-05/segments/1642320306346.64/wet/CC-MAIN-20220128212503-20220129002503-00719.warc.wet.gz
and also "s3 sync" (but only on the wet/ folder of a single segment).


Otherwise: if using a recent version of the AWS CLI you could try the option
aws s3 sync --copy-props none ...
see
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html


Best,
Sebastian

Jason Duke

unread,
Feb 18, 2022, 2:48:40 AMFeb 18
to common...@googlegroups.com
Hi all.

I'm seeing the 503 problem too, but while using Commoncrawl data via  AWS's Athena service.

I presume it's the same issue as discussed above, or should it be exempt from the throttling?
-- 
Jason Duke

* Book a Meeting with me https://booking.strangelogic.ltd/  * 

https://StrangeLogic.com/ - Wisdom & Experience is Strangely Logical 


Email: ja...@strangelogic.com        
Twitter: @JasonD
Skype:JasonD




The information contained within this email along with any attachments are confidential, may be legally privileged and/or protected by copyright.   If you are not the intended recipient of this email then further dissemination, copying or printing is prohibited. If you have received this email in error then you should notify the sender by replying to this email and thereafter permanently deleting the email from your systems.

Any views or opinions in this email are solely those of the sender.  This email is not intended to form a binding contract and as such all communications are “subject to contract” unless it is expressly indicated to the contrary and is properly authorised.  You should not rely on any information contain within this email, and any actions taken are at the recipient’s own risk.

Strange Logic Limited is a company registered in England and Wales (Company No. 10995931 ) with its registered address being 1 Alfriston Park, Seaford, East Sussex. BN25 3LS


Sebastian Nagel

unread,
Feb 18, 2022, 3:01:46 AMFeb 18
to common...@googlegroups.com
Hi Jason,

> I presume it's the same issue as discussed above, or should it be
> exempt from the throttling?

The 503s affect all users independent from the location or the
used serviced. During the last days the situation has improved,
I was able to query the columnar index via Athena. However,
we're still working on a final solution.

Thanks for your patience.

Best,
Sebastian

On 2/15/22 19:59, Jason Duke wrote:
> Hi all.
>
> I'm seeing the 503 problem too, but while using Commoncrawl data via 
> AWS's Athena service.
>
> I presume it's the same issue as discussed above, or should it be exempt
> from the throttling?
> -- 
> Jason Duke
>
> * Book a Meeting with me https://booking.strangelogic.ltd/ 
> <https://booking.strangelogic.ltd/> * 
>
> https://StrangeLogic.com/ <https://strangelogic.com/> - Wisdom &
> Experience is Strangely Logical 
>
>
> Email: ja...@strangelogic.com <mailto:ja...@strangelogic.com>        
> Email: ja...@the.domain.name <mailto:ja...@the.domain.name>
> <mailto:maxk...@gmail.com> wrote:
>
> same here
>
> On Friday, February 11, 2022 at 6:39:16 PM UTC-5
> kasper...@gmail.com wrote:
>
> I still seem to be getting 503 errors. I tried a download on
> the news dataset. Anyone else have the same problem?
>
> On Friday, February 11, 2022 at 2:30:03 PM UTC+1
> alan....@gmail.com wrote:
>
> The CC is so incredibly large (well into the petabyte
> range) that AFAIK it's only available via S3. There are
> very few other systems that could hold it, let alone
> deliver it for free.
>
> On Wednesday, February 9, 2022 at 2:28:53 PM UTC+1
> maxk...@gmail.com wrote:
>
> Ditto, Feb 9th, 503 for every request I've tried.
> Are there any viable mirrors?
>
> On Wednesday, February 9, 2022 at 3:45:39 AM UTC-5
> sgran...@gmail.com wrote:
>
> Same issues here! 
>
> On Tuesday, February 8, 2022 at 5:49:41 AM UTC-8
> Ozgur Turel wrote:
>
> __
>> <http://index.commoncrawl.org>) and
>> <https://groups.google.com/d/msgid/common-crawl/f31fa62a-fb50-671f-c542-fa1f2e698f1b%40commoncrawl.org>.
>>
>> >
>>
>>
>> --
>> You received this message because you are
>> subscribed to the Google Groups "Common
>> Crawl" group.
>> To unsubscribe from this group and stop
>> receiving emails from it, send an email to
>> common-crawl...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/common-crawl/30677914-db6a-46ce-a28c-f64f13c3df57n%40googlegroups.com
>> <https://groups.google.com/d/msgid/common-crawl/30677914-db6a-46ce-a28c-f64f13c3df57n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/3c8aace1-48f7-4212-ae96-70bdcc658edcn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/3c8aace1-48f7-4212-ae96-70bdcc658edcn%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CADTM-zRRk%2Baowx67P-7FR5SMqaKJ7peJava7hSZURGu%3DNf5W9g%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CADTM-zRRk%2Baowx67P-7FR5SMqaKJ7peJava7hSZURGu%3DNf5W9g%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Alan Gibson

unread,
Feb 25, 2022, 5:27:13 AMFeb 25
to Common Crawl
The 503 problem looks to me to be fixed. I've done 60K byte-range requests over the last 24 hours or so with no failures.

kasper...@gmail.com

unread,
Mar 21, 2022, 6:25:55 PMMar 21
to Common Crawl
I seem to be getting 503 errors quite regularly again when downloading the news crawl. Anyone else have the same problem?

Sebastian Nagel

unread,
Mar 24, 2022, 6:29:57 PMMar 24
to common...@googlegroups.com
Hi,

could you share few details about the access method and the location
you're accessing the news data from?

Just for your information: a new way to access the data via CloudFront
has been implemented (thanks to the AWS Open Data Set team!), please see
https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/
https://commoncrawl.org/access-the-data/

Please note that on April 4th we'll enforce usage of the new access
scheme!

For the news crawl data, this means that only authenticated AWS users
can continue to use the S3 API to *list the WARC files* written by the
news crawler. Other users need to use the provided WARC file listings on
https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html
resp. on the yearly landing pages linked there.

Best,
Sebastian
> > https://StrangeLogic.com/ <https://StrangeLogic.com/>
> <https://strangelogic.com/ <https://strangelogic.com/>> - Wisdom &
> > Experience is Strangely Logical 
> >
> >
> > Email: ja...@strangelogic.com <mailto:ja...@strangelogic.com
>       
> > Email: ja...@the.domain.name <mailto:ja...@the.domain.name>
> > Twitter: @JasonD
> > LinkedIn: http://uk.linkedin.com/in/jasonduke1
> <http://uk.linkedin.com/in/jasonduke1>
> > <http://uk.linkedin.com/in/jasonduke1
> <http://uk.linkedin.com/in/jasonduke1>>
> > Skype:JasonD
> >
> > Mob:+44 (0)7595 924 934 <tel:+44%207595%20924934>
> >> <http://index.commoncrawl.org
> <https://groups.google.com/d/msgid/common-crawl/30677914-db6a-46ce-a28c-f64f13c3df57n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/30677914-db6a-46ce-a28c-f64f13c3df57n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> >
> > --
> > You received this message because you are subscribed to the
> Google
> > Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to common-crawl...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/3c8aace1-48f7-4212-ae96-70bdcc658edcn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/3c8aace1-48f7-4212-ae96-70bdcc658edcn%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/common-crawl/3c8aace1-48f7-4212-ae96-70bdcc658edcn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/3c8aace1-48f7-4212-ae96-70bdcc658edcn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> >
> > --
> > You received this message because you are subscribed to the
> Google
> > Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from
> it, send
> > an email to common-crawl...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/CADTM-zRRk%2Baowx67P-7FR5SMqaKJ7peJava7hSZURGu%3DNf5W9g%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CADTM-zRRk%2Baowx67P-7FR5SMqaKJ7peJava7hSZURGu%3DNf5W9g%40mail.gmail.com>
>
> >
> <https://groups.google.com/d/msgid/common-crawl/CADTM-zRRk%2Baowx67P-7FR5SMqaKJ7peJava7hSZURGu%3DNf5W9g%40mail.gmail.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/CADTM-zRRk%2Baowx67P-7FR5SMqaKJ7peJava7hSZURGu%3DNf5W9g%40mail.gmail.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/63c4038f-fd47-472a-bbd9-663f90c21b2fn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/63c4038f-fd47-472a-bbd9-663f90c21b2fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Karan Joshi

unread,
Mar 25, 2022, 2:33:51 PMMar 25
to Common Crawl

Hi Sebastian/all, 

I am currently getting 503 errors as well while accessing warc files for CC-MAIN-2022-05. This happens with requests using either of the access methods- s3 api and http request to https://data.commoncrawl.org/.  Is there a recommended request rate?

Thanks,

Karan

Sebastian Nagel

unread,
Mar 26, 2022, 5:18:09 PMMar 26
to common...@googlegroups.com
Hi Karan,

thanks for implementing the new access scheme in your code and trying
both schemes. I can reproduce the issue and can confirm that many (about
50% for my tests) requests for data of CC-MAIN-2022-05 fail. I'll
continue to monitor for ongoing issues. In the worst case, we would have
to enforce the new access schemes one week ahead of the scheduled time.

Best,
Sebastian

On 3/25/22 19:33, Karan Joshi wrote:
> Hi Sebastian/all, 
>
> I am currently getting 503 errors as well while accessing warc files for
> CC-MAIN-2022-05. This happens with requests using either of the access
> methods- s3 api and http request to *https://data.commoncrawl.org/
> <https://data.commoncrawl.org/>*.  Is there a recommended request rate?
>
> Thanks,
>
> Karan
>
> On Thursday, March 24, 2022 at 6:29:57 PM UTC-4 Sebastian Nagel wrote:
>
> Hi,
>
> could you share few details about the access method and the location
> you're accessing the news data from?
>
> Just for your information: a new way to access the data via CloudFront
> has been implemented (thanks to the AWS Open Data Set team!), please
> see
> https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/
> <https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/>
>
> https://commoncrawl.org/access-the-data/
> <https://groups.google.com/d/msgid/common-crawl/63c4038f-fd47-472a-bbd9-663f90c21b2fn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/63c4038f-fd47-472a-bbd9-663f90c21b2fn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/e7ab0e1d-19de-4be4-b4b0-c5820060a05an%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/e7ab0e1d-19de-4be4-b4b0-c5820060a05an%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages