No new CC-NEWS files are listed since 2023-10-23 15:36:50

249 views
Skip to first unread message

Nikolay Kushin

unread,
Oct 24, 2023, 3:56:15 AM10/24/23
to Common Crawl
Hi, there!

First of all thank you common crawl team for the great work providing access to the common crawl dataset! I think every member of the community appreciates the effort common crawl team does to ensure everything is working smooth and reliably.

And here is a problem - as stated in the subject the CC-NEWS archive stopped to appear at some point yesterday after 2023-10-23 15:36:50. Is there any reason to start worrying about?

Do we have anywhere a status page or something to refer to the current state of the crawler? To avoid spamming on the forum with the questions like this.

Best regards,
Nikolay

Julien Nioche

unread,
Oct 25, 2023, 4:25:09 AM10/25/23
to common...@googlegroups.com
Hi Nikolay

The news-crawler has been paused as we need to do some maintenance work to it. We do not have a status page on the site but that would be a nice thing to do in the future.
No clear ETA for when the news archives  will resume, I will send a message to the mailing list when this is the case.

Julien 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/a8b4d682-eb29-4a83-aea6-8cd4531fa761n%40googlegroups.com.


--

Nikolay Kushin

unread,
Oct 25, 2023, 4:34:28 AM10/25/23
to Common Crawl
hi, Julien

Thank you for the reply! Really appreciate!!!

Nikolay

Julien Nioche

unread,
Oct 27, 2023, 12:47:09 PM10/27/23
to common...@googlegroups.com
Dear community, 

The news crawler is back in action and you should see the WARC files appear on the S3 bucket at the usual location. There will be more maintenance work in the short to middle term but I will notify the list when that happens.

Thanks!

Julien

Roi Krakovski

unread,
Oct 28, 2023, 3:47:21 AM10/28/23
to Common Crawl
Hi, 
I am still getting this error:
<Error>
<Code>SlowDown</Code>
<Message>Please reduce your request rate.</Message>
<RequestId>PYFT3XWT5WWX86P0</RequestId>
<HostId>jkBQyV0yC2JxaRIeFOEi0LCO307o7jdRq851uMHcNLA5SEqXIdzlJreeDl94q+f/AkZ+NdXi7Ds=</HostId>
</Error>

when trying to load a file:
https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/10/CC-NEWS-20231028060853-00054.warc.gz

Any idea why?
Thanks!
Roi

Greg Lindahl

unread,
Oct 28, 2023, 12:08:08 PM10/28/23
to common...@googlegroups.com
Roi,

Our Amazon S3 bucket has been overloaded for about the last week and a
half, throwing a lot of errors, mostly 503 (slow down).

Here's how you can patiently download the file you want, retrying
(with backoff) until you get a 200:

wget -t 0 --retry-on-http-error=503 https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/10/CC-NEWS-20231028060853-00054.warc.gz

-- greg
> >>>> <https://groups.google.com/d/msgid/common-crawl/a8b4d682-eb29-4a83-aea6-8cd4531fa761n%40googlegroups.com?utm_medium=email&utm_source=footer>
> >>>> .
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>> *Open Source Solutions for Text Engineering*
> >>>
> >>> http://www.digitalpebble.com
> >>> http://digitalpebble.blogspot.com/
> >>> #digitalpebble <http://twitter.com/digitalpebble>
> >>>
> >> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Common Crawl" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an
> >> email to common-crawl...@googlegroups.com.
> >>
> > To view this discussion on the web visit
> >> https://groups.google.com/d/msgid/common-crawl/0ea8af93-2ef2-4a21-b741-b8186e5cfaa1n%40googlegroups.com
> >> <https://groups.google.com/d/msgid/common-crawl/0ea8af93-2ef2-4a21-b741-b8186e5cfaa1n%40googlegroups.com?utm_medium=email&utm_source=footer>
> >> .
> >>
> >
> >
> > --
> >
> > *Open Source Solutions for Text Engineering*
> >
> > http://www.digitalpebble.com
> > http://digitalpebble.blogspot.com/
> > #digitalpebble <http://twitter.com/digitalpebble>
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/57e4cf0d-b394-4785-83bc-f5d03a0b3effn%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages