Can't download archives

84 views
Skip to first unread message

John Walter

unread,
Jul 20, 2022, 4:57:53 AM7/20/22
to Common Crawl
Can't download archives i got message "Please reduce your request rate."
June/July 2022 crawl archive.
What can i do?
Thanks.

Sebastian Nagel

unread,
Jul 20, 2022, 5:30:15 AM7/20/22
to common...@googlegroups.com
Hi John,

could you share more details and context about the access method and the
location you're accessing the data from?

- which file formats (WARC, WAT, WET files, etc.)?
- from which IP (range), alternatively the location?
- running how many concurrent requests, parallel processes or threads?
- which access method or the requested URL leading to the error?

If possible, please share some log snippets showing the error.

In case you cannot publicly share the details in this discussion group,
you may contact us directly via in...@commoncrawl.org - Thanks!

Best,
Sebastian

John Walter

unread,
Jul 20, 2022, 5:35:41 AM7/20/22
to Common Crawl
https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2022-27/indexes/cdx-00000.gz

For example i can't download this archieve, some archieves i can download, but some not

I tried do this from many proxies/vpns everywhere i get the same message

<Error>
<Code>SlowDown</Code>
<Message>Please reduce your request rate.</Message>

Sebastian Nagel

unread,
Jul 20, 2022, 6:29:47 AM7/20/22
to common...@googlegroups.com
Hi John,

I can reproduce the errors that for the prefixes

cc-index/
(at least many collections there)

crawl-data/CC-MAIN-2022-27/
(but also occasionally for other crawl archives)

We'll all the necessary to get the problem resolved.

Thanks for reporting!

Note: the URL index (index.commoncrawl.org) is also affected
and fails to read from the prefix cc-index/ - for now, a request for
https://index.commoncrawl.org/collinfo.json will return temporarily
the empty list.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/22782615-d642-418b-bada-2dee5074eae9n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/22782615-d642-418b-bada-2dee5074eae9n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sebastian Nagel

unread,
Jul 20, 2022, 9:37:15 AM7/20/22
to common...@googlegroups.com
Hi everybody,

since about an hour everything seems back to normal.

The URL index (index.commoncrawl.org) is also fully
functional again.

Best,
Sebastian

John Walter

unread,
Jul 20, 2022, 4:20:33 PM7/20/22
to Common Crawl
Sorry my friend but i still have this problem...
Reply all
Reply to author
Forward
0 new messages