Common Crawl Index Down

68 views
Skip to first unread message

Eddie Johnson

unread,
Jul 17, 2016, 7:34:03 PM7/17/16
to Common Crawl
The Common Crawl Index is returning a 504 error.  Is the index still being maintained, or is it no longer supported?  

Btw, I'm a big fan of Common Crawl.  Thanks for the great free resource :)

Sylvain Zimmer

unread,
Jul 17, 2016, 7:46:40 PM7/17/16
to common...@googlegroups.com
Hi!

The latest crawl is working properly on my end.

Maybe you are still using the old URLs to access it? You can learn about the new URLs here:

Cheers,


--
Sylvain Zimmer

blog: sylvinus.org
mobile: +33 6 64 67 61 71 / +1 646 266 1588

On Sun, Jul 17, 2016 at 7:34 PM, Eddie Johnson <e...@ed-johnson.com> wrote:
The Common Crawl Index is returning a 504 error.  Is the index still being maintained, or is it no longer supported?  

Btw, I'm a big fan of Common Crawl.  Thanks for the great free resource :)

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Jul 18, 2016, 6:32:24 AM7/18/16
to common...@googlegroups.com
Hi Eddie, hi Sylvain,

in case Eddie's question is about the Common Crawl index servers
(and not about the location of the index files on AWS S3) ...

The Common Crawl index server at
   http://index.commoncrawl.org/
is still maintained and regularly updated to cover the monthly published
crawl archives. The server crashed today 00:50 UTC but was properly
restarted and available again 30 sec. later according to the logs.
The server has currently a heavy load, 550,000 requests within the last
9 hours after it has crashed. That's why it may be temporarily not available.

For bulk-querying it's recommended to access the index files directly at
  s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-00xxx.gz
Here, for the June crawl ("CC-MAIN-2016-26"). There are 300 index files, you need
to replace "xxx" by 000 - 299. There is also an offset index to the index files:
  s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cluster.idx

The old index server
  http://urlsearch.commoncrawl.org/
is currently down. We haven't taken a decision yet whether we fix it
or shut it down finally to save the time required for it's maintenance.

Best,
Sebastian

Sebastian Nagel

unread,
Jul 18, 2016, 8:52:33 AM7/18/16
to common...@googlegroups.com
Hi again,

correction: the index server was indeed down over the week-end and was
fixed and restarted by one of our volunteers last night (UTC). Thanks!

Thanks, Eddie, for reporting the problem!

Sebastian

Spider99

unread,
Sep 18, 2016, 11:10:41 AM9/18/16
to Common Crawl

http://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-000.gz

i am trying this to download index files but not able to download kindly hepl. Nee to CC.

Spider99

unread,
Sep 18, 2016, 11:30:33 AM9/18/16
to Common Crawl
Hi,

I was able to download indexes, now i want to create a index server like cdx-index-client locally how can i do that kindly help me on this. Thanks

Sebastian Nagel

unread,
Sep 19, 2016, 5:11:02 AM9/19/16
to common...@googlegroups.com
Hi,

the code of the index server can be found here
https://github.com/commoncrawl/cc-index-server
(with few minor modifications forked from Ilya Kreymer's
https://github.com/ikreymer/cc-index-server)

It's easy to set it up locally or on a small AWS EC2 instance.

Please, open a new thread for new questions with a header
describing the actual problem. That will help to find
the questions and answers in the future.

Thanks,
Sebastian

On 09/18/2016 05:30 PM, Spider99 wrote:
> Hi,
>
> I was able to download indexes, now i want to create a index server like cdx-index-client locally
> how can i do that kindly help me on this. Thanks
>
> On Monday, July 18, 2016 at 4:02:24 PM UTC+5:30, Sebastian Nagel wrote:
>
> Hi Eddie, hi Sylvain,
>
> in case Eddie's question is about the Common Crawl index servers
> (and not about the location of the index files on AWS S3) ...
>
> The Common Crawl index server at
> http://index.commoncrawl.org/
> is still maintained and regularly updated to cover the monthly published
> crawl archives. The server crashed today 00:50 UTC but was properly
> restarted and available again 30 sec. later according to the logs.
> The server has currently a heavy load, 550,000 requests within the last
> 9 hours after it has crashed. That's why it may be temporarily not available.
>
> For bulk-querying it's recommended to access the index files directly at
> s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-00xxx.gz
> Here, for the June crawl ("CC-MAIN-2016-26"). There are 300 index files, you need
> to replace "xxx" by 000 - 299. There is also an offset index to the index files:
> s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cluster.idx
>
> The old index server
> http://urlsearch.commoncrawl.org/ <http://urlsearch.commoncrawl.org/>
> is currently down. We haven't taken a decision yet whether we fix it
> or shut it down finally to save the time required for it's maintenance.
>
> Best,
> Sebastian
>
> On 18 July 2016 at 01:46, Sylvain Zimmer <syl...@sylvainzimmer.com <javascript:>> wrote:
>
> Hi!
>
> The latest crawl is working properly on my end.
>
> Maybe you are still using the old URLs to access it? You can learn about the new URLs here:
> https://groups.google.com/forum/#!topic/common-crawl/L4-Sxz_wkTg
> <https://groups.google.com/forum/#!topic/common-crawl/L4-Sxz_wkTg>
>
> Cheers,
>
>
> --
> Sylvain Zimmer
>
> blog: sylvinus.org <http://sylvinus.org>
> On Sun, Jul 17, 2016 at 7:34 PM, Eddie Johnson <e...@ed-johnson.com <javascript:>> wrote:
>
> The Common Crawl Index is returning a 504 error. Is the index still being maintained,
> or is it no longer supported?
>
> Btw, I'm a big fan of Common Crawl. Thanks for the great free resource :)
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to common...@googlegroups.com <javascript:>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to common...@googlegroups.com <javascript:>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

Spider99

unread,
Sep 19, 2016, 1:05:46 PM9/19/16
to Common Crawl
Thanks Sebastain. Solved
Reply all
Reply to author
Forward
0 new messages