Common Crawl Index Down

Eddie Johnson

unread,

Jul 17, 2016, 7:34:03 PM7/17/16

to Common Crawl

The Common Crawl Index is returning a 504 error. Is the index still being maintained, or is it no longer supported?

Btw, I'm a big fan of Common Crawl. Thanks for the great free resource :)

Sylvain Zimmer

unread,

Jul 17, 2016, 7:46:40 PM7/17/16

to common...@googlegroups.com

Hi!

The latest crawl is working properly on my end.

Maybe you are still using the old URLs to access it? You can learn about the new URLs here:

https://groups.google.com/forum/#!topic/common-crawl/L4-Sxz_wkTg

Cheers,

--
Sylvain Zimmer

blog: sylvinus.org
mobile: +33 6 64 67 61 71 / +1 646 266 1588

On Sun, Jul 17, 2016 at 7:34 PM, Eddie Johnson <e...@ed-johnson.com> wrote:

The Common Crawl Index is returning a 504 error. Is the index still being maintained, or is it no longer supported?

Btw, I'm a big fan of Common Crawl. Thanks for the great free resource :)

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,

Jul 18, 2016, 6:32:24 AM7/18/16

to common...@googlegroups.com

Hi Eddie, hi Sylvain,

in case Eddie's question is about the Common Crawl index servers
(and not about the location of the index files on AWS S3) ...

The Common Crawl index server at

http://index.commoncrawl.org/

is still maintained and regularly updated to cover the monthly published

crawl archives. The server crashed today 00:50 UTC but was properly
restarted and available again 30 sec. later according to the logs.

The server has currently a heavy load, 550,000 requests within the last
9 hours after it has crashed. That's why it may be temporarily not available.

For bulk-querying it's recommended to access the index files directly at
s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-00xxx.gz

Here, for the June crawl ("CC-MAIN-2016-26"). There are 300 index files, you need
to replace "xxx" by 000 - 299. There is also an offset index to the index files:
s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cluster.idx

The old index server
http://urlsearch.commoncrawl.org/

is currently down. We haven't taken a decision yet whether we fix it
or shut it down finally to save the time required for it's maintenance.

Best,

Sebastian

Sebastian Nagel

unread,

Jul 18, 2016, 8:52:33 AM7/18/16

to common...@googlegroups.com

Hi again,

correction: the index server was indeed down over the week-end and was

fixed and restarted by one of our volunteers last night (UTC). Thanks!

Thanks, Eddie, for reporting the problem!

Sebastian

Spider99

unread,

Sep 18, 2016, 11:10:41 AM9/18/16

to Common Crawl

http://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-000.gz

i am trying this to download index files but not able to download kindly hepl. Nee to CC.

Spider99

unread,

Sep 18, 2016, 11:30:33 AM9/18/16

to Common Crawl

Hi,

I was able to download indexes, now i want to create a index server like cdx-index-client locally how can i do that kindly help me on this. Thanks

Sebastian Nagel

unread,

Sep 19, 2016, 5:11:02 AM9/19/16

to common...@googlegroups.com

Hi,

the code of the index server can be found here
https://github.com/commoncrawl/cc-index-server
(with few minor modifications forked from Ilya Kreymer's
https://github.com/ikreymer/cc-index-server)

It's easy to set it up locally or on a small AWS EC2 instance.

Please, open a new thread for new questions with a header
describing the actual problem. That will help to find
the questions and answers in the future.

Thanks,
Sebastian

On 09/18/2016 05:30 PM, Spider99 wrote:
> Hi,
>
> I was able to download indexes, now i want to create a index server like cdx-index-client locally
> how can i do that kindly help me on this. Thanks
>
> On Monday, July 18, 2016 at 4:02:24 PM UTC+5:30, Sebastian Nagel wrote:
>
> Hi Eddie, hi Sylvain,
>
> in case Eddie's question is about the Common Crawl index servers
> (and not about the location of the index files on AWS S3) ...
>
> The Common Crawl index server at
> http://index.commoncrawl.org/
> is still maintained and regularly updated to cover the monthly published
> crawl archives. The server crashed today 00:50 UTC but was properly
> restarted and available again 30 sec. later according to the logs.
> The server has currently a heavy load, 550,000 requests within the last
> 9 hours after it has crashed. That's why it may be temporarily not available.
>
> For bulk-querying it's recommended to access the index files directly at
> s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-00xxx.gz
> Here, for the June crawl ("CC-MAIN-2016-26"). There are 300 index files, you need
> to replace "xxx" by 000 - 299. There is also an offset index to the index files:
> s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cluster.idx
>
> The old index server

> http://urlsearch.commoncrawl.org/ <http://urlsearch.commoncrawl.org/>

> is currently down. We haven't taken a decision yet whether we fix it
> or shut it down finally to save the time required for it's maintenance.
>
> Best,
> Sebastian
>

> On 18 July 2016 at 01:46, Sylvain Zimmer <syl...@sylvainzimmer.com <javascript:>> wrote:
>
> Hi!
>
> The latest crawl is working properly on my end.
>
> Maybe you are still using the old URLs to access it? You can learn about the new URLs here:
> https://groups.google.com/forum/#!topic/common-crawl/L4-Sxz_wkTg
> <https://groups.google.com/forum/#!topic/common-crawl/L4-Sxz_wkTg>
>
> Cheers,
>
>
> --
> Sylvain Zimmer
>

> blog: sylvinus.org <http://sylvinus.org>

> mobile: +33 6 64 67 61 71 / +1 646 266 1588
>

> On Sun, Jul 17, 2016 at 7:34 PM, Eddie Johnson <e...@ed-johnson.com <javascript:>> wrote:
>
> The Common Crawl Index is returning a 504 error. Is the index still being maintained,
> or is it no longer supported?
>
> Btw, I'm a big fan of Common Crawl. Thanks for the great free resource :)
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to common...@googlegroups.com <javascript:>.

> Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.

> For more options, visit https://groups.google.com/d/optout

> <https://groups.google.com/d/optout>.

>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to common...@googlegroups.com <javascript:>.

> Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com

> <mailto:common...@googlegroups.com>.

Spider99

unread,

Sep 19, 2016, 1:05:46 PM9/19/16

to Common Crawl

Thanks Sebastain. Solved

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Reply all

Reply to author

Forward