Hi,
the code of the index server can be found here
https://github.com/commoncrawl/cc-index-server
(with few minor modifications forked from Ilya Kreymer's
https://github.com/ikreymer/cc-index-server)
It's easy to set it up locally or on a small AWS EC2 instance.
Please, open a new thread for new questions with a header
describing the actual problem. That will help to find
the questions and answers in the future.
Thanks,
Sebastian
On 09/18/2016 05:30 PM, Spider99 wrote:
> Hi,
>
> I was able to download indexes, now i want to create a index server like cdx-index-client locally
> how can i do that kindly help me on this. Thanks
>
> On Monday, July 18, 2016 at 4:02:24 PM UTC+5:30, Sebastian Nagel wrote:
>
> Hi Eddie, hi Sylvain,
>
> in case Eddie's question is about the Common Crawl index servers
> (and not about the location of the index files on AWS S3) ...
>
> The Common Crawl index server at
>
http://index.commoncrawl.org/
> is still maintained and regularly updated to cover the monthly published
> crawl archives. The server crashed today 00:50 UTC but was properly
> restarted and available again 30 sec. later according to the logs.
> The server has currently a heavy load, 550,000 requests within the last
> 9 hours after it has crashed. That's why it may be temporarily not available.
>
> For bulk-querying it's recommended to access the index files directly at
> s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-00xxx.gz
> Here, for the June crawl ("CC-MAIN-2016-26"). There are 300 index files, you need
> to replace "xxx" by 000 - 299. There is also an offset index to the index files:
> s3://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cluster.idx
>
> The old index server
>
http://urlsearch.commoncrawl.org/ <
http://urlsearch.commoncrawl.org/>
> is currently down. We haven't taken a decision yet whether we fix it
> or shut it down finally to save the time required for it's maintenance.
>
> Best,
> Sebastian
>
> blog:
sylvinus.org <
http://sylvinus.org>
> On Sun, Jul 17, 2016 at 7:34 PM, Eddie Johnson <
e...@ed-johnson.com <javascript:>> wrote:
>
> The Common Crawl Index is returning a 504 error. Is the index still being maintained,
> or is it no longer supported?
>
> Btw, I'm a big fan of Common Crawl. Thanks for the great free resource :)
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to
common...@googlegroups.com <javascript:>.
> <
https://groups.google.com/group/common-crawl>.
> <
https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to
common...@googlegroups.com <javascript:>.
> <
https://groups.google.com/group/common-crawl>.
> For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> <mailto:
common...@googlegroups.com>.