Broken link to the (not columnar) index?

35 views
Skip to first unread message

Henry S. Thompson

unread,
Apr 17, 2024, 12:25:44 PMApr 17
to common-crawl
Several pages on the site (see below) contain this link:

https://index.commoncrawl.org/

which currently results in

Common Crawl Index Server Error
None

At least the following contain the above link:

https://commoncrawl.org/overview
https://commoncrawl.org/blog/announcing-the-common-crawl-index

Is this link rot or a server bug?

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail: h...@inf.ed.ac.uk
URL: https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Tom Morris

unread,
Apr 17, 2024, 10:52:20 PMApr 17
to common...@googlegroups.com
> Is this link rot or a server bug?

I think that's the correct URL, but it looks like the Index Server is
currently broken (at least the home page).

It's still possible to access the API directly if you build your own
URL, e.g. the example from the README
https://index.commoncrawl.org/CC-MAIN-2015-06-index?url=commoncrawl.org

Tom

Greg Lindahl

unread,
Apr 18, 2024, 2:37:50 PMApr 18
to common...@googlegroups.com
The deal with the cdx index is that while the associated website is
broken, the API mostly works. Most people use it via the API, so we
only get a complaint every few months.

I believe there's an N^2 algorithm in the code that is getting worse
over time, as the number of crawls rises. I haven't had time to find
it.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEGx_Cf6n%3DwAkuFBsRt%2BKu8AiGRaO64ABx36n4W4rLmhog%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages