index.commoncrawl.org

106 views
Skip to first unread message

Greg Lindahl

unread,
May 11, 2024, 9:00:12 PMMay 11
to Common Crawl
index.commoncrawl.org has not been working well for a long time, a
combination of a potential N^2 software bug and heavy usage. I
recently turned on fail2ban in the hopes of throttling some of the IP
addresses hitting it many times per second. If you see it behaving
worse than before, please let me know.

Ashim Mahara (RIT Student)

unread,
May 11, 2024, 9:11:06 PMMay 11
to common...@googlegroups.com
It is simple enough to install and host a private index server. 

Download the collections and host using pywb2. 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CABQM%2BAw-C_TgfS1Uf5gtdr9UwBCK%2BK_ZxoUqG_bGK2vdiCL4cQ%40mail.gmail.com.

Greg Lindahl

unread,
May 11, 2024, 9:12:18 PMMay 11
to common...@googlegroups.com
It's simple enough if you're a heavy user of the cdx index -- for
casual users, we'd like our cdx server to continue to work!
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CALzBUNHJOQdSwt33w%2BUYWb24mznx14MaYAjT89zB%2BU%3DzOCC4SA%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages