@Greg, @Jay, thank you both for your valuable input! (@Bob did you accidentally post in the wrong thread? This is completely unrelated, no? Sorry if not)
It is a bit hard to find to get an overview over CC implementations, but Chat Noir looks good. And it seems to be backed by powerful hardware too.
About my title-based search engine idea, outlined above: I made a few tests and a demo application at
https://link-archive.org (ready for testing). It's based on but one of the 56k wat files from a 2017 crawl.
On my machine, processing a single WAT file takes
- 17 seconds downloading
- 12 seconds unzipping
- 01 seconds to processing (extracting all URLs and titles - I wrote a small Go script for that as I didn't need sophisticated JSON parsing)
and inserting the values in the DB takes another 1-5 seconds. I simply used SQLite. It has pretty usable full text search capabilities. For scaling up, a database server is probably better, but I don't see any advantage in spinning up Elasticsearch for something as basic as word matching over two columns. That is why after these tests, I still think that computing time (all existing dumps plus maintenance) and storage space poses the greatest hurdle. I estimate that even the smallest possible setup will take over 10 TB. A bit too much for my taste, so I'll probably put this on hold. Maybe something to pursue on university again some day... it is unfortunate because I would actually want to use this myself a lot.
Page rank:
Good idea, adds another whole layer of work. Maybe a super basic page rank based on external backlink count (no magic) would be feasible. This way, you would treat the web as it used to work 15 years ago: no walled gardens and blogs interlinking each other. Treat it the way you want it, right?! The more a blog post is linked to from unrelated sites, the more likely it is to be relevant and/or good. But still pretty exploitable.
It is interesting you say that anchortext has played a major role for Googles Pagerank. I'm afraid I don't really understand how you mean to replicate that.
The demo page has no ranking mechanism at all currently. Simply sorting by crawl count would be another idea - the more often a page has been crawled, the older and more stable it probably is. The more stable a site, the more reliable and less likely to contain spam it is.
> Maybe you can differentiate yourself if you can find a better way (then current one) of identifying "spammy" webpages and exclude them from the index.
Honestly I am not much interested in pursuing this. There is no point in reinventing the wheel. Integrating common filter lists like
https://github.com/uBlockOrigin/uAssets could go a long way for this site's purpuses, and maaaybe a manual barebones domain blacklisting community effort.
> I forgot to mention that if you are going to iterate through WAT files, then you might as well create a page level web graph.
Cool stuff, and the domain level webgraph is already worth integrating! Other than that, this seems like a massive task.
As a final thought, how useful do you estimate is the entirety of CC crawls in regards to searching / completeness? Is combining all of them even an idea worth pursuing at all? (links+urls only)