General purpose search engine, title-based

534 views
Skip to first unread message

Philip Waritschlager

unread,
Dec 5, 2021, 1:13:56 PM12/5/21
to Common Crawl
Hi! :)

Has anyone ever built a general internet search engine based on CC? It does not seem so. Why not?

Motivation:
Not having found anything remotely close, I was interested in building such an open source search engine that is based on CC URLs and (unique) titles *only*, in other words omitting all other metadata and html contents - HTML titles have always been a very reliable and accurate summary, aside from spam. I was thinking of building a dumb, non-nlp, strictly word-match based site/link/title index that does not rank pages, but can sort by crawl date, match accuracy and such. I think this could be a fairly usable alternative to conventional search engines, for when your search terms are very specific. It would also serve as a searchable archive for deleted and dead link titles (again, nothing like it seems to exist??), which you can then look up in ancient dumps or via archive.org.

Constraints. Please correct me if anything is wrong:
As far as I understand, CC crawls only (large) chunks of the web, so there are natural gaps. It might not be a perfect fit, but it's the only choice there is. (CC is a fantastic organization by the way!) Also, each crawl is different. To build a common search database, one would need to scan through all existing crawls and combine them. The older a crawl hit, the more likely it is to not exist anymore. But it is impossible to scan through multiple crawls at once, *except* with the columnar index. And finally, website titles are only accessible in the raw WARC or (preferrably) WAT files.

So iterating over all WAT files for *one* recent crawl would take more than 1000 hours which is presumably parallelizable with AWS and about $10-30. And if you wanted to iterate through all existing datasets, we're talking expenses of more than $1000. Oof!

So due to the expenses, time and maintenance required, this just does not really seem like a feasible idea, at least not without commercial involvement :-/ It's a shame, as the final clean, deduplicated database would probably be only be 10-50 GB in size.

Hope I'm making some sense, regards,
Philip (phil294)

Jay Patel

unread,
Dec 6, 2021, 4:18:43 AM12/6/21
to common...@googlegroups.com
Actually a search engine based on a common crawl dataset has been done atleast a few times. 

I dont want to go through my archives and cite all the papers on it here but one of the more recent ones was Elastic chat noir (https://github.com/chatnoir-eu).  you can check out their search engine at chatnoir.eu

Iterating over WAT (or even WARC) files even for half a dozen crawls is honestly not a big deal costwise. We do it all the time in our batch processes.

The real constraint is what will happen with all the data? you will index it in something like Elasticsearch? that's expensive and the real bottleneck. I mean if you aren't in academia, aren't counting on ad revenue then I dont see how a free open source search engine becomes economically feasible. 

Maybe you can differentiate yourself if you can find a better way (then current one) of identifying "spammy" webpages and exclude them from the index. But then again, hasn't all that been done already in early 2010s? when Google came out with Panda update?

There are battle hardened veterans of search engine alternative space in this email list like former Blekko people (Greg Lindahl ?), let them give their two cents.

Thanks,

Jay.


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/07f5fd2f-a27e-479b-8e09-5132e0c9b603n%40googlegroups.com.

Greg Lindahl

unread,
Dec 6, 2021, 7:52:34 PM12/6/21
to common...@googlegroups.com
Philip,

I see that Jay Patel has already given you a good answer to your question.

In addition, I'd like to point out that much of the early success of Google
and Pagerank was a focus on incoming anchortext, not the matrix algebra thing.
These days titles are probably better thanks to SEO, which also probably makes
the anchortext more uniform.

If you're more interested in saving money, you can iterate over everything
in the public bucket from outside of Amazon. It will be slow but free.

I love to see experiments like this and chatnoir.edu, it is exactly what
Common Craw was intended to facilitate.

-- greg

Jay Patel

unread,
Dec 6, 2021, 10:37:33 PM12/6/21
to common...@googlegroups.com
Phillip,

I forgot to mention that if you are going to iterate through WAT files, then you might as well create a page level web graph. 

As you might know, backlinks are super important for SEO folks, and there are a bunch of commercially available backlinks databases (ahrefs, semrush, moz). 

Common crawl does have a host and domain level webgraph but no page level webgraph till now (look at this discussion https://groups.google.com/g/common-crawl/c/-CnBtVFR1mg/m/SDifX9qJAQAJ).

So if you create a page level webgraph and its transpose (aka a backlinks database) which is completely open source, I can imagine that you will also find enough supporters/donors to keep that effort going and that will take care of AWS costs for WAT processing.

Ofcourse, the other good idea is like Greg has suggested that you download each WAT file to a local server and process it there; this will be completely free although you will be constrained by your bandwidth. 

Jay.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

Bob van Luijt

unread,
Dec 6, 2021, 10:44:31 PM12/6/21
to common...@googlegroups.com
We just release this dataset: 

I'm very interested in doing the same for the CC.

If anybody is interested in collaborating please let me know.


SeMI Technologies
websiteTwitter - LinkedIN - P.O. Box 95263, 1090HG Amsterdam

Greg Lindahl

unread,
Dec 7, 2021, 11:50:54 AM12/7/21
to common...@googlegroups.com
Bob,

This is extremely interesting! I have chatted with several semantic
search companies in the past, and I always recommend that they do
a Wikipedia demo.

From a Common Crawl point of view, the dataset is much larger and is
intentionally a sample of the web. One idea is to take all of the
off-site incoming anchortext and index that. So, as an example, if you
found this link outside of pbm.com:

<a href="https://www.pbm.com/~lindahl/">Greg Lindahl's Homepage</a>

You would index the text [Greg Lindahl's Homepage]. Even if my
homepage wasn't crawled, if linking it was popular enough (either
naturally or for SEO reasons), you'd have a good result for it.

And if you processed Common Crawl in the past, I think it would be an
interesting way to search the historical web. Archive.org has such an
engine for the Wayback Machine.

I'm looking forward to hearing some additional ideas about using CC
data with this engine!

-- greg
> > https://groups.google.com/d/msgid/common-crawl/CAHcUgvPtmmdoxEVst%3D3p6ObrX%2BuDzgnBA6GzPZoVqYPUT7GMcw%40mail.gmail.com
> > <https://groups.google.com/d/msgid/common-crawl/CAHcUgvPtmmdoxEVst%3D3p6ObrX%2BuDzgnBA6GzPZoVqYPUT7GMcw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> > .
> >
>
> --
> *SeMI Technologies*
> website <https://www.semi.technology> - Twitter
> <https://twitter.com/SeMI_tech> - LinkedIN
> <https://www.linkedin.com/company/semi-technologies> - P.O. Box 95263,
> 1090HG Amsterdam
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CALChgytx9k9rPAKMtuQE04wacwyF4ZSdfNwjBYSKj-j4b3wjYQ%40mail.gmail.com.

phil294

unread,
Dec 11, 2021, 2:14:40 PM12/11/21
to Common Crawl
@Greg, @Jay, thank you both for your valuable input! (@Bob did you accidentally post in the wrong thread? This is completely unrelated, no? Sorry if not)

It is a bit hard to find to get an overview over CC implementations, but Chat Noir looks good. And it seems to be backed by powerful hardware too.

About my title-based search engine idea, outlined above: I made a few tests and a demo application at https://link-archive.org (ready for testing). It's based on but one of the 56k wat files from a 2017 crawl.
On my machine, processing a single WAT file takes
  • 17 seconds downloading
  • 12 seconds unzipping
  • 01 seconds to processing (extracting all URLs and titles - I wrote a small Go script for that as I didn't need sophisticated JSON parsing)
and inserting the values in the DB takes another 1-5 seconds. I simply used SQLite. It has pretty usable full text search capabilities. For scaling up, a database server is probably better, but I don't see any advantage in spinning up Elasticsearch for something as basic as word matching over two columns. That is why after these tests, I still think that computing time (all existing dumps plus maintenance) and storage space poses the greatest hurdle. I estimate that even the smallest possible setup will take over 10 TB. A bit too much for my taste, so I'll probably put this on hold. Maybe something to pursue on university again some day... it is unfortunate because I would actually want to use this myself a lot.

Page rank:
Good idea, adds another whole layer of work. Maybe a super basic page rank based on external backlink count (no magic) would be feasible. This way, you would treat the web as it used to work 15 years ago: no walled gardens and blogs interlinking each other. Treat it the way you want it, right?! The more a blog post is linked to from unrelated sites, the more likely it is to be relevant and/or good. But still pretty exploitable.
It is interesting you say that anchortext has played a major role for Googles Pagerank. I'm afraid I don't really understand how you mean to replicate that.

The demo page has no ranking mechanism at all currently. Simply sorting by crawl count would be another idea - the more often a page has been crawled, the older and more stable it probably is. The more stable a site, the more reliable and less likely to contain spam it is.


> Maybe you can differentiate yourself if you can find a better way (then current one) of identifying "spammy" webpages and exclude them from the index.

Honestly I am not much interested in pursuing this. There is no point in reinventing the wheel. Integrating common filter lists like https://github.com/uBlockOrigin/uAssets could go a long way for this site's purpuses, and maaaybe a manual barebones domain blacklisting community effort.


> I forgot to mention that if you are going to iterate through WAT files, then you might as well create a page level web graph.

Cool stuff, and the domain level webgraph is already worth integrating! Other than that, this seems like a massive task.

As a final thought, how useful do you estimate is the entirety of CC crawls in regards to searching / completeness? Is combining all of them even an idea worth pursuing at all? (links+urls only)

phil294

unread,
Jan 15, 2022, 1:59:22 PM1/15/22
to Common Crawl
> you can iterate over everything in the public bucket from outside of Amazon. It will be slow but free.

Just FYI - I have been doing that for a while now, but after ~10 days of querying, 80% of the 2021-43 dataset, I am seeing 503 SLOW DOWN responses. Even when accessing the resource https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-43/segments/1634323587854.13/wat/CC-MAIN-20211026072759-20211026102759-00464.warc.wat.gz from another IP, the file is now most of the times (seemingly randomly) not available. I reduced my download batch size from 4 to 1 and added a 30 second delay and will try again tomorrow.

Common Screens

unread,
Jan 8, 2023, 2:59:49 PM1/8/23
to Common Crawl
We are building it as a fun project, may turn commercial also
Check it out at 
https://visualsearch.org  Indexing in progress and also the search results are not by country and language yet and we plan to add harmonicc ranking when sufficient indexing is done

phil294

unread,
Jan 8, 2023, 5:43:01 PM1/8/23
to Common Crawl

Hello,

I'm not sure if you are really referring to the title-based search engine I wrote about one year ago, but just to make sure: I have actually finished the below task and it's available at https://link-archive.org/, based on the Oct 21 crawl, with *no* ranking at all, and I have no plans for further development currently. Apologies, I should have mentioned this in the thread.

Quite apparently though, you are building quite a different beast. It looks quite neat! I like including site screenshots in the results, this definitely enhances search results.

Philip

Reply all
Reply to author
Forward
0 new messages