distributed search engine

24 views
Skip to first unread message

Stan Srednyak

unread,
Jul 16, 2022, 5:38:53 PMJul 16
to Common Crawl
hi CC,

Thanks for the amazing work you have been doing over the years!


We started the Distributed Search Engine (DSE) project https://rorur.com some time ago and we are happy to announce the next iteration of the code.

The project aims are :

1) design a system that would allow users with small computational resources to join into a network of nodes that would do the ordinary operations of a web2 search engine, such as crawling the data, indexing web pages, computing their rank according to some algorithm, serving the search requests of the users , providing means for advertising companies to advertise in the system in such a way that : a) they should be able to design code to target the audience of interest b) the revenue from advertisement be transferred in fair, programmable way to the end user and maintainers of the nodes in the network

2) make search open source

3) design the system in such a way that it is easy for anyone to participate in the design and operation of the system

4) make ranking algorithms open and transparent. Provide users with choice options of such algorithms

5) allow easy and streamlined benchmarking

6) design a fair advertising system in which consumers of the ads  a) get a fraction of revenue from advertising b) maintain full ownership of their click and attention patterns , making it into a merchandise that they can trade directly with ad companies

We did substantial development along these lines. In particular we demonstrated that there is an architecture:

1)  capable of respecting the 0.5 second latency threshold for the delivery of search query results

2)  is scalable both in the number of pages on the web and in the number of participating nodes ( logarithmic in the web size)

3) is resilient with respect to node failures, malicious behaviour, collusion between malicious nodes, various types of adversarial behavior

4) is blockchain based: the operation of the system is recorded in a system of ledgers ( blockchains) that allow for streamlined proof of computational correctness and for delivery of revenue according to the specified contracts.

The architecture details are described in the whitepaper.

The deployment of the system will proceed in several stages:

1) crawl the namespace , i.e., list of URLs

1.5) crawl the web graph and compute popularity rank

2) full fledged engine, that performs text analysis, indexation , ranking and is capable of serving search transactions

3) implementation of competitive rank market

4) financial and accounting system , together with advertisement system

( stage 1.5) is not necessary, we may omit it).

At the moment we are releasing code for the stage 1) https://github.com/cnn-rnn/ouroboros

We are also running an AWS cluster to do this crawl (  we can get the web graph from CC, but it is an important test to do this independently).

You are welcome to join the network from your hardware. It is absolutely free, there are no obligations on node maintainers - you can always leave the network or shut down the operation ( system is resilient to node failures). At the moment there is no financial component - we are just testing the crawler. The revenue system will come in future forks.

There are certain minimum hardware requirements. We designed the system to have minimal RAM footprint, however for best performance we recommend at least 2GB RAM. We had a lot of success running it from .large ( and higher, such as .xlarge,...) instances from AWS.

We are currently running a cluster of 10 c6gn.large machines , see https://rorur.com . Each machine does in average 70 pages/second ( note that we do not store and index text at the moment - with indexing and ranking in parallel to crawl the speed of operation is a few times slower). This amounts to ~ 5*10^6 pages/day/machine. With a total of ~ 5*10^10 pages on the web , it brings us to 10^4 days*machine for the complete web crawl. With our very modest cluster, we cannot hope to achieve this in reasonable time. We hope that open source and data transparency enthusiasts will join us in this effort to build a network capable of full scale web search. Note that we are currently using rather small machines. On machines with larger numbers of cores the speed is better, e.g. we saw ~300 pages/second on 8 vCPU machines. We estimate that the minimal network size of the typical 8-core machines to perform 1-day web crawl is ~1000 machines. We only tested the system on machines with SSDs, and we do not know if it can be run from machines with HDs ( this is important as most operations are done on disk, with little RAM footprint). We also usually maximize IOPS at 16000.

We would appreciate your comments, suggestions or criticism concerning this project, our current architecture, code design etc.

best,
Stan Srednyak


Reply all
Reply to author
Forward
0 new messages