Optimal Hardware for Large Scale Crawling

8 views

Skip to first unread message

Erik G.

unread,

Nov 26, 2020, 3:44:35 PM11/26/20

to frontera

Hi all,

We are evaluating Frontera as a candidate for large scale crawling as part of an education program focused on data engineering. We are planning to crawl a sizable portion of the public web in Switzerland.

I would be very grateful for any indication regarding optimal hardware for running frontera on such a scale (e.g. if single server with lots of cores and ram are a preferable option versus distributed setups).

Due to our focus on teaching we do not plan to opt for cloud based solutions.

Thanks a lot in advance,

Erik

Alexander Sibiryakov

unread,

Nov 27, 2020, 8:40:21 AM11/27/20

to Erik G., frontera

Hi Erik,

From architecture standpoint nothing prevents it from running on a single server. Network connection could be a limiting factor.

It depends on the throughput you need. A single spider (fetching process) can provide around 1200 pages/minute. These would have to parsed and processed by workers. May be it is worth to start from the resources that you have and then figure out what would be configuration of Frontera to leverage them.