Optimal Hardware for Large Scale Crawling

8 views
Skip to first unread message

Erik G.

unread,
Nov 26, 2020, 3:44:35 PM11/26/20
to frontera
Hi all,

We are evaluating Frontera as a candidate for large scale crawling as part of an education program focused on data engineering. We are planning to crawl a sizable portion of the public web in Switzerland.

I would be very grateful for any indication regarding optimal hardware for running frontera on such a scale (e.g. if single server with lots of cores and ram are a preferable option versus distributed setups).

Due to our focus on teaching we do not plan to opt for cloud based solutions.

Thanks a lot in advance,

Erik

Alexander Sibiryakov

unread,
Nov 27, 2020, 8:40:21 AM11/27/20
to Erik G., frontera
Hi Erik,

From architecture standpoint nothing prevents it from running on a single server. Network connection could be a limiting factor. 

It depends on the throughput you need. A single spider (fetching process) can provide around 1200 pages/minute. These would have to parsed and processed by workers. May be it is worth to start from the resources that you have and then figure out what would be configuration of Frontera to leverage them.

See also

--
You received this message because you are subscribed to the Google Groups "frontera" group.
To unsubscribe from this group and stop receiving emails from it, send an email to frontera+u...@scrapinghub.com.
To view this discussion on the web, visit https://groups.google.com/a/scrapinghub.com/d/msgid/frontera/9ab661c7-5322-4d02-a82a-c96027e0dca6n%40scrapinghub.com.
Reply all
Reply to author
Forward
0 new messages