Optimal Hardware

6 views
Skip to first unread message

Erik Graf

unread,
Nov 26, 2020, 3:48:37 PM11/26/20
to DigitalPebble
Hi all,

We are planning to crawl a good portion of the Swiss public Web as part of our teaching efforts.

I was wondering what would be the best setup (in terms of a trade-off between complexity, scalability, and performance) regarding on-site hardware.
Is it generally preferable to have one server with lots of cores and ram, versus a distributed setup?
What would be a suitable setup to attempt petabyte scale crawls on basis of stormcrawler.

Thanks a lot in advance.

Best,

Erik

DigitalPebble

unread,
Nov 27, 2020, 7:09:28 AM11/27/20
to DigitalPebble
Hi Erik, 

That sounds very interesting. Are you planning to share the resulting dataset in one way or another?  

To answer your question: I'd go for a distributed cluster of average servers instead of a single huge one with Elasticsearch installed on all the nodes alongside Storm. Something like the equivalent specs of an EC2 m5.2xlarge instance, but the only important thing to have are SSD drives as this makes quite a difference to Elasticsearch's performance.

Going distributed is also probably more interesting from a teaching perspective. It also makes it more robust re-hardware failure and depending on the configuration of your network, having more machines might be more efficient.

What are you planning to do with the documents once crawled? Indexed for search? Store only extracted metadata? Store the entire pages at the WARC format?

Kind regards

Julien



--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/digitalpebble/d8f82f0e-ecd6-4808-a7eb-1ec300296f15n%40googlegroups.com.


--
Reply all
Reply to author
Forward
0 new messages