Good Morning,
I am migrating a JPPF-based Crawler4J Fork (in this special case a focused crawler leveraging several ML-algorithmns) towards the StormCrawler SDK for research purpose.
The reason for this migration is twofold: (1) use standard software instead of tinker solutions ;) (2) performance of StormCrawler vs Apache Nutch (and then compared to our previous results with the JPPF-based Crawler4J Implementation).
For our use-case, a continious crawl with unlimited depth is mandatory. For this reason, we decided to start our customization based on the ElasticSearch components provided in the SC-external directory.
However, documentation about performance fine-tuning SC is rarely found in the Wiki or in the Issue tracker.
Some technical facts for our setup:
- We have plenty of bandwidth for this experiment:
- The University's core router is connected with 40 GbE to the Internet
- Our server infastracture (Cisco UCS blades based on VMWare / ESXi) is also connected with 40 GbE
- We have a Storm-Cluster (exclusivley for this work) consisting of 24 virtual machines running on Ubuntu 16.04 LTS
- 3 Nimbus Nodes + Zookeeper
- 21 Supervisor Nodes (4 vCores, 4 GB Ram)
- At the moment one ElasticSearch Instance on Ubuntu 16.04
- 32 GB Ram, 4 vCores + Storage
So I have the following questions (maybe you have some hints / ideas with your experience with SC and detail knowledge as author):
- I assume, that the status-Index will grow a lot over time - at the moment we are using the AggegrationSpout with bucket sorting based on a (previously computed) priority value.
- Is a single ES instance the bottleneck here?
- Would it be better to use more than one ES instance?
- Is there any available SC documentation related to the configuration parameters of the ES extension? If not, I will have to need to read the code in detail :)
- Are there any other hidden pitfalls, then configuring the parallelism hints for the SC bolts like Fetcher, Parser, etc? I already found, that the amount of ES spouts needs to equal the amount of ES shards :)
- As I am quite new with Storm, SC and ES: Can you personally recommend a good source for configuration parameter information?
I am open for any answers / questions or related discussion :)
Thanks,
Richard