[StormCrawler] [ElasticSearch] Configuration Documentation / Technical Questions

149 views
Skip to first unread message

fear...@gmail.com

unread,
Jan 20, 2018, 3:37:43 AM1/20/18
to DigitalPebble
Good Morning,

I am migrating a JPPF-based Crawler4J Fork (in this special case a focused crawler leveraging several ML-algorithmns) towards the StormCrawler SDK for research purpose. 

The reason for this migration is twofold: (1) use standard software instead of tinker solutions ;) (2) performance of StormCrawler vs Apache Nutch (and then compared to our previous results with the JPPF-based Crawler4J Implementation).

For our use-case, a continious crawl with unlimited depth is mandatory. For this reason, we decided to start our customization based on the ElasticSearch components provided in the SC-external directory. 

However, documentation about performance fine-tuning SC is rarely found in the Wiki or in the Issue tracker.

Some technical facts for our setup:
  • We have plenty of bandwidth for this experiment:
    • The University's core router is connected with 40 GbE to the Internet
    • Our server infastracture (Cisco UCS blades based on VMWare / ESXi) is also connected with 40 GbE
  • We have a Storm-Cluster (exclusivley for this work) consisting of 24 virtual machines running on Ubuntu 16.04 LTS
    • 3 Nimbus Nodes + Zookeeper
    • 21 Supervisor Nodes (4 vCores, 4 GB Ram)
  • At the moment one ElasticSearch Instance on Ubuntu 16.04
    • 32 GB Ram, 4 vCores + Storage

So I have the following questions (maybe you have some hints / ideas with your experience with SC and detail knowledge as author):

  • I assume, that the status-Index will grow a lot over time - at the moment we are using the AggegrationSpout with bucket sorting based on a (previously computed) priority value. 
    • Is a single ES instance the bottleneck here? 
    • Would it be better to use more than one ES instance?
    • Is there any available SC documentation related to the configuration parameters of the ES extension? If not, I will have to need to read the code in detail :)
  • Are there any other hidden pitfalls, then configuring the parallelism hints for the SC bolts like Fetcher, Parser, etc? I already found, that the amount of ES spouts needs to equal the amount of ES shards :)
  • As I am quite new with Storm, SC and ES: Can you personally recommend a good source for configuration parameter information? 
I am open for any answers / questions or related discussion :)

Thanks,
Richard

DigitalPebble

unread,
Jan 22, 2018, 5:07:17 AM1/22/18
to DigitalPebble
Hi Richard, 

Thanks for getting in touch, great to hear that you're using StormCrawler. Please find my comments below:

On 20 January 2018 at 08:37, <fear...@gmail.com> wrote:
Good Morning,

I am migrating a JPPF-based Crawler4J Fork (in this special case a focused crawler leveraging several ML-algorithmns) towards the StormCrawler SDK for research purpose. 

The reason for this migration is twofold: (1) use standard software instead of tinker solutions ;) (2) performance of StormCrawler vs Apache Nutch (and then compared to our previous results with the JPPF-based Crawler4J Implementation).

For our use-case, a continious crawl with unlimited depth is mandatory. For this reason, we decided to start our customization based on the ElasticSearch components provided in the SC-external directory. 

However, documentation about performance fine-tuning SC is rarely found in the Wiki or in the Issue tracker.

Some technical facts for our setup:
  • We have plenty of bandwidth for this experiment:
    • The University's core router is connected with 40 GbE to the Internet
    • Our server infastracture (Cisco UCS blades based on VMWare / ESXi) is also connected with 40 GbE
  • We have a Storm-Cluster (exclusivley for this work) consisting of 24 virtual machines running on Ubuntu 16.04 LTS
    • 3 Nimbus Nodes + Zookeeper
    • 21 Supervisor Nodes (4 vCores, 4 GB Ram)
  • At the moment one ElasticSearch Instance on Ubuntu 16.04
    • 32 GB Ram, 4 vCores + Storage

So I have the following questions (maybe you have some hints / ideas with your experience with SC and detail knowledge as author):

  • I assume, that the status-Index will grow a lot over time - at the moment we are using the AggegrationSpout with bucket sorting based on a (previously computed) priority value. 
    • Is a single ES instance the bottleneck here? 
Definitely! As you pointed out, the status index will grow rapidly which will lead to longer query times by the spouts and longer merging of the segments by ES. Es will always be the bottleneck. 
    • Would it be better to use more than one ES instance?

Ideally, you'd have one instance of ES running on each machine of your Storm cluster but youcan of course keep both cluster separate if you wish. Don't forget to set a number of shards in the ES init script >= number of ES instances and also set the number of spouts to the same value.
    • Is there any available SC documentation related to the configuration parameters of the ES extension? If not, I will have to need to read the code in detail :)
There are comments on the ES configs in https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/es-conf.yaml but no docs on the WIKi summarising them all. 

Let me know if you need explanations on any specific value.
 
  • Are there any other hidden pitfalls, then configuring the parallelism hints for the SC bolts like Fetcher, Parser, etc? I already found, that the amount of ES spouts needs to equal the amount of ES shards :)
I wouldn't call them pifalls but maybe a few tips => use one Fetcher per worker and adjust the number of Parser instances based on your observations using the Storm UI - as a rule of thumb,  x5 parser instances per fetcher instance, i.e. if you have 42 workers, have 42 Fetchers and 168 parsers

Also set es.status.sample to true to optimise the perfs of the AggregationSpouts
 
  • As I am quite new with Storm, SC and ES: Can you personally recommend a good source for configuration parameter information? 
Storm
Make sure your config gives the worker plenty of RAM e.g at least 2GB each

Elastic 

in particular use SSDs if you can + loads of RAM + turn swap off

SC

There is https://github.com/DigitalPebble/stormcrawlerfight which has some configs for a single node SC instance based on ES, you can use that as a source of inspiration

 
I am open for any answers / questions or related discussion :)

Likewise, please share any questions or feedback

Julien 

 

Thanks,
Richard

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.



--

fear...@gmail.com

unread,
Jan 23, 2018, 8:45:11 AM1/23/18
to DigitalPebble
Hi Julien,

Thanks a lot for your fast and detailed answer.

I will come back to this thread after I did some distributed testing on our infrastructure or if I find any other questions.

Best,
Richard

fear...@gmail.com

unread,
Jun 9, 2018, 4:12:26 AM6/9/18
to DigitalPebble
Hi Julien,

as a follow up:

I started our (focused) storm-crawler (enhanced with Spring, some custom TLD / URL filters, Classifier ParseFilters and Priority Computation mechanismn) on Wednesday on the production system following your suggestions above.
So far, it is working really, really smooth :) - big thanks for your work on this awesome crawler SDK!

I am also watching the Github Repository and I think, that I will join the ongoing discussions soon :)

Best,
Richard

Hi Richard, 


Thanks,
Richard
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.

To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.

DigitalPebble

unread,
Jun 9, 2018, 6:01:30 AM6/9/18
to DigitalPebble
Hi Richard

That's great to hear, thanks for the feedback and looking forward to having you as an active member of the community. One way to help the project is to increase its visibility so that more people give it a try so if you have slides you can share, or a blog describing your experiences with it, it would be a great way to spread the word. 


If you have any figures about perfs etc.. these are also always of interest to users.

Have a good week end

Julien

On Sat, 9 Jun 2018 at 09:12, <fear...@gmail.com> wrote:
Hi Julien,

as a follow up:

I started our (focused) storm-crawler (enhanced wit so h Spring, some custom TLD / URL filters, Classifier ParseFilters and Priority Computation mechanismn) on Wednesday on the production system following your suggestions above.
Reply all
Reply to author
Forward
0 new messages