[StormCrawler] [ElasticSearch] Configuration Documentation / Technical Questions

fear...@gmail.com

unread,

Jan 20, 2018, 3:37:43 AM1/20/18

to DigitalPebble

Good Morning,

I am migrating a JPPF-based Crawler4J Fork (in this special case a focused crawler leveraging several ML-algorithmns) towards the StormCrawler SDK for research purpose.

The reason for this migration is twofold: (1) use standard software instead of tinker solutions ;) (2) performance of StormCrawler vs Apache Nutch (and then compared to our previous results with the JPPF-based Crawler4J Implementation).

For our use-case, a continious crawl with unlimited depth is mandatory. For this reason, we decided to start our customization based on the ElasticSearch components provided in the SC-external directory.

However, documentation about performance fine-tuning SC is rarely found in the Wiki or in the Issue tracker.

Some technical facts for our setup:

We have plenty of bandwidth for this experiment:

The University's core router is connected with 40 GbE to the Internet
Our server infastracture (Cisco UCS blades based on VMWare / ESXi) is also connected with 40 GbE

We have a Storm-Cluster (exclusivley for this work) consisting of 24 virtual machines running on Ubuntu 16.04 LTS

3 Nimbus Nodes + Zookeeper
21 Supervisor Nodes (4 vCores, 4 GB Ram)

At the moment one ElasticSearch Instance on Ubuntu 16.04

32 GB Ram, 4 vCores + Storage

So I have the following questions (maybe you have some hints / ideas with your experience with SC and detail knowledge as author):

I assume, that the status-Index will grow a lot over time - at the moment we are using the AggegrationSpout with bucket sorting based on a (previously computed) priority value.

Is a single ES instance the bottleneck here?
Would it be better to use more than one ES instance?
Is there any available SC documentation related to the configuration parameters of the ES extension? If not, I will have to need to read the code in detail :)

Are there any other hidden pitfalls, then configuring the parallelism hints for the SC bolts like Fetcher, Parser, etc? I already found, that the amount of ES spouts needs to equal the amount of ES shards :)
As I am quite new with Storm, SC and ES: Can you personally recommend a good source for configuration parameter information?

I am open for any answers / questions or related discussion :)

Thanks,

Richard

DigitalPebble

unread,

Jan 22, 2018, 5:07:17 AM1/22/18

to DigitalPebble

Hi Richard,

Thanks for getting in touch, great to hear that you're using StormCrawler. Please find my comments below:

On 20 January 2018 at 08:37, <fear...@gmail.com> wrote:

Good Morning,

I am migrating a JPPF-based Crawler4J Fork (in this special case a focused crawler leveraging several ML-algorithmns) towards the StormCrawler SDK for research purpose.

The reason for this migration is twofold: (1) use standard software instead of tinker solutions ;) (2) performance of StormCrawler vs Apache Nutch (and then compared to our previous results with the JPPF-based Crawler4J Implementation).

For our use-case, a continious crawl with unlimited depth is mandatory. For this reason, we decided to start our customization based on the ElasticSearch components provided in the SC-external directory.

However, documentation about performance fine-tuning SC is rarely found in the Wiki or in the Issue tracker.

Some technical facts for our setup:
We have plenty of bandwidth for this experiment:
The University's core router is connected with 40 GbE to the Internet
Our server infastracture (Cisco UCS blades based on VMWare / ESXi) is also connected with 40 GbE
We have a Storm-Cluster (exclusivley for this work) consisting of 24 virtual machines running on Ubuntu 16.04 LTS
3 Nimbus Nodes + Zookeeper
21 Supervisor Nodes (4 vCores, 4 GB Ram)
At the moment one ElasticSearch Instance on Ubuntu 16.04
32 GB Ram, 4 vCores + Storage

So I have the following questions (maybe you have some hints / ideas with your experience with SC and detail knowledge as author):

I assume, that the status-Index will grow a lot over time - at the moment we are using the AggegrationSpout with bucket sorting based on a (previously computed) priority value.
Is a single ES instance the bottleneck here?

Definitely! As you pointed out, the status index will grow rapidly which will lead to longer query times by the spouts and longer merging of the segments by ES. Es will always be the bottleneck.

Would it be better to use more than one ES instance?

Ideally, you'd have one instance of ES running on each machine of your Storm cluster but youcan of course keep both cluster separate if you wish. Don't forget to set a number of shards in the ES init script >= number of ES instances and also set the number of spouts to the same value.

Is there any available SC documentation related to the configuration parameters of the ES extension? If not, I will have to need to read the code in detail :)

There are comments on the ES configs in https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/es-conf.yaml but no docs on the WIKi summarising them all.

Let me know if you need explanations on any specific value.

Are there any other hidden pitfalls, then configuring the parallelism hints for the SC bolts like Fetcher, Parser, etc? I already found, that the amount of ES spouts needs to equal the amount of ES shards :)

I wouldn't call them pifalls but maybe a few tips => use one Fetcher per worker and adjust the number of Parser instances based on your observations using the Storm UI - as a rule of thumb, x5 parser instances per fetcher instance, i.e. if you have 42 workers, have 42 Fetchers and 168 parsers

Also set es.status.sample to true to optimise the perfs of the AggregationSpouts

As I am quite new with Storm, SC and ES: Can you personally recommend a good source for configuration parameter information?

Storm

Make sure your config gives the worker plenty of RAM e.g at least 2GB each

Elastic

https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html

https://www.elastic.co/guide/en/elasticsearch/reference/master/system-config.html

in particular use SSDs if you can + loads of RAM + turn swap off

SC

https://github.com/DigitalPebble/storm-crawler/wiki/Configuration lists some of the basic configs

There is https://github.com/DigitalPebble/stormcrawlerfight which has some configs for a single node SC instance based on ES, you can use that as a source of inspiration

+ as well tutos on Youtube https://www.youtube.com/channel/UCUuzkdeQV-XEPUq53bH_b5A/videos

I am open for any answers / questions or related discussion :)

Likewise, please share any questions or feedback

Julien

Thanks,
Richard

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com

http://digitalpebble.blogspot.com

@digitalpebble

fear...@gmail.com

unread,

Jan 23, 2018, 8:45:11 AM1/23/18

to DigitalPebble

Hi Julien,

Thanks a lot for your fast and detailed answer.

I will come back to this thread after I did some distributed testing on our infrastructure or if I find any other questions.

Best,

Richard

fear...@gmail.com

unread,

Jun 9, 2018, 4:12:26 AM6/9/18

to DigitalPebble

Hi Julien,

as a follow up:

I started our (focused) storm-crawler (enhanced with Spring, some custom TLD / URL filters, Classifier ParseFilters and Priority Computation mechanismn) on Wednesday on the production system following your suggestions above.

So far, it is working really, really smooth :) - big thanks for your work on this awesome crawler SDK!

I am also watching the Github Repository and I think, that I will join the ongoing discussions soon :)

Best,

Richard

Hi Richard,

Thanks,
Richard

To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.

To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.

DigitalPebble

unread,

Jun 9, 2018, 6:01:30 AM6/9/18

to DigitalPebble

Hi Richard

That's great to hear, thanks for the feedback and looking forward to having you as an active member of the community. One way to help the project is to increase its visibility so that more people give it a try so if you have slides you can share, or a blog describing your experiences with it, it would be a great way to spread the word.

I had listed a few similar ideas a while ago in http://digitalpebble.blogspot.com/2017/03/contribute-to-open-source-project.html

If you have any figures about perfs etc.. these are also always of interest to users.

Have a good week end

Julien

On Sat, 9 Jun 2018 at 09:12, <fear...@gmail.com> wrote:

Hi Julien,

as a follow up:

I started our (focused) storm-crawler (enhanced wit so h Spring, some custom TLD / URL filters, Classifier ParseFilters and Priority Computation mechanismn) on Wednesday on the production system following your suggestions above.

Reply all

Reply to author

Forward