Elasticsearch shard and index settings

James Gordon

unread,

May 11, 2018, 2:37:37 PM5/11/18

to security-onion

We've been experiencing some slow search times in our distributed Elastic instance. From what I can see, disk and RAM resources don't seem to indicate a bottleneck. Reading through the Security Onion mailing list today I found a post in a thread (https://groups.google.com/forum/#!topic/security-onion/qOcBAnPnAAU) that indicates that for best performance, elasticsearch shards should be no greater than 50gb. I noticed that a handful of my sensors have shards that are over 100gb.

1. What's the best way to configure elasticsearch in Security Onion to limit shard size? I did some digging through elastic documentation, but as someone who's never configured elasticsearch outside of Security Onion's easy installation / configuration tools, I'm not sure on the best way to do this. From what I can see, the recommended method is to reduce the timeframe that individual indexes cover.

2. Would it be recommended to align this configuration across all sensors, or only limit index time ranges on the sensors with larger than 50gb shards?

3. I imagine that any configuration change made will only affect indices created in the future. I don't mind using some CPU power to re-shard old indices, but I'm not really sure how. Are there steps available to do this?

Thanks!

James Gordon

Bryant Treacle

unread,

May 11, 2018, 5:50:29 PM5/11/18

to security-onion

James,

I just experienced similar issue with my distributed deployment. I did a bunch of reading and a lot of trial and error tweaking the number of shards, shard timeout, and kibana timeout. Here is what I learned.
When an elasticsearch node receives a request from kibana it sends that request to each of the shards on the sensors. It then waits for a response from all shards prior to returning the response to kibana. So a single shard can impact the search time and it may be difficult to identify the bottleneck. I ended up changing to 4 shards per index per sensor (about 20). I really did not notice any significant changes in speed. Also I use curator to close my indexes after 7 days. If I find a need to search prior to that you can re open them.

Here are a few things that helped. Keep the searches as small and direct as possible. Avoid submitting a query while another one is still processing. And the hardest practice digital patience. If it seems like elastic search is just not responding you can always flush the indices. You can run: salt '*' cmd.run 'curl -X POST "local host:9200/index name/_flush"' writing this off of memory :) from the master. It basically resets the shards and clears any searches.

I ended up redeploying my sensor grid with forward only nodes. It does increase bandwidth utilization (not a big deal for my network) but the searches are much faster. It does however require a hefty master node. I would recommend at least 16 cpu cores. I
Hope this helps.

James Gordon

unread,

May 13, 2018, 10:28:39 AM5/13/18

to security-onion

Thanks for the information, Bryant!

I increased the shard count per index on the busier sensors up to four, and upped them on the less busy sensors, too. I ended up resetting all my elastic instances (using so-elastic-reset) after increasing shard count, and today search speeds have been far more responsive than in the past - which I would also expect with the significantly reduced dataset :)

I'll monitor going forward to see if this resolves the search speeds, otherwise, I'll look into adding dedicated storage nodes.

One thing I tried to do (unsuccessfully) was only increase the shard count for logstash-bro indices. I followed the steps here https://github.com/Security-Onion-Solutions/security-onion/wiki/Elasticsearch to increase shard count, but made a separate entry for logstash-bro indices and removed logstash-bro from the original entry. When logstash started I found was generating four shards for logstash-syslog. Other than Bro, all of my other indices are relatively small. Are there significant disadvantages in sharding already small indexes? I imagine there is some CPU overhead associated with needlessly searching small shards, but if the impact is minimal then it might not be worth worrying about.

Additionally, after running so-elastic-reset, I am receiving courier fetch shard fail error in Kibana. I was able to track this down to replica elastalert_status shards on the master server, and one sensor that is attempting to create replica logstash-syslog shards. I ran through so-elastic-reset on those systems, but they still attempt to create replica shards. Where in Security Onion Elastic Stack's implementation are replica shards configurable so I can disable these and delete the replicas?

Thanks,

James Gordon

Bryant Treacle

unread,

May 14, 2018, 9:04:47 AM5/14/18

to securit...@googlegroups.com

James,

It is awesome to hear your response time has increased. Regarding your attempt to have different shards for your different indices, here would be my approach. Currently the logstash-template.json file has the mappings for all the logstash-* indexes. If you wanted to change the shards on just logstash-bro, you could make a copy of the logstash-template.json, rename it, then point to it in the template portion of the 9000_output_bro.conf file. This should get your desired outcome. Note. you may want to test it first, logically it should work but I never actually split them before.

One thing that I am unsure of is that Elasticseach has a template for the logstash-* indecies. You can see this by typing GET _template\logstash-* in the Kibana Dev tool. I believe that the 'template_overwrite => true' statement in the 9000_output_bro.conf file will supersede the elasticesearch template. Again I have not tested this.

As far a have a lot of shards with small indexes. I did read that it was not preferred because it is not an efficient use of RAM/CPU cycles, but a little over allocation is ok. In the blog I read, they used the example that 1000 shards as overkill. Elastic titled it a Kagillion shards. Here is the link. https://www.elastic.co/guide/en/elasticsearch/guide/current/kagillion-shards.html.

The replica shards are added in the logstash-template.json file for all logstash-* indices. As for elastalert_status, is it creating a replica on each sensor?

In previous candidate release versions I toyed with creating my own template for elastalert so I could manipulate the number of Shards and Replicas. I believe the Python script for elastalert contains the field mappings but it uses elasticsearch's default template which is 5 primary and 1 replica shard. I created a template using the elastalert mappings and assigned the number of shards.

I hope this helps.

Bryant

On May 13, 2018 10:28 AM, "James Gordon" <gordon...@gmail.com> wrote

Thanks for the information, Bryant

I increased the shard count per index on the busier sensors up to four, and upped them on the less busy sensors, too. I ended up resetting all my elastic instances (using so-elastic-rese after increasing shard count, and today search speeds have been far more responsive than in the past - which I would also expect with the significantly reduced dataset :)

I'll monitor going forward to see if this resolves the search speeds, otherwise, I'll look into adding dices. I followed the steps here https://github.com/Security-Onion-Solutions/security-onion/wiki/Elasticsearch to increase shard count, but made a separate entry for logstash-bro indices and removed logstash-bro from the original entry. When logstash started I found was generating four shards for logstash-syslog. Other than Bro, all of my other indices are relatively small. Are there significant disadvantages in sharding already small indexes? I imagine there is some CPU overhead associated with needlessly searching small shards, but if the impact is minimal then it might not be worth worrying about.

James Gordon

unread,

May 16, 2018, 8:18:44 AM5/16/18

to security-onion

So now that Elasticsearch has some some time to ingest data, I'm back to experiencing slow searches. They're still faster than they were prior to increasing shard count, but it's still slower than I'd like to see. I've noticed two things that seem indicative of larger problems, but I'm not sure on the best way to troubleshoot these:

1. The reported "query duration" in Kibana is almost as long as the "request duration". My understanding being that the query duration is the time it takes elasticsearch to ask the question, and request duration is the time it takes to get a response...it looks like many of my queries are being bottlenecked prior to actually searching for the data. Here's stats from a recent run of the "Total Log Count Over Time" visualization on the Home dashboard for data over the past 24 hours.

Query Duration 68880ms
Request Duration 69041ms
Hits 1114230676
Index *:logstash-*
Id *:logstash-*

2. I've noticed intermittent long garbage collections on several nodes. This happens sporadically, but once a sensor starts experiencing these long GC's, I have to restart elasticsearch (using so-elasticsearch-restart) to get this back under control.

ES logs as follows:
[2018-05-15T01:37:51,979][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1144] overhead, spent [34.7s] collecting in the last [35.3s]
[2018-05-15T01:38:21,570][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][old][1145][22] duration [29.2s], collections [1]/[29.5s], total [29.2s]/[9.3m], memory [14.6gb]->[14.7gb]/[15.7gb], all_pools {[young] [1.7gb]->[1.7gb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [12.9gb]->[12.9gb]/[12.9gb]}
[2018-05-15T01:38:22,462][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1145] overhead, spent [29.2s] collecting in the last [29.5s]
[2018-05-15T01:38:57,821][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][old][1146][23] duration [35.9s], collections [1]/[36.2s], total [35.9s]/[9.9m], memory [14.7gb]->[14.8gb]/[15.7gb], all_pools {[young] [1.7gb]->[1.8gb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [12.9gb]->[12.9gb]/[12.9gb]}
[2018-05-15T01:38:57,822][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1146] overhead, spent [35.9s] collecting in the last [36.2s]
[2018-05-15T01:39:28,750][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][old][1147][24] duration [30.6s], collections [1]/[30.1s], total [30.6s]/[10.4m], memory [14.8gb]->[14.9gb]/[15.7gb], all_pools {[young] [1.8gb]->[2gb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [12.9gb]->[12.9gb]/[12.9gb]}
[2018-05-15T01:39:28,750][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1147] overhead, spent [30.6s] collecting in the last [30.1s]

I read in this blog post ( https://www.elastic.co/blog/a-heap-of-trouble ) that long garbage collections are often a result of heaps being too large. I tested this on one sensor and reduced the heap size by a few Gb and it immediately suffered long garbage collections. Normally it takes a little time after restarting ES to display this issue, so I raised the heap back up to 25G (as set by sosetup).

Any suggestions on what to try or where to look for next troubleshooting steps?

Thanks,

James Gordon

unread,

May 29, 2018, 3:27:15 PM5/29/18

to security-onion

I have a slight update on this issue - I've noticed that the long garbage collections will occur every time I load the Security Onion HTTP dashboard. This dashboard loads everything except the 'http-summary' visualization. While trying to load that, several of my sensors will start to experience long GC's. As a result logstash loses connectivity to ES and logs start to drop, and queries time out until elasticsearch is either restarted or has had enough time to recover on its own.

Not sure if this behavior helps in troubleshooting or not but I'm still open for ideas on where to look for troubleshooting this :)

Thanks,

James Gordon

Wes Lambert

unread,

May 30, 2018, 7:55:44 AM5/30/18

to securit...@googlegroups.com

James,

Have you tried removing the HTTP visualization as a test to see if the dashboard loads? I would also try adjusting heap some more -- try cutting it in half, and also maybe reducing a GB at a time from the top and testing from there.

Thanks,

Wes

Thanks,

James Gordon

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onion+unsubscribe@googlegroups.com.
To post to this group, send email to security-onion@googlegroups.com.
Visit this group at https://groups.google.com/group/security-onion.
For more options, visit https://groups.google.com/d/optout.

--

https://twitter.com/therealwlambert

https://securityonion.net/

Joe Lane

unread,

Jun 20, 2018, 11:26:36 AM6/20/18

to security-onion

James, did you ever make any headway on this? I seem to be experiencing the same.

Doug Burks

unread,

Jun 21, 2018, 6:39:45 AM6/21/18

to securit...@googlegroups.com

The "HTTP - Summary" visualization pulls a lot of data and is causing performance issues on some installations, so we will likely be removing it from the HTTP dashboard in a future release. So you might want to go ahead and manually remove it from your HTTP dashboard today to see if that helps your performance.

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onion+unsubscribe@googlegroups.com.
To post to this group, send email to security-onion@googlegroups.com.
Visit this group at https://groups.google.com/group/security-onion.
For more options, visit https://groups.google.com/d/optout.

--

Doug Burks
CEO
Security Onion Solutions, LLC

Audrius J

unread,

Jun 21, 2018, 4:01:52 PM6/21/18

to security-onion

Hi,

One additional option what you can try, is to split bro index in several indices.
Just know your data and split indices. For example every bro type is sent to logstash-bro-*. What you can do, you can try to split bro_conn to logstash-bro-conn index,
bro_files to logstash-bro-files index and etc. And point your saved search to that index. So, instead of quering all indices, it will query just relevant ones. So performance will for sure increase.
But this probably will not help if you will try load for example conn log for 1 month, because of size and you will get timeout.
Actually we face here a problem, which elastic should solve by using additional hosts in a cluster (horizontal scaling). But because of sensor design, is not so easy to achieve, because sensor consist of one node.

I was trying to play with one of my setup and the initial bro index size was about 360GB.
By using these techniques, I was able to improve performance, but still it is not sufficient for me, because queries are still to slow. I will try to add SSD's to check how much it will improve...
If this will not help, the only one option resides (from my point of view) - send logs to storage nodes.

Regards,
Audrius

Reply all

Reply to author

Forward