1. What's the best way to configure elasticsearch in Security Onion to limit shard size? I did some digging through elastic documentation, but as someone who's never configured elasticsearch outside of Security Onion's easy installation / configuration tools, I'm not sure on the best way to do this. From what I can see, the recommended method is to reduce the timeframe that individual indexes cover.
2. Would it be recommended to align this configuration across all sensors, or only limit index time ranges on the sensors with larger than 50gb shards?
3. I imagine that any configuration change made will only affect indices created in the future. I don't mind using some CPU power to re-shard old indices, but I'm not really sure how. Are there steps available to do this?
Thanks!
James Gordon
I just experienced similar issue with my distributed deployment. I did a bunch of reading and a lot of trial and error tweaking the number of shards, shard timeout, and kibana timeout. Here is what I learned.
When an elasticsearch node receives a request from kibana it sends that request to each of the shards on the sensors. It then waits for a response from all shards prior to returning the response to kibana. So a single shard can impact the search time and it may be difficult to identify the bottleneck. I ended up changing to 4 shards per index per sensor (about 20). I really did not notice any significant changes in speed. Also I use curator to close my indexes after 7 days. If I find a need to search prior to that you can re open them.
Here are a few things that helped. Keep the searches as small and direct as possible. Avoid submitting a query while another one is still processing. And the hardest practice digital patience. If it seems like elastic search is just not responding you can always flush the indices. You can run: salt '*' cmd.run 'curl -X POST "local host:9200/index name/_flush"' writing this off of memory :) from the master. It basically resets the shards and clears any searches.
I ended up redeploying my sensor grid with forward only nodes. It does increase bandwidth utilization (not a big deal for my network) but the searches are much faster. It does however require a hefty master node. I would recommend at least 16 cpu cores. I
Hope this helps.
I increased the shard count per index on the busier sensors up to four, and upped them on the less busy sensors, too. I ended up resetting all my elastic instances (using so-elastic-reset) after increasing shard count, and today search speeds have been far more responsive than in the past - which I would also expect with the significantly reduced dataset :)
I'll monitor going forward to see if this resolves the search speeds, otherwise, I'll look into adding dedicated storage nodes.
One thing I tried to do (unsuccessfully) was only increase the shard count for logstash-bro indices. I followed the steps here https://github.com/Security-Onion-Solutions/security-onion/wiki/Elasticsearch to increase shard count, but made a separate entry for logstash-bro indices and removed logstash-bro from the original entry. When logstash started I found was generating four shards for logstash-syslog. Other than Bro, all of my other indices are relatively small. Are there significant disadvantages in sharding already small indexes? I imagine there is some CPU overhead associated with needlessly searching small shards, but if the impact is minimal then it might not be worth worrying about.
Additionally, after running so-elastic-reset, I am receiving courier fetch shard fail error in Kibana. I was able to track this down to replica elastalert_status shards on the master server, and one sensor that is attempting to create replica logstash-syslog shards. I ran through so-elastic-reset on those systems, but they still attempt to create replica shards. Where in Security Onion Elastic Stack's implementation are replica shards configurable so I can disable these and delete the replicas?
Thanks,
James Gordon
Thanks for the information, Bryant
I increased the shard count per index on the busier sensors up to four, and upped them on the less busy sensors, too. I ended up resetting all my elastic instances (using so-elastic-rese after increasing shard count, and today search speeds have been far more responsive than in the past - which I would also expect with the significantly reduced dataset :)
I'll monitor going forward to see if this resolves the search speeds, otherwise, I'll look into adding dices. I followed the steps here https://github.com/Security-Onion-Solutions/security-onion/wiki/Elasticsearch to increase shard count, but made a separate entry for logstash-bro indices and removed logstash-bro from the original entry. When logstash started I found was generating four shards for logstash-syslog. Other than Bro, all of my other indices are relatively small. Are there significant disadvantages in sharding already small indexes? I imagine there is some CPU overhead associated with needlessly searching small shards, but if the impact is minimal then it might not be worth worrying about.
1. The reported "query duration" in Kibana is almost as long as the "request duration". My understanding being that the query duration is the time it takes elasticsearch to ask the question, and request duration is the time it takes to get a response...it looks like many of my queries are being bottlenecked prior to actually searching for the data. Here's stats from a recent run of the "Total Log Count Over Time" visualization on the Home dashboard for data over the past 24 hours.
Query Duration 68880ms
Request Duration 69041ms
Hits 1114230676
Index *:logstash-*
Id *:logstash-*
2. I've noticed intermittent long garbage collections on several nodes. This happens sporadically, but once a sensor starts experiencing these long GC's, I have to restart elasticsearch (using so-elasticsearch-restart) to get this back under control.
ES logs as follows:
[2018-05-15T01:37:51,979][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1144] overhead, spent [34.7s] collecting in the last [35.3s]
[2018-05-15T01:38:21,570][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][old][1145][22] duration [29.2s], collections [1]/[29.5s], total [29.2s]/[9.3m], memory [14.6gb]->[14.7gb]/[15.7gb], all_pools {[young] [1.7gb]->[1.7gb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [12.9gb]->[12.9gb]/[12.9gb]}
[2018-05-15T01:38:22,462][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1145] overhead, spent [29.2s] collecting in the last [29.5s]
[2018-05-15T01:38:57,821][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][old][1146][23] duration [35.9s], collections [1]/[36.2s], total [35.9s]/[9.9m], memory [14.7gb]->[14.8gb]/[15.7gb], all_pools {[young] [1.7gb]->[1.8gb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [12.9gb]->[12.9gb]/[12.9gb]}
[2018-05-15T01:38:57,822][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1146] overhead, spent [35.9s] collecting in the last [36.2s]
[2018-05-15T01:39:28,750][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][old][1147][24] duration [30.6s], collections [1]/[30.1s], total [30.6s]/[10.4m], memory [14.8gb]->[14.9gb]/[15.7gb], all_pools {[young] [1.8gb]->[2gb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [12.9gb]->[12.9gb]/[12.9gb]}
[2018-05-15T01:39:28,750][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][1147] overhead, spent [30.6s] collecting in the last [30.1s]
I read in this blog post ( https://www.elastic.co/blog/a-heap-of-trouble ) that long garbage collections are often a result of heaps being too large. I tested this on one sensor and reduced the heap size by a few Gb and it immediately suffered long garbage collections. Normally it takes a little time after restarting ES to display this issue, so I raised the heap back up to 25G (as set by sosetup).
Any suggestions on what to try or where to look for next troubleshooting steps?
Thanks,
James Gordon
Not sure if this behavior helps in troubleshooting or not but I'm still open for ideas on where to look for troubleshooting this :)
Thanks,
James Gordon
Thanks,
James Gordon
--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onion+unsubscribe@googlegroups.com.
To post to this group, send email to security-onion@googlegroups.com.
Visit this group at https://groups.google.com/group/security-onion.
For more options, visit https://groups.google.com/d/optout.
James, did you ever make any headway on this? I seem to be experiencing the same.
--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onion+unsubscribe@googlegroups.com.
To post to this group, send email to security-onion@googlegroups.com.
Visit this group at https://groups.google.com/group/security-onion.
For more options, visit https://groups.google.com/d/optout.
One additional option what you can try, is to split bro index in several indices.
Just know your data and split indices. For example every bro type is sent to logstash-bro-*. What you can do, you can try to split bro_conn to logstash-bro-conn index,
bro_files to logstash-bro-files index and etc. And point your saved search to that index. So, instead of quering all indices, it will query just relevant ones. So performance will for sure increase.
But this probably will not help if you will try load for example conn log for 1 month, because of size and you will get timeout.
Actually we face here a problem, which elastic should solve by using additional hosts in a cluster (horizontal scaling). But because of sensor design, is not so easy to achieve, because sensor consist of one node.
I was trying to play with one of my setup and the initial bro index size was about 360GB.
By using these techniques, I was able to improve performance, but still it is not sufficient for me, because queries are still to slow. I will try to add SSD's to check how much it will improve...
If this will not help, the only one option resides (from my point of view) - send logs to storage nodes.
Regards,
Audrius