Here is initial test. So here you will find my insights.
Test info: No load, 44GB of pcaps, 4.3 million events, snort/bro - 6 instances(each), ES 30g RAM, LS 8GB RAM + 12 workers. Bare metal installation.
DomainStats and Freq Analysis - set to "no".
SysInfo: r730xd Raid60 (4x12TB) + 2x128GB SSD, 128 GB RAM, CPU E5-2630 v4 @2.20GHz, 40 Cores
P.S. No load means, that no heavy queries were executed during these tests.
Pcaps were replayed using the command "tcpreplay -ieth1 -M$Speed /pcaps/data_ssd/*"
$Speed - means different speeds used during the tests, like M100, M200 etc...
All pcaps were placed on ssd drive and replayed from there.
Results:
Test Name | Total Time(s) | EPS (calculated) | Events per min (avg) | Actual pcap replayed speed (Mbps)
M100 | 3489 | 1358 | 81465 | 99
M200 | 1800 | 2475 | 148470 | 192
M300 | 1257 | 3487 | 209200 | 275
M400 | 978 | 4413 | 264764 | 353
M500 | 832 | 5182 | 310937 | 414
M600 | 727 | 5882 | 352944 | 475
M700 | 642 | 6667 | 400000 | 538
M800 | 590 | 7200 | 432000 | 585
M900 | 539 | 7848 | 470885 | 641
M1000 | 508 | 8333 | 500000 | 680
P.S. During the tests first and last 1 or 2 min results were removed from evaluation (to get most even distribution in histogram, because test could start/end in the middle of the first/last min.). Next the average was evaluated and divided by 60 to get EPS.
Please see attached pic, for better understanding. Red border marks what was evaluated.
LogType distribution
Log Type(s) | Count | "Log Type Distribution in %"
bro_conn | 1172894 | 27,2033
bro_files | 1134783 | 26,3194
bro_http | 602526 | 13,9746
bro_dns | 491361 | 11,3963
bro_ssl | 443372 | 10,2833
bro_x509 | 286088 | 6,6353
bro_syslog | 71040 | 1,6477
bro_weird | 60403 | 1,4009
snort | 25392 | 0,5889
bro_ssh | 9669 | 0,2243
bro_smtp | 6775 | 0,1571
bro_snmp | 3239 | 0,0751
bro_software | 3157 | 0,0732
bro_pe | 365 | 0,0085
ossec_archive | 244 | 0,0057
CRON | 191 | 0,0044
bro_notice | 36 | 0,0008
bro_ftp | 32 | 0,0007
ossec | 13 | 0,0003
su | 9 | 0,0002
My remarks
Pcap files were from real user environment. Results may change depending on how much flows users will produce.
In general, I expected worse results. But with some tuning and additional testing it may show good results. At least I was happy to see, that I can't kill the cluster easily :)
Problems:
- Sharding. For few million of logs ES works well. So, I run a scripts, which replays 44GB of pcap data 100 times (with configuration of test M700). After 6-8 hours, my bro index became ~200GB. So if you want to visualize a data for 8 hours - your query will timeout. According to my knowledge, normally "searchable shard" should be ~ 15GB, in general 30-50GB.
Great thing is, that query will timeout and not power down your cluster to red state. Bad thing, that if you will not redefine your search (like change time frame) some dashboards may fail to do visualization, because queries will time out.
During this specific test, persistent queue fills up too (1gb), so that means, that LS should tell syslog-ng to slow down (it is called back-pressure on the inputs). If syslog-ng doesn't have a buffer, that means that after bro will rotate logs not all logs will be sent (at least it should be like this), because logstash can't ingest them all.
- Stats Dashb.- it seems, that something is here wrong. Based on data on stats dashboard, I should only expect 10EPS per LS pipeline, which is not true. Because of this, I could not use it for my tests. So, I did a small tick and added a new field called an EventDate (timestamp, when event was processed) and based on this field did data histogram analysis. I found, that this produce accurate results.
In general, logstash has now performance metrics, which can be used for performance monitoring (https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html).
For every event filter used:
filter {
mutate {
add_field => { "EventDate" => "%{@timestamp}" }
}
}
Possible solutions:
- Still had not time to dive deep in entire project, so my suggestions might be wrong, but therefore:
a) Increase number of shards (not sure if this will help). I will try to test this to see if this will help. But I think it should...
b) For high speed installations, we can move to hourly index. But we need to take here care, how many indices our cluster will have (https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster)
c) Probably increase persistent queue to have good buffer. Don't know if here is any limitations or problems;
d) Probably logstash metrics filter would be better in evaluation of performance.
Additional recommended improvements:
- Would like to see the scripts to restart every docker container separately.
That's all for now!
Regards,
Audrius
--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onion+unsubscribe@googlegroups.com.
To post to this group, send email to security-onion@googlegroups.com.
Visit this group at https://groups.google.com/group/security-onion.
For more options, visit https://groups.google.com/d/optout.
You are more than welcome! This is just small thing, how we can support this great project!
Now to answer your questions:
1. Were you finding that DomainStats and/or FreqServer were causing performance issues?
The main idea now was to get performance info about core components, which is really needed to make new transition.
I think that DomainStats and/or FreqServer is great addition and I will try to do the same tests with these features turned on.
Based on that data, we will be able to do comparison with the same condition and see penalties provided by them.
1a. Our current version of syslog-ng does not have any buffer. So perhaps
we should go ahead and move Bro logs from syslog-ng to being collected
directly by Logstash. Thoughts?
This question is quite tricky. syslog-ng is very fast and flexible program and it is perfect if you want to replicate the same information twice or even more time. I think, that previous versions of syslog-ng just commercial versions had a buffer. But now, they can have too (https://www.balabit.com/documents/syslog-ng-ose-latest-guides/en/syslog-ng-ose-guide-admin/html/what-syslog-ng-is.html and https://syslog-ng.org/493-2/)
Also, it is very reliable, so if the buffer will be enabled on it, we will be able to not loose logs during the logstash crash/logs rotation and etc...
So, users will be able safely restart logstash...
Also, this can be always changed.
2. Are you referring to the Processing Time metrics? These metrics are
based on the logstash_time field, which is calculated in the following
Logstash filters:
1000_preprocess_log_elapsed.conf
8998_postprocess_log_elapsed.conf
I think the way they are calculated may be more representative of the
processing time of a batch of events rather than individual events.
Yes, I am referring to them. The main thing for me is to get information how the system perform. If they are misleading, so we can skip them and save cpu cycles. In meanwhile we can think about the use of metrics plugin to get better metrics.
3. We've defaulted shards to 1 for the use case of running Evaluation
Mode (NOT running in production). Perhaps for users selecting
Production Mode, we should go back to the Elasticsearch default of 5
shards. Thoughts?
Here is a 2 points.
a) We can go with 5 shards. But also, we need to provide an info to the user how they can change that.
b) Or we can use second logic. We can count the size of the index (with GET _cat/indices from Kibana console or using curl) and divide by 30.
If the size is 150GB and we divide by 30 we will get 5. In that case we can skip the process of increasing the number of shards.
If the size of index is 210GB and we divide by 30, we will get 7. So, we need to increase the number of shards to 7 or (7+1).
4. The current queue size is probably sufficient for Evaluation Mode, but
we should probably increase for Production Mode. Perhaps make it a
percentage of total disk space. Thoughts?
I think that yes, we can increase persistent queue. If we think, that it is a buffer, we can make it as big as we want. In that case all logs should be in a buffer until logstash will consume them. Probably it is good idea to make them percentage of total disk space, but not less than 50GB. If a user will need to change it, it can always do it and this information can be provided in a wiki. I will try to test it...
50GB probably will be able to hold the information for entire day (we need to think, that this is just message, not enriched).
Regards,
Audrius
I second use of Logstash for pipeline as Logstash will throttle the flow control as needed. Beats to Logstash even better for the stream management, throttling, recovery as necessary under the hood.
Here you have great points. It can be done something like that.
I will try to test this too, as I said in previous post.
Speaking about json, one huge benefit we get - we don't need to be afraid of new fields. They are always in place, because of json.
Also, if message field is "-", where is no info in json, so we even can save a space.
Performance should also increase (probably)
Of course, you will need to rewrite parsers, at least change them.
I can't promise, but I will try to make small changes and make a test run...
Regards,
Audrius
Tried to perform the same tests with DomainStats and Freq options enabled and results are not very good. We have performance issues.
For example, with test M600, the duration of packet replay is ~730s, but it takes about 30 min for logstash to complete. Also after the queue is empty I have lost about 30% of logs.
If I increase persistence queue, it takes the same amount of time, but at least I do not lose any logs.
Also, it is strange for me, that syslog-ng doesn't slow down log shipping to logstash and just rejects (probably) connection, when the ls buffer is full...
I tried to do with DomainStats enabled and Freq disabled and vice versa.
Both have similar issues...
Will try to investigate more...
Regards,
Audrius
thanks for the update!
I will try to perform testing on this next week as well.
I think, that the best way to upgrade to new release, just start with new installation?
Regards,
Audrius
finaly had a time to finish tests.
Results are very similar to alpha release
Test info: No load, 44GB of pcaps, 4.3 million events, snort/bro - 6 instances(each), ES 30g RAM, LS 8GB RAM + 12 workers. Bare metal installation.
DomainStats and Freq Analysis - set to "no".
SysInfo: r730xd Raid60 (4x12TB) + 2x128GB SSD, 128 GB RAM, CPU E5-2630 v4 @2.20GHz, 40 Cores
Release: Beta
P.S. No load means, that no heavy queries were executed during these tests.
Pcaps were replayed using the command "tcpreplay -ieth1 -M$Speed /pcaps/data_ssd/*"
$Speed - means different speeds used during the tests, like M100, M200 etc...
All pcaps were placed on ssd drive and replayed from there.
---------------------------------------------------------------------------------------------------------
Test_Name |Total_Time |EPS_(calculated) |Events_per_min_(avg) |Replayed_speed |Queue
---------------------------------------------------------------------------------------------------------
M100 | 3485 | 1359 | 81565 | 99 |ok
M200 | 1804 | 2474 | 148462 | 191 |ok
M300 | 1256 | 3491 | 209460 | 275 |ok
M400 | 981 | 4407 | 264435 | 351 |ok
M500 | 836 | 5141 | 308466 | 413 |ok
M600 | 731 | 5838 | 350273 | 472 |ok
M700 | 638 | 6512 | 390690 | 541 |queue ~ 400k events
M800 | 594 | 7162 | 429720 | 581 |queue ~ 700k events
M900 | 549 | 7721 | 463258 | 628 |queue > 900k events
M1000 | 525 | 8070 | 484222 | 657 |queue ~ 1mln events
---------------------------------------------------------------------------------------------------------
Regards,
Audrius
During several weeks I performed very different test and here is my notes.
With current configuration there is some problems with parsing csv logs (especialy http, ssl), because some strings, which exist in the logs will cause logstash to throw an error and not much can be done about it, because it violates CSV formating. This action is logged and it decreases performance. If your system has a lot of users and the main thing is surf the NET, you will experience performance degradation.
To test it better (check worst case scenario), I changed some configuration files in a way, that timestamp was assigned by the logstash and not by bro (ts->timestamp) field. In this case you can see, how long does it take to process logs, when the LS queue is full.
Also this modification helped with some automation, because in this case you can just reuse your existing logs and put them to bro current log like this:
cat conn.log >> /nsm/bro/logs/current/conn.log
For this test conn, dns, http, ssl logs were used.
The results I got, I didn't liked very much, so decided to take json road.
So, the latest findings are, that the best performance is achieved, if we switch bro logs to json and ingest them directly via logstash (with configuration very similar to 6001_bro_import.conf). Of course some twiking was done to so-logstash start script, syslog-ng, local.bro and etc. to make this work.
In this case I could achieve 15k EPS with almost empty logstash queue (GEO enrichment was included), with the same server configuration used in previous tests.
By replaying traffic with M1000, I was not be able to reach limits, so I reused the same idea, presented above and created some scripts to make a copy of existing logs and pipe them to the current log.
Histograms are assigned to this post, and pay atention on how many logs were processed with current (beta3) configuration and with json.
Both test were performed in the same way, just different parsing methods were used and some additional configs were changed and removed.
Do not pay a lot of attention on spikes, this is because of my dumb script, which copies logs to current log and then sleeps for 50s. Some log type contains more entries, so this is why spike comes in to the play. But it also shows, that it processes logs immediatelly, so no queue are in LS.
So, I think, this way (json) should be taken in to consideration.
Also by using this wiki https://github.com/Security-Onion-Solutions/security-onion/wiki/Bro-Fields and LS renaming functionality (and removing some config files) I got almost all dashboards working without any change, but some custom parsing were not working properly, so for test I just removed them...
Of course this test is very extreme case, but this should allow us to reach 1Gbps...
Regards,
Audrius
Audrius
-
After using this new configuration, /var/log/logstash/logstash.log looks much better, but performance is very similar.
Also you can try to improve performance by using dissect filter. Take a look here https://www.elastic.co/blog/logstash-dude-wheres-my-chainsaw-i-need-to-dissect-my-logs.
Audrius